We have a client who needs to run monthly reports from their financial application. When we first started working for them, the monthly report took nearly a week to complete. We noticed a huge amount of logfile waits and did the classic tuning task of increasing the size and number of logfiles, while putting them on alternating disks. That reduced the processing time from several days to several hours, and kept them happy for about 5 years. However, over the years, the size of the database and the processing time have kept getting bigger and longer. It is now taking about 14 hours to complete the report. While that is much better than it was originally, it’s becoming burdensome again.
This time, the statspack report needed a bit more analysis. The client is running Standard Edition without Diagnostic or Tuning Packs on a Windows 32bit OS. That rules out AWR, and using more than a couple Gigs of RAM. We kicked off an hourly statspack snapshot a week before the monthly run to ensure we had a good baseline, then picked a time in the middle of the run to compare to normal processing.
The first thing that jumped out at us was that they were doing MUCH more work in MUCH smaller transactions during the monthly report. The Load Profile section showed redo per second going from 72K to 1.5M, while the redo per transaction went from 800K to 3K. Likewise Block Changes jumped from 470/s to 8.4K/s, but plummeted from 5.2K/tran to 17/tran. Logical reads stayed approximately the same per second–13.5K to 16.6K, but per transaction dropped from 150K to 35. The only way to interpret those stats is that the application is performing row-by-row (Tom Kyte refers to this as slow-by-slow) processing. There’s almost never a reason to do this, and is rightly regarded as a bug in programming. Unfortunately, it’s also extremely common. Like the vast majority of tuning problems, there really isn’t anything we as DBAs can do to fix it. The application needs to be fixed to allow bulk processing.
Working our way down the statspack reports, in the Top 5 Timed Events, we saw that the db file sequential read went from 5K to 295K, but the Buffer Hit Ratio had stayed in the 99%+ range. That indicates full table scans, which are sequential and bypass the Buffer Cache. If we could add RAM, it’s possible we could pin those tables in SGA, but we’d really like to avoid full table scans if we can. Looking ahead to the Top SQL by reads, we saw that the top several queries all related to a table named xxx_TEMP_DETAIL. That’s trouble right there. Temporary tables are almost always a bad idea in Oracle. It’s pretty standard practice in MS SQL Server, so sometimes a programmer with a Microsoft background or a vendor that’s trying to be “DBMS agnostic” will use the technique in Oracle. It’s a bug. What’s worse is that the table isn’t actually a temporary table, so it incurrs much more overhead than if it were actually a temp table. Again, there isn’t much we can do as DBAs to fix a poorly architected application. This code needs to do bulk processing and avoid temp tables. We can try to force better behavior by making it an actual temp table, pinning in RAM, or by making better indexes or turning it into an Index Only Table. Unfortunately, without knowing exactly what the application is doing, any of these could break the application (more than it already is) by changing side effects that the programmer is wrongly relying on.
Finally, we saw in the Top SQL sorted by CPU that the third most CPU intensive query is updating seq$. That’s internal Oracle housekeeping. It means that Oracle spent more time updating sequences than executing all but two user statements. That is highly unusual, since sequences are extremely efficient. Looking at the sequence table, I see that all the application owned sequences are set to not cache. This is a very bad thing. It means that every row that relies on a sequence, also requires an update to the seq$ table. There were 1.6M unneeded updates. This is usually caused by programmers who do not understand sequences. They are not incremental row counters. Programmers who try to use them as if they were often turn off caching to try to force them to behave like row counters, but they cannot be used that way. Typically, this indicates a bug in the program. On the other hand, it is possible that some tool, or legacy code sets caching off for all tables, and the code is actually correct. If we can turn on caching, we will pave the way for much lower transaction load (with less logging) and parallel processing. Unfortunately, based on what we see, it appears the application is relying on sequences being incremental row counters, so turning on caching might break the application.