Diagnosing performance problems on Oracle… Why so difficult?
Since introducing Oracle to our live environment to support a major system, we’ve been seeing a variety of problems – resolution for which can only really be described as a black art.
Why does Oracle make life so difficult for those new to the platform to find out what it’s actually doing?
We’re currently running a 2-node Oracle 11g RAC cluster on 64-bit Windows 2003 server; which seems to be the first hurdle that has to be climbed… Although Oracle are quick to state that Oracle-on-Windows (OoW )is a first class platform, and on the whole one that seems to perform just fine, something that’s become more apparent as time has gone on is that finding other people who have investigated & resolved performance issues with OoW is somewhat of an upward struggle. This extends into patching for example.
We recently identified a problem in the RAC clusterware that has been patched for all other platforms, but for some reason wasn’t ported to OoW. This came to light during an amusing conversation with Oracle’s support team who were avidly advising us to apply this patch, but were unable to provide us with a direct link to the patch to download. Only after getting them to go check the available downloads on Metalink did they realise that the particular patch wasn’t available for 11.0.x on Windows! Linux, *nix or Solaris – no problem…
Patching aside, we’ve recently been struggling with a major performance problem on our database which periodically resulted on both RAC servers essentially locking up at 100% CPU load across all cores, with the applications grinding to a standstill thanks to endless amounts of escalating locks within the database.
After what felt like many fruitless days of battling with this issue alongside working with the software vendor to determine whether the application issues we were seeing were a cause or effect of the problem, we started examining exactly what Oracle was doing at the point of an issue occurring. Eventually, with the aid of Oracle’s support team to analyse many Gb’s of traces captured with the system struggling, we were able to determine that the database was having great difficulty caching & reusing execution plans – and once identified, altered a system parameter to force Oracle to reuse execution plans. Problem seemingly solved!
Presenting symptoms:
- High CPU usage across both cluster nodes
- Unstable RAC environment with frequent node crashes, possibly due to excessively high CPU loads, forcing us to run on 1 RAC node.
- Rapid lock-state escalation resulting in all sessions being blocked & application crashes
- High % of lock wait states
- High % “Library Cache Latch” wait states
- High IO wait states
- Retention of 10’s of 1000’s of versions of query execution plans
Our first attempt at diagnosing this focused largely on the IO side of things, as storage for Oracle was not originally dedicated to Oracle but was located on a shared SAN diskgroup with some other applications. This was causing a variety of IO problems within the database as performance of the DB’s LUNs would fluctuate depending on other activities on the SAN. We decided to rule this out completely, so decided to provision a new set of SAN enclosures & disks dedicated to providing Oracle with sufficient storage, over dedicated spindles, arranged into RAID 10 for best performance. Oracle’s ASM proved its worth when the new disk became available, allowing us to keep the database fully available for users during migration of around 2.5Tb to new disks. Although the new storage helped dramatically by reducing IO contention & related wait states, it left us with the lock & Library Cache Latch wait states which between them still were able to bring the system to its knees.
To cut a long story short, it seems that the query code emitted from our applications was not lending itself to efficient re-use within Oracle as it doesn’t currently utilise bind variables, so at the peak of our problems we could query V$sqlarea and see 1000’s of versions of our top 10 queries being retained – despite them being identical bar substitution of a few parameters.
The key symptoms in all of this were actually the high % of “Library Cache Latch” wait states and the high number of execution plan versions being retained in the DB’s shared memory pool.
Inspecting v$sqlarea revealed that Oracle was not able to reuse execution plans as we were seeing retention of 1000’s of versions of plans for specific queries, along with high numbers of open & loaded versions of those plans indicating that nearly every query was being individually parsed to determine how best to execute it.
To determine this, try running something like the query below to retrieve your version counts from v$sqlarea:
select version_count, loaded_versions, open_versions, executions, sql_text
from v$sqlarea orderby version_count desc
In our case we could see 6000 or so versions of one query, with 4-5000 loaded versions, 1-2000 open versions etc. These numbers were somewhat higher than expected… should ideally have a handful of versions, <10 loaded / open versions etc.
Every one of those execution plan versions gets checked when Oracle is presented with a new query to see if they can be reused, so given that every query was being parsed… all those plan versions were being checked for virtually every transaction the DB was processing. This is what seems to have produced 100% CPU load & the “Library Cache Latch” waits, and as a result the corresponding high numbers of Lock waits.
To test our theory we tried flushing the shared memory pool during one period when the system was experiencing locking issues:
ALTER SYSTEM FLUSH SHARED_POOL;
And were able to see an instant reduction in both CPU load, retained plan versions & lock waits.
To prevent this from happening and to get Oracle to re-use queries whenever possible, Oracle Support advised us to set the CURSOR_SHARING system parameter to “FORCE”, which under certain circumstances can improve re-use of execution plans.
ALTER SYSTEM SET cursor_sharing=’FORCE’ SCOPE=BOTH;
After altering the setting & flushing the shared pool, we saw an immediate reduction in CPU load along with an instant reduction in the numbers of queries being cached & a corresponding drop in contention / wait states due to all the various types of lock state. Since introducing the setting we’re now seeing a maximum of 5-10 open versions of our most frequently hit queries despite some of them having been executed 4-500,000 times. We’ve also been able to resume operations with both cluster nodes online as neither node is now running at much more than 30-40% CPU.
This chart shows the difference in wait states within the database from before & after the CURSOR_SHARING configuration was adjusted…
Just goes to show that Oracle performance issues can be dependent on anything and sometimes wait states from something seemingly unconnected to the issues you’re seeing can actually be directly symptomatic of the underlying issue plaguing your database!