Wednesday, June 02, 2010

ORA-00600 Stellium

ORA-00600 [ARG1] [ARG2] [ARG3] [ARG4] [ARG5]
This is the kind of things a production DBA does not want to read at the alert.log file, particularly when the production database availability is compromised. At this time the only thing we would like to have is a magic wand that could make this dreaded ORA600 errors magically disappear. The truth is that so far there is no such magic wand and we all have to go through a troubleshooting procedure.

This is not a troubleshooting guide, and if you are reading this now it means either you are following my blog or you are googling after a problem you currently face and you have no clue on what is going on.

The basic troubleshooting procedure
Make sure you are at the latest patchset available for your current installed Oracle Home.
Collect and pack all the trace, and log files and raise a Service Request (SR) at Oracle Support Services (OSS).

In the mean time, try to read through the quasi-encrypted text trace and log files to find out if there is a known pattern, a readable entry point function, a particular environmental circumstance that precluded the ORA600, any thing that can help you out answer to the very basic question ... What's the root problem?

In my particular environment I had an ORA-00600 Stellium (a group of planets that from the observer's perspective they are linked together in a series of continuous conjunctions) It was a cascade of ORA-00600 errors that used to periodically flood my alert.log, trace files and incident library.

The ORA-00600 Stellium
ORA-00600 [KKPAMRFGET0]
ORA-00600 [kkdcacr: ptn_kxcp]
ORA-00600 [qerrmOFbu:invalid rowcount]
ORA-00600 [kkpamRFGet()+369]

The ORA-07445 companions
ORA-07445: exception encountered: core dump [kxccexi()+1219] [ACCESS_VIOLATION] [ADDR:0xC45] [PC:0x263EFAF] [UNABLE_TO_READ] [].

The technical environment
Oracle 11g Rel. 1 (11.1.0.6.0)
MS Windows 2003 x64 R2 Server 5.2 SP 2
PowerEdge 2950
4 CPU's
16 Gb RAM

Symptoms
The pattern found was, during a huge transactional activity on several partitioned tables containing spatial data columns suddenly all connections were broken and in order to continue with the production environment a the Oracle instance had to be restarted.

There was no specific time for this error to show up, it could be at night, at noon, on weekends, so there was actually not a specific time pattern. The key to solve this issue lied on the fact that there was too much transactional activity. After digging into the trace files I found out that there were memory issues, not related with the amount, but rather with the allocation. After the Oracle instance was running for a while (several days) on a stressful transactional activity it started to cause instability withe the memory allocation policies.

The SR Feedback
So far OSS told us the recurrence of the Ora600 errors documented at the production environment are due to a bug not yet identified so far by Oracle. The only tasks that can be done so far are to upgrade to 11.1.7.0 and wait for a patch that would be released on May 2010 that probably fixes this issue. The main argument OSS provided was the fact that the error could not be reproduced on their environment.

At this point Oracle does not have a patch or a direct procedure to avoid the kind of ORA600 errors seen at the production environment. The notes and white papers documented so far point to some issues found on the Intel x64 platform based environments. So it can happen both at Linux or Windows.

Troubleshooting
The troubleshooting path I followed in this case was:
  • Compile and pack the facts (trace and log files in an incident pack)
  • Raise a Service Request Informing Oracle about the issues
  • Track with the assigned Oracle Support Analyst
  • Downgrade the memory parameters behavior from 11g to 10g
  • Upgrade from 11.1.0.6.0 to 11.1.0.7.0
  • Upgrade Client Machines
  • Upgrade each client machine where an oracle client is installed
  • Upgrade the middle tier client that connects to the Oracle database.
  • Revert back the Oracle 11g memory allocation policy to 10g.

Three key factors
After struggling for months with these issues I was able to identify three key factors:
Upgrade to 11.1.0.7.0. Upgrading provides three advantageous points, first you don't have to wait for OSS to ask for an upgrade when raising a SR, second you prevent the root cause to be due after some known bugs, and third it is part of the best practices.
Revert back Memory allocation policy to 10g. 11g introduced a new instance parameter memory_target, which defines the total SGA + PGA memory allocated for the current instance, this way you don't have to worry about the way memory is individually distributed among the different memory component, meanwhile in 10g there were two different parameters, one named sga_target and the second pga_aggregate_target, this divided the memory allocation in two parts. In my personal belief I am strongly convinced that there was an issue with this new allocation policy on this particular Windows 2003 x64 platform. 11g for windows is the last port released by Oracle Corp., not too many big companies use Windows as the main production O.S. platform, and the new features have to go through a testing that sometimes cannot thoroughly reproduce a heavy weight production environment, so it is very likely that at the extreme conditions systems are sometimes exposed a very hidden in the deep internals bug is hit. This was the case in this environment, so I suggested to revert back to the 10g memory allocation policy.
Rewrite the application. Leaning on the original development team is great, you can manipulate the source code and have it tailored in case it is required. In this case there were some applications that wildly hit the production environment with a huge amount of transactions. In this particular point the credit goes to Anil R. who knew exactly how to rewrite the code that was still triggering an ORA-07455 error after the DB was upgraded.

After one month of activity no errors have shown up, and even though O.S.S. was not able to issue a conclusive reason why these errors showed up, the troubleshooting procedure was so far successful.

Conclusion
An ORA-00600 is still an error that lies within the Oracle jurisdiction, don't try to rewrite the oracle.exe code in case this kind of errors show up, but at least try to do your very best to abstract the 600 Black Box by identifying the key factors that trigger the error, and reading through the log and trace files. Answer the question Why this error showed up? and absolutely the very first troubleshooting point, raise a Service Request at Oracle Support Services.