This week, a client encountered a particularly nasty bug – 10194190 – which caused their ASM instance processes to “spin” and cause a bunch of errors, essentially leading to a instance crash on both of their RAC nodes.
In the ASM instance:
ORA-00490: PSP process terminated with error
PMON (ospid: 12345): terminating the instance due to error 490
In the database instance, we either saw an ORA-00240 or ORA-03113 error and an instance crash.
ORA-00240: control file enqueue held for more than 120 seconds
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
What triggered this was that the uptime of each node was greater than 248 days. On Solaris SPARC systems, there is a bug in the compiler which can cause either a database or an ASM instance crash if the server has been up for more than 248 days.
There are bug fixes available for versions 10.2.0.4 through 220.127.116.11, but there is no fix for any 12c database at the moment.
Larry’s Secret Number?
“248 days”, you ask? Curious, n’est pas? Indeed, especially when you learn that there are another couple of Oracle bugs out there which really seem to fixate on that number.
In a previous life, I encountered bug 4612267 while running the 10.2.0.1 Client for an EDW batch server.
It is looping on the times() function.
In addition to sqlplus, it has been reported that the netca and dbca tools also hang.
This may happen with a new installation of Instant Client 10.2.0.1.0 or Oracle 10.2.0.1.0 on UNIX platform, or it can just occur after some period of time with no other changes.
This is a known, unpublished bug.
Bug 4612267 OCI CLIENT IS IN AN INFINITE LOOP WHEN MACHINE UPTIME HITS 248 DAYS
This machine was critical to the enterprise and had to be rock solid, so once we got things stable, the project team were loathe to mess with it, even though it was running 10.2.0.1.
“Not messing with it” also involved maintaining server uptime, because we had a bunch of NFS mounts attached to it, which seemed to somehow cause the server to take a very long time to reboot.
Unfortunately, the platform was a bit too “solid” and we hit bug 4612267 during a particularly busy afternoon of batch processing. As with such things, it took a while to find the culprit, but it ended up being new sqlplus sessions started after 1pm that day on the EDW batch server. The server had been up 231 days.
We looked and we could see the sessions had started, but they had got no CPU time whatsoever. In typical IT fashion, we collected as much diagnostic info as we could and bounced the server. Naturally, the problem went away.
After much head-scratching and seeing the same issue once or twice more whenever the server uptime ranged between 60 and 248 days, we realized we were hitting the bug. As we were using a client install, we couldn’t upgrade to 10.2.0.2 and applying the patch failed for some obscure reason, despite assistance from Oracle Support. We only “fixed” the bug when we upgraded the client to 11g, which was deemed to be too much “messing about” until after our busiest period of the year was over.
Our workaround was to schedule a server reboot, which was a pain, though it was better than the alternative.
I always did wonder why someone would code something which would “spin” if the server uptime was between 60 and 248 days. What does that matter to SQL*Plus? Besides, once you went beyond 248 days of uptime, you were in the clear. This is why it took three years before it was noticed.
I wonder what significance 248 days has with Oracle? Maybe it’s a code which is somehow used to obtain privileged access somewhere in Redwood City? 🙂