Last week, I had two storage cell reboots within five days of each other. Oddly enough, both happened during relatively quiet times on the system and with no indication of any cause recorded in the alerthistory, the syslog or the trace files.
It turns out that I was bitten by some “expected behavior”:
If an I/O operation on a disk(s) fails to complete within 95 seconds, the machine will force a power cycle of that individual storage cell to avoid ALL of the cells crashing and causing a database-wide outage.
Unfortunately, the documentation that the SR analyst gave me as a reference appears to be an internal Oracle document (and I have not yet been shown the secret handshake which gets me access) but the issue is very similar to MOS 1605255.1 – IO Hang in Single Disk Causes Cell Node Power Cell in Exadata.
In my particular case(s), I didn’t get any indication in alerthistory that a power cycle was forced because of an “IO hang” on a celldisk, but the behavior was identical otherwise.
To temporarily mitigate the risk of hitting this issue, you can proactively perform rolling restarts of the storage cell’s services. Of course, don’t use dcli to issue this command on multiple storage cells at once, unless you want your databases to go down:
cellcli -e “alter cell restart services all”
It’s probably no surprise to hear that the fixes to “all of the disk/IO hang bugs” – I presume the analyst meant “expected behavior” – are included in April 2014’s Quarterly Full-Stack Patch 🙂