Category Archives: Storage Cells

Exadata Critical Issue DB20

A new Exadata Critical Issue – EX20 – has been announced on MOS note 1270094.1 and applies to Exadata Storage Server versions 12.1.1.1.0 and 12.1.1.1.1.

The issue is caused by bug 19211091:

CELLSRV Internal Error ORA-600 [DiskIOSched::GetCatIndex:2]

Further details can be found in MOS 1967985.1

You might hit this bug if your database resource manager plan contains sub-plans and OTHER_GROUPS is present in a sub-plan instead of the top plan.

The CELLSRV trace file will contain one or more entries indicating CELLSRV process failure similar to the following:

ORA-00600: internal error code, arguments: [DiskIOSched::GetCatIndex:2], [4294967295], [], [], [], [], [], [], [], [], [], []

CELLSRV encountered a fatal signal 11. LWPID: 28000 userId: 80 kernelId: 80 pthreadID: 139785595115840
Ignoring fatal signal encountered during Cellsrv state dump LWPID: 28000 userId: 80 kernelId: 80 pthreadID: 139785595115840

If CELLSRV fails on multiple cells simultaneously, then the ASM disk groups may dismount or ASM instances may crash, potentially causing databases to crash.

Typically, the Restart Server (RS) process will restart CELLSRV after it fails.  However, too many CELLSRV failures will trigger “flood control” and prevent further CELLSRV restarts.  Flood control is indicated in the trace file with entries similar to the following:

[RS] monitoring process /opt/oracle/cell/cellsrv/bin/cellrsomt (pid: 26763) returned with error: 126
[RS] Monitoring process for service CELLSRV detected a flood of restarts. Disable monitoring process.
RS-7445 [CELLSRV monitor disabled] [Detected a flood of restarts] [] [] [] [] [] [] [] [] [] []

Workarounds
The recommended action is to upgrade to Exadata Storage Server software version 12.1.1.1.2 (or higher) or 12.1.2.1.1 (or higher).

Alternately, you can apply patch 19211091.

As a temporary workaround, you can disable the Resource Manager on the affected databases, modify the appropriate plan so that the OTHER_GROUPS directive is in the top plan (and not any sub-plan) and re-enable the Resource Manager:

ALTER SYSTEM SET resource_manager_plan=” SCOPE=both SID=’*’;

SELECT unique name
FROM v$rsrc_plan_history
WHERE name NOT IN (
SELECT plan
FROM dba_rsrc_plan_directives
WHERE plan IN (
SELECT unique name
FROM v$rsrc_plan_history)
AND group_or_subplan = ‘OTHER_GROUPS’);

SYS.DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE(
plan => ‘MY_PLAN’,
group_or_subplan => ‘OTHER_GROUPS’,
mgmt_p2 => 80,
switch_estimate => FALSE,
comment => NULL);

ALTER SYSTEM SET resource_manager_plan=’MY_PLAN’ SCOPE=both SID=’*’;

Advertisements
Tagged , , ,

Exadata: why a half-rack is the “recommended minimum size”

Lots of shops dipped their toes in the Exadata water with a quarter-rack first of all.

(For those who are new to the Exadata party and don’t know of a world without elastic configurations, a quarter-rack is a machine with two compute nodes and three storage cells).

If you are / were one of those customers, you’ll probably have winced at the difference between the “raw” storage capacity and the “usable” storage capacity when you got to play with it for the first time.

While you could choose to configure your DATA and RECO diskgroups with HIGH redundancy in ASM, did you notice that you couldn’t do the same with the DBFS_DG / SYSTEM_DG?

Check out page 5 in this document about best practices for consolidation on Exadata.

“A slight HA disadvantage of an Oracle Exadata Database Machine X3-2 quarter or eighth rack is that there are insufficient Exadata cells for the voting disks to reside in any high redundancy disk group which can be worked around by expanding with 2 more Exadata cells. Voting disks require 5 failure groups or 5 Exadata cells; this is one of the main reasons why an Exadata half rack is the recommended minimum size.”

Basically, you need at least 5 storage cells for each Exadata environment if you want to have true “high availability” with your Exadata machine.

While quarter-rack machines have 3 storage cells, half-rack machines have 7 or 8 storage cells, depending on the model.

Let’s say that you have the model with 8 storage cells:  if you split a half-rack machine equally, you’ll have 2x quarter-rack machines with 4 storage cells, so you would need one more storage cell per machine to provide HA for the SYSTEMDG / DATA_DG diskgroup.

For some reason, this nugget escaped my attention until recently.  Even more reason to have a standby Exadata machine at your DR site …

Mark

 

Tagged , , , , ,

Exadata Critical Issue EX19

Overnight, Oracle announced a new Exadata Critical Issue (EX19) which applies to storage cells running 12.1.1.1.1 or earlier of the ESS software.

The bug is 19695225 and more information can be found on MOS 1991445.1.

Cell disk metadata corruption and loss of cell disk content (i.e. grid disk, ASM disk) will occur if many CREATE GRIDDISK or ALTER GRIDDISK commands that modify cell disk space configuration are run over time for the same cell disk.

If CellCLI griddisk commands are typically run in parallel on all storage servers simultaneously, which is a common maintenance practice, and the issue occurs on multiple storage servers at the same time such that all redundant disk extents are lost for files in an ASM disk group, then the disk group will dismount and database will crash, and will require restoring files from backup.

Rolling cell maintenance commands that change grid disk state, such as ALTER GRIDDISK INACTIVE and ALTER GRIDDISK ACTIVE, do not contribute to this issue.

Since initial system deployment if you have recreated or reconfigured grid disks using CellCLI commands CREATE GRIDDISK or ALTER GRIDDISK more than 31 times, then the likelihood of occurrence is high.

 

Risk and Detection
The risk to test and development systems is expected to be higher than production systems due to the dynamic manner in which they may be reconfigured.

To determine if your system is exposed to this issue, and how close the system is to having cell disk metadata corruption, download and run the script attached to this document on all storage servers as the root user.

Possible symptoms that cell disk metadata corruption has occurred as a result of this bug include the following:

  • ASM disk group(s) dismount and database crash following CREATE GRIDDISK or ALTER GRIDDISK.
  • ASM disk group(s) cannot be mounted following the disk group dismount.
  • Error ORA-600 [addNewSegmentsToGDisk_2] is reported in the cell alert.log.

 

The cell disk corruption cannot be repaired once it occurs.  Recovery requires recreating cell disks, grid disks, and ASM disk groups, then restoring affected databases from backup.

Perform one of the following actions to prevent bug 19695225:

  • Upgrade to Exadata Storage Server version 12.1.2.1.1 or later (Exadata 12.1.2.1.0 contains the fix to this issue, however 12.1.2.1.1 or later is the recommended version).
  • Upgrade to Exadata Storage Server version 12.1.1.1.2 or later 12.1.1.1.x.
  • Apply patch 19695225 to all Exadata Storage Servers. At the time of writing a patch is available for Exadata versions 12.1.1.1.1, 11.2.3.3.1, and 11.2.3.3.0.
  • Avoid running CellCLI commands CREATE GRIDDISK or ALTER GRIDDISK until the code fix is applied via upgrade or patch apply.

 

I think it’s a good idea to run the check script on your storage cells as root to determine whether there’s any immediate risk (probably unlikely). If necessary, consider applying the patch – but you should be planning your patching to the QFSDP April 2015 now, right? 🙂

 

Mark

Tagged , , , ,

DBA 3.0 – How to Become a Real-World Exadata DBA – IOUG Collaborate 2015

According to a Book of Lists survey, 41% of people’s biggest fear is “public speaking”.  To put that into perspective, “death” is the biggest fear for 19%, “flying” for 18% and “clowns” don’t even register (which does make me seriously doubt the survey’s credibility).

I gave my first public presentation at IOUG Collaborate 2015 last week in Las Vegas and I didn’t die.

Why did do make your presentation debut at the second largest Oracle event on the calendar?  Excellent question.

Continue reading

Tagged , , , ,

My Collaborate IOUG 2015 Abstract

I will be presenting DBA 3.0 or “How to Become a Real-World Exadata DBA” at Collaborate 2015 – IOUG’s annual user conference – from April 12th to 16th at the Mandalay Bay Resort and Casino in Las Vegas. I submitted this as my abstract:

“DBA resources are more scarce than ever before and it can be very difficult to allocate time on anything but keeping the lights on – even when an organization has made a (substantial) hardware investment in Exadata.

However, if Exadata is treated like any other Oracle database, the promised “extreme performance” will likely be very underwhelming to developers, users and managers and can become unwieldy for DBAs to support.

On the other hand, when an organization configures and supports Exadata properly, they can realize exponential performance improvements in key IT infrastructure, can facilitate better business decisions and may actually reduce infrastructure costs.

The customer has bought a sports car – but might not realize that they haven’t taken it out of second gear (yet).

I will talk about the evolution of Exadata and then get into the “nuts and bolts” of how to support a high-performance Exadata environment as a Production DBA.

I will discuss how to get performance improvements of up to 20x, what NOT to do as an Exadata DBA and how Exadata can become the foundation of your organization’s high-performance enterprise infrastructure.”

I hope to see you in Las Vegas!

Tagged , , ,

UKOUG 2014 – Dan Norris – Exadata Security Best Practices

Dan Norris of the Maximum Availability Architecture team gave what sounded like a very interesting presentation at UKOUG 2014. There seemed to be a lot of really cool stuff at this year’s event, which is to be expected as I no longer reside in the UK!

I encourage you to take a look at the slides, but also at the interesting links he provided:

Naturally, he also quoted a plethora of My Oracle Support notes – some of the greatest hits and some which you might not have seen before:

  • Responses to common Exadata security scan findings (Doc ID 1405320.1)
  • Oracle Sun Database Machine X2-2/X2-8, X3-2/X3-8 and X4-2 Security Best Practices (Doc ID 1071314.1)
  • How to change OS user password for Cell Node, Database Node , ILOM, KVM , Infiniband Switch , GigaBit Ethernet Switch and PDU on Exadata Database Machine (Doc ID 1291766.1)
  • Exadata Database Machine and Exadata Storage Server Supported Versions (Doc ID 888828.1)
  • Information Center: Oracle Exadata Database Machine (Doc ID 1306791.2)

Happy reading!

Tagged , , , , ,

Exadata Storage Cell Reboot: Expected Behavior

Last week, I had two storage cell reboots within five days of each other. Oddly enough, both happened during relatively quiet times on the system and with no indication of any cause recorded in the alerthistory, the syslog or the trace files.

It turns out that I was bitten by some “expected behavior”: Continue reading

Tagged , , , , ,
%d bloggers like this: