My Exadata “Check List”

I’ve been maintaining a configuration “check list” for Exadata machines in my EverNote for months and finally decided it was time to put it into a document.

This is meant to be used in the event of a new Exadata machine arriving at our data center or before some major upcoming change. Do you have anything else to add that you specifically check for? 

Best Practices, Critical Issues and Support Information
— MOS Articles
=== 757552.1 (Best Practices)
=== 888828.1 (Support)
=== 1270094.1 (Critical Issues)
— Review Best Practices and add changes to “Change List”
— Review Critical Issues and add changes to “Change List”
— Review End of Support notifications and plan upgrades

exachk Healthcheck
— Record score, fail/warnings in “DBA” tables
— Add any changes to “Change List” and schedule outages
— Raise SR(s) if necessary

Database Configuration
— Standard database setup
=== SPFILE, memory, RAC, SQLNet
=== Maintain list of non-default parameters and comments in “DBA tables” especially for hidden parameters
=== Do NOT use AMM, use HugePages and set use_large_pages=ONLY
— FlashCache
=== Are there any objects (tables, indexes, partitions) that we should “pin”?
=== Storage cell can show list of objects which are in the FlashCache at that time need to link back to object_id in the database
=== AWR allows us to determine usage of FlashCache
=== WriteBack FlashCache for OLTP processing offers the ability to WRITE to the FlashCache
=== Exadata Smart Flash Log
=== by default, in ESS 11.2.2.4.0, 512Mb of “flashlog” memory is allocated to help minimize redo log waits
— HCC
=== Which tables are using HCC?
=== Which compression method/ratio is being used
=== AWR allows us to determine usage of HCC
— SmartScans/storage cell offload
=== AWR allows us to determine usage of SmartScans

ASM Configuration
— Where to keep PFILE? If SPFILE is kept on ASM diskgroup, when CRS shuts down the diskgroup, it cannot do so “cleanly”. Keep PFILE local?
— Check processes setting
— Disk group redundancy
— Exadata will mirror the disks automatically
— If using Data Guard, “NORMAL” redundancy is acceptable, especially on quarter-rack
— Oracle’s recommendation is to use “HIGH” redundancy, not viable on quarter-rack machines as it drastically reduces the usable free storage
— Use proper ASM diskgroup attributes (compatible, cell.smart_scan_capable, au_size 4M)
— Review the ASM POWER LIMIT and DISK REPAIR TIME

General O/S Configuration (Comp Nodes and Storage Cells)
— Check the exachkcfg autostart service status
— Verify that CSS misscount = 60
— Review users, groups, hostnames, user profile, aliases
— Disk cache policy should be disabled
— Monitor ambient temperature
— Verify RAID controller battery charge and temperature
— Verify hardware and firmware on comp nodes and storage cells are consistent
— Comp nodes and storage cells using WriteBack (not WriteThrough)
— Verify ILOM power up configuration: HOST_AUTO_POWER_ON=disabled, HOST_LAST_POWER_STATE=enabled

Comp Node-Specific Configuration
— Verify no outstanding hardware alerts using “show faulty” with “ipmitool sunoem cli”
— Portmap and nfslock services have to be running if we’re using NFS
— Use HugePages
— “Locked memory” should total 75% of physical memory (max)
— Shared memory segment max size = 85% of physical memory
— Sum of processes does not exceed maximum number of semaphors
— Number of semaphores in a semaphore set must be at least as high as the processes parameter in ALL databases
— Size of Shared Memory Segments OS setting for max size = 85% of physical memory
— Verify disk controller configuration on comp nodes
— Verify physical and virtual drive configuration on comp nodes
— Verify that NUMA is NOT enabled on the comp nodes
— Verify that RAC databases use RDS and not UDP to communicate
— Set SQLNET.EXPIRE_TIME = 10 in the RDBMS home

Storage Cell-Specific Configuration
— cellconf check — especially for NTP servers, SNMP configuration
— Check for WriteBack/WriteThrough mode (disk cache policy)
— Check for ECC memory errors on the storage cells
— Verify celldisk and flashdisk configuration (no griddisks on Flash!)
— Confirm that total size of all griddisks fully utilizes celldisk capacity
— ipconf parameter file must be consistent with O/S configuration
— Verify that cell services are up and running

InfiniBand Configuration
— Check that IB is the PRIVATE network for cluster communications
— Check for ports disabled due to excessive symbol errors
— Check IB ARP (Address Resolution Protocol) is correctly set up on comp nodes
— Verify IB cable connection quality
— Verify Ethernet cable connection quality
— Verify IB fabric topology
— Check for IB network errors
— Verify IB subnet manager is running on an IB switch
— Verify IB subnet manager is not able to run on anything except an IB switch
— Verify key parameters in the /etc/opensm/opensm.conf file
— Verify IB network throughput (infinicheck) this HAS to be run during a “quiet” time as it evaluates full network throughput

Network Configuration
— Client network should be bonded on bondeth0
— IB network should be bonded on bondib0
— Admin network cannot be bonded and must be eth0 ON ITS OWN SUBNET
— Is 10GiB hardware enabled and used (optical)?
— Do we run a dedicated backup or Data Guard network?
— Verify average ping times from the comp nodes and storage cells to the DNS servers
— May need to set network routes/rule for client network/admin network, especially if OEM is in the DMZ (cannot use the client network)
— Monitor usage

Backup/Recovery, Disaster Recovery, MAA
— Baremetal restores:
=== storage cells, switches, comp nodes, LVM
— RMAN:
=== backup schedule, logs and reports (success, timing), check for block change tracking
— Configuration file backups:
=== PFILE, ASM PFILE, encryption keys, LVM, tnsnames.ora, listener.ora, sqlnet.ora, DCLI groups, hostnames, oraInventory, network interfaces, network rules and routes, RMAN scripts, crontab, cronjob scripts
— Data Guard:
=== DB params, Data Guard Broker configuration, switchover steps/checklist, status tests
=== Active Data Guard?
=== Make sure that the DGB timeout is longer than other timeouts (clusterware, SQLnet, etc 90 seconds at least)
— ZFS:
=== Use of ZFS storage appliance on InfiniBand to reduce backup/restore time
— FLASHBACK/archive/ORL/SRLs:
=== FLASHBACK DATABASE enabled?
=== Where are the Flashback logs located?
=== How long should we be able to Flashback?
=== Archive log location, backup and housekeeping
=== Forced checkpoints, archive log switches, MTTR
=== Log review ORLs, SRLs, logfile groups and sizes, log_buffer parameter
— Recovery Time Objective:
=== How LONG will it take to restore to the appropriate point-in-time?
— Recovery Point Objective:
=== WHAT is the point-in-time?
— Recovery tests
=== restore of database and configuration files
=== restore of encryption keys
=== switchover tests

Monitoring
— OEM Cloud Control
=== Agents, alerts
=== Platinum Support maintain their own OEM agent
— Storage cells:
=== alerthistory, list physicaldisk, SNMP setup, customized scripts
=== sundiag.sh for failed disks
— Switches:
=== ibcheckerrors
— Syslogs:
=== for comp nodes, storage cells, switches

Platinum Support
— Patching planning
— List components to be monitored (databases, binaries)
— Access to AMR portal
— Maintain list of “known issues” for which PS should not alert
— PS gateways?
— PS monitoring OEM agents?

Resource Management
— DBRM:
=== Consumer groups with resource limits
=== Windows?
=== Different resource plans
— IORM
=== Set up for machines with multiple databases
=== Is Instance Caging in use?
— NetRM
=== Available after QFSDP Jan 2014 switch firmware update

Service Management
— Set up database services and tnsnames.ora files for clients
— Monitor using listener/SCAN listener logs and AWR
— Associate services with consumer groups
— Maintain “DBA table” with connection/service information

AWR
— Snapshot frequency, retention, retained snapshot periods
— Create AWR “repository” in “DBA tables” for historical purposes
— Maintain periods for comparison (before and after upgrades)
— List of important values to review
— Regular review

Auditing
— DB:
=== DBA audit trail housekeeping, create “summary” trail in “DBA tables”
=== List of audit options/statements enabled
=== Make sure the audit trail AND the FGA audit trail are NOT in the SYSTEM tablespace they need to be in an ASSM tablespace (SYSAUX)
— ASM:
=== ASM audit trail housekeeping
— OS:
=== OS DB audit trail (SYSDBA) housekeeping
=== auditd logs (sudo)

“DBA Tables”
— Tables containing important admin/support/monitoring data
— List of “audited” database users and activity
— List of “application” users and access
— Capacity planning storage/safely usable, new data growth, HCC impact, IOPS, memory
— Audit trail summary
— List of non-default and hidden parameters with “comments”
— exachk scores, dates, fail/warnings
— Create external table populated using OPatch
— Serial numbers for each server (external table)
— Product name for each server (external table)
— Associated CSI for each server
— Check for OCM “hardware configuration” set up in MOS
— Maintain list of failed disks (SRs, serial numbers, dates, etc)
— Maintain options, stack software versions, bug fixes, critical issues, upgrade dates, etc

Capacity Planning
— Leverage “DBA tables” for capacity history
— Check storage (total datafile size, total segment size, ASM diskgroup usable space via external table)
— Review HCC configuration and usage (AWR)
— Review FlashCache usage (AWR)
— Review storage cell offloading (AWR)
— Review IOPS, storage, memory, memory structures, disk throughput

Statistics
— Assumes copious amounts of partitioning
— SQL Plan Management:
=== needs some initial “tweaking” for the optimizer to accept the “best” explain plan
=== doesn’t work well without bind variables — could end up with billions of plans which are checked by the optimizer, causing bad performance
— Incremental global stats gathering:
=== partitioned tables often only update statistics for the local partition
=== global table/index stats are NOT updated unless “incremental global stats” are enabled
=== if NOT enabled, run weekly job to pick the “top 25” more stale tables and explicitly gather their global stats
— Dictionary statistics:
=== we run these once a month
— System statistics
=== we do not run these without a change in hardware
=== use the EXADATA mode to gather system statistics

Housekeeping
— Keep trace files for 31 days (alert.log, css.log, syslog, etc)
— gzip the alert.log every day
— gzip the listener/SCAN logs every month
— Purge the audit trail every day (keeping 14-28 days’ data)
— Consider using an “audit archive” summary table to maintain data for longer
— Purge the O/S audit trail every day (keeping 31 days)
— Run capacity planning “capture” every day
— Run datafile “shrink” commands weekly
— Run statistic gathering regularly global stats on stale tables (weekly), dictionary stats (monthly)

Advertisements
Tagged , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: