Oracle Troubleshooting Best Practices

Library Note Morgan's Library Page Header
The Library is currently in the process of being upgraded from Oracle Database Version to Demos are being upgraded to reflect the new Container paradigm as well as EBR (Edition Based Redefinition) and may contain references to CDBs, PDBs, and other objects you may not be familiar with such as CDB_OBJECTS_AE: Welcome to 12c.

Have you viewed the Library's main Table of Contents? If not ... [Click Here] ... scroll down the page and you will find links to the largest collection of Oracle Database demos on the net.
Why a troubleshootings page?

Because I am up to my [...] in frustration right now working with a team that does not know how to troubleshoot a problem and that is not the least bit interested in listening to those that do. They have a RAC cluster with one node that restarts and immediately shoots itself in the head.

Is it an Oracle Database issue? It isn't. But they are not even remotely close to considering that it might not be. Thus staggering amounts of expensive time are being wasted and corporate management is being deceived not through some wicked intention but simply because they rely on the blind to point out the path. When the help desk reports that an application is slow the thing every DBA is advised to do is determine where in the technology stack the issue exists. Is it the database or is it perhaps an application/app server issue, network issue, storage layer, etc. Unfortunately this good advice is seemingly gone when there is a database outage. The sanity that prevails for "measured and careful" is gone in this situation and the default assumption that the issue is the database prevails.

So for those of you facing issues related to database and instance outages these notes are for you and all based on personal experiences and the results of the outage to which I refer is now a part of a conference presentation on Root Cause Analysis I have presented in the Americas, EMEA, LA, and APAC. Perhaps the only thing as bad a guessing is to put on a blind-fold.
My Oracle Support
One of the biggest failures I observe in our community is the lack of knowledge of how Oracle Support, now "My Oracle Support" or "MOS", previously "Metalink" works. If you don't understand the rules you cannot play the game. So let's start out by learning how to work effectively with MOS.

Everyone at MOS knows the rules and they rarely violate them. What they do, which is almost as bad, is take advantage of the customer's ignorance of those rules.
When to open the SR Set a timer. If you can not get everything back online in less than 5 minutes then your first step is to open an SR with Oracle Support. If by some chance you fix the problem before they respond then close it: No harm done. If you have not then use Oracle Support to sanity check what you are thinking of doing. It is not uncommon for DBAs to make a bad situation far worse by doing the wrong thing. You want Oracle to agree with your proposed course of action and, in addition to helping you with your decision making, having them involved will keep your employer from making your life a living hell (see CYA).
What information to include When you first open your SR put in the facts. You will be prompted for most of them by the MOS form but "facts" requires, in most cases, that you know what you are running, where you are running it, and have reviewed all of the log files. The following is guidance on which log files you should review.
Non-RAC / All
1 $ORACLE_BASE/diag/<database_name><instance_name>trace/alert.log
2 oracle software owner's .bash_history file
3 root's .bash_history file
4 /var/log/messages (Linux)
5 If thin provisioning storage make sure that you don't over-provision because while the pool may show X TB you may not really have that much space. I had a situation where the failure of a backup and redo log delete filled physical space such that three separate databases froze even though the virtualized space made it appear there was sufficient room.
Additional Tests if ASM and Oracle Clusterware such as RAC
1 ASM Instance Alert Log $GRID_HOME/log/<hostname>/alert<hostname>.log
2 Clusterware Logs $GRID_HOME/log/ocssd<#>.log
3 Cluster Ready Services Daemon (crsd) CRS_HOME/log/hostname/crsd
4 Oracle Cluster Registry (OCR) records CRS_HOME/log/hostname/client
5 Oracle Process Monitor Daemon (OPROCD) /etc/oracle/hostname.oprocd.log
6 Cluster Synchronization services (CSS) CRS_HOME/log/<hostname>/cssd
7 Event Manager (EVM) CRS_HOME/log/hostname/evmd
8 RAC RACG CRS_HOME/log/<hostname>/racg
My Oracle Support: Support Engineer Quality Not all DBAs are of equal quality and not all support engineers with Oracle are either. If you have uploaded an RDA and a support engineer asks you for information that is in that RDA ask to speak to an escalation manager: Doing so is your right as a customer. Do not tolerate this "engineer" as they are either lazy or not sufficiently skilled to read what you uploaded.

If you have a support engineer that is asking you to run diagnostic test after diagnostic test, multiple system state dumps, etc. and it is not getting you anywhere or you know enough to know it won';t get you anywhere ... ask to speak to an escalation manager.

The time you are wasting is your own.

Equally true if you are someone that has not taken classes, does not read books, does not utilize the online documentation, and/or thinks you can apply your version 7.3.2 skills to then expect no sympathy from the Library. Don't open an SR ... RTFM.
Acronym Definitions
Acronym Definition
CSI Customer Service Identifier
MOS MyOracleSupport. The horrible, by improving, website that provides Oracle on-line support.
RDA Remote Diagnostic Agent
Sev Severity. An outage is Sev 1
SR Service Request
TAR Technical Assistance Request: The old name for a Service Request (SR)
Tactics in Root Cause Analysis
Term Definition
ASM Alert Log If you are using ASM then you have a separate Oracle instance that is the ASM home with its own alert log. If you do not know where it is then you need to learn that now and document it as you should be reading the alert log every day for warnings and errors.
Blocking Sessions The fact that something is blocking does not mean it is the root cause of an issue. It does mean you need to consider that it might be.
  SELECT username
  FROM gv$session
  WHERE sid=a.sid) blocker,
  a.sid, ' is blocking ', (
  SELECT username
  FROM gv$session
  WHERE sid=b.sid) blockee, b.sid
FROM gv$lock a, gv$lock b
WHERE a.block = 1
AND b.request > 0
AND a.id1 = b.id1
AND a.id2 = b.id2;

FROM dba_blockers;
Change Management If you do not have the ability to, within one or two minutes walk through changes made to your environment by system admins, network admins, storage admins, vendors, yourself, and other DBAs on your team your organization flunks by definition. Did a network engineer reconfigure the DNS server? You do not know. Did a storage admin patch the software on your SAN? You have no idea. Did a UNIX SA decide to fence CPUs? You are blind. Another member of your team change an init parameter? Perhaps. And exactly how perfect is anyone's memory about what they did yesterday much less on Friday of last week? Please don't make me laugh (or cry).

Change management is a serious topic and you need to implement change management practices that include not just "here's what I did" but also the actual keystrokes, spooled to a file, for later review.
Clusterware Logs If you have a RAC cluster, by definition, you have a clustered file system which is likely Oracle Clusterware installed which means you have log files for the nodeapps and you need to know where to find these and need to read them every day for warnings and errors. You might have an ODA or Exadata where ASM is the layer under ACFS but you have Oracle Cluterware installed. Even if you have NFS mounted a NetApp ... you have Oracle Clusterware and you have many log files in several different directories all of which must be checked.
Database Alert Log If you have not reviewed your alert log every day prior to this outage you are starting with a negative rating. You have no idea what errors are unrelated to the current situation and which ones may be warnings prior to the current outage.

Find the alert log entries for the current outage and then review the previous 48 hours to see if there are any warnings that might be related. If you see a warning or error that might be related and you do not know what it is ... use google. You do not have time to mess around with MyOracleSupport. Copy those that might be related to an electronic scratchpad and hold them.
O/S Logs /var/log/messages
RDA If you do not know what an RDA is know that I would never hire you as a DBA on any project and neither should anyone else. If you do and you do not have a current one ready to send to Oracle attached to an SR know that I'd have you in my office immediately after the outage for a serious discussion of your career path choices. If you have one ... then make sure it is updated every time you make a configuration change, zip it up, and have it ready to upload. An RDA on a big system can take hours to create. Trying to create one during an outage is a guaranteed non-starter.

Not having a current, RDA ready to use when opening an SR, should be a firing offense on any production system.
Resource Manager Are you using Resource Management (DBMS_RESOURCE_MGR)?
Services If a user or application connects using a service, as they all should, and the service is not running the user or application will not be able to connect. If you do not know what services should be running, and with RAC on what nodes, then one of your tasks is to document this using the following SQL statement when everything is fine.
SELECT inst_id, name, goal, dtp, blocked, clb_goal
FROM gv$active_services
ORDER BY 2, 1;
Waiting Sessions The fact that something is waiting for a resource does not mean it is the root cause of an issue. It does mean you need to consider that it might be.
FROM dba_waiters;

Related Topics
DBA Best Practices
DBA Best Practice Guidelines
Developer Best Practice Guidelines