Oracle Troubleshooting Best Practices

Library Note Morgan's Library Page Footer
The Library is currently in the process of being upgraded from Oracle Database Version to Demos are being upgraded to reflect the new Container paradigm as well as EBR (Edition Based Redefinition) and may contain references to CDBs, PDBs, and other objects you may not be familiar with such as CDB_OBJECTS_AE: Welcome to 12c.
Why a troubleshootings page?

Because I am up to my [...] in frustration right now working with a team that does not know how to troubleshoot a problem and that is not the least bit interested in listening to those that do. They have a RAC cluster with one node that restarts and immediately shoots itself in the head.

Is it an Oracle issue? It isn't. But they are not even remotely close to considering that it might not be. So staggering amounts of expensive time are being wasted and corporate management is being deceived not through some wicked intention but simply because they rely on the blind to point out the path. When the help desk reports that an application is slow the thing every DBA is advised to do is determine where in the technology stack the issue exists. Is it the database or is it perhaps an application/app server issue, network issue, storage layer, etc. Unfortunately this good advice is seemingly gone when there is a database outage. The sanity that prevails for "slow" is gone in this situation and it is the default assumption that the issue is the database.

So for those of you facing issues related to database and instance outages these notes are for you and all based on personal experiences
My Oracle Support
One of the biggest failures I observe in our community is the lack of knowledge of how Oracle Support, now "My Oracle Support" or "MOS", previously "Metalink" works. If you don't understand the rules you cannot play the game. So let's start out by learning how to work effectively with MOS.

Everyone at MOS knows the rules and they rarely violate them. What they do, which is equally bad, is take advantage of the customer's ignorance of those rules.
When to open the SR Set a timer. If you can not get everything back online in less than 3 minutes then your first step is to open an SR with Oracle Support. If by some chance you fix the problem before they respond then close it: No harm done. If you have not then use Oracle Support to sanity check what you are thinking of doing. It is not uncommon for DBAs to make a bad situation far worse by doing the wrong thing. You want Oracle to agree with your proposed course of action and, in addition to helping you with your decision making, having them involved will keep your employer from making your life a living hell (see CYA).
What information to include When you first open your SR put in the facts. You will be prompted for most of them by the MOS form but "facts" requires, in most cases, that you know what you are running, where you are running it, and have reviewed all of the log files. The following is guidance on which log files you should review.
Non-RAC / All
1 $ORACLE_BASE/diag/<database_name><instance_name>trace/alert.log
2 .bash_history
3 /var/log/messages
5 If thin provisioning make sure that you don't over-provision because while the pool may show X TB you may not really have that much space.
Had a situation where the failure of a backup and redo log delete filled physical space even though the virtualized space made it appear there
was sufficient room.
Additional Tests if ASM and Oracle Clusterware such as RAC
1 ASM Instance Alert Log $GRID_HOME/log/<hostname>/alert<hostname>.log
2 Clusterware Logs $GRID_HOME/log/ocssd<#>.log
3 Cluster Ready Services Daemon (crsd) CRS_HOME/log/hostname/crsd
4 Oracle Cluster Registry (OCR) records CRS_HOME/log/hostname/client
5 Oracle Process Monitor Daemon (OPROCD) /etc/oracle/hostname.oprocd.log
6 Cluster Synchronization services (CSS) CRS_HOME/log/<hostname>/cssd
7 Event Manager (EVM) CRS_HOME/log/hostname/evmd
8 RAC RACG CRS_HOME/log/<hostname>/racg
My Oracle Support: Topic 2 Not all DBAs are of equal quality and not all support engineers with Oracle are either. If you have uploaded an RDA and a support engineer asks you for information that is in that RDA ask to speak to an escalation manager: Doing so is your right as a customer. Do not tolerate this "engineer" as they are either lazy or not sufficiently skilled to read what you uploaded.

If you have a support engineer that is asking you to run diagnostic test after diagnostic test, multiple system state dumps, etc. and it is not getting you anywhere ... ask to speak to an escalation manager.

The time you are wasting is your own.
??? ???
??? ???
??? ???
??? ???
??? ???
??? ???
Topic Discussion
Terms and Definitions
Term Definition
CSI Customer Service Identifier
MOS MyOracleSupport. The horrible, by improving, website that provides Oracle on-line support.
RDA Remote Diagnostic Agent
Sev Severity. An outage is Sev 1
SR Service Request
TAR Technical Assistance Request: The old name for a Service Request (SR)
Tactics 1: Identify the Root Cause

Related Topics
DBA Best Practices
DBA Best Practice Guidelines
Developer Best Practice Guidelines