Wednesday, November 16, 2016

oracle12c : Resolving a Hung Database


Problem
Your database is hung. Users aren’t able to log in, and existing users can’t complete their transactions.The DBAs with SYSDBA privileges may also be unable to log in to the database. You need to find out what is causing the database to hang, and fix the problem.

Solution
Follow these general steps when facing a database that appears to be hung:
1. Check your alert log to see if the database has reported any errors, which mayindicate why the database is hanging.
2. See if you can get an AWR or ASH report or query some of the ASH views, as explained in Chapter 5. You may notice events such as hard parses at the top of the Load Profile section of the AWR report, indicating that this is what is slowing down the database.
3. A single ad hoc query certainly has the potential to bring an entire database to its knees. See if you can identify one or more very poorly performing SQL statements that may be leading to the hung (or a very poorly performing)
database.
4. Check the database for blocking locks as well as latch contention.
5. Check the server’s memory usage as well as CPU usage. Make sure the sessions aren’t stalling because you’ve sized the PGA too low.
6. Don’t overlook the fact that a scary-looking database hang may be caused by something as simple as the filling up of all archive log destinations. If the archive destination is full, the database will hang, and new user connections
will fail. You can, however, still connect as the SYS user, and once you make room in the archive destination by moving some of the archived redo log files, the database becomes accessible to the users.
7. Check the Flash Recovery Area (FRA). A database also hangs when it’s unable to write Flashback Database logs to the recovery area. When the FRA fills up, the database won’t process new work and it won’t spawn new database
connections. You can fix this problem by making the recovery area larger with the alter system set db_recovery_file_dest_size command.

If you’re still unable to resolve the reasons for the hung database, you most likely have a truly hung database. While you’re investigating such a database, you may sometimes find yourself unable to connect and log in. In that case, use the “prelim” option to log in to the database. The prelim option doesn’t require a real database connection. Here's an example that shows how to use the prelim option to log into a database:

C:\app\ora\product\11.2.0\dbhome_1\bin>sqlplus /nolog
SQL*Plus: Release 11.2.0.1.0 Production on Sun Mar 27 10:43:31 2011
Copyright (c) 1982, 2010, Oracle. All rights reserved.
SQL> set _prelim on
SQL> connect / as sysdba
Prelim connection established

Alternatively, you can use the command sqlplus -prelim "/ as sysdba" to log in with the –prelim option. Note that you use the nolog option to open a SQL*Plus session. You can’t execute the set _prelim on command if you’re already connected to the database. Once you establish a prelim connection as shown here, you can execute the oradebug hanganalyze command to analyze a hung database—for example:

SQL> oradebug hanganalyze 3
Statement processed.
SQL>
In an Oracle RAC environment, specify the oradebug hanganalyze command with additional options,
as shown here:
SQL> oradebug setinst all

SQL> oradebug -g def hanganalyze 3

You can repeat the oradebug hanganalyze command a couple of times to generate dump files for varying process states. In addition to the dump files generated by the hanganalyze command, Oracle Support may often also request a process state dump, also called a systemstate dump, to analyze hung database conditions. The systemstate dump will report on what the processes are doing and the resources they’re currently holding. You can get a systemstate dump from a non-RAC system by executing the following set of commands.

SQL> oradebug setmypid
Statement processed.
SQL> oradebug dump systemstate 266
Statement processed.
SQL>
Issue the following commands to get a systemstate dump in a RAC environment:
SQL> oradebug setmypid
SQL> oradebug unlimit
SQL> oradebug -g all dump systemstate 266

Note that unlike the oradebug hanganalyze command, you must connect to a process. The setmypid option specifies the process, in this case your own process. You can also specify a process ID other than yours, in which case you issue the command oradebug setmypid before issuing the dump systemstate command. If you try to issue the dump systemstate command without setting the PID, you’ll receive an error:

SQL> oradebug dump systemstate 10
ORA-00074: no process has been specified

You must take the systemstate dumps a few times, with an interval of about a minute or so in between the dumps. Oracle Support usually requests

How It Works
A true database hang can be due to a variety of reasons, including a system that has exhausted resources such as the CPU or memory, or because several sessions are stuck waiting for a resource such  as a lock. While the database can automatically resolve deadlocks between sessions (by killing one of the sessions holding a needed lock), when there’s a latch or pin on an internal kernel-level resource, Oracle is sometimes unable to automatically detect and resolve the internal deadlock—and this leads to what Oracle Support calls a “true database hang.” A true database hang is thus an internal deadlock or a cyclical dependency among multiple processes. Oracle Support will usually ask you to provide them the hanganalyze trace files and multiple systemstate dumps to enable them to diagnose the root cause of your hang. At times like this, you may not even be able to log into the database. Your first instinct when you realize that you can’t even log in to a database is to try shutting down and restarting, often referred to as bouncing the database. Unfortunately, while shutting down and restarting the database may
“resolve” the issue, it’ll also disconnect all users—and you’re no wiser as to what exactly caused the problem. If you do decide to bounce your database, quickly generate a few hanganalyze and systemstate dumps first.

Tip As unpleasant as it may be at times, if you find that you simply can’t connect to a hung database, then collect any trace dumps you may need, and quickly bounce the database so that users can access their applications. Especially when you’re dealing with a database that’s hanging because of memory issues, bouncing the instance may get things going again quickly.

If you find that the database is completely unresponsive, and you can’t even log in to the database with the SYSDBA privilege, you can use the prelim option to log into the database. The prelim option stands for preliminary connection, and it starts an Oracle process and attaches that process to the SGA shared memory. However, this is not a full or complete connection, but a limited connection where the structures for query execution are not set up—so, you cannot even query the V$ views. However, the prelim option lets you run oradebug commands to get error dump stacks for diagnostic purposes. The output of the hanganalyze command can tell Oracle Support engineers if your database is really hanging, because of sessions waiting for some resource. The command makes internal kernel calls to find out all sessions that are waiting for a resource and shows the relationship between the blocking and waiting sessions. The hanganalyze option that you can specify with either the oradebug command or an alter session statement produces details about hung sessions. Once you get the dump file, Oracle Support personnel can analyze it and let you know the reasons for the database hang.

You can invoke the hanganalyze command at various levels ranging from 1 to 10. Level 3 dumps processes that are in a hanging (IN_HANG) state. You normally don’t need to specify a level higher than 3, because higher levels will produce voluminous reports with too many details about the processes

Note that we issued the oradebug command to get a systemstate dump with a level of 266. Level 266 (combination of Level 256, which produces short stack information, and Level 10) is for Oracle releases 9.2.0.6 and onward (earlier releases used systemstate level 10). Level 266 allows you to dump the short stacks for each process, which are Oracle function calls that help Oracle development teams determine which Oracle function is causing the problem. The short stack information also helps in matching known bugs in the code. On Solaris and Linux systems, you can safely specify level 266, but on other systems, it may take a long time to dump the short stacks. Therefore, you may want to stick with level 10 for the other operating systems. If you can find out the blocking session, you can also take a dump just for that session, by using the command oradebug setospid nnnn, where nnnn is the blocking session’s PID, and then invoking the oradebug command, as shown here:

SQL> oradebug setospid 9999
SQL> oradebug unlimit
SQL> oradebug dump errorstack 3

Note that you can generate the hanganalyze and systemstate dumps in a normal session (as well as in a prelim session), without using the oradebug command. You can invoke the hanganalyze command with an alter session command, as shown here.

SQL>alter session set events 'immediate trace name hanganalyze level 3';
Similarly, you can get a systemstate dump with the following command:

SQL> alter session set events 'immediate trace name SYSTEMSTATE level 10';
Session altered.
SQL>

The oradebug and systemstate dumps are just two of the many dumps you can collect. Use the
oradebug dumplist command to view the various error dumps you can collect.

SQL> oradebug dumplist
TRACE_BUFFER_ON
TRACE_BUFFER_OFF
LATCHES
PROCESSSTATE
SYSTEMSTATE
INSTANTIATIONSTATE
REFRESH_OS_STATS
CROSSIC
CONTEXTAREA
HANGDIAG_HEADER
HEAPDUMP
Note that while you can read some of the dump files in an editor, these files are mainly for helping
Oracle Support professionals troubleshoot a database hang situation. There’s not much you can do with the dump files, especially when a database hang situation is due to an Oracle bug or a kernel-level lock, except to send them along to Oracle Support for analysis.

Oracle12c: Resolving Open Cursor Errors

Problem
You are frequently getting the Maximum Open Cursors exceeded error, and you want to resolve the error.

Solution
One of the first things you need to do when you receive the ORA-01000: “maximum open cursors
exceeded” error is to check the value of the initialization parameter open_cursors. You can view the
current limit for open cursors by issuing the following command:

SQL> sho parameter open_cursors
NAME TYPE VALUE
------------------------------------ ----------- ---------
open_cursors integer 300

The parameter OPEN_CURSORS sets the maximum number of cursors a session can have open at once.
You specify this parameter to control the number of open cursors. Keeping the parameter’s value too
low will result in a session receiving the ORA-01000 error. There’s no harm in specifying a very large value for the OPEN_CURSORS parameter (unless you expect all sessions to simultaneously max out their cursors, which is unlikely), so you can usually resolve cursor-related errors simply by raising the parameter value to a large number. However, you may sometimes find that raising the value of the open_cursors parameter doesn’t “fix” the problem. In such cases, investigate which processes are using the open cursors by issuing the following query:

SQL> select a.value, s.username,s.sid,s.serial#,s.program,s.inst_id
from gv$sesstat a,gv$statname b,gv$session s
where a.statistic# = b.statistic# and s.sid=a.sid
and b.name='opened cursors current'

The GV$OPEN_CURSOR (or the V$OPEN_CURSOR) view shows all the cursors that each user session has
currently opened and parsed, or cached. You can issue the following query to identify the sessions with a
high number of opened and parsed or cached cursors.

SQL> select saddr, sid, user_name, address,hash_value,sql_id, sql_text
from gv$open_cursor  where sid in (select sid from v$open_cursor
group by sid having count(*) > &threshold);

The query lists all sessions with an open cursor count greater than the threshold you specify. This way, you can limit the query’s output and focus just on the sessions that have opened, parsed, or cached a large number of cursors.

You can get the actual SQL code and the open cursor count for a specific session by issuing the following query:

SQl> select sql_id,substr(sql_text,1,50) sql_text, count(*)
from gv$open_cursor where sid=81
group by sql_id,substr(sql_text,1,50)
order by sql_id;


The output shows the SQL code for all open cursors in the session with the SID 81. You can examine all SQL statements with a high open cursor count, to see why the session was keeping a large number of cursors open.

The default value of 50 for the SESSION_CACHED_CURSORS initialization parameter may be too low for
many databases. You can check if the database is bumping against the maximum limit for session cached cursors by issuing the following statement:

SQL> select max(value) from v$sesstat  where statistic# in (select statistic# from v$statname
 where name = 'session cursor cache count');

MAX(VALUE)
----------
49

The query shows the maximum number of session cursors that have been cached in the past. Since this number (49) is virtually the same as the default value (or the value you’ve set) for the SESSION_CACHED_CURSORS parameter, you must set the parameter's value to a larger number. Session cursor caches use the shared pool. If you’re using automatic memory management, there’s nothing for you to do after you reset the SESSION_CACHED_CURSORS parameter—the database will bump up the shared pool size if necessary. You can find out how many cursors each session has in its session cursor cache by
issuing the following query:

SQL> select a.value,s.username,s.sid,s.serial#  from v$sesstat a, v$statname b,v$session s
 where a.statistic#=b.statistic

Wednesday, February 3, 2016

Oracle and RAID

source1
source2

RAID LevelDescription
NoneAny RAID operation involves a certain management overhead so if your requirement is for optimum write speed and no data protection you may choose to avoid RAID.
0This RAID level provides automatic block level striping of data across multiple disks to balance the load across the disk array. It provides no protection from data loss!
1This RAID level is also known as disk mirroring. The RAID controller keeps a complete copy of each disk on at least one other disk. In the event of a disk failure the RAID controller switches to one of the mirrors to prevent system failure. Double or tripple mirroring can be used to provide increased levels of security.
0+1As the name suggests this is a combination of RAID 0 and RAID 1, providing the benefits of block level striping across the array and the security of disk mirroring. The striping occurs across disks and the entire set is mirrored. If you can't do 1+0, this is the best RAID option for Oracle.
1+0 or 10As the name suggests this is a combination of RAID 1 and RAID 0. This sounds like it should be the same as 0+1, but it is subtly different. Each disk is mirrored individually and striping occurs across all the mirrored pairs. This is the best RAID for Oracle.
5This RAID level stripes data and parity information across 3 or more disks. The parity information, which is always stored on a separate disk to its corresponding data, allows the contents of lost blocks to be derived. The significant write overhead associated with this RAID level make it slower than the previous methods, especially when a disk failure occurs, but it requires far fewer disks so it is very cost effective. In the past people have avoided RAID 5 for database applications but improvements in disk speed and controller performance mean that it is a viable solution for datafiles if performance is not a consideration.

Oracle RAID Usage

So we know what the different RAID levels do, but which level should be use for Oracle? The answers are not carved in stone but here are some suggestions, with the RAID levels listed in order of preference.
File TypePreferred RAID LevelComments
Control FilesRAID 1+0,
RAID 0+1,
RAID 0,
No RAID
Control files are updated constantly so the quicker they can be accessed the better. Multiple control files should always be used whether you use RAID or not. When using no RAID or RAID 0 it is up to you to make sure that at least one copy of the control file is always available.
Online Redo LogsRAID 1+0,
RAID 0+1,
RAID 0,
No RAID
Similar access requirements to control files. Once again, the redo logs should always be multiplexed whether you use RAID or not.
Temporary DatafilesNo RAID,
RAID 0
The main requirement here is rapid access, not reliability. If the datafile is lost it can simply be recreated as there is no data to restore.
Archived Redo LogsRAID 1+0,
RAID 0+1,
RAID 0,
No RAID
Always multiplex these, even when using RAID, if you can afford the space. If space is an issue, rely on RAID to provide redundancy. It's dangerous to use no RAID or RAID 0 without manually multiplexing them. If you lose archived redo logs you may have compromised your backup and recovery.
Rollback/Undo DatafilesRAID 1+0
RAID 0+1
These files require constant I/O and must be protected. They cannot be mutliplexed by Oracle so let the hardware do it for you.
DatafilesRAID 1+0
RAID 0+1
or RAID 5
Datafiles with heavy I/O requirements should use RAID 1+0 (or 0+1). It is the fastest and most secure option.
If price is more of a consideration than performance, you can consider RAID 5. Most I/O operations to datafiles are buffered, with the physical writes happening in the background. As a result RAID 5 can be used to reduce costs without a significant impact on low performance systems.

RAID and ASM

Automatic Storage Manager (ASM) is Oracle's preferred solution to provide Stripe And Mirror Everything (SAME) storage. Even though ASM can provide double or triple mirroring, Oracle recommend hardware RAID is used in preference to ASM mirroring. As such, all LUNs presented to ASM are protected by hardware RAID, so ASM disk groups are defined using external redundancy and no failure groups.
The Recommendations for Storage Preparation documentation provide some helpful pointers, including some of the following:
  • In most cases you will only need two disk groups (DATA and FRA), where DATA holds all database related files and FRA holds the fast recovery area, including multiplexed copies on online redo logs and controlfiles. Typically, the FRA disk group will be twice the size of the DATA disk group, since it must hold all backups.
  • Oracle recommend a minimum of 4 LUNs per disk group, with LUNs using hardware RAID and external redundancy if possible. In this case, ASM still stripes across all LUNS, but it relies on hardware RAID to provide the mirroring.
  • All LUNs within a disk group should be the same size and have the same performance characteristics. So don't mix LUNs of different RAID levels within a disk group.
  • LUNs should be made up from disks dedicated to Oracle, not shared with other applications.

Conclusion

As most Oracle DBAs know, rules of thumb can be misleading but here goes:

  • If you can afford it, use RAID 1+0 for all your Oracle files and you shouldn't run into too many problems. If you are using ASM, use this RAID level for all LUNs presented to ASM.
  • To reduce costs move datafiles with lower access rates to RAID 5. If you are using ASM, this may mean defining multiple disk groups to hold different files.
  • To reduce costs further move the other datafiles to RAID 5.
  • To reduce costs further experiment with moving redo logs and rollback/undo tablespaces to RAID 5. Pick the combinations that work best for your system.
  • If cost dictates it, move all your Oracle files on to RAID 5.