In this Document
Oracle Server – Enterprise Edition – Version 22.214.171.124 to 126.96.36.199 [Release 9.2 to 11.1]
Hangcheck_timer module is required to run a supported configuration in Oracle Real Application Clusters environments on Linux, with Oracle releases 9i, 10g, or 11gR1 RAC. This note identifies and outlines the requirements needed to configure hangcheck-timer in an Oracle Enterprise Linux, Red Hat Linux, or SUSE Linux environment.
//oracle clusterware 11gr2之后，hangcheck模块不在被需要了。
This article is provided for product management, system architects, and system administrators involved in deploying and configuring Oracle RAC 9i, 10g, or 11gR1 in a Linux environment. This document will also be useful to field engineers and consulting organizations to facilitate installations and configuration requirements of Oracle in a Linux RAC environment.
Starting in release 188.8.131.52 and later, Oracle RAC environments required using a new I/O fencing model, named the hangcheck-timer module. This module was implemented to replace the Watchdog module, which provided similar fencing functionality. Hangcheck-timer was subsequently delivered as part of the standard kernel distribution for Linux kernel releases 2.4 and above.
Hangcheck-timer should be loaded at boot time, and monitors the Linux kernel for long operating system hangs that could affect the reliability of a RAC node. It runs in kernel mode and uses the Time Stamp Counter (TSC) to catch scheduling delays or node hangs. This is done by setting a timer, then checking when the timer fires as to whether it was delayed by more than the allowed margin of error. If the duration exceeds the allowed time of (hangcheck_tick + hangcheck_margin seconds), the machine is restarted. Hangcheck-timer will not cause reboots to occur due to CPU starvation.
Hangcheck-timer requires three configuration parameters:
- hangcheck_tick – defines how often, in seconds, the hangcheck-timer checks the node for hangs. The default value is 60 seconds.
- hangcheck_margin – defines how much margin is allowed, in seconds, between expected scheduling and real scheduling time. The default value is 180 seconds.
- hangcheck_reboot – determines if the hangcheck-timer restarts the node if the kernel fails to respond within the sum of the hangcheck_tick and hangcheck_margin parameter values. If the value of hangcheck_reboot is equal to or greater than 1, then the hangcheck-timer module restarts the system. If the hangcheck_reboot parameter is set to zero, then the hangcheck-timer module will not reboot the node, even if a hang is detected. The default value varies by kernel version. In the 2.4 kernel, the default is 1. In 2.6 kernels, the default is 0.
9i: Assuming the default setting of "oracm misscount" is set to 220 seconds:
hangcheck_tick=30 hangcheck_margin=180 hangcheck_reboot=1
10g/11gR1: Assuming the default setting of "CSS misscount" is set to either 30 or 60 seconds:
hangcheck_tick=1 hangcheck_margin=10 hangcheck_reboot=1
You must always ensure that the Cluster misscount setting is greater than the sum of the setting for hangcheck_tick + hangcheck_margin.
@ Unpublished information for Oracle Support Internal Use:
When running Oracle Clusterware on Linux, hangcheck-timer should always be configured on each RAC cluster node, as the functionality of this module is required to provide I/O Fencing to ensure no stray writes will occur from an evicted node in a RAC cluster. To verify if the hangcheck-timer module is running on a node execute as the root or oracle user:
If the hangcheck-timer module is loaded (running) you will see output similar to above. When hangcheck-timer is not loaded no output is generated, and the command prompt is returned to the user.
In an Oracle Enterprise Linux, Red Hat 4/5, or SUSE 9/10 environment the hangcheck-timer module is loaded using the modprobe command:
In order to ensure the module is loaded at boot time, you should also place the same command in the appropriate local command execution directory (e.g. /etc/rc.d/rc.local, or /etc/init.d/boot.local). In earlier releases, hangcheck-timer was loaded using insmod in place of modprobe. Consult your release specific documentation to determine which initialization method is required.
Hangcheck-timer will provide message logging to the system messages log when a failure is detected, and a node restart is initiated by the module:
- When Hangcheck-timer reboots it may leave "Hangcheck: hangcheck is restarting the machine" message in /var/log/messages
- If you see the following message in /var/log/messages: "Hangcheck: hangcheck value past margin!" this means a reboot was required but was not performed, because hangcheck_reboot was not set to 1. If this message is seen, you must reload the hangcheck module as described earlier in this note, with the hangcheck_reboot value set to 1.
- Bug:6125546 which can prevent hangcheck-timer from rebooting in RHEL4 (fixed in 184.108.40.206 or RHEL4.6)
Database – RAC/Scalability Community
To discuss this topic further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Database – RAC/Scalability Community
@ BUG:6125546 – FASTER FENCING: EXECUTE SYSREQ B IMMEDIATELY
NOTE:232355.1 – Hangcheck Timer FAQ
@NOTE:259487.1 – Hangcheck-Timer Module Details
NOTE:559365.1 – Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictions
NOTE:567730.1 – Changes in Oracle Clusterware on Linux with the 10.2.0.4 Patchset