how to set hangcheck for linux

客户询问在10g rac实施的时候,hangcheck模块的参数设置,我转了篇文档给他,以供参考。上有篇文档写得很清楚了,可以参考:


In this Document



Applies to:

Oracle Server – Enterprise Edition – Version to [Release 9.2 to 11.1]
Linux x86
Linux x86-64


Hangcheck_timer module is required to run a supported configuration in Oracle Real Application Clusters environments on Linux, with Oracle releases 9i, 10g, or 11gR1 RAC.  This note identifies and outlines the requirements needed to configure hangcheck-timer in an Oracle Enterprise Linux, Red Hat Linux, or SUSE Linux environment.

Note : Hangheck timer is not required starting with Oracle Clusterware 11gR2

//oracle clusterware 11gr2之后,hangcheck模块不在被需要了。


This article is provided for product management, system architects, and system administrators involved in deploying and configuring Oracle RAC 9i, 10g, or 11gR1 in a Linux environment. This document will also be useful to field engineers and consulting organizations to facilitate installations and configuration requirements of Oracle in a Linux RAC environment.


Starting in release and later, Oracle RAC environments required using a new I/O fencing model, named the hangcheck-timer module. This module was implemented to replace the Watchdog module, which provided similar fencing functionality. Hangcheck-timer was subsequently delivered as part of the standard kernel distribution for Linux kernel releases 2.4 and above. 

Hangcheck-timer should be loaded at boot time, and monitors the Linux kernel for long operating system hangs that could affect the reliability of a RAC node.  It runs in kernel mode and uses the Time Stamp Counter (TSC) to catch scheduling delays or node hangs.  This is done by setting a timer, then checking when the timer fires as to whether it was delayed by more than the allowed margin of error.  If the duration exceeds the allowed time of (hangcheck_tick + hangcheck_margin seconds), the machine is restarted.  Hangcheck-timer will not cause reboots to occur due to CPU starvation.

 Hangcheck-timer requires three configuration parameters:

  • hangcheck_tick – defines how often, in seconds, the hangcheck-timer checks the node for hangs. The default value is 60 seconds.
  • hangcheck_margin – defines how much margin is allowed, in seconds, between expected scheduling and real scheduling time. The default value is 180 seconds.
  • hangcheck_reboot – determines if the hangcheck-timer restarts the node if the kernel fails to respond within the sum of the hangcheck_tick and hangcheck_margin parameter values. If the value of hangcheck_reboot is equal to or greater than 1, then the hangcheck-timer module restarts the system. If the hangcheck_reboot parameter is set to zero, then the hangcheck-timer module will not reboot the node, even if a hang is detected.   The default value varies by kernel version.  In the 2.4 kernel, the default is 1.  In 2.6 kernels, the default is 0.
All hangcheck-timer default values should be explicitly overridden when loading the kernel module, based on the Oracle release as follows: 
  • 9i: Assuming the default setting of "oracm misscount" is set to 220 seconds
    hangcheck_tick=30 hangcheck_margin=180 hangcheck_reboot=1
  • 10g/11gR1: Assuming the default setting of "CSS misscount" is set to either 30 or 60 seconds:
    hangcheck_tick=1 hangcheck_margin=10 hangcheck_reboot=1

You must always ensure that the Cluster misscount setting is greater than the sum of the setting for hangcheck_tick + hangcheck_margin.

@  Unpublished information for Oracle Support Internal Use: 

When running Oracle Clusterware on Linux, hangcheck-timer should always be configured on each RAC cluster node, as the functionality of this module is required to provide I/O Fencing to ensure no stray writes will occur from an evicted node in a RAC cluster.  To verify if the hangcheck-timer module is running on a node execute as the root or oracle user:

# /sbin/lsmod | grep hangcheck


hangcheck-timer         2672   0

If the hangcheck-timer module is loaded (running) you will see output similar to above. When hangcheck-timer is not loaded no output is generated, and the command prompt is returned to the user.

In an Oracle Enterprise Linux, Red Hat 4/5, or SUSE 9/10 environment the hangcheck-timer module is loaded using the modprobe command:

# modprobe hangcheck-timer  hangcheck_tick=1 hangcheck_margin=10 hangcheck_reboot=1

In order to ensure the module is loaded at boot time, you should also place the same command in the appropriate local command execution directory (e.g. /etc/rc.d/rc.local, or /etc/init.d/boot.local).  In earlier releases, hangcheck-timer was loaded using insmod in place of modprobe. Consult your release specific documentation to determine which initialization method is required.

Hangcheck-timer will provide message logging to the system messages log when a failure is detected, and a node restart is initiated by the module:

  • When Hangcheck-timer reboots it may leave "Hangcheck: hangcheck is restarting the machine" message in /var/log/messages
  • If you see the following message in /var/log/messages:  "Hangcheck: hangcheck value past margin!" this means a reboot was required but was not performed, because hangcheck_reboot was not set to 1.  If this message is seen, you must reload the hangcheck module as described earlier in this note, with the hangcheck_reboot value set to 1.

Known Issues

  • Bug:6125546 which can prevent hangcheck-timer from rebooting in RHEL4 (fixed in or RHEL4.6)



Database – RAC/Scalability Community
To discuss this topic further with Oracle experts and industry peers, we encourage you to review, join or start a discussion in the My Oracle Support Database – RAC/Scalability Community


NOTE:232355.1 – Hangcheck Timer FAQ
@NOTE:259487.1 – Hangcheck-Timer Module Details
NOTE:559365.1 – Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictions
NOTE:567730.1 – Changes in Oracle Clusterware on Linux with the Patchset

还有一篇 "best practice on oracle10gRac on linux"可以用作实施参考。