Reasons for package failover
Off the shelf, with no enhancement by Event Monitoring Service High Availability Monitors, Serviceguard has a limited package failover trigger repertoire. The following list describes the various causes for Serviceguard to initiate a package failover. They can be divided into two main reasons; heartbeat transmission problems at the cluster level and package "up" criteria not being met.
Heartbeat transmission failure
Serviceguard uses a heartbeat transmission protocol and network to verify the vitality of the active nodes in the cluster. If a heartbeat fails to come from a particular node, the cluster performs a cluster reformation, waiting a grace time for the missing node to re-vote itself a member of the cluster and maintain it's packages. If it fails to vote in, then cluster reformation excludes it as a current member of the running cluster and it's packages may be adopted out to other active servers.
Heartbeat transmission failures may be due to:
- Cluster node rebooted itself (due to a hardware or software crash) or power failed.
A reboot may be induced by a hardware problem, kernel bug, or by Serviceguard if it is unable to update a kernel-based safety timer that Serviceguard uses to verify hung kernels.
Though the node may rejoin the cluster automatically after reboot, the reboot time would exceed the NODE_TIMEOUT (see below) and cluster reformation time.
Keep up with patching and firmware updates and use a UPS to reduce this possibility.
- Heartbeat network connectivity to the missing server was severed.
This may be countered by configuring more than one heartbeat network and/or configuring a standby LAN NIC for the HB NIC.
- Using the factory-default NODE_TIMEOUT setting.
This value determines the wait time for a heartbeat packet to arrive. The kernel has higher priority than heartbeat packet generation. If the kernel is doing a lengthy task, such as flushing an extremely large buffer cache, it may exceed the NODE_TIMEOUT, triggering a cluster reformation.
Configuring the cluster with a value of 6-8 seconds usually resolves this.
- Serviceguard network ports (hacl) may be disabled while the cluster is running, preventing further Serviceguard inter-node communication.
This may result in an unexpected abort of cmcld - the main Serviceguard deamon. Should such happen, the node would TOC (reboot) itself.
Some older versions of Serviceguard were susceptable to hanging network ports subsequent to the use of port scanning applications.
- A duplicate Heartbeat IP or MAC address was configured on a different NIC on the same subnet which may intercept HB packets.
Package "Up" criteria was not met
Serviceguard will halt a package on it's current node (and potentially fail it over - Note 1) for the following reasons:
- The node the package is running on failed to maintain package SERVICEs.
This usually takes the form of a "monitor" service which is tasked with validating the existance of an application-specific process critical to the availability of the package. Once the application-critical process ceases to exist, the Monitor service self-terminates after its' polling interval has expired and testing reveals the condition.
- A network SUBNET the package depends on has ceased to operate properly.
Serviceguard uses an OSI layer 2 (link-level) polling technique to validate inter-NIC (primary/standby) or inter-node (no configured standby NIC) communication capabilities of every NIC it is responsible for. If both the primary and standby (if configured) NIC fails to transmit at a link-level, Serviceguard marks the NIC down until the next NETWORK_POLLING_INTERVAL (2 seconds). If the NIC outage signals a failure of the SUBNET, Serviceguard will halt the package, and possibly move it to an adoptive node which has not lost the SUBNET.
Subnet failures can be reduced by using standby NICs and redundant network bridges between the primary and standby NIC. Alternatively, deconfiguring SUBNET monitoring from the package configuration file causes Serviceguard to ignore subnet outages and leaves the package running in such an event. This may be preferable if the SUBNET functionality is intermittent; returning to operation before a package shutdown has completed due to a prior subnet failure.
- Serviceguard enables the cluster administrator to prevent package failover to a particular node in the event it is in a condition that makes it an unsavory candidate to operate a particular package should that package fail on it's primary server. Thus, though a package may fail on its primary server, the adoptive node may be disabled from taking up the package.
- With the addition of EMS HA Monitors, Serviceguard can be enhanced to initiate package failover when other system resources states or thresholds have been exceeded.