Wednesday, June 11, 2008

Troubleshooting an HACMP Cluster

It is useful to follow guidelines for troubleshooting. You should be aware of all the diagnostic
tools available from HACMP and AIX 5L. See Chapter 1: Troubleshooting HACMP Clusters
in the Troubleshooting Guide for suggested troubleshooting guidelines, as well as for
information on tuning the cluster for best performance.
When you become aware of a problem, the first place to look for helpful diagnostic information
is the log files. Chapter 2: Using Cluster Log Files in the Troubleshooting Guide describes how
to use the various log files. This chapter also contains information on viewing and maintaining
log file parameters and instructions for redirecting log files.
If log files do not help you resolve the issue, you may need to check cluster components. See
Chapter 3: Investigating System Components and Solving Common Problems in the
Troubleshooting Guide for suggested strategies as well as for a list of solutions to common
problems that may occur in an HACMP environment.
For information specific to Reliable Scalable Cluster Technology (RSCT) subsystems, see the
following IBM publications:
• IBM Reliable Scalable Cluster Technology for AIX 5L and Linux: Group Services
Programming Guide and Reference, SA22-7888
• IBM Reliable Scalable Cluster Technology for AIX 5L and Linux: Administration Guide,
SA22-7889
• IBM Reliable Scalable Cluster Technology for AIX 5L: Technical Reference, SA22-7890
• IBM Reliable Scalable Cluster Technology for AIX 5L: Messages, GA22-7891
Administering an HACMP Cluster
Related Administrative Tasks
Administration Guide 29
1
Related Administrative Tasks
The tasks below, while not specifically discussed in this book, are essential for effective system
administration.
Backing Up Your System
The practice of allocating multiple copies of a logical volume can enhance high availability in
a cluster environment, but it should not be considered a replacement for regular system
backups. Although HACMP is designed to survive failures within the cluster, it cannot survive
a catastrophic failure where multiple points of failure leave data on disks unavailable.
Therefore, to ensure data reliability and to protect against catastrophic physical volume failure,
you must have a backup procedure in place and perform backups of your system on a
regular basis.
To maintain your HACMP environment, you must back up the root volume group (which
contains the HACMP software) and the shared volume groups (which contain the data for
highly available applications) regularly. HACMP is like other AIX 5L environments from this
perspective. Back up all nodes.
Documenting Your System
As your HACMP system grows and changes, it differs from its initial cluster configuration. It
is your responsibility as system administrator to document all aspects of the HACMP system
unique to your environment. This responsibility includes documenting procedures concerning
the highly available applications, recording changes that you make to the configuration scripts
distributed with HACMP, documenting any custom scripts you write, recording the status of
backups, maintaining a log of user problems, and maintaining records of all hardware. This
documentation, along with the output of various display commands and cluster snapshots, will
be useful for you, as well as for IBM support, to help resolve problems.
Starting with HACMP 5.2, you can use the report supplied with the Online Planning Worksheet
program to generate a of a cluster configuration, then save and print the report to document the
system.
Maintaining Highly Available Applications
As system administrator, you should understand the relationship between your applications and
HACMP. To keep the applications highly available, HACMP starts and stops the applications
that are placed under HACMP control in response to cluster events. Understanding when, how,
and why this happens is critical to keeping the applications highly available, as problems can
occur that require corrective actions.
For a discussion of strategies for making your applications highly available, see the planning
chapters and Appendix B on Applications and HACMP in the Planning Guide.
Administering an HACMP Cluster
AIX 5L Files Modified by HACMP
30 Administration Guide
1
Helping Users
As the resident HACMP expert, you can expect to receive many questions from end users at
your site about HACMP. The more you know about HACMP, the better you are able to answer
these questions. If you cannot answer questions about your HACMP cluster environment,
contact your IBM support representative.
AIX 5L Files Modified by HACMP
The following AIX 5L files are modified to support HACMP. They are not distributed with
HACMP.
/etc/hosts
The cluster event scripts use the /etc/hosts file for name resolution. All cluster node IP
interfaces must be added to this file on each node.
HACMP may modify this file to ensure that all nodes have the necessary information in their
/etc/hosts file, for proper HACMP operations.
If you delete service IP labels from the cluster configuration using SMIT, we recommend that
you also remove them from /etc/hosts. This reduces the possibility of having conflicting entries
if the labels are reused with different addresses in a future configuration.
Note that DNS and NIS are disabled during HACMP-related name resolution. This is why
HACMP IP addresses must be maintained locally.
/etc/inittab
The /etc/inittab file is modified in each of the following cases:
• HACMP is configured for IP address takeover
• The Start at System Restart option is chosen on the SMIT System Management
(C-SPOC) > Manage HACMP Services > Start Cluster Services panel
• Concurrent Logical Volume Manager (CLVM) is installed with HACMP
• Starting with HACMP 5.3, the /etc/inittab file has the following entry in the
/user/es/sbin/cluster/etc/rc.init:
hacmp:2:once:/usr/es/sbin/cluster/etc/rc.init
This entry starts the HACMP Communications Daemon, clcomd, and the clstrmgr
subsystem.
Modifications to the /etc/inittab File due to IP Address Takeover
The following entry is added to the /etc/inittab file for HACMP network startup with IP
address takeover:
harc:2:wait:/usr/es/sbin/cluster/etc/harc.net # HACMP network startup
Administering an HACMP Cluster
AIX 5L Files Modified by HACMP
Administration Guide 31
1
When IP address takeover is enabled, the system edits /etc/inittab to change the rc.tcpip and
inet-dependent entries from run level “2” (the default multi-user level) to run level “a”. Entries
that have run level “a” are processed only when the telinit command is executed specifying that
specific run level.
Modifications to the /etc/inittab File due to System Boot
The /etc/inittab file is used by the init process to control the startup of processes at boot time.
When the system boots, the /etc/inittab file calls the /usr/es/sbin/cluster/etc/rc.cluster script
to start HACMP. The entry is added to the /etc/inittab file if the Start at system restart option
is chosen on the SMIT System Management (C-SPOC) > Manage HACMP Services > Start
Cluster Services panel or when the system boots:
hacmp:2:once:/usr/es/sbin/cluster/etc/rc.init
This starts the HACMP Communications Daemon, clcomd, and the clstrmgr subsystem.
Because the inet daemons must not be started until after HACMP-controlled interfaces have
swapped to their service IP address, HACMP also adds the following entry to the end of the
/etc/inittab file to indicate that /etc/inittab processing has completed:
clinit:a:wait:/bin/touch /usr/es/sbin/cluster/.telinit
#HACMP for AIX These must be the last entry in run level “a” in inittab!
pst_clinit:a:wait:/bin/echo Created /usr/es/sbin/cluster/ .telinit >
/dev/console
#HACMP for AIX These must be the last entry in run level “a” in inittab!
See Chapter 9: Starting and Stopping Cluster Services, for more information about the files
involved in starting and stopping HACMP.
/etc/rc.net
The /etc/rc.net file is called by cfgmgr, (cfgmgr is the AIX 5L utility that configures devices
and optionally installs device software into the system), to configure and start TCP/IP during
the boot process. It sets hostname, default gateway, and static routes. The following entry is
added at the beginning of the file for a node on which IP address takeover is enabled:
# HACMP for AIX \
# HACMP for AIX These lines added by HACMP for AIX software
[ "$1" = "-boot" ] && shift || { # HACMP for AIX
ifconfig lo0 127.0.0.1 up; # HACMP for AIX
/bin/uname -S`hostname|sed 's/\..*$//'`; # HACMP for AIX
exit 0; # HACMP for AIX
} # HACMP for AIX
#
The HACMP entry prevents cfgmgr from reconfiguring boot and service IP addresses while
HACMP is running.
/etc/services
The /etc/services file defines the sockets and protocols used for network services on a system.
The ports and protocols used by the HACMP components are defined here.
#clinfo_deadman 6176/tcp
#clsmuxpd 6270/tcp
#clm_lkm 6150/tcp
#clm_smux 6175/tcp
Administering an HACMP Cluster
AIX 5L Files Modified by HACMP
32 Administration Guide
1
#godm 6177/tcp
#topsvcs 6178/udp
#grpsvcs 6179/udp
#emsvcs 6180/udp
#clver 6190/tcp
#clcomd 6191/tcp
Note: If, in addition to HACMP, you install HACMP/XD for GLVM, the
following entry for the port number and connection protocol is
automatically added to the /etc/services file on each node on the local
and remote sites on which you installed the software: rpv
6192/tcp. This default value enables the RPV server and RPV
client to start immediately after they are configured, that is, to be in the
available state. For more information, see HACMP/XD for GLVM
Planning and Administration Guide.
/etc/snmpd.conf
Note: The default version of the snmpd.conf file for AIX 5L v.5.2 and v. 5.3
is snmpdv3.conf.
The SNMP daemon reads the /etc/snmpd.conf configuration file when it starts up and when a
refresh or kill -1 signal is issued. This file specifies the community names and associated
access privileges and views, hosts for trap notification, logging attributes, snmpd-specific
parameter configurations, and SMUX configurations for the snmpd. The HACMP installation
process adds a clsmuxpd password to this file. The following entry is added to the end of the
file, to include the HACMP MIB supervised by the Cluster Manager:
smux 1.3.6.1.4.1.2.3.1.2.1.5 "clsmuxpd_password" # HACMP clsmuxpd
HACMP supports SNMP Community Names other than “public.” That is, HACMP will
function correctly if the default SNMP Community Name has been changed in
/etc/snmpd.conf to be anything other than “public” (the default). The SNMP Community
Name used by HACMP is the first name found that is not “private” or “system” using the lssrc
-ls snmpd command.
The Clinfo service also gets the SNMP Community Name in the same manner. The Clinfo
service supports the -c option for specifying SNMP Community Name but its use is not
required. The use of the -c option is considered a security risk because doing a ps command
could find the SNMP Community Name. If it is important to keep the SNMP Community Name
protected, change permissions on /tmp/hacmp.out, /etc/snmpd.conf, /smit.log and
/usr/tmp/snmpd.log to not be world readable.
Note: See the AIX documentation for full information on the
/etc/snmpd.conf file. Version 3 (default for AIX 5.2 and up) has some
differences from Version 1.
/etc/snmpd.peers
The /etc/snmpd.peers file configures snmpd SMUX peers. During installation, HACMP adds
the following entry to include the clsmuxpd password to this file:
clsmuxpd 1.3.6.1.4.1.2.3.1.2.1.5 "clsmuxpd_password" # HACMP clsmuxpd
Administering an HACMP Cluster
HACMP Scripts
Administration Guide 33
1
/etc/syslog.conf
The /etc/syslog.conf configuration file is used to control output of the syslogd daemon, which
logs system messages. During the install process HACMP adds entries to this file that direct the
output from HACMP-related problems to certain files.
# example:
# "mail messages, at debug or higher, go to Log file. File must exist."
# "all facilities, at debug and higher, go to console"
# "all facilities, at crit or higher, go to all users"
# mail.debug /usr/spool/mqueue/syslog
# *.debug /dev/console
# *.crit *
# HACMP Critical Messages from HACMP
local0.crit /dev/console
# HACMP Informational Messages from HACMP
local0.info /usr/es/adm/cluster.log
# HACMP Messages from Cluster Scripts
user.notice /usr/es/adm/cluster.log
# HACMP/ES for AIX Messages from Cluster Daemons
daemon.notice /usr/es/adm/cluster.log
The /etc/syslog.conf file should be identical on all cluster nodes.
/etc/trcfmt
The /etc/trcfmt file is the template file for the system trace logging and report utility, trcrpt.
The installation process adds HACMP tracing to the trace format file. HACMP tracing is
performed for the clstrmgr and clinfo daemons.
Note: HACMP 5.3 and up no longer uses the clsmuxpd daemon; the SNMP
server functions are included in the Cluster Manager—the
clstrmgr daemon.
/var/spool/cron/crontab/root
The /var/spool/cron/crontab/root file contains commands needed for basic system control.
The installation process adds HACMP logfile rotation to the file.
HACMP Scripts
The HACMP software contains the following scripts.
Startup and Shutdown Scripts
The HACMP software uses each of the following scripts during starting and stopping the cluster
services:
/usr/es/sbin/cluster/utilities/clstart
The /usr/es/sbin/cluster/utilities/clstart script, which is called by the
/usr/es/sbin/cluster/etc/rc.cluster script, invokes the AIX 5L System Resource Controller
(SRC) facility to start the cluster daemons. The clstart script starts HACMP with the options
currently specified on the System Management (C-SPOC) > Manage HACMP Services >
Start Cluster Services SMIT panel.
Administering an HACMP Cluster
HACMP Scripts
34 Administration Guide
1
There is a corresponding C-SPOC version of this script that starts cluster services on each
cluster node. The /usr/es/sbin/cluster/sbin/cl_clstart script calls the HACMP clstart script.
At cluster startup, clstart looks for the file /etc/rc.shutdown. The system file /etc/rc.shutdown
can be configured to run user-specified commands during processing of the AIX 5L
/usr/sbin/shutdown command.
Newer versions of the AIX 5L /usr/sbin/shutdown command automatically call HACMP's
/usr/es/sbin/cluster/etc/rc.shutdown, and subsequently call the existing /etc/rc.shutdown (if
it exists).
Older versions of the AIX 5L /usr/sbin/shutdown command do not have this capability. In this
case, HACMP manipulates the /etc/rc.shutdown script, so that both
/usr/es/sbin/cluster/etc/rc.shutdown and the existing /etc/rc.shutdown (if it exists) are run.
Since HACMP needs to stop cluster services before the shutdown command is run, on cluster
startup, rc.cluster replaces any user supplied /etc/rc.shutdown file with the HACMP version.
The user version is saved and is called by the HACMP version prior to its own processing.
When cluster services are stopped, the clstop command restores the user's version of
rc.shutdown.
/usr/es/sbin/cluster/utilities/clstop
The /usr/es/sbin/cluster/utilities/clstop script, which is called from the SMIT Stop Cluster
Services panel, invokes the SRC facility to stop the cluster daemons with the options specified
on the Stop Cluster Services panel.
There is a corresponding C-SPOC version of this script that stops cluster services on each
cluster node. The /usr/es/sbin/cluster/sbin/cl_clstop script calls the HACMP clstop script.
Also see the notes on /etc/rc.shutdown in the section on clstart above for more information.
/usr/es/sbin/cluster/utilities/clexit.rc
If the SRC detects that the clstrmgr daemon has exited abnormally, it executes the
/usr/es/sbin/cluster/utilities/clexit.rc script to halt the system. If the SRC detects that any
other HACMP daemon has exited abnormally, it executes the clexit.rc script to stop these
processes, but does not halt the node.
You can change the default behavior of the clexit.rc script by configuring the
/usr/es/sbin/cluster/etc/hacmp.term file to be called when the HACMP cluster services
terminate abnormally. You can customize the hacmp.term file so that HACMP will take
actions specific to your installation. See the hacmp.term file for full information.
/usr/es/sbin/cluster/etc/rc.cluster
If the Start at system restart option is chosen on the System Management (C-SPOC) >
Manage HACMP Services > Start Cluster Services SMIT panel, the
/usr/es/sbin/cluster/etc/rc.cluster script is called by the /etc/inittab file to start HACMP. The
/usr/es/sbin/cluster/etc/rc.cluster script does some necessary initialization and then calls the
usr/es/sbin/cluster/utilities/clstart script to start HACMP.
The /usr/es/sbin/cluster/etc/rc.cluster script is also used to start the clinfo daemon on a client.
A corresponding C-SPOC version of this script starts cluster services on each cluster node. The
/usr/es/sbin/cluster/sbin/cl_rc.cluster script calls the HACMP rc.cluster script.
See the man page for rc.cluster for more information.

No comments: