
Heartbeat reporting

   Dejan Muhamedagic
   <[1]dmuhamedagic@suse.de>
   v1.0

   hb_report is a utility to collect all information relevant to Heartbeat over
   the given period of time.

Quick start

   Run hb_report on one of the nodes or on the host which serves as a central
   log server. Run hb_report without parameters to see usage.

   A few examples:
    1. Last night during the backup there were several warnings encountered
       (logserver is the log host):
logserver# hb_report -f 3:00 -t 4:00 /tmp/report
       collects everything from all nodes from 3am to 4am last night. The files
       are   stored   in   /tmp/report   and   compressed   to  a  tarball
       /tmp/report.tar.gz.
    2. Just found a problem during testing:
node1# date : note the current time
node1# /etc/init.d/heartbeat start
node1# nasty_command_that_breaks_things
node1# sleep 120 : wait for the cluster to settle
node1# hb_report -f time /tmp/hb1

Introduction

   Managing  clusters  is  cumbersome.  Heartbeat  v2  with  its  numerous
   configuration files and multi-node clusters just adds to the complexity. No
   wonder then that most problem reports were less than optimal. This is an
   attempt to rectify that situation and make life easier for both the users
   and the developers.

On security

   hb_report is a fairly complex program. As some of you are probably going to
   run it as root let us state a few important things you should keep in mind:
    1. Don't run hb_report as root! It is fairly simple to setup things in such
       a way that root access is not needed. I won't go into details, just to
       stress that all information collected should be readable by accounts
       belonging the haclient group.
    2. If you still have to run this as root. Well, don't use the -C option.
    3. Of  course, every possible precaution has been taken not to disturb
       processes,  or  touch  or remove files out of the given destination
       directory. If you (by mistake) specify an existing directory, hb_report
       will  bail  out soon. If you specify a relative path, it won't work
       either.

   The  final  product of hb_report is a tarball. However, the destination
   directory  is not removed on any node, unless the user specifies -C. If
   you're too lazy to cleanup the previous run, do yourself a favour and just
   supply a new destination directory. You've been warned. If you worry about
   the space used, just put all your directories under /tmp and setup a cronjob
   to remove those directories once a week:
        for d in /tmp/*; do
                test -d $d ||
                        continue
                test -f $d/description.txt || test -f $d/.env ||
                        continue
                grep -qs 'By: hb_report' $d/description.txt ||
                        grep -qs '^UNIQUE_MSG=Mark' $d/.env ||
                        continue
                rm -r $d
        done

Mode of operation

   Cluster data collection is straightforward: just run the same procedure on
   all nodes and collect the reports. There is, apart from many small ones, one
   large complication: central syslog destination. So, in order to allow this
   to be fully automated, we should sometimes run the procedure on the log host
   too. Actually, if there is a log host, then the best way is to run hb_report
   there.

   We use ssh for the remote program invocation. Even though it is possible to
   run  hb_report without ssh by doing a more menial job, the overall user
   experience is much better if ssh works. Anyway, how else do you manage your
   cluster?

   Another  ssh  related  point:  In  case your security policy proscribes
   loghost-to-cluster-over-ssh communications, then you'll have to copy the log
   file to one of the nodes and point hb_report to it.

Prerequisites

    1. ssh
       This  is  not  strictly  required,  but  you  won't regret having a
       password-less ssh. It is not too difficult to setup and will save you a
       lot of time. If you can't have it, for example because your security
       policy does not allow such a thing, or you just prefer menial work, then
       you  will  have  to resort to the semi-manual semi-automated report
       generation. See below for instructions.
       If you need to supply a password for your passphrase/login, then please
       use the -u option.
    2. Times
       In order to find files and messages in the given period and to parse the
       -f and -t options, hb_report uses perl and one of the Date::Parse or
       Date::Manip  perl  modules.  Note  that you need only one of these.
       Furthermore,  on  nodes  which have no logs and where you don't run
       hb_report directly, no date parsing is necessary. In other words, if you
       run this on a loghost then you don't need these perl modules on the
       cluster nodes.
       On rpm based distributions, you can find Date::Parse in perl-TimeDate
       and on Debian and its derivatives in libtimedate-perl.
    3. Core dumps
       To backtrace core dumps gdb is needed and the Heartbeat packages with
       the debugging info. The debug info packages may be installed at the time
       the report is created. Let's hope that you will need this really seldom.

What is in the report

    1. Heartbeat related
          + heartbeat version/release information
          + heartbeat configuration (CIB, ha.cf, logd.cf)
          + heartbeat status (output from crm_mon, crm_verify, ccm_tool)
          + pengine transition graphs (if any)
          + backtraces of core dumps (if any)
          + heartbeat logs (if any)
    2. System related
          + general platform information (uname, arch, distribution)
          + system statistics (uptime, top, ps, netstat -i, arp)
    3. User created :)
          + problem description (template to be edited)
    4. Generated
          + problem analysis (generated)

   It is preferred that the Heartbeat is running at the time of the report, but
   not absolutely required. hb_report will also do a quick analysis of the
   collected information.

Times

   Specifying times can at times be a nuisance. That is why we have chosen to
   use one of the perl modules—they do allow certain freedom when talking
   dates. You can either read the instructions at the [2]Date::Parse examples
   page.

   or just rely on common sense and try stuff like:
3:00          (today at 3am)
15:00         (today at 3pm)
2007/9/1 2pm  (September 1st at 2pm)

   hb_report will (probably) complain if it can't figure out what do you mean.

   Try to delimit the event as close as possible in order to reduce the size of
   the report, but still leaving a minute or two around for good measure.

   Note that -f is not an optional option. And don't forget to quote dates when
   they contain spaces.

   It is also possible to extract a CTS test. Just prefix the test number with
   cts: in the -f option.

Should I send all this to the rest of Internet?

   We make an effort to remove sensitive data from the Heartbeat configuration
   (CIB, ha.cf, and transition graphs). However, you have to tell us what is
   sensitive! Use the -p option to specify additional regular expressions to
   match variable names which may contain information you don't want to leak.
   For example:
# hb_report -f 18:00 -p "user.*" -p "secret.*" /var/tmp/report

   We look by default for variable names matching "pass.*" and the stonith_host
   ha.cf directive.

   Logs  and  other files are not filtered. Please filter them yourself if
   necessary.

Logs

   It may be tricky to find syslog logs. The scheme used is to log a unique
   message on all nodes and then look it up in the usual syslog locations. This
   procedure  is not foolproof, in particular if the syslog files are in a
   non-standard directory. We look in /var/log /var/logs /var/syslog /var/adm
   /var/log/ha /var/log/cluster. In case we can't find the logs, please supply
   their location:
# hb_report -f 5pm -l /var/log/cluster1/ha-log -S /tmp/report_node1

   If you have different log locations on different nodes, well, perhaps you'd
   like to make them the same and make life easier for everybody.

   The log files are collected from all hosts where found. In case your syslog
   is configured to log to both the log server and local files and hb_report is
   run on the log server you will end up with multiple logs with same content.

   Files starting with "ha-" are preferred. In case syslog sends messages to
   more than one file, if one of them is named ha-log or ha-debug those will be
   favoured to syslog or messages.

   If there is no separate log for Heartbeat, possibly unrelated messages from
   other programs are included. We don't filter logs, just pick a segment for
   the period you specified.

   NB: Don't have a central log host? Read the CTS README and setup one.

Manual report collection

   So, your ssh doesn't work. In that case, you will have to run this procedure
   on all nodes. Use -S so that we don't bother with ssh:
# hb_report -f 5:20pm -t 5:30pm -S /tmp/report_node1

   If you also have a log host which is not in the cluster, then you'll have to
   copy the log to one of the nodes and tell us where it is:
# hb_report -f 5:20pm -t 5:30pm -l /var/tmp/ha-log -S /tmp/report_node1

   Furthermore, to prevent hb_report from asking you to edit the report to
   describe the problem on every node use -D on all but one:
# hb_report -f 5:20pm -t 5:30pm -DS /tmp/report_node1

   If you reconsider and want the ssh setup, take a look at the CTS README file
   for instructions.

Analysis

   The point of analysis is to get out the most important information from
   probably several thousand lines worth of text. Perhaps this should be more
   properly named as report review as it is rather simple, but let's pretend
   that we are doing something utterly sophisticated.

   The analysis consists of the following:
     * compare files coming from different nodes; if they are equal, make one
       copy in the top level directory, remove duplicates, and create soft
       links instead
     * print errors, warnings, and lines matching -L patterns from logs
     * report if there were coredumps and by whom
     * report crm_verify results

The goods

    1. Common
          + ha-log (if found on the log host)
          + description.txt (template and user report)
          + analysis.txt
    2. Per node
          + ha.cf
          + logd.cf
          + ha-log (if found)
          + cib.xml (cibadmin -Ql or cp if Heartbeat is not running)
          + ccm_tool.txt (ccm_tool -p)
          + crm_mon.txt (crm_mon -1)
          + crm_verify.txt (crm_verify -V)
          + pengine/ (only on DC, directory with pengine transitions)
          + sysinfo.txt (static info)
          + sysstats.txt (dynamic info)
          + backtraces.txt (if coredumps found)
          + DC (well…)
          + RUNNING or STOPPED

   Last updated 29-Nov-2007 16:12:02 CEST

References

   1. mailto:dmuhamedagic@suse.de
   2. http://search.cpan.org/dist/TimeDate/lib/Date/Parse.pm#EXAMPLE_DATES
