
                                Faq'n Tips

    1. [1]Hey!  This doesn't look like a FAQ!  What gives?
    2. [2]Are there mailing lists for Linux-HA?
    3. [3]What is a cluster?
    4. [4]What is a resource script?
    5. [5]How to monitor various resources?. If one of my resources stops
       working heartbeat doesn't do anything unless the server crashes. How do
       I monitor resources with heartbeat?
    6. [6]If my one of my ethernet connections goes away (cable severance, NIC
       failure,  locusts),  but my current primary node (the one with the
       services) is otherwise fine, no one can get to my services and I want to
       fail them over to my other cluster node.  Is there a way to do this?
    7. [7]Every  time my machine releases an IP alias, it loses the whole
       interface (i.e. eth0)!  How do I fix this?
    8. [8]I want a lot of IP addresses as resources (more than 8).  What's the
       best way?
    9. [9]The documentation indicates that a serial line is a good idea, is
       there really a drawback to using two ethernet connections?
   10. [10]How to use heartbeat with ipchains firewall?
   11. [11]I got this message ERROR: No local heartbeat. Forcing shutdown and
       then heartbeat shut itself down for no reason at all!
   12. [12]How to tune heartbeat on heavily loaded system to avoid split-brain?
   13. [13]When heartbeat starts up I get this error message in my logs:
         WARN: process_clustermsg: node [<hostname>] failed authentication
       [14]What does this mean?
   14. [15]When I try to start heartbeat i receive message: [16]"Starting
       High-Availability services: Heartbeat failure [rc=1]. Failed.
       [17]and there is nothing in any of the log files and no messages. What
       is wrong ?
   15. [18]How to run multiple clusters on same network segment ?
   16. [19]How to get latest CVS version of heartbeat ?
   17. [20]Heartbeat on other OSs.
   18. [21]When I try to install the linux-ha.org heartbeat RPMs, they complain
       of dependencies from packages I already have installed!  Now what?
   19. [22]I don't want heartbeat to fail over the cluster automatically.  How
       can I require human confirmation before failing over?
   20. [23]What is STONITH?  And why might I need it?
   21. [24]How do I figure out what STONITH devices are available, and how to
       configure them?
   22. [25]I want to use a shared disk, but I don't want to use STONITH.  Any
       recommendations?
   23. [26]Can heartbeat be configured in an active/active configuration? If
       so, how do I do this, since the haresources file is supposed to be the
       same on each box so I do not know how this could be done.
   24. [27]Why are my interface names getting truncated when they're brought up
       and down?
   25. [28]What is this auto_failback parameter? What happened to the old
       nice_failback parameter?
   26. [29]I  am  upgrading  from  a  version of Linux-HA which supported
       nice_failback to one that supports auto_failback. How to I avoid a flash
       cut in this upgrade?
   27. [30]If nothing helps, what should I do ?
   28. [31]I want to submit a patch, how do I do that?
   ______________________________________________________________________


    1. Quit  your bellyachin'!  We needed a "catch-all" document to supply
       useful information in a way that was easily referenced and would grow
       without a lot of work.  It's closer to a FAQ than anything else.
    2. Yes!  There are two public mailing lists for Linux-HA.  You can find out
       about them by visiting [32]http://linux-ha.org/contact/.
    3. HA (High availability Cluster) - A cluster that allows a host (or hosts)
       to become Highly Available. This means that if one node goes down (or a
       service on that node goes down) another node can pick up the service or
       node and take over from the failed machine. [33]http://linux-ha.org
       Computing  Cluster  -  This is what a Beowulf cluster is. It allows
       distributed computing over off the shelf components. In this case it is
       usually cheap IA32 machines. [34]http://www.beowulf.org/
       Load balancing clusters - This is what the Linux Virtual Server project
       does. In this scenario you have one machine with load balances requests
       to  a  certain  server (apache for example) over a farm of servers.
       [35]www.linuxvirtualserver.org
       All of these sites have howtos etc. on them. For a general overview on
       clustering under Linux, look at the Clustering HOWTO.
    4. Resource scripts are basically (extended) System V init scripts. They
       must support stop, start, and status operations.  In the future we will
       also add support for a "monitor" operation for monitoring services as
       you requested. The IPaddr script implements this new "monitor" operation
       now (but heartbeat doesn't use that function of it). For more info see
       Resource HOWTO.
    5. Heartbeat itself was not designed for monitoring various resources. If
       you need to monitor some resources (for example, availability of WWW
       server)  you  need  some  third party software. Mon is a reasonable
       solution.
         A. Get Mon from [36]http://kernel.org/software/mon/.
         B. Get  all required modules listed. You can find them at nearest
            mirror  or  at  the CPAN archive (www.cpan.org). I am not very
            familiar  with Perl, so I downloaded them from CPAN archive as
            .tar.gz  packages  and  installed  them in the usual way (perl
            Makefile.pl && make && make test && make install).
         C. Mon is software for monitoring different network resources. It can
            ping computers, connect to various ports, monitor WWW, MySQL etc.
            In case of dysfunction of some resources it triggers some scripts.
         D. Unpack mon in some directory. Best starting point is README file.
            Complete documentation is in the <dir>/doc, where <dir> is the
            place you unpacked mon package.
         E. For a fast start do following steps:
              a. copy all subdirs found in <dir> to /usr/lib/mon
              b. create dir /etc/mon
              c. copy auth.cf from <dir>/etc to /etc/mon
            Now, mon is prepared to work. You need to create your own mon.cf
            file, where you should point to resources mon should watch and
            actions mon will start in case of dysfunction and when resources
            are   available   again.     All  monitoring  scripts  are  in
            /usr/lib/mon/mon.d/. At the beginning of every script you can find
            explanation how to use it.
            All alert scripts are placed in /usr/lib/mon/alert.d/. Those are
            scripts triggered in case something went wrong. In case you are
            using ipvs on theirs homepage (www.linuxvirtualserver.org) you can
            find scripts for adding and removing servers from an ipvs list.
    6. Yes!  Use the ipfail plug-in.  For each interface you wish to monitor,
       specify one or more "ping" nodes or "ping groups" in your configuration.
        Each node in your cluster will monitor these ping nodes or groups.
       Should one node detect a failure in one of these ping nodes, it will
       contact the other node in order to determine whether it or the ping node
       has the problem.  If the cluster node has the problem, it will try to
       failover its resources (if it has any).
       To  use  ipfail,  you  will  need  to  add  the  following  to your
       /etc/ha.d/ha.cf files:
               respawn hacluster /usr/lib/heartbeat/ipfail
               ping <IPaddr1> <IPaddr2> ... <IPaddrN>
       See [37]Kevin's documentation for more details on the concepts.
       IPaddr1..N  are  your  ping  nodes.   NOTE:   ipfail  requires  the
       auto_failback option to be set to on or off (not legacy).
    7. This isn't a problem with heartbeat, but rather is caused by various
       versions of net-tools.  Upgrade to the most recent version of net-tools
       and it will go away.  You can test it with ifconfig manually.
    8. Instead of failing over many IP addresses, just fail over one router
       address.   On  your  router,  do  the equivalent of "route add -net
       x.x.x.0/24  gw  x.x.x.2",  where  x.x.x.2 is the cluster IP address
       controlled by heartbeat.  Then, make every address within x.x.x.0/24
       that  you wish to failover a permanent alias of lo0 on BOTH cluster
       nodes.  This is done via "ifconfig lo:2 x.x.x.3 netmask 255.255.255.255
       -arp" etc...
    9. If  anything makes your ethernet / IP stack fail, you may lose both
       connections. You definitely should run the cables differently, depending
       on how important your data is...
   10. To  make heartbeat work with ipchains, you must accept incoming and
       outgoing traffic on 694 UDP port. Add something like
       /sbin/ipchains -A output -i ethN -p udp -s <source_IP> -d <dest_IP>  -j
       ACCEPT
       /sbin/ipchains -A input -i ethN -p udp -s <source_IP> -d <dest_IP>  -j
       ACCEPT
   11. This can be caused by one of two things:
          + System under heavy I/O load, or
          + Kernel bug.
       For how to deal with the first occurrence (heavy load), please read the
       answer to the [38]next FAQ item.
       If your system was not under moderate to heavy load when it got this
       message, you probably have the kernel bug. The 2.4.18 Linux kernel had a
       bug in it which would cause it to not schedule heartbeat for very long
       periods of time when the system was idle, or nearly so. If this is the
       case, you need to get a kernel that isn't broken.
   12. "No local heartbeat" or "Cluster node returning after partition" under
       heavy load is typically caused by too small a deadtime interval. Here is
       suggestion for how to tune deadtime:
          + Set deadtime to 60 seconds or higher
          + Set warntime to whatever you *want* your deadtime to be.
          + Run your system under heavy load for a few weeks.
          + Look at your logs for the longest time either system went without
            hearing a heartbeat.
          + Set your deadtime to 1.5-2 times that amount.
          + Set warntime to a little less than that amount.
          + Continue to monitor logs for warnings about long heartbeat times.
            If you don't do this, you may get "Cluster node ... returning after
            partition" which will cause heartbeat to restart on all machines in
            the cluster. This will almost certainly annoy you.
       Adding memory to the machine generally helps. Limiting workload on the
       machine generally helps. Newer versions of heartbeat are a better about
       this than pre 1.0 versions.
   13. It's common to get a single mangled packet on your serial interface when
       heartbeat starts up.  This message is an indication that we received a
       mangled  packet.   It's  harmless  in  this scenario. If it happens
       continually, there is probably something else going on.
   14. It's probably a permissions problem on authkeys.  It wants it to be read
       only mode (400, 600 or 700).  Depending on where and when it discovers
       the problem, the message will wind up in different places.
       But, it tends to be in
         1. stdout/stderr
         2. wherever you specified in your setup
         3. /var/log/messages
       Newer releases are better about also putting out startup messages to
       stderr in addition to wherever you have configured them to go.
   15. Use multicast and give each its own multicast group. If you need to/want
       to use broadcast, then run each cluster on different port numbers.  An
       example  of  a  configuration  using multicast would be to have the
       following line in your ha.cf file:
            mcast eth0 224.1.2.3 694 1 0
       This  sets  eth0 as the interface over which to send the multicast,
       224.1.2.3 as the multicast group (will be same on each node in the same
       cluster), udp port 694 (heartbeat default), time to live of 1 (limit
       multicast to local network segment and not propagate through routers),
       multicast loopback disabled (typical).
   16. There  is  a  CVS  repository  for  Linux-HA.  You  can  find it at
       cvs.linux-ha.org.  Read-only access is via login guest, password guest,
       module  name  linux-ha.  More  details  are  to  be  found  in  the
       [39]announcement email.  It is also available through the web using
       viewcvs at [40]http://cvs.linux-ha.org/viewcvs/viewcvs.cgi/linux-ha/
   17. Heartbeat now uses use automake and is generally quite portable at this
       point. Join the Linux-HA-dev mailing list if you want to help port it to
       your favorite platform.
   18. Due to distribution RPM package name differences, this was unavoidable.
       If  you're  not  using STONITH, use the "--nodeps" option with rpm.
       Otherwise, use the heartbeat source to build your own RPMs.  You'll have
       the  added dependencies of autoconf >= 2.53 and libnet (get it from
       [41]http://www.packetfactory.net/libnet).  Use the heartbeat source RPM
       (preferred) or unpack the heartbeat source and from the top directory,
       run "./ConfigureMe rpm".  This will build RPMS and place them where it's
       customary for your particular distro.  It may even tell you if you are
       missing some other required packages!
   19. You configure a "meatware" STONITH device into the ha.cf file.  The
       meatware STONITH device asks the operator to go power reset the machine
       which has gone down.  When the operator has reset the machine he or she
       then issues a command to tell the meatware STONITH plug-in that the
       reset  has taken place.  Heartbeat will wait indefinitely until the
       operator acknowledges the reset has occurred.  During this time, the
       resources will not be taken over, and nothing will happen.
   20. STONITH is a form of fencing, and is an acronym standing for Shoot The
       Other Node In The Head.  It allows one node in the cluster to reset the
       other.  Fencing is essential if you're using shared disks, in order to
       protect the integrity of the disk data.  Heartbeat supports STONITH
       fencing, and resources which are self-fencing.  You need to configure
       some kind of fencing whenever you have a cluster resource which might be
       permanently damaged if both machines tried to make it active at the same
       time.  When in doubt check with the Linux-HA mailing list.
   21. To get the list of supported STONITH devices, issue this command:
       stonith -L
       To get all the gory details on exactly what these STONITH device names
       mean, and how to configure them, issue this command:
       stonith -h
   22. This is not something which heartbeat supports directly, however, there
       are a few kinds of resources which are "self-fencing".  This means that
       activating the resource causes it to fence itself off from the other
       node  naturally.  Since this fencing happens in the resource agent,
       heartbeat  doesn't  know  (and doesn't have to know) about it.  Two
       possible hardware candidates are IBM's ServeRAID-4 RAID controllers and
       ICP Vortex RAID controllers - but do your homework!!!   When in doubt
       check with the mailing list.
   23. Yes, heartbeat has supported active/active configurations since its
       first  release. The key to configuring active/active clusters is to
       understand that each resource group in the haresources file is preceded
       by  the  name  of the server which is normally supposed to run that
       service.  When  in  a "auto_failback yes (or legacy)" (or old-style
       "nice_failback off") configuration, when a cluster node comes up, it
       will  take over any resources for which it is listed as the "normal
       master" in the haresources file. Below is an example of how to do this
       for an apache/mysql configuration.
server1 10.10.10.1 mysql
server2 10.10.10.2 apache
       In this case, the IP address 10.10.10.1 should be replaced with the IP
       address you want to contact the mysql server at, and 10.10.10.2 should
       be replaced with the IP address you want people to use to contact the
       web server. Any time server1 is up, it will run the mysql service. Any
       time server2 is up, it will run the apache service. If both server1 and
       server2  are  up,  both  servers  will be active. Note that this is
       contradictory with the old nice_failback on parameter. With the new
       release which supports hb_standby foreign, you can manually fail back
       into an active/active configuration if you have auto_failback off. This
       allows  administrators  more  flexibility in failing back in a more
       customized way at more safe or convenient times.
   24. Heartbeat was written to use ifconfig to manage its interfaces.  That's
       nice for portability for other platforms, but for some reasons ifconfig
       truncates interface names.  If you want to have fewer than 10 aliases,
       then you need to limit your interface names to 7 characters, and 6 for
       fewer than 100 interfaces.
   25. The auto_failback parameter is a replacement for the old nice_failback
       parameter. The old value nice_failback on is replaced by auto_failback
       off. The old value nice_failback off is logically replaced by the new
       auto_failback on parameter. Unlike the old nice_failback off behavior,
       the new auto_failback on allows the use of the ipfail and hb_standby
       facilities.
       During upgrades from nice_failback to auto_failback, it is sometimes
       necessary  to  set auto_failback to legacy, as described in the
       [42]upgrade procedure below.
   26. To upgrade from a pre-auto_failback version of heartbeat to one which
       supports auto_failback, the following procedures are recommended to
       avoid a flash cut on the whole cluster.
         1. Stop heartbeat on one node in the cluster.
         2. upgrade this node. If the other node has nice_failback on in ha.cf
            then set auto_failback off in the new ha.cf file. If the other node
            in the cluster has nice_failback off then set auto_failback legacy
            in the new ha.cf file.
         3. Start the new version of heartbeat on this node.
         4. Stop heartbeat on the other node in the cluster.
         5. upgrade this second node in the cluster with the new version of
            heartbeat. Set auto_failback the same as it was set in the previous
            step.
         6. Start heartbeat on this second node in the cluster.
         7. If you set auto_failback to on or off, then you are done.
            Congratulations!
         8. If you set auto_failback legacy in your ha.cf file, then continue
            as described below...
         9. Schedule a time to shut down the entire cluster for a few seconds.
        10. At the scheduled time, stop both nodes in the cluster, and then
            change the value of auto_failback to on in the ha.cf file on both
            sides.
        11. Restart both nodes on the cluster at about the same time.
        12. Congratulations, you're done! You can now use ipfail, and can also
            use the hb_standby command to cause manual resource moves.
   27. Please be sure that you read all documentation and searched mail list
       archives. If you still can't find a solution you can post questions to
       the mailing list. Please include following:
          + What OS are you running.
          + What version (distro/kernel).
          + How  did you install heartbeat (tar.gz, rpm, src.rpm or manual
            installation)
          + Include your configuration files from BOTH machines. You can omit
            authkeys.
          + Include the parts of your logs which describe the errors.  Send
            them as text/plain attachments.
            Please  don't send "cleaned up" logs.  The real logs have more
            information in them than cleaned up versions.  Always include at
            least  a little irrelevant data before and after the events in
            question so that we know nothing was missed.  Don't edit the logs
            unless you really have some super-secret high-security reason for
            doing so.
            This means you need to attach 6 or 8 files. Include 6 if your debug
            output  goes  into  the  same file as your normal output and 8
            otherwise. For each machine you need to send:
               o ha.cf
               o haresources
               o normal logs
               o debug logs (perhaps)
   28. We love to get good patches.  Here's the preferred way:
          + If you have any questions about the patch, please check with the
            linux-ha-dev mailing list for answers before starting.
          + Make your changes against the current CVS source
          + Test them, and make sure they work ;-)
          + Produce the patch this way:
                   cvs -q diff -u >patchname.txt
          + Send an email to the linux-ha-dev mailing list with the patch as a
            [text/plain] attachment. If your mailer wants to zip it up for you,
            please fix it.
   ______________________________________________________________________

   Rev 0.0.8
   (c) 2000 Rudy Pawul [43]rpawul@iso-ne.com
   (c) 2001 Dusan Djordjevic [44]dj.dule@linux.org.yu (c) 2003 IBM (Author Alan
   Robertson [45]alanr@unix.sh)

References

   1. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#FAQ
   2. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#mailinglists
   3. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#what_is_it
   4. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#res_scr
   5. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#mon
   6. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#ipfail
   7. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#nettools
   8. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#manyIPs
   9. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#serial
  10. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#firewall
  11. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#nolocalheartbeat
  12. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#heavy_load
  13. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#serialerr
  14. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#serialerr
  15. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#authkeys
  16. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#authkeys
  17. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#authkeys
  18. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#multiple_clusters
  19. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#CVS
  20. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#other_os
  21. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#RPM
  22. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#meatware
  23. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#STONITH
  24. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#config_stonith
  25. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#self_fence
  26. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#active_active
  27. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#iftrunc
  28. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#why_auto_failback
  29. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#auto_failback_upgrade
  30. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#last_hope
  31. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#patches
  32. http://linux-ha.org/contact/
  33. http://www.linux-ha.org/
  34. http://www.beowulf.org/
  35. http://www.linuxvirtualserver.org/
  36. http://kernel.org/software/mon/
  37. http://pheared.net/devel/c/ipfail/
  38. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#heavy_load
  39. http://lists.community.tummy.com/pipermail/linux-ha-dev/1999-October/000212.html
  40. http://cvs.linux-ha.org/viewcvs/viewcvs.cgi/linux-ha/
  41. http://www.packetfactory.net/libnet
  42. file://localhost/usr/src/asianux/BUILD/Heartbeat-STABLE-2-1-STABLE-2.1.4/doc/faqntips.html#auto_failback_upgrade
  43. mailto:rpawul@iso-ne.com
  44. mailto:dj.dule@linux.org.yu
  45. mailto:alanr@unix.sh
