Chapter 39. High Availability and Failover

We define high availability as the ability for the system to continue functioning after failure of one or more of the servers.

A part of high availability is failover which we define as the ability for client connections to migrate from one server to another in event of server failure so client applications can continue to operate.

39.1. Live - Backup Groups

HornetQ allows servers to be linked together as live - backup groups where each live server can have 1 or more backup servers. A backup server is owned by only one live server. Backup servers are not operational until failover occurs, however 1 chosen backup, which will be in passive mode, announces its status and waits to take over the live servers work

Before failover, only the live server is serving the HornetQ clients while the backup servers remain passive or awaiting to become a backup server. When a live server crashes or is brought down in the correct mode, the backup server currently in passive mode will become live and another backup server will become passive. If a live server restarts after a failover then it will have priority and be the next server to become live when the current live server goes down, if the current live server is configured to allow automatic failback then it will detect the live server coming back up and automatically stop.

39.1.1. HA modes

HornetQ supports two different strategies for backing up a server shared store and replication.

Note

Only persistent message data will survive failover. Any non persistent message data will not be available after failover.

39.1.2. Data Replication

Replication is supported since version 2.3.

When using replication, the live and the backup servers do not share the same data directories, all data synchronization is done through network traffic. Therefore all (persistent) data traffic received by the live server will be duplicated to the backup.

Notice that upon start-up the backup server will first need to synchronize all existing data from the live server, before becoming capable of replacing the live server should it fail. So unlike the shared store case, a replicating backup will not be a fully operational backup right after start, but only after it finishes synchronizing the data. The time it will take for this to happen will depend on the amount of data to be synchronized and the connection speed.

Note

Synchronization occurs in parallel with current network traffic so this won't cause any blocking on current clients.

Replication will create a copy of the data at the backup. One issue to be aware of is: in case of a successful fail-over, the backup's data will be newer than the one at the live's storage. If you configure your live server to perform a Section 39.1.4, “Failing Back to live Server” when restarted, it will synchronize its data with the backup's. If both servers are shutdown, the administrator will have to determine which one has the lastest data.

The replicating live and backup pair must be part of a cluster. The Cluster Connection also defines how backup servers will find the remote live servers to pair with. Refer to Chapter 38, Clusters for details on how this is done, and how to configure a cluster connection. Notice that:

Both live and backup servers must be part of the same cluster. Notice that even a simple live/backup replicating pair will require a cluster configuration.
their cluster user and password must match

Within a cluster, there are two ways that a backup server will locate a live server to replicate from, these are:

specifying a node group. You can specify a group of live servers that a backup server can connect to. This is done by configuring backup-group-name in the main hornetq-configuration.xml. A Backup server will only connect to a live server that shares the same node group name
connecting to any live. Simply put not configuring backup-group-name will allow a backup server to connect to any live server

Note

A backup-group-name example: suppose you have 5 live servers and 6 backup servers:

live1, live2, live3: with backup-group-name=fish
live4, live5: with backup-group-name=bird
backup1, backup2, backup3, backup4: with backup-group-name=fish
backup5, backup6: with backup-group-name=bird

After joining the cluster the backups with backup-group-name=fish will search for live servers with backup-group-name=fish to pair with. Since there is one backup too many, the fish will remain with one spare backup.

The 2 backups with backup-group-name=bird (backup5 and backup6) will pair with live servers live4 and live5.

The backup will search for any live server that it is configured to connect to. It then tries to replicate with each live server in turn until it finds a live server that has no current backup configured. If no live server is available it will wait until the cluster topology changes and repeats the process.

Note

This is an important distinction from a shared-store backup, as in that case if the backup starts and does not find its live server, the server will just activate and start to serve client requests. In the replication case, the backup just keeps waiting for a live server to pair with. Notice that in replication the backup server does not know whether any data it might have is up to date, so it really cannot decide to activate automatically. To activate a replicating backup server using the data it has, the administrator must change its configuration to make a live server of it, that change backup=true to backup=false.

Much like in the shared-store case, when the live server stops or crashes, its replicating backup will become active and take over its duties. Specifically, the backup will become active when it loses connection to its live server. This can be problematic because this can also happen because of a temporary network problem. In order to address this issue, the backup will try to determine whether it still can connect to the other servers in the cluster. If it can connect to more than half the servers, it will become active, if more than half the servers also disappeared with the live, the backup will wait and try reconnecting with the live. This avoids a split brain situation.

39.1.2.1. Configuration

To configure the live and backup servers to be a replicating pair, configure both servers' hornetq-configuration.xml to have:

<!-- FOR BOTH LIVE AND BACKUP SERVERS' -->
<shared-store>false</shared-store>
.
.
<cluster-connections>
   <cluster-connection name="my-cluster">
      ...
   </cluster-connection>
</cluster-connections>

The backup server must also be configured as a backup.

<backup>true</backup>

39.1.3. Shared Store

When using a shared store, both live and backup servers share the same entire data directory using a shared file system. This means the paging directory, journal directory, large messages and binding journal.

When failover occurs and a backup server takes over, it will load the persistent storage from the shared file system and clients can connect to it.

This style of high availability differs from data replication in that it requires a shared file system which is accessible by both the live and backup nodes. Typically this will be some kind of high performance Storage Area Network (SAN). We do not recommend you use Network Attached Storage (NAS), e.g. NFS mounts to store any shared journal (NFS is slow).

The advantage of shared-store high availability is that no replication occurs between the live and backup nodes, this means it does not suffer any performance penalties due to the overhead of replication during normal operation.

The disadvantage of shared store replication is that it requires a shared file system, and when the backup server activates it needs to load the journal from the shared store which can take some time depending on the amount of data in the store.

If you require the highest performance during normal operation, have access to a fast SAN, and can live with a slightly slower failover (depending on amount of data), we recommend shared store high availability

39.1.3.1. Configuration

To configure the live and backup servers to share their store, configure all hornetq-configuration.xml:

<shared-store>true</shared-store>

Additionally, each backup server must be flagged explicitly as a backup:

<backup>true</backup>

In order for live - backup groups to operate properly with a shared store, both servers must have configured the location of journal directory to point to the same shared location (as explained in Section 15.3, “Configuring the message journal”)

Note

todo write something about GFS

Also each node, live and backups, will need to have a cluster connection defined even if not part of a cluster. The Cluster Connection info defines how backup servers announce there presence to its live server or any other nodes in the cluster. Refer to Chapter 38, Clusters for details on how this is done.

39.1.4. Failing Back to live Server

After a live server has failed and a backup taken has taken over its duties, you may want to restart the live server and have clients fail back. In case of "shared disk", simply restart the original live server and kill the new live server. You can do this by killing the process itself or just waiting for the server to crash naturally. In case of a replicating live server replaced by a remote backup you will need to also set check-for-live-server.

It is also possible to cause failover to occur on normal server shutdown, to enable this set the following property to true in the hornetq-configuration.xml configuration file like so:

<failover-on-shutdown>true</failover-on-shutdown>

By default this is set to false, if by some chance you have set this to false but still want to stop the server normally and cause failover then you can do this by using the management API as explained at Section 30.1.1.1, “Core Server Management”

You can also force the running live server to shutdown when the old live server comes back up allowing the original live server to take over automatically by setting the following property in the hornetq-configuration.xml configuration file as follows:

<allow-failback>true</allow-failback>

In replication HA mode you need to set an extra property check-for-live-server to true. If set to true, during start-up a live server will first search the cluster for another server using its nodeID. If it finds one, it will contact this server and try to "fail-back". Since this is a remote replication scenario, the "starting live" will have to synchronize its data with the server running with its ID, once they are in sync, it will request the other server (which it assumes it is a back that has assumed its duties) to shutdown for it to take over. This is necessary because otherwise the live server has no means to know whether there was a fail-over or not, and if there was if the server that took its duties is still running or not. To configure this option at your hornetq-configuration.xml configuration file as follows:

<check-for-live-server>true</check-for-live-server>

39.2. Failover Modes

HornetQ defines two types of client failover:

Automatic client failover
Application-level client failover

HornetQ also provides 100% transparent automatic reattachment of connections to the same server (e.g. in case of transient network problems). This is similar to failover, except it is reconnecting to the same server and is discussed in Chapter 34, Client Reconnection and Session Reattachment

During failover, if the client has consumers on any non persistent or temporary queues, those queues will be automatically recreated during failover on the backup node, since the backup node will not have any knowledge of non persistent queues.

39.2.1. Automatic Client Failover

HornetQ clients can be configured to receive knowledge of all live and backup servers, so that in event of connection failure at the client - live server connection, the client will detect this and reconnect to the backup server. The backup server will then automatically recreate any sessions and consumers that existed on each connection before failover, thus saving the user from having to hand-code manual reconnection logic.

HornetQ clients detect connection failure when it has not received packets from the server within the time given by client-failure-check-period as explained in section Chapter 17, Detecting Dead Connections. If the client does not receive data in good time, it will assume the connection has failed and attempt failover. Also if the socket is closed by the OS, usually if the server process is killed rather than the machine itself crashing, then the client will failover straight away.

HornetQ clients can be configured to discover the list of live-backup server groups in a number of different ways. They can be configured explicitly or probably the most common way of doing this is to use server discovery for the client to automatically discover the list. For full details on how to configure server discovery, please see Chapter 38, Clusters. Alternatively, the clients can explicitly connect to a specific server and download the current servers and backups see Chapter 38, Clusters.

To enable automatic client failover, the client must be configured to allow non-zero reconnection attempts (as explained in Chapter 34, Client Reconnection and Session Reattachment).

By default failover will only occur after at least one connection has been made to the live server. In other words, by default, failover will not occur if the client fails to make an initial connection to the live server - in this case it will simply retry connecting to the live server according to the reconnect-attempts property and fail after this number of attempts.

2 phase commit

The caveat to this rule is when XA is used either via JMS or through the core API. If 2 phase commit is used and prepare has already been called then rolling back could cause a HeuristicMixedException. Because of this the commit will throw a XAException.XA_RETRY exception. This informs the Transaction Manager that it should retry the commit at some later point in time, a side effect of this is that any non persistent messages will be lost. To avoid this use persistent messages when using XA. With acknowledgements this is not an issue since they are flushed to the server before prepare gets called.

It is up to the user to catch the exception, and perform any client side local rollback code as necessary. There is no need to manually rollback the session - it is already rolled back. The user can then just retry the transactional operations again on the same session.

HornetQ ships with a fully functioning example demonstrating how to do this, please see Section 11.1.73, “Transaction Failover”

If failover occurs when a commit call is being executed, the server, as previously described, will unblock the call to prevent a hang, since no response will come back. In this case it is not easy for the client to determine whether the transaction commit was actually processed on the live server before failure occurred.

Note

If XA is being used either via JMS or through the core API then an XAException.XA_RETRY is thrown. This is to inform Transaction Managers that a retry should occur at some point. At some later point in time the Transaction Manager will retry the commit. If the original commit has not occurred then it will still exist and be committed, if it does not exist then it is assumed to have been committed although the transaction manager may log a warning.

To remedy this, the client can simply enable duplicate detection (Chapter 37, Duplicate Message Detection) in the transaction, and retry the transaction operations again after the call is unblocked. If the transaction had indeed been committed on the live server successfully before failover, then when the transaction is retried, duplicate detection will ensure that any durable messages resent in the transaction will be ignored on the server to prevent them getting sent more than once.

Note

By catching the rollback exceptions and retrying, catching unblocked calls and enabling duplicate detection, once and only once delivery guarantees for messages can be provided in the case of failure, guaranteeing 100% no loss or duplication of messages.

39.2.1.5. Handling Failover With Non Transactional Sessions

If the session is non transactional, messages or acknowledgements can be lost in the event of failover.

If you wish to provide once and only once delivery guarantees for non transacted sessions too, enabled duplicate detection, and catch unblock exceptions as described in Section 39.2.1.3, “Handling Blocking Calls During Failover”

39.2.2. Getting Notified of Connection Failure

JMS provides a standard mechanism for getting notified asynchronously of connection failure: java.jms.ExceptionListener. Please consult the JMS javadoc or any good JMS tutorial for more information on how to use this.

The HornetQ core API also provides a similar feature in the form of the class org.hornet.core.client.SessionFailureListener

Any ExceptionListener or SessionFailureListener instance will always be called by HornetQ on event of connection failure, irrespective of whether the connection was successfully failed over, reconnected or reattached, however you can find out if reconnect or reattach has happened by either the failedOver flag passed in on the connectionFailed on SessionfailureListener or by inspecting the error code on the javax.jms.JMSException which will be one of the following:

Table 39.1. JMSException error codes

error code	Description
FAILOVER	Failover has occurred and we have successfully reattached or reconnected.
DISCONNECT	No failover has occurred and we are disconnected.

39.2.3. Application-Level Failover

In some cases you may not want automatic client failover, and prefer to handle any connection failure yourself, and code your own manually reconnection logic in your own failure handler. We define this as application-level failover, since the failover is handled at the user application level.

To implement application-level failover, if you're using JMS then you need to set an ExceptionListener class on the JMS connection. The ExceptionListener will be called by HornetQ in the event that connection failure is detected. In your ExceptionListener, you would close your old JMS connections, potentially look up new connection factory instances from JNDI and creating new connections. In this case you may well be using HA-JNDI to ensure that the new connection factory is looked up from a different server.

For a working example of application-level failover, please see Section 11.1.2, “Application-Layer Failover”.

If you are using the core API, then the procedure is very similar: you would set a FailureListener on the core ClientSession instances.