Design-High Availability - Agent Failover

RHQ High Availability - Design and Goals

RHQ 1.1.0 is scheduled to introduce RHQ High Availability (HA). The initial goal of RHQ HA is to support multiple RHQ servers configured for a single database repository. Agent load can then be partitioned amongst the available servers. Failover will occur for agents whose server goes down for any reason. So, RHQ HA will introduce scalability and fault tolerance.

The sections below describe features of RHQ High Availability that are currently in plan for version 1.1.0. Although accurate at the time of writing the content and features below are subject to change or omission.

For a quick illustration of what HA is going to accomplish, see this flash: failover.swf

Server Cloud

The foundation of RHQ HA is a cloud of RHQ servers. The cloud is made up of one or more loosely coupled RHQ servers. Cloud members must:

Have a unique name.
Have a unique endpoint [non-translated address/(secure)port].
Be configured for the same database.
Have compatible RHQ versions ( initially, all must be version 1.1.0)

A single-server installation is considered a 1-member HA server cloud and must therefore supply valid name and endpoint information. The RHQ installer will handle multiple scenarios: single server installation, the addition of a cloud member, and the upgrade/re-installation of an existing cloud member. The server distribution will remain a .zip file in RHQ 1.1.0; the archive will unzip into a new directory.

Load Balancing

HA will provide load balancing by partitioning agents amongst servers in the cloud. There are various factors that will contribute to the partition algorithm:

Affinity
Round robin

Affinity

An "Affinity Group" is a tag (just a unique name) created by the administrator. It can then be assigned to any number of RHQ servers and agents. For example, in an environment with several data centers it may be useful to assign each data center an affinity group (e.g. "DC1", "DC2"). Agents and servers co-located in a single data center can then set their affinity groups appropriately. The partition algorithm will then show preference to assigning agents to servers withing the same affinity group. Although, based on load and availability affinity is not guaranteed.

A server or agent can be assigned to only one affinity group.
By default a server or agent is assigned no affinity group.
A server can be assigned an affinity group at install time or via the HA Administration Console.
An agent can be assigned an affinity group via the HA Administration Console.

Round Robin

The goal is to partition agents across servers while taking into consideration aspects such as affinity. Additionally, the partition algorithm will take into consideration failover topology. Agents will be assigned server lists (see below) in a weighted round robin fashion distribution.

For example, in a 3-Server cloud with no affinity then agents would be assigned server lists similar to:
A1 : S1, S2, S3
A2 : S2, S3, S1
A3 : S3, S1, S2
A4 : S1, S3, S2

If S1, S2, A1, A2, A3 all had affinity then the assigned server lists may look similar to:
A1 : S1, S2, S3
A2 : S2, S1, S3
A3 : S1, S2, S3
A4 : S3, S2, S1

Compute Power

( images/author/images/icons/emoticons/information.gif planned, 1.2.0)

Servers may vary in the load they can carry. If servers in a cloud vary significantly in computing power the administrator can alter the assigned compute power to distribute load appropriately. Compute power is relative, by default all servers in the cloud are assigned a compute power of 1. So, for example, in a 2-Server cloud where S1 is twice as powerful as S2 and you want S1 to assume twice the agent load, set ComputePower(S1)=2 and leave ComputePower(S2)=1. Compute powers are set as positive integers.

Algorithm

A few notes on the partition algorithm.

Affinity is strong and will always be satisfied if servers with the correct affinity are available. As shown in the example above, affinity servers will always be the first entries in the server list for an agent with defined affinity. Load will be distributed amongst affinity servers while satisfying affinity. But, as also shown in the example, if no affinity servers are available the agent will fail over to an available non-affinity server.

So, although it is possible, and potentially desirable, to have load imbalance across all servers, due to affinity, agents with affinity should have their load fairly well distributed amongst their affinity servers.

The repartition algorithm attempts to limit connection churn, it will attempt to maintain as many primary server assignments as possible while still balancing load.

All known servers and agents participate in the repartition. DOWN or MAINTENANCE_MODE servers are expected to be up and operating in NORMAL fashion and as such are included on the assumption that their unavailability will be remedied. Long term downtime should be handled by deleting the server and potentially reinstalling it at a later time.

Server Assignment

Perhaps the best way to understand the proposed behavior for Agent and Server assignment is to look at various use cases for how an an RHQ agent determines its server. To do this, a few terms need to be defined:

Token
The Agent's "Token" is an identifier provided by the server to, and persisted by, an agent at registration time. An agent will not have a token on initial startup. The agent's token will be delete if it is started with the --clean option.

Configured Server
The Agent's "Configured Server" is the address/port used by the agent to contact the server. It is initially defined via the setup questions and can change if the agent fails over to another server in the cloud. It can also be potentially updated via the "setconfig" or "config import" agent prompt commands. Once the agent receives a server list, the configured server is typically going to be the primary server (i.e. the first server in the list) but can fail-over to other servers in the server list.

Server List (a.k.a Failover List)
The Agent's "Server List" is provided by the server, and persisted by, an agent. It is an ordered list of servers the agent will use for connection.

Primary Server
The Agent's "Primary Server" is the Server at the head of the Server List. This will always be the preferred server for the agent.

Registration and Connection

Agents go through a registration/connection phase when initially contacting the configured server. A successful registration will return the most recent server list to the agent. The agent will then set the configured server to the primary server and attempt to connect. If connection fails it will set the configured server to the next server in the list and try again, failing over until it succeeds or must start again from the primary. Therefore, the registration server and the connected server may not be the same server.

Agent startup logic:

If the agent:

has no assigned token

Then

removes its server list (in the unlikely case that it has one persisted)
attempts to register with the configured server
attempts to connect to servers in the returned server list, starting with the primary

if the agent name is not known to the server a new token is generated, otherwise the existing token is supplied to the agent.
if the configured server can not be contacted the agent will not be able to connect to an RHQ server and must be reconfigured or wait for the configured server to come online.

If the agent:

has a token
has no server list

Then

attempts to register with the configured server and the existing token
attempts to connect to servers in the returned server list, starting with the primary

if the configured server can not be contacted the agent will not be able to connect to an RHQ server and must be reconfigured or wait for the configured server to come online.

If the agent:

has a token
has a server list

Then

attempts to register with the configured server and the existing token
when registered, attempts to connect to servers in the returned server list, starting with the primary

the configured server will be the last connected server from the server list, unless the configuration has been updated by a user
if the configured server can not be contacted the agent will use the server list, starting with the primary, to find a server with which it can register.
When a server list is exhausted it will be reprocessed, from the head of the list, after some (configurable) delay.

Failover

After successful startup the agent will be connected to a server in its server list, typically the primary server. If the agent loses its connection to its server it will perform some logic to ensure the connection loss was not just temporary (e.g. network blip). If reconnection does not succeed, the agent will attempt to failover to a different server, starting with the head of its server list, until a connection is made or until the server list is exhausted. When a server list is exhausted it will be reprocessed, from the head of the list, after some (configurable) delay.

Upon connection to a new server the agent will scale its workload incrementally. This is to prevent overwhelming a particular server after a large scale failover. For example, in a 2-Server cloud, if one server goes down all agents will failover to the remaining server.

Messages from agent-to-server that were marked for reliable delivery will be sent to the new server once a connection is established.

Cloud Repartition

RHQ HA will, in certain circumstances, repartition the agents amongst the cloud servers. This will result in new server lists being generated for all known agents. It is important to note that the repartition algorithm will seek to limit connection churn, and as such will re-assign the minimal number of agents to new primary servers to accomplish the re-balancing.

A repartition does not push new server lists to connected agents. This prevents large scale fail-over in large environments, potentially spiking a server with connection processing. Instead, agents will intermittently check for updated server lists, and reconnect to new primary assignments, if necessary. This disperses the connection load.

Redistribution can occur for the following reasons:

a new server is added (installed) to the cloud.
an existing server is deleted from the cloud (must already be down or in maintenance mode).
an admin request via HA Administration Console

Agent Behavior

New agents will be assigned a server list such that load balancing and affinity are satisfied in the same ways as if the agent had been registered during the last cloud repartition.

Connected agents check for updated server lists at (configurable) scheduled intervals. At that time if an agent is not connected to its primary server it will attempt to connect to the primary. In this fashion all agents seek to run on their assigned primary server. All agents being connected to their primary servers guarantees the best load balancing and affinity satisfaction.

Server Operation Modes

There are four server operation modes: INSTALLED, NORMAL, DOWN, MAINTENANCE.

The valid transitions are as follows:

Transition	Description
INSTALLED -> NORMAL	In general a server will move quickly from INSTALLED to NORMAL during a standard installation. This results in a cloud partition event.
NORMAL -> DOWN	A graceful shutdown. If a server crashes it will be set to DOWN by another, non-DOWN, cloud member after a period of unresponsiveness
NORMAL -> MAINTENANCE	The admin can force this transition from the HA Admin Console
DOWN -> MAINTENANCE	The admin can force this transition from the HA Admin Console
DOWN -> NORMAL	Normal startup transition
MAINTENANCE -> NORMAL	The admin can force this transition from the HA Admin Console

Note that a server started up in NORMAL or MAINTENANCE mode will maintain that mode.

Server Maintenance Mode

An HA Server can be taken out of the cloud for maintenance reasons without actually being shut down. This is done via the HA Administration Console and effectively shuts down all agent communication with the server, although the server remains up and the RHQ GUI remains usable. Agents will treat this as a downed server and will apply reconnect and failover logic as needed.

A server taken down in maintenance mode comes up in maintenance mode.

HA Administration Console

The RHQ GUI will offer an HA Administration Console (HAAC), available to RHQ users with management permissions. It will be accessed via the Administration Page in the existing GUI. The Administration Console will offer the following features:

List of HA Servers
1. All configured details
2. Active agent count
3. Ability to change Operating Mode
HA Server Detail
1. Limited editing (Endpoint address, port, secure port, affinity group)
2. Agent list
List of Known Agents
1. Assign affinity group
Agent Detail
1. Server List
2. Primary Server
HA Options
1. Forced Redistribution

GUI

Note that it doesn't matter which server you connect to in the Server Cloud to use the RHQ GUI; the viewable resources and available options will be identical regardless of which you choose.

RHQ Agent

Commands

The RHQ Agent will have new commands introduced with HA:

Server List View
View the current server list for the agent. This is the failover list the agent will consult when it needs to switch to a new server.
- failover --list
Server List Regenerate
Request a new server list be generated for the agent if for some reason the current list is stale. This will also ensure the agent is talking to its primary server (if it is not, the agent will attempt to switch to it).
- failover --check
Server Switch
Request the agent (temporarily) switch to another server immediately. This is temporarly because the primary server switchover thread is still running and when it wakes up, it'll switch the agent back to the primary. The only way this prompt command will be switch over permanently is if you configure that switchover thread to not run (by setting its period to 0).
- {{failover --switch=<server>)

Note that the sender status command will now tell you what the configured server currently is (so you can know what server the agent is, or is attempting to, talk to.

Upgrade

The next version of RHQ plans to have automated agent updates. The RHQ agent will need to run the same version as the RHQ server (this is the "Prime Directive"), and will need to be re-installed with the new version. Ease of installation and agent backward compatibility are high priority goals for future versions.

Future

The following features are currently not in scope for RHQ 1.1.0 but are in plan for subsequent releases of RHQ High Availability.

Load Balancing (Future)

Relative server power (number of cores)
Agent load

Relative Server Power

If the server cloud is made up of servers with unequal compute power it makes sense to assign more agent load to the servers with more compute power.

Agent Load

RHQ agents can vary significantly in the load they put on a server based on number of inventoried resources, measurement collection (schedule) frequency, and other factors. HA will base agent assignment not on number of agents but on relative agent load.

Database Failure Handling (Future)

On database failure all RHQ servers configured for that database will, on a best effort of detection, be moved to Maintenance Mode. When the database is restored, for servers still operating, they can be reset to Normal operating mode via the GUI HAAC.

Failover (Future)

Initially a server will have no hard limit on how mucg agent load can be assigned. A potential future is to be able to define various limits for server load which when enforced will deny agent connection requests.

Redistribution

Redistribution can occur for the following reasons:

in response to unbalanced agent load.

RHQ will periodically review the server-agent topology and decide whether redistribution is necessary. If so, RHQ will re-balance agent load across available servers.

RHQ Agent (Future)

Maintenance Mode
Put the agent into Maintenance Mode. This will suppress failover in situations where it is known that the primary may be down temporarily.

Testing

We have documented some of the HA testing we have performed.

JBoss Community Archive (Read Only)

RHQ 4.9

Design-High Availability - Agent Failover

RHQ High Availability - Design and Goals

Server Cloud

Load Balancing

Affinity

Round Robin

Compute Power

Algorithm

Server Assignment

Registration and Connection

Failover

Cloud Repartition

Agent Behavior

Server Operation Modes

Server Maintenance Mode

HA Administration Console

GUI

RHQ Agent

Commands

Upgrade

Future

Load Balancing (Future)

Relative Server Power

Agent Load

Database Failure Handling (Future)

Failover (Future)

Redistribution

RHQ Agent (Future)

Testing