Deploying Storage Nodes

Overview

This document discusses deploying as well as undeploying storage nodes. Like the RHQ relational database, there must be at least one storage node installed and running in order to install the RHQ server. Once the server is installed, additional storage node can be deployed.

Deploying Storage Nodes before Server Installation

An RHQ Storage Node has to be installed and running prior to installing the RHQ Server. If you run rhqctl install or rhqctl upgrade on a machine that does not have a storage node or if you do not specify a remote storage node, a new one will be deployed. The storage node will be installed on disk, configured, and started.

Shared cluster settings will be initialized when the first storage node is imported into inventory.

Storage nodes are automatically imported into inventory.

If you deploy additional storage nodes prior to server installation, you must ensure that the shared, cluster settings are the same for each node. If the settings do not match for subsequent nodes, an exception will be thrown when the agent reports them to the server; consequently, they will not be imported into inventory.

TODO - Discuss shared, cluster settings

Deploying Storage Nodes after Server Installation

The deployment process consists of several steps or phases. The names of phases correspond to a storage node's operation mode and are as follows,

INSTALL
ANNOUNCE
BOOTSTRAP
ADD_MAINTENANCE

The API for deployment is provided by

StorageNodeManagerBean.deployStorageNode(Subject subject, StorageNode storageNode)

deployStorageNode is exposed in both the local and remote APIs. It will determine in which phase to start the deployment based on the storage node's mode. Deployment is not finished until each phase completes successfully. The process will be aborted if an error occurs.Each phase should be idempotent. This makes it easier and possible to retry or resume a deployment. The following sections describe each phase in greater detail.

Phase 1: INSTALL

The storage node is imported into inventory
The StorageNode entity is created
Its mode is set to INSTALLED
Its status is INSTALLED
The mode is set to ANNOUNCE
The status changes to JOINING

Note that deployment happens automatically. A property will be added to the cluster setting to disable/enable automatic deployment.

It might seem unnecessary to first set the mode to INSTALLED since it is subsequently changed to ANNOUNCE. The INSTALL phase as a part of the node being imported into inventory. It is possible for multiple nodes to be simultaneously be imported into inventory. Deployment however should be serialized. This is why the mode is updated twice.

Phase 2: ANNOUNCE

The mode is set to ANNOUNCE.
The node's status is JOINING.
The announce resource operation is scheduled to run on each cluster node.
- This operation updates the internode authentication configuration file so that existing cluster nodes will accept gossip from the new node.
If the operation fails
- Set the failedOperation property of the corresponding storage node
- Set the errorMessage property of the new node
- The node's status changes to DOWN

Phase 3: BOOTSTRAP

Schedule the prepareForBootstrap operation on the new node. This operation performs a few tasks.
- Shut down the storage node
- Purge the data directories
- Update cluster settings in cassandra.yaml
- Update internode auth conf settings
- Restart the node
If the operation fails,
- Set the errorMessage property of the new node
- Set the failedOperation property of the new node
- The node's status changes to DOWN
The node is reported up (by the cluster)
Set the mode to ADD_MAINTENANCE
The status is still JOINING

When the node starts up, it will go through the bootstrap process in which it streams data from other nodes. When the bootstrap process finishes, the node will start serving client requests. At this point, the new is fully operational and part of the cluster as far as Cassandra is concerned. For our purposes though, the deployment process is not yet complete.

Phase 4: ADD_MAINTENANCE

Apply schema updates if necessary
Schedule the addNodeMaintenance operation on each node (including the new node)
If the operation fails
- Set the failedOperation property of the corresponding storage node
- Set the errorMessage property of the new node
- The node's status changes to DOWN
The operation completes successfully on all nodes
Set the mode to NORMAL
The status changes to NORMAL

TODO - Add docs on addNodeMaintenance operation and schema changes

Undeploying Storage Nodes

Undeployment is the process removing the node from the cluster and completely uninstalling it. Like deployment it consists of several phases where each phase corresponds to the name of a storage node's operation mode. The phases are,

DECOMMISSION
REMOVE_MAINTENANCE
UNANNOUNCE
UNINSTALL

The API for undeployment is provided by

StorageNodeManagerBean.undeployStorageNode(Subject subject, StorageNode storageNode)

undeployStorageNode is exposed in both the local and remote APIs. It will determine in which phase to start the undeployment based on the storage node's mode. Undeployment is not finished until each phase completes successfully. The process is aborted if an error occurs. Each phase should be idempotent. This makes it easier and possible to retry or resume an undeployment. The following sections describe each phase in greater detail.

Phase 1: DECOMMISSION

Set the mode to DECOMMISSION
- Note that this assumes the previous mode was NORMAL
The status changes to LEAVING
Apply schema updates if necessary
Schedule the decommission resource operation on the node
If the operation fails
- Set the errorMessage property of the node
- Set the failedOperation property of the node
- The node's status changes to DOWN
The cluster reports that the node has been removed (from the cluster)
Set the mode to REMOVE_MAINTENANCE
The status is LEAVING

Note that the decommission operation performs the unbootstrap process. The node stops serving client requests and starts streaming the data it owns to other nodes. When the process finishes, the node is no longer part of the cluster as far as Cassandra is concerned. For our purposes though, undeployment is not yet complete.

Phase 2: REMOVE_MAINTENANCE

Run the removeNodeMaintenance resource operation on each node (excluding the node being undeployed)
If the operation fails
- Set the errorMessage property of the node
- Set the failedOperation property of the node
- The status of the node being undeployed changes to DOWN
The operation completes successfully on all cluster nodes
Set the mode to UNANNOUNCE
The status is LEAVING

Phase 3: UNANNOUNCE

Run the unannounce resource operation on each node (excluding the node being undeployed)
- This removes the undeployed node from the internode authentication conf file
If the operation fails
- Set the errorMessage property of the node
- Set the failedOperation property of the node
- The status of the node being undeployed changes to DOWN
The operation completes successfully on all cluster nodes
Set the mode to UNINSTALL
The status is LEAVING

Phase 4: UNINSTALL

Run the uninstall operation against the node being undeployed.
The operation will
- Shut down the node if it is running
- Delete all of its files from disk
If the operation fails
- Set the errorMessage property of the node
- Set the failedOperation property of the node
- The status of the node being undeployed changes to DOWN
Uninventory the storage node resource
Delete the storage node entity
Undeployment has completed successfully

Undeploying a storage node whose status is not NORMAL

The previous sections assume that the node's current status is NORMAL when the undeployment begins. We need to handle undeploying a node with other statuses. For example, deploying a storage node may have failed, leaving the node with a mode of something other than NORMAL. The user might decide that is simply easier and/or faster to deploy a different node on a different machine instead of trying to resolve the issues with the failed deployment. The table below details what the starting undeployment phase will give for a given mode.

storage node mode	starting undeployment phase
INSTALLED	UNINSTALL
ANNOUNCE	UNANNOUNCE
BOOTSTRAP	UNANNOUNCE
ADD_MAINTENANCE	DECOMMISSION
NORMAL	DECOMMISSION
DECOMMISSION	DECOMMISSION
REMOVE_MAINTENANCE	REMOVE_MAINTENANCE
UNANNOUNCE	UNANNOUNCE
UNINSTALL	UNINSTALL

Mode, Status, Availability

While the three are closely related, there are some subtle distinctions that need to explained.

Availability

Availability is a familiar concept within RHQ. It is a special measurement collected by the agent to determine whether or not a resource is up or down. A storage node is linked to a resource; so, it undergoes availability checks like every other resource. The RHQ Storage plugin performs a simple check, ensuring that it can make a JMX connection to the storage node. JMX is used only for management. It is not used for internode communication or for client requests. The next table describes what availability does and does not mean for a storage node.

Availability Type	Meaning
UP	The storage node is running and the agent can perform management operations via JMX. This does not necessarily mean that the node serving client requests or that it is participating in internode communication.
DOWN	The agent cannot make a JMX connection to the storage node. Unless the JMX URL in the connection settings is wrong, this indicates that the storage node is not running.

Mode and Status

The mode and status differ from availability in that they report on the node's state relative to the rest of the cluster. The status provides more general info while the mode is more specific, and multiple modes actually map to the same status flags. The following table lists the possible combinations for the mode and status. The errorMessage and failedOperation properties are included as well as they too are used to determine the status.

Mode	errorMessage and failedOperation empty?	Status	Description
UP	yes	UP	The node is serving client requests and participating in gossip (i.e., internode communication).
UP	no	UP	A resource operation against this node failed during the deployment of another node. In this scenario the failedOperation is set on the existing node to show precisely where the failure occurred. The node is however still actively participating in cluster operations.
DOWN	yes	DOWN	The node is part of the cluster but not actively participating. The issue could be that the node is actually shut down or it could be running but unable to reach other cluster nodes due to a network partition.
ANNOUNCE	yes	JOINING	The node is being deployed.
ANNOUNCE	no	DOWN	Deploying the node failed. It is installed (i.e., linked to a resource in inventory) but not part of the cluster.
BOOTSTRAP	yes	JOINING	The node is being deployed.
BOOTSTRAP	no	DOWN	Deploying the node failed. It is not yet part of the cluster.
ADD_MAINTENANCE	yes	JOINING	The node is being deployed. At this point the node will actually service client requests but maintenance task still need to be performed to ensure that the cluster is in a consistent state.
ADD_MAINTENANCE	no	DOWN	Deployment failed. At this point the node will actually service client requests but maintenance task still need to be performed to ensure that the cluster is in a consistent state.
DECOMMISSION	yes	LEAVING	The node is being undeployed.
DECOMMISSION	no	DOWN	Undeploying the node failed. It is possible that it is still serving client requests and participating in gossip.
REMOVE_MAINTENANCE	yes	LEAVING	The node is being undeployed. The node is no longer a participating member of the cluster, but it is still accepting JMX connections.
REMOVE_MAINTENANCE	no	DOWN	Undeployment failed. The node is no longer a participating member of the cluster.
UNANNOUNCE	yes	LEAVING	The node is being undeployed and is no longer a participating member of the cluster.
UNANNOUNCE	no	DOWN	Undeployment failed and the node is no longer a participating member of the cluster.
UNINSTALL	yes	LEAVING	The node is being undeployed.
UNINSTALL	no	LEAVING	Undeployment failed and the node is no longer a participating member of the cluster.
MAINTENANCE	yes	NORMAL	The node is undergoing routine maintenance but is part of the cluster and is operational.
MAINTENANCE	no	NORMAL	An error occurred while undergoing routine maintenance. The node is part of the cluster and is operational.

If both the status and the availability are DOWN, then we can reasonably assume that the node is not running. Furthermore, when they are both DOWN, we can look at the mode to know where exactly in the (un)deployment process something failed.

JBoss Community Archive (Read Only)

RHQ 4.9

Deploying Storage Nodes

Overview

Deploying Storage Nodes before Server Installation

Deploying Storage Nodes after Server Installation

Phase 1: INSTALL

Phase 2: ANNOUNCE

Phase 3: BOOTSTRAP

Phase 4: ADD_MAINTENANCE

Undeploying Storage Nodes

Phase 1: DECOMMISSION

Phase 2: REMOVE_MAINTENANCE

Phase 3: UNANNOUNCE

Phase 4: UNINSTALL

Undeploying a storage node whose status is not NORMAL

Mode, Status, Availability

Availability

Mode and Status