JBoss Community Archive (Read Only)

RHQ 4.9

Deploying Storage Nodes

Overview

This document discusses deploying as well as undeploying storage nodes. Like the RHQ relational database, there must be at least one storage node installed and running in order to install the RHQ server. Once the server is installed, additional storage node can be deployed.

Deploying Storage Nodes before Server Installation

An RHQ Storage Node has to be installed and running prior to installing the RHQ Server. If you run rhqctl install or rhqctl upgrade on a machine that does not have a storage node or if you do not specify a remote storage node, a new one will be deployed. The storage node will be installed on disk, configured, and started.

Shared cluster settings will be initialized when the first storage node is imported into inventory.

Storage nodes are automatically imported into inventory.

If you deploy additional storage nodes prior to server installation, you must ensure that the shared, cluster settings are the same for each node. If the settings do not match for subsequent nodes, an exception will be thrown when the agent reports them to the server; consequently, they will not be imported into inventory.

TODO - Discuss shared, cluster settings

Deploying Storage Nodes after Server Installation

The deployment process consists of several steps or phases. The names of phases correspond to a storage node's operation mode and are as follows,

  1. INSTALL

  2. ANNOUNCE

  3. BOOTSTRAP

  4. ADD_MAINTENANCE

The API for deployment is provided by

StorageNodeManagerBean.deployStorageNode(Subject subject, StorageNode storageNode)

deployStorageNode is exposed in both the local and remote APIs. It will determine in which phase to start the deployment based on the storage node's mode. Deployment is not finished until each phase completes successfully. The process will be aborted if an error occurs.Each phase should be idempotent. This makes it easier and possible to retry or resume a deployment. The following sections describe each phase in greater detail.

Phase 1: INSTALL

  • The storage node is imported into inventory

  • The StorageNode entity is created

  • Its mode is set to INSTALLED

  • Its status is INSTALLED

  • The mode is set to ANNOUNCE

  • The status changes to JOINING

Note that deployment happens automatically. A property will be added to the cluster setting to disable/enable automatic deployment.

It might seem unnecessary to first set the mode to INSTALLED since it is subsequently changed to ANNOUNCE. The INSTALL phase as a part of the node being imported into inventory. It is possible for multiple nodes to be simultaneously be imported into inventory. Deployment however should be serialized. This is why the mode is updated twice.

Phase 2: ANNOUNCE

  • The mode is set to ANNOUNCE.

  • The node's status is JOINING.

  • The announce resource operation is scheduled to run on each cluster node.

    • This operation updates the internode authentication configuration file so that existing cluster nodes will accept gossip from the new node.

  • If the operation fails

    • Set the failedOperation property of the corresponding storage node

    • Set the errorMessage property of the new node

    • The node's status changes to DOWN

Phase 3: BOOTSTRAP

  • Schedule the prepareForBootstrap operation on the new node. This operation performs a few tasks.

    • Shut down the storage node

    • Purge the data directories

    • Update cluster settings in cassandra.yaml

    • Update internode auth conf settings

    • Restart the node

  • If the operation fails,

    • Set the errorMessage property of the new node

    • Set the failedOperation property of the new node

    • The node's status changes to DOWN

  • The node is reported up (by the cluster)

  • Set the mode to ADD_MAINTENANCE

  • The status is still JOINING

When the node starts up, it will go through the bootstrap process in which it streams data from other nodes. When the bootstrap process finishes, the node will start serving client requests. At this point, the new is fully operational and part of the cluster as far as Cassandra is concerned. For our purposes though, the deployment process is not yet complete.

Phase 4: ADD_MAINTENANCE

  • Apply schema updates if necessary

  • Schedule the addNodeMaintenance operation on each node (including the new node)

  • If the operation fails

    • Set the failedOperation property of the corresponding storage node

    • Set the errorMessage property of the new node

    • The node's status changes to DOWN

  • The operation completes successfully on all nodes

  • Set the mode to NORMAL

  • The status changes to NORMAL

TODO - Add docs on addNodeMaintenance operation and schema changes

Undeploying Storage Nodes

Undeployment is the process removing the node from the cluster and completely uninstalling it. Like deployment it consists of several phases where each phase corresponds to the name of a storage node's operation mode. The phases are,

  1. DECOMMISSION

  2. REMOVE_MAINTENANCE

  3. UNANNOUNCE

  4. UNINSTALL

The API for undeployment is provided by

StorageNodeManagerBean.undeployStorageNode(Subject subject, StorageNode storageNode)

undeployStorageNode is exposed in both the local and remote APIs. It will determine in which phase to start the undeployment based on the storage node's mode. Undeployment is not finished until each phase completes successfully. The process is aborted if an error occurs. Each phase should be idempotent. This makes it easier and possible to retry or resume an undeployment. The following sections describe each phase in greater detail.

Phase 1: DECOMMISSION

  • Set the mode to DECOMMISSION

    • Note that this assumes the previous mode was NORMAL

  • The status changes to LEAVING

  • Apply schema updates if necessary

  • Schedule the decommission resource operation on the node

  • If the operation fails

    • Set the errorMessage property of the node

    • Set the failedOperation property of the node

    • The node's status changes to DOWN

  • The cluster reports that the node has been removed (from the cluster)

  • Set the mode to REMOVE_MAINTENANCE

  • The status is LEAVING

Note that the decommission operation performs the unbootstrap process. The node stops serving client requests and starts streaming the data it owns to other nodes. When the process finishes, the node is no longer part of the cluster as far as Cassandra is concerned. For our purposes though, undeployment is not yet complete.

Phase 2: REMOVE_MAINTENANCE

  • Run the removeNodeMaintenance resource operation on each node (excluding the node being undeployed)

  • If the operation fails

    • Set the errorMessage property of the node

    • Set the failedOperation property of the node

    • The status of the node being undeployed changes to DOWN

  • The operation completes successfully on all cluster nodes

  • Set the mode to UNANNOUNCE

  • The status is LEAVING

Phase 3: UNANNOUNCE

  • Run the unannounce resource operation on each node (excluding the node being undeployed)

    • This removes the undeployed node from the internode authentication conf file

  • If the operation fails

    • Set the errorMessage property of the node

    • Set the failedOperation property of the node

    • The status of the node being undeployed changes to DOWN

  • The operation completes successfully on all cluster nodes

  • Set the mode to UNINSTALL

  • The status is LEAVING

Phase 4: UNINSTALL

  • Run the uninstall operation against the node being undeployed.

  • The operation will

    • Shut down the node if it is running

    • Delete all of its files from disk

  • If the operation fails

    • Set the errorMessage property of the node

    • Set the failedOperation property of the node

    • The status of the node being undeployed changes to DOWN

  • Uninventory the storage node resource

  • Delete the storage node entity

  • Undeployment has completed successfully

Undeploying a storage node whose status is not NORMAL

The previous sections assume that the node's current status is NORMAL when the undeployment begins. We need to handle undeploying a node with other statuses. For example, deploying a storage node may have failed, leaving the node with a mode of something other than NORMAL. The user might decide that is simply easier and/or faster to deploy a different node on a different machine instead of trying to resolve the issues with the failed deployment. The table below details what the starting undeployment phase will give for a given mode.

storage node mode

starting undeployment phase

INSTALLED

UNINSTALL

ANNOUNCE

UNANNOUNCE

BOOTSTRAP

UNANNOUNCE

ADD_MAINTENANCE

DECOMMISSION

NORMAL

DECOMMISSION

DECOMMISSION

DECOMMISSION

REMOVE_MAINTENANCE

REMOVE_MAINTENANCE

UNANNOUNCE

UNANNOUNCE

UNINSTALL

UNINSTALL

Mode, Status, Availability

While the three are closely related, there are some subtle distinctions that need to explained.

Availability

Availability is a familiar concept within RHQ. It is a special measurement collected by the agent to determine whether or not a resource is up or down. A storage node is linked to a resource; so, it undergoes availability checks like every other resource. The RHQ Storage plugin performs a simple check, ensuring that it can make a JMX connection to the storage node. JMX is used only for management. It is not used for internode communication or for client requests. The next table describes what availability does and does not mean for a storage node.

Availability Type

Meaning

UP

The storage node is running and the agent can perform management operations via JMX. This does not necessarily mean that the node serving client requests or that it is participating in internode communication.

DOWN

The agent cannot make a JMX connection to the storage node. Unless the JMX URL in the connection settings is wrong, this indicates that the storage node is not running.

Mode and Status

The mode and status differ from availability in that they report on the node's state relative to the rest of the cluster. The status provides more general info while the mode is more specific, and multiple modes actually map to the same status flags. The following table lists the possible combinations for the mode and status. The errorMessage and failedOperation properties are included as well as they too are used to determine the status.

Mode

errorMessage and failedOperation empty?

Status

Description

UP

yes

UP

The node is serving client requests and participating in gossip (i.e., internode communication).

UP

no

UP

A resource operation against this node failed during the deployment of another node. In this scenario the failedOperation is set on the existing node to show precisely where the failure occurred. The node is however still actively participating in cluster operations.

DOWN

yes

DOWN

The node is part of the cluster but not actively participating. The issue could be that the node is actually shut down or it could be running but unable to reach other cluster nodes due to a network partition.

ANNOUNCE

yes

JOINING

The node is being deployed.

ANNOUNCE

no

DOWN

Deploying the node failed. It is installed (i.e., linked to a resource in inventory) but not part of the cluster.

BOOTSTRAP

yes

JOINING

The node is being deployed.

BOOTSTRAP

no

DOWN

Deploying the node failed. It is not yet part of the cluster.

ADD_MAINTENANCE

yes

JOINING

The node is being deployed. At this point the node will actually service client requests but maintenance task still need to be performed to ensure that the cluster is in a consistent state.

ADD_MAINTENANCE

no

DOWN

Deployment failed. At this point the node will actually service client requests but maintenance task still need to be performed to ensure that the cluster is in a consistent state.

DECOMMISSION

yes

LEAVING

The node is being undeployed.

DECOMMISSION

no

DOWN

Undeploying the node failed. It is possible that it is still serving client requests and participating in gossip.

REMOVE_MAINTENANCE

yes

LEAVING

The node is being undeployed. The node is no longer a participating member of the cluster, but it is still accepting JMX connections.

REMOVE_MAINTENANCE

no

DOWN

Undeployment failed. The node is no longer a participating member of the cluster.

UNANNOUNCE

yes

LEAVING

The node is being undeployed and is no longer a participating member of the cluster.

UNANNOUNCE

no

DOWN

Undeployment failed and the node is no longer a participating member of the cluster.

UNINSTALL

yes

LEAVING

The node is being undeployed.

UNINSTALL

no

LEAVING

Undeployment failed and the node is no longer a participating member of the cluster.

MAINTENANCE

yes

NORMAL

The node is undergoing routine maintenance but is part of the cluster and is operational.

MAINTENANCE

no

NORMAL

An error occurred while undergoing routine maintenance. The node is part of the cluster and is operational.

If both the status and the availability are DOWN, then we can reasonably assume that the node is not running. Furthermore, when they are both DOWN, we can look at the mode to know where exactly in the (un)deployment process something failed.

JBoss.org Content Archive (Read Only), exported from JBoss Community Documentation Editor at 2020-03-13 08:17:15 UTC, last content change 2013-09-18 19:41:54 UTC.