Agent Features - RHQ 4.9

Persistent Configuration

The agent configuration, made up of what are called "preferences", are persisted on a per-user basis in an OS-specific way (i.e. on Windows, they are stored in the registry; on UNIX in a directory located under the user's home directory). You can define different configuration preferences by specifying the -p command line argument of the agent to the name of the preferences node you want to use. This allows you to have several sets of configuration preferences and switch between them by passing in different -p options.

When an agent starts, its preferences are checked to see if they can be upgraded, and if so, upgrades them to the latest schema. This is helpful when you upgrade the agent. When the new agent starts up, it makes sure any previous configuration is carried over with the old and new preferences added while obsolete preferences are deleted.

You can pre-configure your agent so the first time it starts up, it can immediately register with a server and begin operating without manual intervention. Please see Preconfiguring the Agent for more information.

Server Auto-Detection and Polling

The agent can be configured to auto-detect its RHQ Server. It can do this in two different ways:

Multicast detection : using multicast detection technology, the agent can usually detect the RHQ Server coming online or going offline within a matter of seconds (a time which is configurable). This requires your network to support multicast traffic; if it does not, then you cannot use this method of server auto-detection.
Server polling : this mechanism polls the RHQ Server periodically to determine if it is online or offline. This method of auto-detection does not require multicast traffic but does require the agent to periodically connect to the RHQ Server and send it a simple ping command.

Typically one or both of these mechanisms are enabled. With the ability to auto-detect the RHQ Server going offline, the agent will be given the opportunity to persist commands that are waiting to be sent and allows the agent to shutdown its attempts to send commands. When the RHQ Server comes back online (and is auto-detected), the agent can resume.

Throttling

The agent has several configuration preferences that define its client-side commands sender - they limit how many resources it can use and how "fast" it can perform some functions (called throttling). These configuration preferences have two main purposes: 1) to help limit the amount of resources the agent is able to claim for itself and 2) to help avoid flooding the server with large amount of commands which could put too-heavy a load on the RHQ server and/or starve other agents from being able to communicate with the RHQ Server. The following configuration preferences define the settings that enable the agent to throttle its outbound messages. Most of these settings should be configured with the other settings in mind. While these do work independently, their effects are usually determined not by their own value but by related values. For example, a queue-size should be set to a larger number if the command timeout is lengthened. This is because if commands are given more time to complete, then more commands will be in the queue waiting to be sent. But, if max-concurrent is raised, this would allow more commands to be dequeued at any one time, so an increase in the queue-size may not be needed. As you can see, all of those preferences set an independent parameter within the agent, but their effects on the agent's behavior as a whole is dependent on the other agent's preferences.

rhq.agent.client.queue-size defines the maximum number of commands the agent can queue up for sending to the RHQ Server. The larger the number, the more memory the agent will be able to use up. Setting this to 0 effectively sets the queue to be unbounded. Please take caution when setting this to 0; if the RHQ Server is down for a long period of time, the agent may run out of memory if it attempts to queue up more commands than it has memory for.
rhq.agent.client.max-concurrent is the number of messages the agent can send at any one time. The larger the number, the more messages the agent can dequeue (thus freeing up space in the queue for more messages to come in). However, the higher this number is, the more messages will get sent to the server at the same time and may require the agent to use more CPU cycles.
rhq.agent.client.command-timeout-msecs defines the amount of time the agent will wait for the RHQ Server to reply with a response from a command before that command will be aborted. The longer this time is, the less of a chance the agent will abort a command that otherwise would have succeeded (e.g. if the server just needs alot of time to process the particular command). However, the longer this time is, the more messages have to be queued up and wait before being sent to the server.
rhq.agent.client.retry-interval-msecs is the amount of time the agent will wait before attempting to retry a command. Only those commands that are flagged for guaranteed delivery will be retried. Non-guaranteed commands (aka volatile commands) will not be retried and thus this setting will have no effect.
rhq.agent.client.send-throttling, if defined, enables send-throttling. When this is enabled, only a certain number of commands can be sent before the agent enters a quiet period. During the quiet period, no throttle-able commands are allowed to be sent to the server. The commands can resume after the quiet period ends. Send throttling only affects those commands configured as "throttle-able" - these are typically commands containing metric collection data (i.e. those commands that tend to be sent to the RHQ Server very frequently and in large numbers). Any other commands are not affected by the send-throttle. Send throttling helps in preventing message storms on the RHQ Server, thus helping to avoid the server from getting flooding with incoming messages and preventing agent starvation (that is, not locking out other agents from being able to talk to the RHQ Server). The send-throttling preference defines both the maximum number of commands that can be sent and the length of the quiet period. For example, a preference value of "50:10000" means that after 50 'throttleable' commands are sent, a quiet period will commence and last for 10000 milliseconds. After that time expires, 50 more commands can be sent before the next quiet period begins.
rhq.agent.client.queue-throttling, if defined, enables queue throttling. This limits the amount of commands that can be dequeued in a given amount of time, called the burst period. If more commands are attempted to be dequeued during the burst period than allowed, those dequeue requests will be blocked until the next burst period begins. For example, if this is set to "50:10000", it means that at most 50 commands can be dequeued in any 10000 millisecond interval. If, during a burst period, a 51st command attempts to be dequeued, that dequeue request will block until the burst period finishes (at which time a new burst period begins and the dequeue request becomes the first of the next 50 allowed dequeue requests). The purpose of queue throttling is not so much to limit the amount of requests being sent to the server (although this does have that side-effect), it really is to prohibit the agent from spinning the CPU too much as it attempts to dequeue and send commands as fast as it can. If an agent is using too much CPU cycles, you can throttle the queue thus (hopefully) reducing the amount of CPU required for the agent to send its commands. Note that if you enable queue-throttling, you must take care in ensuring your queue-size is large enough (since you are limiting the amount of commands that can be dequeued in a specific amount of time, you need to make sure you have enough space in the queue to support the extra amount of commands that get queue up).

Guaranteed Delivery

Some commands that the agent sends to the RHQ Server are not critical in the grand scheme of things. For example, if a ping request fails to make it to the RHQ Server, we do not want to retry it nor do we want to persist the command to ensure it survives an agent shutdown. These commands are called volatile commands. Volatile commands are sent once - if they fail for whatever reason to be successfully processed by the RHQ Server, the failure is logged and the agent drops the command and moves on to the next that it needs to send.
However, there are some commands that must make their way to the RHQ Server and the agent must ensure the RHQ Server processes them. The agent must guarantee that these commands are delivered - these are called guaranteed commands.

While the agent will do its best to guarantee the delivery of guaranteed commands; this guarantee is not 100%. That is to say, there may be rare circumstances that arise that cause a guaranteed command to fail to get delivered (e.g. if the JVM crashes suddenly in the middle of an attempt to send a guaranteed command).

Guaranteed commands are retried every X milliseconds while the agent is alive and actively sending commands to the server (where X is the rhq.agent.client.retry-interval-msecs preference setting). Guaranteed commands also survive agent shutdowns. If an agent shuts down prior to being able to deliver a guaranteed command, that command is persisted to disk in what is called the command spool file. The next time the agent starts up, it will load up commands it has spooled to disk and immediately queue them for sending to the RHQ Server.
There are a couple preferences that define the behavior of this command spool file:

rhq.agent.client.command-spool-file.params defines the parameters for the spool file. The value's format is defined as "max-file-size:purge-percentage". The first number is the size, in bytes, of the maximum file size threshold. If the spool file grows larger than this, a "purge" will be triggered in order to shrink the file. The second number is the purge percentage which indicates how large the file is allowed to be after a purge. This is specified as a percentage of the first parameter - the max file size threshold. For example, if the max file size is 100000 (i.e. 100KB) and the purge percentage is 90, then when the spool file grows larger than 100KB, a purge will be triggered and the file will be compressed to no more than 90% of 100KB - which is 90KB. In effect, 10KB will be freed to allow room for new commands to be spooled. When this occurs, unused space is compressed first and if that does not free up enough space, the oldest commands in the spool file will be erased in order to make room for the newer commands.
rhq.agent.client.command-spool-file.compressed is a true or false flag. If this flag is true, the commands stored in the spool file will be compressed. This can potentially save about 30%-40% in disk space (give or take), however, it slows down the persistence mechanism considerably. The performance hit will only appear when unusual conditions occur, such as shutting down while some guaranteed commands have not been sent yet or if the RHQ Server is down. It will not affect the agent under normal conditions (while running with the RHQ Server up and successfully communicating with the agent).

Transports

Both the RHQ Agent and RHQ Server use the same underlying communications services (based on JBoss/Remoting technology). One feature this enables is the ability for the communications layer to use different transports simply by changing configuration preferences. The following configuration preferences define the transports used by the agent:

rhq.agent.server.transport defines the transport protocol that the agent will use to talk to the RHQ Server
rhq.communications.connector.transport defines the transport that the agent, itself, expects the RHQ Server to use when the server wants to send messages to the agent.

The transports that are supported by RHQ are: socket (raw and unencrypted socket based transport), sslsocket (encrypted and optionally authenticated SSL transport), servlet and sslservlet. In addition to customizing the transport, you can also provide transport parameters that help define the behavior of the connection over the configured transport.

Secure Communications - Encryption and Authentication

The communications services used by the RHQ Server and RHQ Agent can secure the network traffic between the two by using SSL in order to encrypt and optionally authenticate the traffic. By simply using a transport that uses SSL, you automatically get encryption. Each RHQ Server and RHQ Agent can be optionally configured with a keystore and/or a truststore. You can configure the RHQ Server to authenticate RHQ Agents, RHQ Agents to authenticate the RHQ Server or both. By setting up the proper certificates in the proper keystores/truststores, you can set up a fully secured network of RHQ Servers and RHQ Agents. You can even define what encryption protocols you want to use to encrypt the network traffic and what algorithms you want to use within your keystores/truststores.
There are two configuration preferences that are important to consider:

rhq.communications.connector.security.client-auth-mode defines whether or not the agent's server-side components must authenticate incoming requests (that is, authenticate the RHQ Server's certificate). The client-auth-mode can be set to one of three values:
1. none means the agent will not attempt to authenticate the RHQ Server's certificate during the SSL handshake. In this case, the agent will not need a truststore file defined.
2. want means that only if the RHQ Server sends a certificate will it be authenticated. If the RHQ Server does not have a certificate (thus doesn't provide one during the SSL handshake), this anonymous connection will be accepted by the agent. The agent must have a truststore file containing all its trusted certificates (which must include the RHQ Server's public certificate).
3. need means that the agent must authenticate the RHQ Server's certificate in order for the incoming requests to be accepted. If the RHQ Server provides an untrusted certificate or if it provides no certificate at all during the SSL handshake, the agent will deny the connection request and not accept any data from that connection. The agent must have a truststore file containing all its trusted certificates (which must include the RHQ Server's public certificate).

rhq.agent.client.security.server-auth-mode-enabled defines whether or not the agent's client-side sender components must authenticate the RHQ Server's certificate when it sends outbound requests to the RHQ Server. When the agent initiates communicates with the RHQ Server (i.e. when the agent wants to send a command to the server), it must first engage in the SSL handshake, at which time both the agent and server swap certificates. If server-auth-mode-enabled is true, the agent must authenticate/trust the RHQ Server's certificate, otherwise, the agent will refuse to send its command to the server. If this mode is false, the server's certificate is ignored and the agent sends its command regardless of the server's trustworthiness. When this mode is enabled, the agent must have a truststore file containing all its trusted certificates (which must include the RHQ Server's public certificate).

If the agent does not yet have a keystore containing its certificate, it will create and self-sign its own. The self-generated keystore file will, by default, be stored in the agent's data directory under the filename "keystore.dat". A keystore file is required if the agent is to engage in any SSL-based transport. If you wish, you can create and assign the agent your own custom certificate stored in your own keystore file. Simply create the keystore file, put it somewhere on the local file system where the agent has access to and define your keystore configuration preferences accordingly. The same holds true for the truststore files. The agent will not create any truststore files. If you wish to enable either client-auth or server-auth, you must provide the trusted certificates to the agent by putting the truststore files somewhere where the agent can get to them and defining the appropriate truststore configuration preferences.

For more information on how to set up secure communications, see Securing Communications.

Native Code with Java Fallback

The RHQ Agent loads in native code to help it with things that only a low-level native layer can perform (such as examining the operating system's process table). If the native libraries are not available for your hardware/operating system, the agent will still run and be supported. Your agent will lose some capabilities (such as being able to auto-discover resources via process table scanning), however, your agent will still run and function. In addition, you can manually disable the native layer, in the case where the native libraries do exist but for some reason are not working properly . You can disable the native layer to get the agent working again (albeit with the reduced set of capabilities).