Design-AgentAutoUpdate

Agent Auto-Update

To track the progress of this feature, see: http://jira.rhq-project.org/browse/RHQ-110

Definition

"Agent Auto-Update" refers to the ability for a running agent to automatically update itself, including updates to jars, configuration files, scripts and preference values. An agent should be able to update and restart itself automatically without the need for any additional manual intervention, specifically, there should be no need for an administrator to log into the agent machine and perform tasks necessary to complete the update.

Requirements

The following are requirements that must be supported by our agent auto-update feature:

The update solution must be cross-platform - Linux, Windows, UNIX and hopefully even MacOS.
Agents must be able to update themselves while they are running. This seems obvious but is being explicitly mentioned for two reasons:
- We must make sure the agent can overwrite jar files that are currently open by the agent JVM's classloaders (on Windows XP, and perhaps other platforms, you cannot delete a file that is currently opened; altering jar file content could screw up a classloader that is currently reading class definitions from it)
- Agents not already running must be able to know immediately at startup they they need to update (and do so then)
Manual intervention must not be required to apply an update. The agent should be able to automatically download the update, apply the update and restart itself.
Be able to turn off the auto-update feature, from both the server and agent side. I'll need to be able to turn this off on the agent side because the embedded agent will have this feature turned off by default, in addition, this needs to be turned off to be able to run our unit tests. Be able to turn it off at the server side to force the disablement of agent auto-update even if the agent wants to.

The Prime Directive

There is one important design decision we are making in our first version of agent auto-update - we call it the Prime Directive and it is this:

All agents talking to a server cloud will be of the same version and an individual server will support talking to only one specific version of an agent.

Keeping the above fact in mind, this means:

We will not support updating agents such that there ends up being different agent versions running in the RHQ environment.
We will not support servers installed in a cloud that are matched to different agent versions. All servers in a cloud must support the same version of agent.

In the future, we may decide to allow violations of this Prime Directive - we might be able to support different "minor" versions of agents; however, in our first implementation, we will not support it.

In fact, we can somehow explicitly disallow a mis-versioned agent from talking to a server. The easiest way we can do this without requiring additional out-of-band data to flow over the comm connection is to have the agent kill itself if it detects it is one version but the server is another version. Unfortunately, this will not prevent really old agents (those prior to this agent-update feature) from trying to talk to the server. To prevent that, we could add an out-of-band version string to be placed in the agent's outgoing commands' configuration and if the server gets a bad version, it will disallow the messages. NOTE: we need a way to not make this too difficult for agent developers. A rebuild of an agent in a development environment should somehow be able to stay at a particular version so it can be started and not be denied access because the new agent is not of the version the server thinks is the latest version.

Agent Update Binaries

Today we ship agent distributions as a .zip. You unzip them and you have an agent installation.

We are now going to distribute "agent update binaries" that will be packaged in our server distribution. The agent update binary will be a self-executing .jar that contains not only the full agent distribution .zip but also the update code itself. The name of the jar will include version information so it is easily discernable, such as "rhq-enterprise-agentupdate-2.2.0.jar".

The maven build module is "enterprise/agentupdate" so mvn actually builds and installs the agent update binary with the name "rhq-enterprise-agentupdate-#.#.#.jar". However, our server container build will rename it "rhq-enterprise-agent-#.#.#.jar" when it copies it into the server's download area.

The agent distribution zip files will still be called "rhq-enterprise-agent-2.2.0.zip".

Agent Update Binaries can be used either within the context of an existing agent performing an update, or completely standalone (in the case of doing a fresh provisioning of an agent where no agent existed before). Example:

java -jar rhq-enterprise-agent-2.2.0.jar --install[=<new-agent-dir>]

This will tell the Main-Class (as defined in the jar's manifest) to extract the agent in the current directory without doing anything special - do not run any update-specific code. It is as if the user simply extracted the .zip distro from the jar, and unzip'ed the agent .zip distro. If you specify a <new-agent-dir>, the agent will be installed in that directory (because it simply unzips the enclosing agent.zip, the agent will really be installed in "<new-agent-dir>/rhq-agent" since the zip is rooted at "rhq-agent").

java -jar rhq-enterprise-agent-2.2.0.jar --update[=<old-agent-dir>]

This will be the command that an existing agent will issue after it downloads the jar file. This tells the Main-Class that we want to update an existing agent, where that existing agent is installed in the given <old-agent-dir> directory. The default will be the current directory (to allow a user to copy this jar file in an agent home directory and just run "java -jar rhq-enterprise-agent-2.2.0.jar --update" and have it work).

Jar files and ZIP files are binary compatible. You can create a binary using "jar cvf" and unpackage it with "unzip" and you can create a binary using "zip" and unpackage it with "jar xvf" (regardless of the file extension - .jar and .zip work equally). However, you will lose UNIX execute-bits (in fact, all permission bits) if you build the package with "jar" (even if you unpackage it with "unzip") and you will lose them if you unpackage them with "jar" (even if you created it with "zip"). In other words, using "jar" either in the create or unpackage stage will cause you to lose the permission bits of the files. This is important because it means we will lose +x permissions on our *.sh UNIX scripts. We may need to still package our agent update binaries using "zip" (even if our extension to the file is still ".jar") to maintain cross-platform capabilities but warn people that if they use "jar" to unpackage it, they will need to "chmod -R +x *.sh" to get our scripts to have the proper execute permissions. The solution our first implementation will use is to take our existing .zip agent distro and package it directly in the agent update binary .jar (it will not be exploded in the .jar). This means if you want to manually install the agent, you can "jar xvf rhq-enterprise-agent-X.jar rhq-enterprise-agent-X.zip" and then "unzip rhq-enterprise-agent-X.zip" - but I'll assume people won't want to do this since "java -jar rhq-enterprise-agent-X.jar" will be easier.

Jar File Layout

How the files are packaged in the agent update binary jar file will be important. We want this to be a self-executing jar file, so we must have a manifest with the appropriate Main-Class property as well as have all the update code packages in the jar such that the classloader can find it. In addition, we must package the agent distribution in here as well. Additional files are probably also going to be needed, such as perhaps a properties file with version information in it.

Here's is a first stab at how the jar file will be laid out:

/rhq-agent-update-version.properties (information about this agent)
/org/rhq/enterprise/agent/update/... (all the RHQ update code here)
/abc-corp/... (third party libraries needed by the update code - gnu.getopt, etc...)
/rhq-enterprise-agent-#.#.#.zip (the actual agent distro .zip file)
/README.txt (some helpful information; will be displayed with --help)
/LICENSE (the RHQ license file)

The "rhq-agent-update-version.properties" file will look like this:

rhq-agent.latest.version=1.2.0.GA
rhq-agent.latest.build-number=12345

to indicate the version/build number of the agent, after the update is applied.

The RHQ Server's Agent Update Servlet

Deployed in the RHQ Server's portal-war will be a standalone Agent Update Servlet. It will be accessed via simple HTTP GET - which means it is accessible using simply a browser, wget or other web client. However, its main purpose is to be accessed by the RHQ Agent - the agent will be able to ask the servlet for information about the Agent Update Binary as well as to download it (see the agent's "update" prompt command).

The servlet will be mapped to more than one URI to allow for it to support different types of requests. One request type would be to download an agent update binary. Another request will be to ask it what agent version the server supports (i.e. the version of the agent that can be downloaded from the servlet). Example:

<servlet>
  <servlet-name>AgentUpdateServlet</servlet-name>
  <servlet-class>org.rhq.enterprise.gui.agentupdate.AgentUpdateServlet</servlet-class>
</servlet>
<servlet-mapping>
  <servlet-name>AgentUpdateServlet</servlet-name>
  <url-pattern>/agentupdate/version</url-pattern>
</servlet-mapping>
<servlet-mapping>
  <servlet-name>AgentUpdateServlet</servlet-name>
  <url-pattern>/agentupdate/download</url-pattern>
</servlet-mapping>

Note that because we will deploy the servlet inside of our portal-war, our servlet will be able to make MBean and SLSB calls into the server in case it needs to do things like get the version of the server and to check to see if our server has been configured to disable the serving up of agent updates.

/agentupdate/download

The Agent Update Binary will be placed in the server, in rhq.ear/rhq-downloads/rhq-agent. The servlet will assume any file with a .jar extension is the Agent Update Binary (there can be only one). When the servlet is access via the URL "/agentupdate/download", this jar file will be streamed to the client issuing the HTTP GET request.

/agentupdate/version

Inside the jar file, the servlet will look for a file called "rhq-agent-update-version.properties". This .properties file should include information about the agent - such as its version and build number. This agent version information, along with the server information, will be returned in a HTTP GET response when the servlet is accessed via the URL "/agentupdate/version".

Limiting and Disabling Agent Downloads

In the RHQ Server, there are several settings within rhq-server.properties that allow you to limit the number of concurrent messages coming into the RHQ Server from RHQ Agents. These are specific settings to limit comm-layer messages.

We introduced a similar kind of concurrency limit for agent downloads:

rhq.server.agent-downloads-limit=45

This is set in rhq-server.properties. In the future, you may be able to set this in the RHQ Server resource's Config tab, under the Concurrency Limit config group (see RHQ-1111).

This setting does not affect the comm-layer messaging, but serves a similar purpose. The agent update servlet will disallow any more than this number of concurrent downloads occurring at any one time within a single RHQ Server. This value is configurable on a per-RHQ Server basis to allow larger/higher-throughput machines to support downloading more concurrent agent update binaries than smaller/lower-throughput machines. Note that this value has an upper-limit equal to the "rhq.server.startup.web.max-connections" limit (since this max-connections setting sets the upper bounds on the total number of concurrent web requests allowed to come into the RHQ Server's Tomcat layer).

If the maximum number of concurrent downloads is currently in process, and another download request comes in, the servlet should reply to that request with an HTTP error code of 503 "Service Unavailable" to indicate that the agent needs to wait for a bit and resubmit its request. The 503 response should include the header "Retry-After" with a value being the number of seconds the agent should wait before attempting to ask again.

If the "rhq.server.agent-downloads-limit" setting is set to 0, then the agent update servlet will reject any download request for the agent update binary. This means if any web client or agent sends an HTTP GET request to "agentupdate/download", the servlet will immediately reply with an HTTP error code of 403 "Forbidden".

We have a global server-cloud configuration setting in the database (in RHQ_SYSTEM_CONFIG) to disable agent downloads across the entire server cloud. If, for some reason, you never want agents to download updates, you could go to the Administration > Server Configuration UI page and say "No" to the option "Enable Agent Auto-Updates". This setting goes into RHQ_SYSTEM_CONFIG and the servlet would access that setting via the System Configuration SLSB - if it is unchecked, the servlet should immediately reply with an HTTP error code of 403 "Forbidden".

Performing Version Check To Determine If Update Is Needed

The agent has an "update" prompt command that is able to ask the server what version of the agent it has by doing an HTTP GET on http://<rhq-server>:7080/agentupdate/version (or the custom version URL if configured). This will return a simple name=version set of properties. We will make this generic so we can extend it in the future. For the first implementation, I suspect the only thing this will return in the response is:

rhq-server.version=1.2.0.GA
rhq-server.build-number=1852
rhq-agent.latest.md5=<the MD5 hashcode of the binary .jar>
rhq-agent.latest.version=1.2.0.GA
rhq-agent.latest.build-number=12345

In the future, I can see something like the following being sent back:

rhq-server.version=1.2.0.GA
rhq-server.build-number=1852
rhq-agent.latest.md5=<the MD5 hashcode of the binary .jar>
rhq-agent.latest.version=1.2.0.GA
rhq-agent.latest.build-number=12345
rhq-server.agents-supported.versions=[2.0,2.3)
rhq-server.agents-supported.build-numbers=[,1205]

where "rhq-server.agents-supported.build-numbers" could be "*" to mean all of them. The "agents-supported" properties could be useful if we want to say: "the server supports agent version 2.0 thru 2.2 with build numbers up to 1205" or "the server supports agent version 2.0, all builds". This kind of thing would presume we can violate the Prime Directive - and since our first implementation will not do that, this will not be implemented in the first go-around.

The version string that the servlet returns for "rhq-agent.latest.version" (and "rhq-agent.latest.build-number", if we think we need it) will be determined at servlet runtime by examining the agent update binary jar itself (in the "rhq-agent-update-version.properties" file). This version information will be cached in a file internal to the servlet, stored in the rhq-downloads/rhq-agent directory ("rhq-server-agent-versions.properties" file).

The agent doesn't need to do any of the above when it needs to automatically check to see if it needs to update. The agent will have its version checked automatically by the server during the agent's startup registration and thereafter during its connectAgent calls (in the case the agent fails over to another server different from the one it registered on). If the server's version check fails, it means the agent is attempting to violate The Prime Directive - this causes an AgentNotSupportedException to be thrown from the server to the agent.

If the agent is not supported and agent updates are disallowed (either on the server side by turning off its agent update enable flag or on the agent side via its rhq.agent.agent-update.enable preference set to false), this agent will effectively be offline. A non-supported agent will not be able to successfully send any messages to the server - and if it can't update itself, then manual intervention by an administrator is required to clear up this misconfiguration. This usually involves manually installing a new, updated agent.

Performing the Update

After an agent determines it is out of date and needs to update, the agent must stop everything it is doing and start the update processing. The agent will spawn a separate non-daemon thread called the "RHQ Agent Update Thread". This will need to first shutdown all internal agent components (e.g. the comm layer and the plugin container). After the agent is shutdown, with only the update thread running, the agent will do the following:

1) HTTP GET "server-transport://server-endpoint:server-port/agentupdate/download" (or whatever the configured download URL is)
2) Store the agent update binary somewhere on the local filesystem.
3) Execute "java" with "-jar" option passing in the agent update binary jar filename along with any hardcoded command line options (such as "--update").
4) After the process has been executed successfully, the agent returns from this Agent Update thread - but it also spawn a daemon "kill" thread that sleeps for a few minutes and if its still around, it calls "System.exit(1)" to make sure the Java process for the orignal agent dies.

At this point, the new agent update binary jar has been executed. It should sleep for a bit to wait for the original agent to die (can/should we perform some checks to confirm it is dead?). It should then perform its update tasks:

1) unpackage the rhq-agent directory found in the jar file, place it in some tmp location (<rhq-home>/update?).
2) copy the original plugins to the new agent (the new agent will update the plugins at startup)
3) copy the original _env.bat/sh files to the new agent; the new agent's env.bat/sh scripts should be renamed to "_env.bat/.sh.default
3a) try to execute "chmod +x *.sh" on non-Windows platforms - warn on failure but do not exit the update
4) copy the exiting data/failover.dat file (if it exists) into the new agent's data directory
5) purge the rest of the old agent (including its data/ directory and its lib, bin, logs, etc)
6) move the new agent to the old agent's location
7) launch the new agent's script

what if the agent script writes to data/rhq-agent-command-line.dat the full command that was used to execute the agent - this would be something like "rhq-agent-wrapper.sh start" or "rhq-agent.bat -c agent-configuration --cleanconfig". Can we use that to figure out how the agent was started and can we reuse that?

8) exit

If any of this fails, the agent will probably be dead in the water because presumably it needed to update because the server has been updated and its comm layer is different. However, this isn't always the case (most times it probably won't be) so the Agent Update Thread tries to start the original agent (via AgentMain.start()) to bring it back to where it was.

Distributions

We may decide to stop shipping agent-only distributions (our maven builds can still produce them, we just will not publish them). Because all servers must be paired with a particular version of agent, we may ship one distribution that includes both the server and agent, thus ensuring we keep the proper versions of servers and agents paired together. If we need to update an agent (say, we need to patch a bug in the agent), we will ship that new agent in a new server such that you will update the server which will then auto-update all agents with the new code. This will be true even if the server code itself did not change. What this does is:

Reduce the chance of users installing incompatible versions of agents and servers
Provides a mechanism for administrators to pull down agent distributions directly from the server (the agent distro will be a self-contained binary that will be available via anonymous HTTP GET from the server - thus, something like:

wget --content-disposition http://<rhq-server>:7080/agentupdate/download

will pull down an agent that can be provisioned manually if an agent does not yet exist.

Updating Agent Running As A Windows Service / UNIX Boot Time Process

http://jira.rhq-project.org/browse/RHQ-1041

The auto-update feature needs to determine how it will be affected if the agent is running as a Windows Service or UNIX boot time process.

Agent Launch Scripts

First, we need to document all the different, but valid, ways the agent can be run on production machines (i.e. those cases where agent auto-update will be an important feature to help manage that running instance) using the various launch scripts. Note that there are other ways to start the agent; however, those other ways may not be able to allow the agent to be auto-updated but since these are not to be considered "normal" ways you would want to start the agent in production anyway we won't worry about supporting them for auto-update capability. These "other" ways are typically only used by developers during development/testing. Examples of these "non-production/developer" ways to start the agent are:

rhq-agent.bat --daemon starts the server in foreground but as a daemon process, thus not accepting keyboard input (production alternative: install as a Windows Service which uses rhq-agent-wrapper.bat)
rhq-agent.sh --daemon --output=/tmp/agent.stdout & starts the agent in background using the main script (production alternative: use the rhq-agent-wrapper.sh script from the shell console, or install/start rhq-agent-wrapper.sh as an init.d script

The different ways the agent can be started on production machines are:

Windows
1. In foreground, started in shell console: rhq-agent.bat <options>
2. In background, started at boot time as an installed Windows Service: rhq-agent-wrapper.bat start
UNIX/Linux/MacOS
1. In foreground, started in shell console: rhq-agent.sh <options>
2. In background, started in shell console, using init.d script: rhq-agent-wrapper.sh start
3. In background, started at boot time as an installed init.d script: /etc/init.d/rhq-agent-wrapper.sh start

The following describes the behavior of the launch scripts for each of the above ways the agent can be started:

Windows
1. In foreground, started in shell console: rhq-agent.bat <options>
  The launch script rhq-agent.bat reads in the (optional) file rhq-agent-env.bat to set environment variables needed by the launch script. The fallback values will come from the environment of the process/shell launching the script. The AgentMain class will get the same <options> passed to it as was passed to the launch script unless the environment variable RHQ_AGENT_CMDLINE_OPTS was set, in which case that variable contains the options to pass to AgentMain. The launch script is started in foreground and will not exit until the agent VM exits. rhq-agent.bat makes no attempt to restart the agent regardless of the exit code of the agent VM.
2. In background, started at boot time as an installed Windows Service: rhq-agent-wrapper.bat start
  The wrapper launch script will first invoke the rhq-agent.bat launch script in "setenv" mode which means it won't actually start the agent VM but it will set some environment variables needed by the wrapper script such as the agent home directory and where the Java executable is found (this includes executing the rhq-agent-env.bat file, if it exists). Once the environment is setup, it prepares to execute the Java Service Wrapper (JSW) by defining additional environment variables and then launching the JSW executable. This JSW executable invocation merely tells Windows to start the Windows Service. The Windows Service is configured to run the JSW executable in a way that actually launches the agent. It does so by examining the rhq-server-wrapper.conf file (which in turn includes the optional .env and .inc files) - based on the .conf file, JSW determines what Java executable to run, what Java -X and -D options to pass in and what options to send to the AgentMain class, among other things. JSW is configured to restart the agent automatically when the agent exits with a non-zero exit code. The Windows Service that is installed runs this command line:
```
C:\path\to\rhq-agent\bin\wrapper\windows-x86_32\wrapper.exe
-s C:\path\to\rhq-agent\bin\\wrapper\rhq-agent-wrapper.conf
set.RHQ_AGENT_HOME=C:\path\to\rhq-agent
set.RHQ_AGENT_INSTANCE_NAME=rhqagent-HOSTNAME
set.RHQ_AGENT_JAVA_EXE_FILE_PATH=C:\path\to\java\bin\java.exe
set.RHQ_AGENT_OS_PLATFORM=windows-x86_32
set.RHQ_AGENT_WRAPPER_LOG_DIR_PATH=C:\path\to\rhq-agent\logs
```

UNIX/Linux/MacOS
1. In foreground, started in shell console: rhq-agent.sh <options>
  This operates in a very similar way as its Windows counterpart. It reads the (optional) rhq-agent-env.sh to define some environment variables, falling back to the variables defined in the environment of the starting process/shell. The AgentMain class will get the same <options> passed to it as was passed to the launch script unless the environment variable RHQ_AGENT_CMDLINE_OPTS was set, in which case that variable contains the options to pass to AgentMain. The launch script is started in foreground and will not exit until the agent VM exits. One caveat - if the environment variable RHQ_AGENT_IN_BACKGROUND is set, this launch script will write a pid file and put the agent in background and exit the script immediately. This feature, however, is normally reserved for use by the rhq-agent-wrapper.sh script and typically should not be used when invoking rhq-agent.sh directly from the shell console. rhq-agent.sh makes no attempt to restart the agent regardless of the exit code of the agent VM.
2. In background, started in shell console, using init.d script: rhq-agent-wrapper.sh start
  This will prepare to launch the agent in background by setting some environment variables (like RHQ_AGENT_IN_BACKGROUND to the value of the pid file that should be written) and invokes rhq-agent.sh. This also processes the rhq-agent-env.sh so it will pick up mostly the same environment as running the main agent launch script rhq-agent.sh. It starts the agent in background and then immediately exits the script. rhq-agent-wrapper.sh makes no attempt to restart the agent regardless of the exit code of the agent VM.
3. In background, started at boot time as an installed init.d script: /etc/init.d/rhq-agent-wrapper.sh start
  There really is no difference in the way the agent starts using this method versus the previous method. The only difference is the location of the rhq-agent-wrapper.sh script. But, for all intents and purposes, the location of this script (be it in <agent-home>/bin or /etc/init.d) doesn't matter - the script will detect where the agent home directory is and change its working directory so it can run the agent properly. Therefore, everything is the same running a copy of this script out of /etc/init.d as it is when running the script out of the agent home bin directory. Note that if you are to copy this script to /etc/init.d (or put it in the rc.d boot structure), be sure to bring along rhq-agent-env.sh with it, if you use that file to set up the agent environment.

Restarting the Agent

While I'm sure not impossible, the variations are so great that it would be very difficult to restart the agent in the exact manner that the agent was originally started. The above scenarios show you the different ways people might want to start the agent. The agent would have to know exactly how it was started (perhaps by squirreling away its command line arguments and initial script that was used) and then re-execute that command. However, this may not always be desired - for example, if the agent was started with --cleanconfig, we may not want to restart the agent with that option again.

Therefore, we will assume that the person initially provisioning the agent into a production environment will:

setup all the required environment and command line settings in rhq-agent-env.[sh,bat]
start the agent via the rhq-agent-wrapper.[sh,bat] launcher script

In order for the agent auto-update feature to work, it requires that the above is true. In other words, the agent's runtime environment must be started using the rhq-agent-wrapper launcher script, because that is the way the agent will be restarted after it has been auto-updated.

If you start the agent using the rhq-agent.[sh,bat] scripts, the agent can still be auto-updated, but you are not guaranteed to be able to have the agent automatically restarted because its possible the rhq-agent-env script doesn't configure the agent fully and in the same manner as when the agent was first started. This can happen, for example, when the user who started the agent had set some RHQ_AGENT_xxx environment variables in his shell directly, not in the rhq-agent-env configuration script, or the user passed in some custom arguments to the agent (like --pref or --input) that would not otherwise be passed in. In addition, unless the user passed the -daemon command line argument to the agent, the agent probably is running in foreground with keyboard input. Restarting the agent after an agent auto-update will put it into background with no keyboard input. So, while it is possible for you to run the agent using other means, you are not guaranteed to have the agent run in the exact same way with the same configuration as when the agent auto-update restarts the new agent. Therefore, to be safe, you should always run the agent using rhq-agent-wrapper scripts when starting the agent for a production environment.

This means that when running on a Windows platform, the agent should have been installed as a Windows Service. In this case, the user must also have configured the agent with any custom settings via the rhq-agent-wrapper.inc and rhq-agent-wrapper.env configuration files. If it is not installed as a Windows Service, an attempt will be made to start the agent in a console window.

Testing

Testing this stuff is a bit difficult because you have to get an agent to think its an older version than what the server is expecting. In order to artificially violate the Prime Directive, you can use a small ANT script to modify an existing agent distribution or agent update binary so that agent thinks it's a particular version. Read the comments at the top of that ANT script to learn how to use it.

It may take a long time for the auto-update process to complete. First, the agent must shutdown its internals gracefully, which may take a long time. It then must download the new agent update binary, spawn another Java VM that will apply the update and then the new agent must have time to restart and re-establish communications with the server. All of this may take several seconds to several minutes to complete.

Stamping An Agent Distribution

If you already have an agent installed, you can stamp that agent with another version. In short, you need to run this:

ant -Dagent.home.dir=<your-agent-install-directory>

If you have an RHQ SVN working copy and you built your agent distribution there (found in modules/enterprise/agent/target/rhq-agent), then you might find it even easier to use either the Windows script or the UNIX script which runs this ANT command for you. There is even an Eclipse ANT launcher definition for Eclipse users to run this directly from the IDE.

This will inject some bogus version strings into the appropriate places so the agent will think its that version, not the one it really is. You can then run the agent normally and it should immediately begin its agent-update process.

Stamping A Server

If you want several agents to auto-upgrade themselves, it may be easier to stamp the agent update binary itself (as found in the server) as opposed to stamping each of the individual agents. If you already have a server installed, you can stamp that server's agent update binary with another version. In short, you need to run this:

ant -Dserver.home.dir=<your-server-install-directory>

Things That Have Been Tested

Here are some things that were explicitly tested.

Auto-updating an agent on Windows when that agent was running in a foreground console window.
Auto-updating an agent on Windows when that agent was running as a Windows Service.
Auto-updating an agent on Solaris 9.
Auto-updating an agent on RHEL.
A two computer (x86_64 Linux) "cloud" was hosting 2 servers and 100 agents (evenly split - that is, 1 server and 50 agents running on each box). The platforms for each agent were imported. The 2 servers were shutdown, their agent binary updates were stamped and then restarted. All 100 agents successfully updated themselves.
Killed the server while agent is downloading the agent update binary (had to connect to the agent with a debugger to pause it at the time it is downloading so I could shut down the server at the appropriate time). The agent will retry every 60 seconds until it can successfully download and launch the agent update binary.

Future Enhancements

Automatic Pre-Configuration of Agent Update Binary

Allow the server to manipulate the out-of-box agent update binary to set settings in agent-configuration.xml. The main use-case - allow the server to set the rhq.agent.server.bind-address so all agents that download the agent update binary from this server will connect to this server when that new agent is started.

Agent Platform-detection and Agent Update Binary Generation

Right now, all agent distributions are cross-platform. They include both .bat and .sh scripts, Java Server Wrapper binaries and configuration (which is only valid for Windows), and all Sigar libraries for all platforms.

It would be nice for agents to tell the server which platform the agent is running on, and have the server send down the agent update binary appropriate for that platform. We could have several platform-specific agent update binaries - e.g. Windows binary would have the wrapper binaries but none of the UNIX Sigar binaries.

Older Notes That May Or May Not Be Relevent Now

these are preliminary notes I jotted down, prior to our initial design meeting; they may not be relevent anymore, but I will leave them here for future reference

Here are some issues that we need to solve before we can implement this feature:

Persisted data (command-spool.dat, inventory.dat, plugin specific data) cannot be blindly kept or deleted - the update must be able to tell the agent what persisted data it can keep and what data it should purge. Note that we might be able to blindly delete inventory.dat always because the agent can sync itself fully from the server; however, if the agent spooled data while the server was in MM, we should keep and resend that spooled data if possible. But a new agent update MAY in fact invalidate the spool data (for example, if a domain object changed its serializable UID, the agent will not be able to unspool the data without getting errors; nor can the agent send the data up to the server because the server would not be able to deserialize it). Therefore, we cannot blindly keep OR delete /data files - there must be a way that we can tell the agent, for example, "you must delete your spool data but you can keep your inventory data". This is true for plugin-specific data too (can we provide a way to say, "delete the JbossAS plugin's persisted data in /data).
- Perhaps we add a simple piece of data to the agent's .zip (update-metadata.properties) or .jar (META-INF/MANIFEST.MF) that indicates things like "purge-inventory=true", "purge-all-data=true", "purge-file=data/some-specific-file", "purge-config=true", etc.
On Windows, you can't delete a jar file while the VM's classloaders has it open.
You cannot update/overwrite a jar file because if a VM's classloader has it open, the class definitions will get corrupted or fail to load.
How does an agent know its got a new update waiting? Where does the agent look to to find out what its current version is? Where does it look to ask the server what versions are available for download?

Here are some ideas to think about when deciding how to implement:

Have the agent send a "version" string in the out-of-band command config properties. the server's command authenticator can throw a "UnsupportedVersionException" if the version the agent is is an old one that is no longer supported. We can pass things in the exception, like a URL, so the agent can be told immediately where it can go to download a newer agent.

Can the agent download the update to a tmp location, call System.exit(42), have the rhq-agent.sh/bat see the special exit code, know that it means it should update and restart and restart the agent immediately? The script could also purge the old agent, unzip the new agent distro and restart it.

Have the agent distro located in a simple location on the Server so a simple HTTP GET can download it. I don't think we need to go through the agent-server comm layer to simply download the agent.
- The agent should be configurable to be told where to look for the updates. If we publish the updates in rhq-downloads, we must make it so Tomcat can serve it directly (I do not think we want it to go through our comm layer). In addition, I suspect we'll want to be able to serve the updates from another web server, not the RHQ Server - because its possible 100s, 1000s of agents will request these update files and we do not want to clobber the RHQ Servers just so they can serve file on the order of 10s of MBs in size, via HTTP GET. I envision being able to put an agent zip on a web server, along side its metadata file that indicates its version information. Agent starts up, goes to its configured "rhq.agent.update-url", grabs the metadata info file, determines if this update is applicable and pulls down/applies the update if appropriate.

Rather than ask the agent script to perform the update, can we do it in Java? We first must completely unload the agent from the VM (PC must be stopped, all agent threads killed and AgentMain's classloader completely unloaded - we then load in a very small, simple AgentUpdate class and have it perform the update. At which time, it restarts AgentMain - or asks the script to restart it via System.exit(42).

We should ship with a standalone jar (agent-update.jar) that goes in /lib (maybe /lib/update). This is so small we should never have to update it (because if we do, we are then screwed if we have to ever update it, itself . But I envision AgentUpdate class just being a simple loader class that loads ANT and we ship an ANT script in the agent distribution. That ANT script will get extracted from the zip and run. The ANT script can perform all the update stuff. Our installer embeds ANT so we can look at its code and duplicate what it does when needing to launch ANT from within Java code.

Careful with the System.exit() idea - that would kill the server if the agent is embedded.

Before the agent attempts to make the initial connectAgent, it could make a HTTP GET request to "server-transport://server-endpoint:server-port/agentupdate/version". The default URL used by the agent will be a URL "/agentupdate/version" of the RHQ Server it is talking to. This URL is configurable via agent preference rhq.agent.agent-update.version-url. The download URL is also configurable as rhq.agent.agent-update.download-url. This is to allow a deployment environment where all agents can get their updates from a separate web server, apart from the RHQ server. The agent will take the response, compare it to the agent's own version (which it can get via the Version class) and determine if the agent needs to update. If it does not need to update, it continues. If it does need to update, then the agent must abort everything it is doing, and begin the update process.

JBoss Community Archive (Read Only)

RHQ 4.9

Design-AgentAutoUpdate

Agent Auto-Update

Definition

Requirements

The Prime Directive

Agent Update Binaries

Jar File Layout

The RHQ Server's Agent Update Servlet

/agentupdate/download

/agentupdate/version

Limiting and Disabling Agent Downloads

Performing Version Check To Determine If Update Is Needed

Performing the Update

Distributions

Updating Agent Running As A Windows Service / UNIX Boot Time Process

Agent Launch Scripts

Restarting the Agent

Testing

Stamping An Agent Distribution

Stamping A Server

Things That Have Been Tested

Future Enhancements

Automatic Pre-Configuration of Agent Update Binary

Agent Platform-detection and Agent Update Binary Generation

Older Notes That May Or May Not Be Relevent Now