Skip to end of metadata
Go to start of metadata

Agent VM Health Check and Self-Healing

This is tracked via the JIRA issue http://jira.rhq-project.org/browse/RHQ-1035

We have noticed that under certain conditions (which tend to be related to the state of external resources being managed by the agent), the agent will go into a "bad" state, most notably running out of heap or perm-gen memory. These OutOfMemoryErrors are typically not correctable by the plugin container. Once an OOM occurs, the agent should consider itself in a critical condition.

When this happens, an agent shutdown and restart will correct the problem (perhaps temporarily, but eventually, the situation should clear up and the agent can run again without hitting the problem). However, this requires the user to manually log onto the agent machine, kill the agent process and restart it. A very cumbersome task considering some environments may have hundreds or more agents (and if this happens to many of them, this manual restart becomes very time consuming to do).

The agent, however, now has the ability to periodically perform a "VM Health Check" via its VM Health Check Thread. If the agent detects that its VM is critically low on memory (either heap or non-heap), the agent will attempt to automatically heal itself by quickly shutting down all its internal components (including the plugin container and all plugins), and then restart them. This should free up most memory used by the agent, the plugin container and the plugins, thus circumventing any possible OutOfMemoryErrors that might have occurred shortly in the future. In addition, if the agent has already hit an OOM condition, the agent will still attempt to heal itself.

This should help with the problem of requiring users to manually restart agents.

The VM Health Check thread is configurable - the agent has preferences that start with "rhq.agent.vm-health-check" that control this VM Health Check functionality. You can set its interval (number of millis the check thread will pause between VM Health checks) and the thresholds that must be crossed in order for the heap/non-heap memory to be considered critically low and thus trigger the agent reset.

Currently (as of November 1, 2008), the agent's VM Health Check thread only checks for critically low memory/OutOfMemory conditions. In the future, we could have this health check thread look for deadlocked threads and it could either a) attempt to kill those threads if possible or b) perform an agent shutdown/restart like it does for the critically low memory condition.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.