Design-Availability Checking

Availability Handling - Summary of Behavioral Changes

This is a summary of behavioral changes in RHQ Availability Handling as of RHQ 4.4 / JON 3.1. For a more detailed
description of design and changes see Design and Changes.

Change: Platform Children are now Marked UNKNOWN for a DOWN Platform

When an agent is down its platform resource will be marked DOWN but all of the platform children will now be marked UNKNOWN to represent that the RHQ server is not getting updated. In the past the children were also marked as DOWN. Note that DISABLED children will be left as DISABLED.

Change: Goes DOWN Alerts will Not Fire for Platform Children of a DOWN Platform

Existing Goes DOWN alert conditions will not fire when a resource is set to UNKNOWN. So, the new availability assignment for down agents can affect existing alerting. The intent is to be more accurate and avoid false positives but if the prior behavior is desired the alert conditions should be updated to Goes NOT UP, which is a new option.

Change: Group Availability is Calculated Differently

The introduction of new availability types forced changes to the way group availability is determined. Group availability is now determined with the following algorithm, evaluated top to bottom in the table below:

Member Availability	Group Availability
Empty Group	Grey / EMPTY
All Down	Red / DOWN
Some Down or Unknown	Yellow / WARN
Some Disabled	Orange / DISABLED
All UP	Green / UP

Change: Availability Checks are No Longer Fixed at 5 Minute Intervals

Previously an availability check was performed on all resources with a 5 minute interval, and all resources were checked in one pass. Now, availability checking is performed based on the Availability metric schedule. If not set in the plugin descriptor the resource type's default availability check interval is based on its category:

Server 60s (1 minute)
Service 600s (10 minutes)
Platform not applicable, platform availability is determined by agent activity, not getAvailability() calls.

Change: Availability Checking is Now Performed Incrementally

Availability checking now happens incrementally. The availability job runs at 30 second intervals and not every resource is checked on each pass. Instead, checking is spread out, still respecting the desired intervals as much as possible, but in a fashion that avoids the 'peak and valley' issues of the past.

Change/Upgrade Note: All Resources Now Have Built-In Availability Metric

All non-platform resources now have a built-in Availability metric. The new Availability metric schedule will be added automatically to all types in updated plugins. So, for upgrades, new versions (updated MD5) of current plugins must be deployed. Custom plugins must be rebuilt and redeployed to get the new metric schedule.

This metric allows Template/Group/Resource level override for resource availability collection intervals.

Change: Agent Avail Prompt Command Behaves Differently and Has New Option

The Avail prompt command generated either a changes-only or full report, and that is still true. But it always performed an avail check on every resource. With the introduction of prioritized availability checking that is not true, the avail check will be performed only if there is no current availability for the resource, or it's scheduled time is past. There is a new option, --force that can be specified to force the availability checks. Note that this option will increase execution time.

Change: Agent Max Quiet Time is now Set to 5 Minutes

Back-filling of an agent's platform resources was performed after a 15 minute period of no communication from the agent. This period is set as the AGENT_MAX_QUIET_TIME_ALLOWED system setting. This was true of an agent shut down gracefully or one that went down unexpectedly. For upgrades and new installs this will now be set to 5 minutes, reduced due to architectural improvements.

Change: Agents Shut Down Gracefully will be Back-Filled Immediately

Graceful shut downs were previously handled by the Agent Max Quiet Time mechanism and would not be detected as DOWN for several minutes. That is no longer true, they are back-filled immediately after successful shutdown.

Change: Remote API change to ResourceManagerRemote.getLiveResourceAvailability()

The remote API method ResourceManagerRemote.getLiveResourceAvailability() no longer returns null for unknown, it now properly returns AvailabilityType.UNKNOWN. This may affect existing remote clients or CLI scripts.

Change: Fast Availability Reporting for Resource Lifecycle Operations (Start, Stop etc)

Operations now have the ability to request an immediate availability check after completion. All of the RHQ plugins have been updated for any Start/Stop/Restart operations. So, availability should typically be updated within 60s of the operation completing and can be reflected in the UI if it is refreshed.

Design and Changes

Below describes the selected work and design changes to help address the issues described in the previous sections.

For progress, see Bug 741450 - RFE: Improve Availability Handling (Tracker)

Add two new AvailabilityTypes

UNKNOWN
- Used to formally allow an UNKNOWN avail type to support the fact that when the agent is not up we truly don't known if the resource is up or down, because we have no reporting.
DISABLED
- This has also been called ADMIN_DOWN. See more below.

The DISABLED Availability Type

A user can now set a resource to DISABLED availability. This can be useful in several ways, for example:

During maintenance windows to avoid alerting while cycling.
When discovered resources are expected to be DOWN, such as unused network adapters, to keep them out of the down resource portlet.
For resources reporting UP but not actually servicing requests due to reasons outside their view.

A couple of notes:

Once set DISABLED only a user can restore normal availability reporting.
The agent is unaware of the DISABLED status and will continue to report availability and metrics normally, as well as perform any other actions on the resource.
Disable/Enable requires Resource DELETE permission.
The ResourceManagerRemote allows remote clients and the CLI to Disable/Enable resources.
See Design-MTBF and MTTR for how the new avail types affect calculation of these values.

Group Availability

Group availability is similar to but not exactly the same as before. Previously group availability was based on a "0..1" value calculated as a ratio of DOWN to UP members. 0 indicated all members as DOWN and 1 indicated all members as UP. Since all members
were either DOWN=0 or UP=1, a simple average delivered the availability ration.

This no longer works with DISABLED and, to a degree, UNKNOWN group members. Now group availability is determined with the following algorithm, evaluated top to bottom in the table below:

Member Availability	Group Availability
Empty Group	Grey / EMPTY
All DOWN	Red / DOWN
Some DOWN/UNKNOWN	Yellow / WARN
Some DISABLED	Orange / DISABLED
All UP	Green / UP

Notes

The remote API method ResourceManagerRemote.getLiveResourceAvailability() no longer returns null for unknown, it now properly returns AvailabilityType.UNKNOWN. This may affect existing remote clients or CLI scripts.

Backfill sets platform children UNKNOWN

When we have not heard from the agent for some period of time, we assume it is down. This has not changed. What has changed is that we no longer set the children to DOWN, instead, we set them to UNKNOWN since there is no avail checking and we really don't know the
status.

This behavior may be debatable. DOWN is more conservative from an alerting perspective. But UNKNOWN is more realistic and will not trigger false alerts.
Note changes in alerting allow for some more flexibility in alerting (see below)

Fix Bug 701092 - Turning off agent doesn't cause platform to show as down for 15mins

On graceful shutdown perform "backfilling" immediately. So, platform and child resources will have avail changes immediately.

Stop using avail reports as the agent "heartbeat"

Availability for an Agent's platform resource was determined by reception of availability reports. This had several issues:

It forced avail reports when there were no actual avail changes.
- Avail report handling is heavy, even for a single resource (the platform)
It removed flexibility in avail checking on the agent since we had to send reports.
It tied suspect agent handling to avail checking inetrvals, resulting a 15 minute suspect job period (very slow recognition)

A new server service level ping job has replaced the avail reporting for agent health.

disable comm layer server ping, it is superseded with new ping mechanism
clock sync check is moved from comm layer ping to new ping
ping from Agent start to agent stop. This gives us clock sync at all times
- the ping optionally requests an avail update only when the PC is up (determined by a quick call to the PC instance)
  - so, a running agent with a stopped PC will will not be seen as available, and may get backfilled.
use polling only when we need to determine sending mode
- start polling at agent start
- stop polling in ping if in sending mode, then just ping
- resume polling if ping fails
- In other words, poll when we're not pinging.
Add Agent.lastAvailabilityPing field, and use instead of lastAvailabilityReport for suspect job checking
new executor added to AgentMain.start()
ping interval is 1 minute
- re-purposed the client.server-polling-interval-msecs property (was already defaulting to 1 minute)
- can not be set lower than 1 minute to ensure we don't ping too often.
drop AGENT_MAX_QUIET_TIME_ALLOWED" PROPERTY_VALUE from 15' to 5'
- This is an excellent decrease, although it may be able to go lower. not sure, maybe should be calculated as a factor of ping interval. ( 3 * pingInterval ) may be a good setting, it would allow for a missed or delayed ping but still be a fairly small setting).
Rename AgentLastAvailabilityReportComposite to AgentLastAvailabilityPingComposite

Add Availability Duration Alerting

This is a new alert category. The idea of adding dampening support (to the inherently stateless) avail checking seemed unintuitive and also very complex. And actually, this isn't really dampening, which deals with repeated condition evaluation. This is more a deferred condition evaluation. So, instead, basically added a new type of alerting, instead of Availability Change alerts, these are Availability Duration alerts.

Semantically it means fire an alert if the avail changes to A and stays at A for M minutes.

Stopped trying to force fit avail conditions into CHANGE_TO/FROM operators. Adding dedicated AlertConditionOperators provides more flexibility and removes a few hacks in the avail alerting code.
Worked in alerting for UNKNOWN, DISABLED, and added NOT_UP for convenience
Added new Triggered job (AlertAvailabilityDurationJob) for scheduling the deferred avail type checking.
A lot of the touched files result from moving AlertConditionOperator from server jar to domain jar. This made the operator values available to coregui for some conditional formatting. Most *CacheElement file changes were due to this refactor.

Parent avail propagation

This is not new. Avail checking is already performed top-down. Propagate a non-UP parent avail to the children. It used to be that this was always a propagation of DOWN, but now UNKNOWN or DISABLED may be propagated.

Prioritized Avail Checking

This is a major change that has been discussed quite a lot. previously all resources were checked on every avail scan. By default this was every five minutes. This caused CPU and/or memory spikes and also does not provide any way to favor checking of critical resources and lessen priority for many non-critical, service-level resources. The result is:

Provide resource-level granularity for collecting avail information.
Every non-platform resource type will have a built-in metric schedule called "Availability"
The value will be in seconds
If not set in the plugin descriptor the type's default will be based on its category
- Server 60s (1 minute)
- Service 600s (10 minutes, immediate out of box perf improvement)

By making Availability a metric schedule we:

get built-in template/group/resource granularities
leverage the existing UI
leverage the existing schedule sync
If disabled the avail defers to its parent's avail

For our most critical plugins we will review and set certain types to be collected rarely or not at all (because left as described above we'd actually be increasing the number of avail checks when actually we want to reduce avail check load, while providing more effective checking).

Agent side the Availability schedule is not treated like a standard schedule and it should be noted that it is not a "getValues" metric. The schedule is specifically used for avail checking and calls to getAvailability().
Reduces the default scan period to 30 seconds (from 5 minutes). The default should basically be the minimum allowed collection interval. For now that remains the same as for other metrics, 30s.
Only perform a resource avail check ( a call to
```
resContainer.getAvailability()
```
) if:
- we have no current avail for the resource
  - typically at agent startup or initial import
- collection interval has expired since the last avail
- an ancestor has had an avail change to UP
  - when a resource changes to UP avail, all children must be checked as well. Note that if the change is to NOT UP no check is necessary, the children all get the parent's new avail type.
Spread out avail checking in order to avoid bursts and then quiet periods. Avoid the big scan every 5 minutes (the default for services). To do this:
- schedule the first scheduled check at a random time between the current time and the collection interval for the resource. That should basically distribute the load seeing that he scans are now at 30s intervals. Subsequent checks are then set for checkTime+interval, maintaining the distribution. So, the first check will often happen earlier than then collection interval period, but should not be later, given decent loads.
Disabled avail metrics will force no checks at all for the resource, the avail type will defer to the parent's avail type.

avail --force

Added agent prompt command option: 'avail --force', which forces calls to getAvailability(). This is not the default.

Make ResourceContainer.availability transient.

By not being transient the current availability for a resource was being written to disk in the agent's inventory file. As such, at agent startup, the last known avail is set in the resource container, without ever doing an avail check. This would report the stale avail state to the server.

Changing this means that (currently) the agent will perform an avail check for every resource at startup. This is not too unlike the past, where every avail scan checked every resource. Still, we may want to revisit this to see if we can avoid the startup spike.

One option would be to process only "high priority resources" at startup. Those with collection intervals <= some threshold (like 1 minute, the default for servers). And leave less critical resources at UNKNOWN until their first check. This could just be a configurable option, if set to 0 (the default) check all, otherwise impose the threshold.

Don't always demand full scans after calls to InventoryManager.executeAvailabilityScanImmediately()

InventoryManager.executeAvailabilityScanImmediately() would always demand a full avail scan after it completed. Don't demand this if the scan did not detect and changes because no resource avail states were changed.

Allow resource components to request availability checks

Add ability for plugin code to request that the PC perform a call to getAvailability() as soon as possible. This is useful when the plugin has performed an action that may have affected the avail state, such as a start or stop operation.

Added new AvailabilityContext, accessible through the ResourceContext, providing avail-related calls that can be made by the plugin code.
Deprecated ResourceContext.createAvailabilityCollectorRunnable() and moved that method to the new AvailabilityContext.
- updated the AS plugins to use the new signature.
Updated All RHQ maintained plugins to use the new requestAvailabilityCheck feature on lifecycle operations.

Allow resource components to enable/disable themselves

The user can already DISABLE/ENABLE resources via the GUI or CLI. In some circumstances it makes sense for plugin code to be able to do the same. AS7 has the notion of managed servers that may or may not be auto-started. A managed server that is not started is not UP or DOWN, but more logically DISABLED. This can be determined via the component code.

There is now AvailabilityContext.enable() and AvailabilityContext.disable(). This should be used sparingly as it is done
blindly by the resource component, potentially altering the user's setting.

Upgrade Notes

DB Upgrade

db support for last ping time, initialized to last avail report time
update system setting for agent quiet time from 15 to 5 minutes
rhq_availability and rhq_resource_avail availType columns move to non null, and null values are set UNKNOWN

New Availability Metric Schedules

Prioritized Avail Checking introduces the new Availability metric schedule for all types. Types are updated with the Availability metric schedule when a plugin is updated. To be updated the plugin must be newer than the existing plugin in the db. This means it must have a different MD5, and a version/timestamp no less than the existing plugin. Note that the plugin does not need to change other than in a nominal fashion, like a change in build date. Anything that will force a new MD5.

RHQ: This is not a problem for RHQ plugins, which are all built for each release.
JON: For JON 3.1 we should ensure that the plugin packs have new builds of all 3rd party plugins, like BRMS.
Custom: Custom plugins will need to be updated as needed, by the customer/developers.

If RHQ is run with a plugin that has not been updated it is ok:

The Availability metric will not exist for the types of that plugin.
The category default intervals will be assigned to resources lacking an Availability metric schedule.
The plugin can be updated at any time.

Note that new versions of plugins can override default interval setting, but non-default intervals will never be overridden on existing types.

Initial Availability and ResourceAvailability Records

New resources now immediately get initial Availability and ResourceAvailability records. They are set to UNKNOWN, and the start time is set to epoch to indicate the we have no avail history for the resource. A second Availability record, and an update to the ResourceAvailability should typically follow quickly, when the agent sends in an actual avail for the new resource. It is assumed that existing resources at upgrade time will already have at least one Availability record and therefore there is no upgrade work to ensure that state. Note that even if this assumption is false, which is very unlikely, there is code to correct the problem at runtime if detected.

How this will be tested:

Unit tests to be added:

Existing unit tests will be updated to incorporate the changes
New unit tests will be added for the alerting features
New unit tests will be added for Availability metric schedule handling
Others as the need is identified
Integration tests that need to be performed:
Existing integration tests will be updated to incorporate the changes
Others as the need is identified

Ideas/Notes/Futures

There are of course more things we can do. This section is for notes and ideas for availability handling past what is being done as described above.

Allow Resource Components to Disable/Enable Themselves

Setting a resource disabled is performed by a client, either a user or a remote/CLI client. By setting its
AvailabilityType to DISABLED it integrates nicely into GUI views and alerting. But it is fundamentally a
server-side setting, the agent continues managing the resource unaware.

In some rare cases plugin component code may be self-aware enough to intelligently decide that the underlying resource should be disabled or enabled. For example, an AS-7 Scenario:

pilhuhn: I am asking because in as7 we may have instances of managed servers in domain mode, that are configured but not started on purpose - which is like disabled.
pilhuhn: returning down is again the wrong semantic
jshaughn: You should not return disabled from getAvailability, Disabled is to be set only by a user/client.
jshaughn: so you're saying that you want your getAvailability method to actually report disabled?
pilhuhn: suppose I would not do that and manually disable it. Then I go to the server's resource, run operation(start) and then ?
pilhuhn: I would need to go to the parent resource to click on enable to tell the server "hey I have started the managed server"
pilhuhn: In my case is the DomainController able to report that the managed server is disabled

It's not in plan to make the agent DISABLED-aware. It introduces a lot of complexity. As would adding any new agent-aware avail type.

But, it may be possible to fairly easily allow resource component code to "act like a client" and request to the server to disable or enable themselves. It could feasibly be done via a new call introduced into the AvailabilityContext (newly available due to Allow resource components to request availability checks).

There is one downside that I can think of:

A component could enable themselves, countering a disable actually performed by a user.
- I'm not sure if this is a real-world issue.
- I don't see an easy way to prevent this.
- A possible workaround might be a plugin config property on the type, allowing or preventing it from performing enablement. This would have to be managed separately by the user.

CPU throttling

The approach does not attempt to throttle CPU.

Lukas' work here is still investigative and there are some concerns that need to be addressed:
- platform support
- the actual cost of throttling
- starvation of reporting
If worked out CPU throttling should hopefully be orthogonal, and complementary.

Using multiple threads

The approach remains single threaded.

Ian has ideas about using multiple threads although we we need to ensure the added complexity is needed. Some concerns:
It's not clear how to "bucket" the multiple threads. The approach assumed category-based avail check intervals.
Disjoint reporting issues? With different resources in a hierarchy handled by different threads.
May not be necessary given benefits of new approach
- Still, something to consider in another pass, after evaluating the current changes.
a good thought from Heiko: one thread could take care of one (or more) top-level servers below platform level.

MAINTENANCE Availability Type

Perhaps DISABLED will not be descriptive enough for all of the use cases.

pilhuhn: jshaughn I'd actually differentiate between DISABLED and MAINTENANCE - for the purpose of:
- SLA computations later on
- automatic setting of maintenance mode - resources that are DISABLED should not be marked as NON-DISABLED after maintenance. A network interface that is not connected needs to stay DISABLED
jshaughn: I can see the difference although I'd like to see how DISABLED goes over before introducing another avail type. It impacts quite a lot of things, although now that DISABLED exists, another type that is treated basically the same way may not be too hard to add.
pilhuhn: k - semantics are definitively different
jshaughn: At this point though, since the DISABLED state is completely controlled by the user, it may be sufficient.
pilhuhn: and MAINT / NON-MAINT may be controlled by automatic schedulers etc
jshaughn: I don't think it's necessarily easy, it would still require a lot of touch points, new iconography, etc. but certainly not as hard as what we've done in this round

Make Agent aware of DISABLED state

pilhuhn: if the agent were aware, it could skip over those - especially if the checks would be costly if the resource is in MAINT mode and the target thus not responding in timely fashion (e.g. because backend database is constraint)
jshaughn: pilhuhn: perhaps. Although also if the component is self-aware of its actual state it may also be able to do that on its own. It's something to maybe consider but again, I'd like to see how this goes because that introduces a variety of potential sync issues

Enhance plugin requestAvailabilityCheck() support

Currently resource component code can request an avail check on itself. mod-cluster, and maybe other plugins, may benefit from enhanced support due to operations that can affect sibling or child resource avails. So, enhanced support that:

allows an "include-children" option
allows a request on the parent (also with an "include-children" option.

Summary of Legacy Availability Issues, RHQ 4.3 and earlier

Availability is the RHQ way of reporting whether a resource is up, down or in an unknown state. It is a critical component in monitoring and alerting. Unfortunately, RHQ today has serious issues with availability handling.

Current Issues

Very slow reporting of availability changes.
availability reports are not sent frequently from the agent
5 or more minutes can pass before avail is actually updated
Very disconcerting in the UI when we report a status the user knows is incorrect.
Perhaps unusable for SLA outages.
Resource cycling can actually be missed.
Availability reports are waiting behind other long duration transfers
We alert only on availability change, not availability duration.
- Only offer avail conditions "goes up: and "goes down", not "is up" and and "is down"
- No applicable dampening, which confuses, and then dismays, users
No notion of "admin down"
Poor handling when the RHQ Agent goes down, even gracefully.
- Slow recognition
- Avail set to down for all monitored resources on that agent.
- This can be misleading, unknown probably makes more sense

The main blocker to doing this better is performance. We need to be able to monitor and report availability for a large number of resources, over a large number of agents, with little delay and little overhead. And for alerting purposes, we need to enhance the reporting to support avail periods.

Mechanisms in play:

The agent must ask the resource container for its avail state.
the avail checking code is plugin dependent and has no guarantees on efficiency.
the avail check may hang for a resource that is truly down.
- Note that we do support an asynch avail reporting mechanism, which is useful for fast avail checking if implemented by the plugin.
  - Use of this does add more threads
The agent must report availability to the server.
The server must respond to availability changes on the resources.
The server must perform alerting.
Tables/Domain classes
- rhq_availability (resource avail history)
  - a row for each avail period
- rhq_resource_avail (resource current avail)
AvailabilityReport (the object passed from agent to server)
Checking avail on all resources: Parents, children, grandchildren...

Ideas

Note that some of these may already be done in the code, this was a brainstorming list. Also, not everything here needs to be implemented, again, these are just ideas.

Certainly fix https://bugzilla.redhat.com/show_bug.cgi?id=701092
Different frequencies for different categories
- More frequent for servers, less frequent avail checking for services
- Or, even more granularity
No avail checking for dependent types
- deferral to the parent's avail
Detach agent avail from avail reporting
- agent ping to server to prove it's alive
server ping to agent before backfilling
Only report avail changes (means keeping/comparing against previous report agent-side)
Store lastUpdateTime on ResourceAvailability to be able to calculate the length a resource has been at the current avail type
Agent side, process avail checks top down because any parent down implies all children are down. In fact, perhaps server side we assume that we won't get down avail below a down parent, and we handle (backfill) the children server-side.
make all avail checking async on the agent. meaning, the pc maintains the last known avail and checks agains that (is that possible, efficient, do we do it already)
Divide up server side work among HA servers, each handling its connected agents (in memory timers)
probability-based approaches?
Have avail reports be sent out-of band or as 1st prio message to the server
Cache ResourceAvail table / last entry in Avail table
base avail on PID (PC could always do that for processes and only call getAvail() when pid is up)
doable where?
push avail from plugin?

Relevant Configuration Properties:

Availability Report Limit (rhq.server.concurrency-limit.availability-report)
Number of availability reports that can be processed concurrently; if zero or less, there is no limit
default 25
Agent Max Quiet Time (set on "Administration>SystemConfiguration>Settings" )
Number of minutes agent has to provide avail report before being backfilled
default 15 minutes
availability-scan.period-secs
the period between availability scans on the agent
default 5 minutes
rhq.agent.plugins.availability-scan.timeout
default 5 seconds

Relevant Wiki Pages

Relevant BZs

Questions/TODO

When do you get unknown availability?

When a resource is first imported before any data has been reported for the resource
Decide whether communications between servers is required.
Allow users to denote "important" resources for small checking intervals?

Priorities

High: Agent updates to reduce CPU utilization
Medium/High: Server side avail reporting
2k Resources + 2 AS , 1min checks => 6k availabilities per minute coming in? and vice-verse - 1 agent with 50 ASes