Correlation Units for RHQ

Correlation Units for RHQ

This is still work in progress

Before I am going to talk about the correlation units and what they should do, I want to present some scenarios:

Usage Scenarios

Scenario 1: The user has a more complex system where the (non-) functioning can not be determined by only one resource. Take a database backed webshop as an example. No incoming connections from the web are no error, as it could be in the middle of the night where customers are sleeping. The maximum number of incoming connections in use are also no error, as it would mean that the shop is humming along nicely (One could argue that this case is also bad as there are probably more users out there waitig desperately to buy something.). From the database side, all connections in use are no error, as we probably designed the application to open a number of pooled connections. Also no open connections would be ok, if there is no user buying anything.
Only a combination of the metrics can really determine an error condition: if all web connections are in use and no database connection is open it is an error case, as we know that the web shop needs the database for many of its actions.

Scenario 2: The user has set up a cluster of three or more machines. For the business side of things it would be no error if one of the cluster nodes would stop working, as the other nodes could handle the load. Also for the admin crew, this would not be the highest severity. So it should be possible to define what state of an application yields what state of alerting.

Scenario 3: There are often situation where an application has known errors. A reboot of the application would just get them going for another period (and the other nodes in the cluster). So for example if an error (e.g. memory leak) in an application is detected, it could be necessary to reboot the JbossAS it is running on or perhaps even some external component.

Scenario 4: This is related to scenario 2. Add the possibility on dashboards for individual groups of users to display their own state of affairs. For a manager it could be ok to just see red/yellow/green to know how the application behaves, while for certain technical of business groups that definition would be totally different.

Scenario 5: The resource is shown as down. Going to its inventory tab, the agent also shows as down. We can not tell at the moment if the platform is away or the agent 'just died' or what other error happened. We should be able to run additional tests in order to better isolate what is going on.

All scenarios have in common, that they can not be handled with the current implementation of JBoss ON 2.

Introduction to Correlation Units

Correlation Units (CUs) serve the purpose of bringing a number of distinct input metrics and other sources of data (e.g. availability, operation results, matching events, ... ) together to form one output value that can be sent further to the next processing level. The output of one CU can be fed into another one and it is possible that one CU feeds more than one target. Also the corner case of only one source feeding into a CU is allowed (e.g. to adjust or negate the input value).

images/author/download/attachments/73139330/CorrelationUnits.png

In the above figure you see a few sources of information, the correlation units and some "Drains". I took this special naming, as they can be more than just the Alert subsystem:

Green / yellow / red buttons to show the state of a CU at one glance
A gauge to see a more differentiated view
The alert subsystem

Correlation consist of parts:

Normalization
Computation rules

images/author/download/attachments/73139330/CorrelationUnits1.png

Normalization / Mapping

The first step in correlation is to bring the input values into a common format, so that the next level can work on the same type of stuff and does no longer need to take special cases into account. This way it is also possible to (re-)map values.

Examples:

map three value ranges of a metric to value traffic lights (0..10 => green, 11...90 => yellow, >90 => red)
map the availability to red / green (or even yellow if the resource is 'not that important')
map an allowed resource value band to green and out-of-band values to red.
map a failed operation to red and a good one to green

Computation rules

The computation rules take the normalized values from the previous step and transform them into the output. This could be rules like

if input is red or yellow output is red
if any of the input is < 20 then output is 0
if more than two inputs are red then output is red
if more than one input is red then output is yellow

Rules can at least use the full boolean operators and comparision operators.

Back to scenarios

While some of the solutions of the scenarios with the help of correlation units are more or less obvious, others aren't.

Scenario 5: By using only one agent, this can not be solved. What we could do is use a second agent (could even be the one embedded in the server) ("BOB") , that is trying out-of-band communication with the target platform and / or agent. For example if we know that the host with the agent marked as down ("ALICE") hosts a httpd, we could ping that httpd server from BOB. If it answers correctly, we know that the platform is there and can add a correlation rule "if (ALICE agent is down && httpd from BOB is up) mark platform(ALICE) as up". This way we know that at least not the entire box is down, which could e.g. be used to calculate priorities for ops personnel.

Prototype

Attached there is a standalone prototype CorrelationTest.zip. This implements domain objects for correlation,
a CorrelationManagerBean as engine and a Testclass, Test1.java that implements the following correlation:

images/author/download/attachments/73139330/CorrelationTest.png

Domain model

The domain model consists mainly of three domain objects:

NormalizationRule : Rules for normalizations. There needs to be one class per input type.
Depending on the type of subclass, there may be rule items hanging on it that define one expression. This is e.g. the case for MetricNormalizationRule.
CorrelationRule: Rules for correlation. Each expression about how many red, yellow and green input results in what output is stored in one CorrelationItem hanging on the CorrelationRule.
NormalizedData: Normalized data coming from the rules and flowing to other rules

NormalizationRules and CorrelationRules need to be stored in the database so that they can be reloaded on system startup.

images/author/download/attachments/73139330/CorrelationDomain.png

Other classes are

ResultState : the result of a rule RED, YELLOW, GREEN and INVALID
NormalizationOperator: enum that contains possible operators for various normalizations

Algorithm

CorrelationManagerBean does the work. Rules (Normalization and Correlation) can be inserted into a rules store, which is a JBossCache instance, where each type of rule has its own subtree.

Incoming data is handled via insert*() methods. Those check the rules store if a rule for this input is present. If not, they return. Otherwise, the rule is obtained and its normalize() method is called. The resulting NormalizedData is stored in the working memory along with the name of the rule that created the data and correlation is fired.

The working memory is another JBossCache instance where data is stored with the name of the source rule as key.

images/author/download/attachments/73139330/CorrelationFlow.png

The first thing to check when correlation is fired, is whether there are rules that take input from the sending rule. If there are none, then the method returns.
The loop through each rule found checks if enough input data is available in the working memory to allow for computation of the result. If this is true, the result is computed and put into the working memory. Then the next round of correlation for this NormalizedData item just created is fired.

If the correlation rule found is an instance of ActionRule, then the respective action is triggered - this could be sending of an alert, executing an operation ....

Optimizing possibilities

Correlation Rules are keyed by string and the handling of the input path of JBossCache is done via Strings too.
Using Integer / FQN objects here could help speeding things up (a little). But for the prototype it was better to kiss.

Note on adding to RHQ

While adding this to a single RHQ release might be too big, we could still implement it and have a magic code in RHQ_SYSTEM_CONFIG table, that just disabled the feeding of data into correlation for the first release. So users could play with correlation for snapshots, while we will still be able to release a stable release that contains the correlation code.

Work to do

Some things that come to mind

Loading and storing rules in database
Initial loading of rules at system startup
Rule editors (Normalization on resource / metric / operation)
Correlation editor (we used to have graphical version in our old product, which was very nice to use)
Hooking ActionRule into Alert processing and operations
Writing more NormalizationRules for various types of input (Events, ...)
Write a Correlation result portlet that when clicking on an item can show the tree for all involved rules / inputs
Allow comparator expressions for CorrelationRuleItems ( <, ==, > , != ) in addition to the fixed number ones

JBoss Community Archive (Read Only)

RHQ 4.9