Wintermute - RHQ 4.9

Alerting 2.0

This is about a potential future development of the alerting functionality.
Nothing of this is implemented right now

This is a large effort, but would work around quite some restrictions we have
today. Some of the new features are:

comparing of two metrics with each other (used mem is > than 80% of
total mem)
comparing of metrics from different resources (if load balance is
down or one of my 3 app servers in the cluster)
temporary component in reasoning (if night and only 1/3 servers is
down send email; if day always send text message ; default: send email
and text message and make a phone call )
finding an outlier in a group (if the load on one of my 3 boxes in the
cluster is higher than on the others).
Proactive alerting: We need to be able to alarm the user already in advance
when a metric is about to become problematic; e.g. 1 day in advance. This is
not the same as just alarming when a value reaches a certain threshold. For
example a disk drive that is between 89 and 91 % full is no issue. So putting
an alert on 90% is wrong. But if we can extrapolate from some longtime samples
(e.g. the 1h or 1d values) that at the same rate of filling up, the disk will
be full in 3 days, we should inform the user - especially before a week end.

There are many more aspects to this that I can detail further.

The basic idea is though to get rid of our homegrown engine and to use Drools
Fusion. We would keep our current UI, but on top also offer a DSL and/or
the raw .drl language to write alert rules in.

Filtering for alerts at agent level

Currently we have pretty lazy collection intervals, that are nice to
see long term graphs, where you don't need per-minute resolutions, but
which are unusable for serious alerting (this is done to remove the
load on RHQ - and especially server/database; Cassandra is supposed
to help here).

The idea is now to have much more aggressive collection intervals for
alerting, where the metric is e.g. obtained every minute and compared
against alerting conditions inside the agent and only forwarded to
the server if the alerting condition is met (or the normal collection
interval would say so).

This way one would get fast alerting, but still low(er) load on the
server and database.

The item is of course related to the previous, but could possibly
be addressed independently.

Better presentation of incident data

When an alert is triggered, we should try to fire up a panel, that
shows related data for that time. This could be log file entries
for the seconds before. Or "suspect metrics" on related resources etc.

Service levels

IT departments often have the need to fulfill some contracts about the quality
of the service they are delivering. To do this, they need to measure the actual
quality of their service, compare those against some defined values and create
afterwards reports about the actual quality and possible violations.

Defining SL for resources, groups and applications: Service Levels consist
of definitions for availablites and performances of applications and
resources.
Those need to be defined and to be combined into packages that correspond to
a service level agreement.
Measuring availabilities and other SL components: Measurement data needs to
be taken, aggregated and matched against the defined Service Level definitions
from above.
Alarming about possible SLA violations: If service levels are about to be
violated, it is necessary to alert the operations people so that they can try
to optimize the service
Reporting: Reports need to be generated that show the Service Level
definitions and the measured values (and at the end if the SLA has
been met or not).
Definition of service windows: It must be possible to schedule (recurring)
maintenance windows for resources and groups of resources. Within those
windows we would not mark resources as down - and depending on the SLA
agreed upon also not count into final vales for uptime or outage time.
Resources / Applications / Groups that are set into maintenance would
also be set into maintenance mode to suppress (false) alerts in that
time.
Service level windows should be used for alert visibility also. If the resource is outside
it's monitoring timewindow (say a test server would be available 8-16), there's no point
in showing the alerts for that resource outside the timewindow, thus making it easier
for monitoring personnel to see only the relevant alerts.
- This should be a filter option in the alerts window.

Alerting on server-state/events

It should be possible to get alerts on server side events. One example
that came up recently was the addition of a new item to the discovery queue.

The alerting could be hooked into the auditing (see below) where all those
events would pass by anyway.

Alert definitions in plugins

As seen in the current Cassandra work, there is a need for pre-canned and
possibly
pre-activated Alert Definitions. It should be possible to encode them in the
plugin
metadata. One option could be the plugin descriptor.

 <server name="Foo" ... >
    <alertDefinition name="high memory">
      <condition>…</condition>
      <notification>…</notification>
    </alertDefinition>

This is a bit hairy, as the alert definition "language" is not really a DSL.

Explicit handling of dependent alerts

If a user has an alert on an AS and a DataSource inside the AS and the AS
goes down, the user currently gets two alerts for the AS and the DataSource.
It should be possible to suppress creation of alerts for dependent resources/
alerts.

Workflow improvements

These are options that could be done after the alert has been fired, to follow the process how each alert has been handled.

A simple workflow inside RHQ
- More status options
  - Noticed/Assigned
    - Ability to assign alerts to a certain user
  - Acknowledged/Resolved
- Add comments to an alert
  - Not just when it was resolved, but at any point. Provides history information.
- View history of workflow (assignments, status changes & comments)

- Escalation paths (if alert is not acknowledged within 15 mins, send to someone else)
- Shift plans (user can upload excel sheet to determine who is on duty, to send alert email to specific user depending on that plan)
For more complex workflow requirements
- Ability to fire server-plugin for all alerts automatically
  - At the moment one needs to configure this per alert. It should be global as well.
    - Or, to a certain type of alerts, such as just red-flags.
    - This could be something that starts BPM workflow, or calls SwitchYard service, etc
- Manually trigger a process from an alert
  - Could be globally configured options, same as automatic options