Wintermute

Alerting 2.0

This is about a potential future development of the alerting functionality.
Nothing of this is implemented right now

We should also write down the current capabilities so that we don't
forget them in the future implementation.

This is a large effort, but would work around quite some restrictions we have
today. Some of the new features are:

comparing of two metrics with each other (used mem is > than 80% of
total mem)
comparing of metrics from different resources (if load balance is
down or one of my 3 app servers in the cluster)
temporary component in reasoning (if night and only 1/3 servers is
down send email; if day always send text message ; default: send email
and text message and make a phone call )
finding an outlier in a group (if the load on one of my 3 boxes in the
cluster is higher than on the others).
Proactive alerting: We need to be able to alarm the user already in advance
when a metric is about to become problematic; e.g. 1 day in advance. This is
not the same as just alarming when a value reaches a certain threshold. For
example a disk drive that is between 89 and 91 % full is no issue. So putting
an alert on 90% is wrong. But if we can extrapolate from some longtime samples
(e.g. the 1h or 1d values) that at the same rate of filling up, the disk will
be full in 3 days, we should inform the user - especially before a week end.

There are many more aspects to this that I can detail further.

The basic idea is though to get rid of our homegrown engine and to use Drools
Fusion. We would keep our current UI, but on top also offer a DSL and/or
the raw .drl language to write alert rules in.

Better presentation of incident data

When an alert is triggered, we should try to fire up a panel, that
shows related data for that time. This could be log file entries
for the seconds before. Or "suspect metrics" on related resources etc.

Service levels

IT departments often have the need to fulfill some contracts about the quality
of the service they are delivering. To do this, they need to measure the actual
quality of their service, compare those against some defined values and create
afterwards reports about the actual quality and possible violations.

Defining SL for resources, groups and applications: Service Levels consist
of definitions for availablites and performances of applications and
resources.
Those need to be defined and to be combined into packages that correspond to
a service level agreement.
Measuring availabilities and other SL components: Measurement data needs to
be taken, aggregated and matched against the defined Service Level definitions
from above.
Alarming about possible SLA violations: If service levels are about to be
violated, it is necessary to alert the operations people so that they can try
to optimize the service
Reporting: Reports need to be generated that show the Service Level
definitions and the measured values (and at the end if the SLA has
been met or not).
Definition of service windows: It must be possible to schedule (recurring)
maintenance windows for resources and groups of resources. Within those
windows we would not mark resources as down - and depending on the SLA
agreed upon also not count into final vales for uptime or outage time.
Resources / Applications / Groups that are set into maintenance would
also be set into maintenance mode to suppress (false) alerts in that
time.
Service level windows should be used for alert visibility also. If the resource is outside
it's monitoring timewindow (say a test server would be available 8-16), there's no point
in showing the alerts for that resource outside the timewindow, thus making it easier
for monitoring personnel to see only the relevant alerts.
- This should be a filter option in the alerts window.

Alerting on server-state/events

It should be possible to get alerts on server side events. One example
that came up recently was the addition of a new item to the discovery queue.

The alerting could be hooked into the auditing (see below) where all those
events would pass by anyway.

Alert definitions in plugins

As seen in the current Cassandra work, there is a need for pre-canned and
possibly
pre-activated Alert Definitions. It should be possible to encode them in the
plugin
metadata. One option could be the plugin descriptor.

 <server name="Foo" ... >
    <alertDefinition name="high memory">
      <condition>…</condition>
      <notification>…</notification>
    </alertDefinition>

This is a bit hairy, as the alert definition "language" is not really a DSL.

Explicit handling of dependent alerts

If a user has an alert on an AS and a DataSource inside the AS and the AS
goes down, the user currently gets two alerts for the AS and the DataSource.

It should be possible to suppress creation of alerts for dependent resources/
alerts.

While the above example can be easily handled by a parent-child relationship within RHQs current hierarchical resource model,
the following scenario needs more explicit relation handling:

An e-shop consists of a load-balancer and two AS behind the load-balancer. If the load-balancer goes down, individual alerts on AS level (e.g "load too low", "not enough incoming connections") should be suppressed, as the failing load balancer is clearly the root cause.

Workflow improvements

These are options that could be done after the alert has been fired, to follow the process how each alert has been handled.

A simple workflow inside RHQ
- More status options
  - Noticed/Assigned
    - Ability to assign alerts to a certain user
  - Acknowledged/Resolved
- Add comments to an alert
  - Not just when it was resolved, but at any point. Provides history information.
- View history of workflow (assignments, status changes & comments)

- Escalation paths (if alert is not acknowledged within 15 mins, send to someone else)
- Shift plans (user can upload excel sheet to determine who is on duty, to send alert email to specific user depending on that plan)
For more complex workflow requirements
- Ability to fire server-plugin for all alerts automatically
  - At the moment one needs to configure this per alert. It should be global as well.
    - Or, to a certain type of alerts, such as just red-flags.
    - This could be something that starts BPM workflow, or calls SwitchYard service, etc
- Manually trigger a process from an alert
  - Could be globally configured options, same as automatic options

duty / Escalation handling

In the current implementation alert emails always go to the user/group/role defined in the alert definition, no matter if a user is really on duty or not.

Alerting 2.0 needs to take on-duty plans into account and be able to route alert notifications (not only email, but also sms, pager, ..) to the person/group
that is on duty at this time.

On top of this it must be possible if that person/group can not be reached within a certain amount of time (and thus the alert is not marked as acknowledged), to either re-send the notification to them or alternatively to a backup person or group.

For this purpose we need to either implement a on-duty editor (a table with 7 columns and 24 rows) to enter people on duty (and a 2nd one for escalation). Or, and this may be easier, allow to just upload an Excel sheet with this data.

Disjoint alerting / graphing schedules / Filtering for alerts at agent level

Currently we have pretty lazy collection intervals, that are nice to see long term graphs, where you don't need per-minute resolutions, but which are unusable for serious alerting (this is done to remove the load on RHQ - and especially server/database; Cassandra is supposed to help here). The less frequently the data is collected the longer it takes until an alert fires though.

The idea is now to have much more aggressive collection intervals for alerting, where the metric is e.g. obtained every minute and compared against alerting conditions inside the agent and only forwarded to the server if the alerting condition is met (or the normal collection interval would say so).

This way one would get fast alerting, but still low(er) load on the server and database.

Users want:

quick alerting after error condition
low volume of data for long term storage

Issue:

Quick schedules allow for quick alerting
Slow schedules allow for less data in DB

Solution:

Use quick schedules in agent
compare with alert criteria
throw away if no (possible) alert

Alert check is against a single normalized value
“if (value > threshold) { forwardToServer(value) }

Normal forwarding every 10min (=defined schedule time) happens anyway

Simple for exisiting “1 resource per definition” case

Now with Wintermute:

Definition can be N resources on X agents
Need to forward if combined contition could be true