Skip to end of metadata
Go to start of metadata

Alerting 2.0

This is about a potential future development of the alerting functionality.
Nothing of this is implemented right now
We should also write down the current capabilities so that we don't
forget them in the future implementation.

This is a large effort, but would work around quite some restrictions we have
today. Some of the new features are:

  • comparing of two metrics with each other (used mem is > than 80% of
    total mem)
  • comparing of metrics from different resources (if load balance is
    down or one of my 3 app servers in the cluster)
  • temporary component in reasoning (if night and only 1/3 servers is
    down send email; if day always send text message ; default: send email
    and text message and make a phone call )
  • finding an outlier in a group (if the load on one of my 3 boxes in the
    cluster is higher than on the others).
  • Proactive alerting: We need to be able to alarm the user already in advance
    when a metric is about to become problematic; e.g. 1 day in advance. This is
    not the same as just alarming when a value reaches a certain threshold. For
    example a disk drive that is between 89 and 91 % full is no issue. So putting
    an alert on 90% is wrong. But if we can extrapolate from some longtime samples
    (e.g. the 1h or 1d values) that at the same rate of filling up, the disk will
    be full in 3 days, we should inform the user - especially before a week end.

There are many more aspects to this that I can detail further.

The basic idea is though to get rid of our homegrown engine and to use Drools
Fusion. We would keep our current UI, but on top also offer a DSL and/or
the raw .drl language to write alert rules in.

Better presentation of incident data

When an alert is triggered, we should try to fire up a panel, that
shows related data for that time. This could be log file entries
for the seconds before. Or "suspect metrics" on related resources etc.

Service levels

IT departments often have the need to fulfill some contracts about the quality
of the service they are delivering. To do this, they need to measure the actual
quality of their service, compare those against some defined values and create
afterwards reports about the actual quality and possible violations.

  • Defining SL for resources, groups and applications: Service Levels consist
    of definitions for availablites and performances of applications and
    resources.
    Those need to be defined and to be combined into packages that correspond to
    a service level agreement.
  • Measuring availabilities and other SL components: Measurement data needs to
    be taken, aggregated and matched against the defined Service Level definitions
    from above.
  • Alarming about possible SLA violations: If service levels are about to be
    violated, it is necessary to alert the operations people so that they can try
    to optimize the service
  • Reporting: Reports need to be generated that show the Service Level
    definitions and the measured values (and at the end if the SLA has
    been met or not).
  • Definition of service windows: It must be possible to schedule (recurring)
    maintenance windows for resources and groups of resources. Within those
    windows we would not mark resources as down - and depending on the SLA
    agreed upon also not count into final vales for uptime or outage time.
    Resources / Applications / Groups that are set into maintenance would
    also be set into maintenance mode to suppress (false) alerts in that
    time.
  • Service level windows should be used for alert visibility also. If the resource is outside
    it's monitoring timewindow (say a test server would be available 8-16), there's no point
    in showing the alerts for that resource outside the timewindow, thus making it easier
    for monitoring personnel to see only the relevant alerts.
    • This should be a filter option in the alerts window.

Alerting on server-state/events

It should be possible to get alerts on server side events. One example
that came up recently was the addition of a new item to the discovery queue.

The alerting could be hooked into the auditing (see below) where all those
events would pass by anyway.

Alert definitions in plugins

As seen in the current Cassandra work, there is a need for pre-canned and
possibly
pre-activated Alert Definitions. It should be possible to encode them in the
plugin
metadata. One option could be the plugin descriptor.

This is a bit hairy, as the alert definition "language" is not really a DSL.

Explicit handling of dependent alerts

If a user has an alert on an AS and a DataSource inside the AS and the AS
goes down, the user currently gets two alerts for the AS and the DataSource.

It should be possible to suppress creation of alerts for dependent resources/
alerts.

While the above example can be easily handled by a parent-child relationship within RHQs current hierarchical resource model,
the following scenario needs more explicit relation handling:

An e-shop consists of a load-balancer and two AS behind the load-balancer. If the load-balancer goes down, individual alerts on AS level (e.g "load too low", "not enough incoming connections") should be suppressed, as the failing load balancer is clearly the root cause.

Workflow improvements

These are options that could be done after the alert has been fired, to follow the process how each alert has been handled.

  • A simple workflow inside RHQ
    • More status options
      • Noticed/Assigned
        • Ability to assign alerts to a certain user
      • Acknowledged/Resolved
    • Add comments to an alert
      • Not just when it was resolved, but at any point. Provides history information.
    • View history of workflow (assignments, status changes & comments)
    • Escalation paths (if alert is not acknowledged within 15 mins, send to someone else)
    • Shift plans (user can upload excel sheet to determine who is on duty, to send alert email to specific user depending on that plan)
  • For more complex workflow requirements
    • Ability to fire server-plugin for all alerts automatically
      • At the moment one needs to configure this per alert. It should be global as well.
        • Or, to a certain type of alerts, such as just red-flags.
        • This could be something that starts BPM workflow, or calls SwitchYard service, etc
    • Manually trigger a process from an alert
      • Could be globally configured options, same as automatic options

On-duty / Escalation handling

In the current implementation alert emails always go to the user/group/role defined in the alert definition, no matter if a user is really on duty or not.

Alerting 2.0 needs to take on-duty plans into account and be able to route alert notifications (not only email, but also sms, pager, ..) to the person/group
that is on duty at this time.

On top of this it must be possible if that person/group can not be reached within a certain amount of time (and thus the alert is not marked as acknowledged), to either re-send the notification to them or alternatively to a backup person or group.

For this purpose we need to either implement a on-duty editor (a table with 7 columns and 24 rows) to enter people on duty (and a 2nd one for escalation). Or, and this may be easier, allow to just upload an Excel sheet with this data.

Disjoint alerting / graphing schedules / Filtering for alerts at agent level

Currently we have pretty lazy collection intervals, that are nice to see long term graphs, where you don't need per-minute resolutions, but which are unusable for serious alerting (this is done to remove the load on RHQ - and especially server/database; Cassandra is supposed to help here). The less frequently the data is collected the longer it takes until an alert fires though.

The idea is now to have much more aggressive collection intervals for alerting, where the metric is e.g. obtained every minute and compared against alerting conditions inside the agent and only forwarded to the server if the alerting condition is met (or the normal collection interval would say so).

This way one would get fast alerting, but still low(er) load on the server and database.

Users want:

  • quick alerting after error condition
  • low volume of data for long term storage

Issue:

  • Quick schedules allow for quick alerting
  • Slow schedules allow for less data in DB

Solution:

  • Use quick schedules in agent
  • compare with alert criteria
  • throw away if no (possible) alert

Alert check is against a single normalized value
“if (value > threshold) { forwardToServer(value) }

Normal forwarding every 10min (=defined schedule time) happens anyway

Simple for exisiting “1 resource per definition” case

Now with Wintermute:

  • Definition can be N resources on X agents
  • Need to forward if combined contition could be true
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Mar 04, 2014

    My list of shortcomings:

    • Need a better UI to display and handle alerts. It's useful (often) to have a display where alerts are grouped by hostname, grouped by RHQ group, searchable by hostname, etc. Filtering by ack'd or not wasn't even in until recently.
    • ...also need a way to display and filter groups by alert type, etc.
    • Better ways to handle alert recovery. I wrote a plugin that, when a recovery alert happens, to mark the alert itself as recovered. (https://bugzilla.redhat.com/show_bug.cgi?id=1030667) Since this comes up so often, this should be easier to configure.
    • I also wrote a plugin to alert based on the group assigned using tags on resource groups. For example, resources of group A, send an alert to this email address, etc.
    • Dealing with metric aggregation. I created a plugin to manage alerts for resource groups: https://bugzilla.redhat.com/show_bug.cgi?id=1019472 ... It lets you do things like alert if < 20% of your servers are down, there was a 20% drop in traffic since last week for the same hour across a pool of hosts, etc.
    • Date and time threshold alerting: https://bugzilla.redhat.com/show_bug.cgi?id=958254 ... For example, if you track when job A run, did it run in the past hour?

    I'd also like to see more done with events. RHQ could have a Splunk-like log event capturing system (using Cassandra as the back-end), with full text search and alerting. You could have the events in turn generate metrics that could fire alerts if thresholds are met.