JBoss Community Archive (Read Only)

RHQ 4.9

Wintermute

Alerting 2.0

This is about a potential future development of the alerting functionality.
Nothing of this is implemented right now

This is a large effort, but would work around quite some restrictions we have
today. Some of the new features are:

There are many more aspects to this that I can detail further.

The basic idea is though to get rid of our homegrown engine and to use Drools
Fusion. We would keep our current UI, but on top also offer a DSL and/or
the raw .drl language to write alert rules in.

Filtering for alerts at agent level

Currently we have pretty lazy collection intervals, that are nice to
see long term graphs, where you don't need per-minute resolutions, but
which are unusable for serious alerting (this is done to remove the
load on RHQ - and especially server/database; Cassandra is supposed
to help here).

The idea is now to have much more aggressive collection intervals for
alerting, where the metric is e.g. obtained every minute and compared
against alerting conditions inside the agent and only forwarded to
the server if the alerting condition is met (or the normal collection
interval would say so).

This way one would get fast alerting, but still low(er) load on the
server and database.

The item is of course related to the previous, but could possibly
be addressed independently.

Better presentation of incident data

When an alert is triggered, we should try to fire up a panel, that
shows related data for that time. This could be log file entries
for the seconds before. Or "suspect metrics" on related resources etc.

Service levels

IT departments often have the need to fulfill some contracts about the quality
of the service they are delivering. To do this, they need to measure the actual
quality of their service, compare those against some defined values and create
afterwards reports about the actual quality and possible violations.

  • Defining SL for resources, groups and applications: Service Levels consist
    of definitions for availablites and performances of applications and
    resources.
    Those need to be defined and to be combined into packages that correspond to
    a service level agreement.

  • Measuring availabilities and other SL components: Measurement data needs to
    be taken, aggregated and matched against the defined Service Level definitions
    from above.

  • Alarming about possible SLA violations: If service levels are about to be
    violated, it is necessary to alert the operations people so that they can try
    to optimize the service

  • Reporting: Reports need to be generated that show the Service Level
    definitions and the measured values (and at the end if the SLA has
    been met or not).

  • Definition of service windows: It must be possible to schedule (recurring)
    maintenance windows for resources and groups of resources. Within those
    windows we would not mark resources as down - and depending on the SLA
    agreed upon also not count into final vales for uptime or outage time.
    Resources / Applications / Groups that are set into maintenance would
    also be set into maintenance mode to suppress (false) alerts in that
    time.

  • Service level windows should be used for alert visibility also. If the resource is outside
    it's monitoring timewindow (say a test server would be available 8-16), there's no point
    in showing the alerts for that resource outside the timewindow, thus making it easier
    for monitoring personnel to see only the relevant alerts.

    • This should be a filter option in the alerts window.

Alerting on server-state/events

It should be possible to get alerts on server side events. One example
that came up recently was the addition of a new item to the discovery queue.

The alerting could be hooked into the auditing (see below) where all those
events would pass by anyway.

Alert definitions in plugins

As seen in the current Cassandra work, there is a need for pre-canned and
possibly
pre-activated Alert Definitions. It should be possible to encode them in the
plugin
metadata. One option could be the plugin descriptor.

 <server name="Foo" ... >
    <alertDefinition name="high memory">
      <condition>…</condition>
      <notification>…</notification>
    </alertDefinition>

This is a bit hairy, as the alert definition "language" is not really a DSL.

Explicit handling of dependent alerts

If a user has an alert on an AS and a DataSource inside the AS and the AS
goes down, the user currently gets two alerts for the AS and the DataSource.
It should be possible to suppress creation of alerts for dependent resources/
alerts.

Workflow improvements

These are options that could be done after the alert has been fired, to follow the process how each alert has been handled.

  • A simple workflow inside RHQ

    • More status options

      • Noticed/Assigned

        • Ability to assign alerts to a certain user

      • Acknowledged/Resolved

    • Add comments to an alert

      • Not just when it was resolved, but at any point. Provides history information.

    • View history of workflow (assignments, status changes & comments)

    • Escalation paths (if alert is not acknowledged within 15 mins, send to someone else)

    • Shift plans (user can upload excel sheet to determine who is on duty, to send alert email to specific user depending on that plan)

  • For more complex workflow requirements

    • Ability to fire server-plugin for all alerts automatically

      • At the moment one needs to configure this per alert. It should be global as well.

        • Or, to a certain type of alerts, such as just red-flags.

        • This could be something that starts BPM workflow, or calls SwitchYard service, etc

    • Manually trigger a process from an alert

      • Could be globally configured options, same as automatic options

JBoss.org Content Archive (Read Only), exported from JBoss Community Documentation Editor at 2020-03-13 08:16:36 UTC, last content change 2013-09-18 19:41:16 UTC.