Measurement Subsystem

Data Collection

Metrics are collected on a predefined schedule using the monitor > configure tab. As a point-in-time, collection-based monitoring system, we do not capture "live" data (we do have a mechanism for retrieving live data for a specific resource, but we do not try and capture live data across all of your resources in inventory). Instead, we capture values that are snapshots in time. As a result, if your system is generating large volumes of data that get updated many times per second, you will naturally only be able to poll for / collect a subset of those values. For performance and scalability reasons, the fastest any metric can be collected is in 30 second intervals.

Measurement Schedules

Measurement schedules can be modified on an individual resource at a time, an autogroup of resources, or even a compatible group. Although the server is the authoritative source of the current values, these schedules need to be sent down to the various agents who do the actual data collection against the managed resources.

Measurement Templates

A template stores the default values for metric collection. As new resources are imported into your inventory, the templates will govern whether measurement schedules are created as enabled/disabled as well as their collection interval/frequency.

If you change a template, the UI will give you the option of whether you ONLY want the change to be applied to resources that are imported after the change, or whether you want to also apply the change to every single resource already in your inventory (of the corresponding type).

Synchronization Concerns

If the agent(s) are online when a schedule (or template) is updated, they will be sent a message synchronously (with respect to the action performed in the UI) to merge the changes into their local "cache" of schedules. The agents then use these schedules to perform the actual metric collection against your managed resources. In this way, the agent can continue to collect data even if the server-side infrastructure goes down (either accidentally, for regular maintenance, or because the severs are being upgraded).

If the agent(s) are, however, offline when a schedule (or template) is updated, the servers will make a note of which agents it was not able to talk to (by updating the mtime on the corresponding resources which could not be contacted). Then, when the agent comes back online and starts to talk to the server-side again, part of it's handshaking mechanism is to determine whether any of the schedules it had cached are "stale". If they are, it will get the new schedules and merge the updated values into it's local cache.

Data Insertion

Data is inserted into one of 15 metric tables, collectively called the "raw tables". here are some facts about them:

15 raw tables suffixed with rXX where XX is 00 -> 14 (e.g. rhq_meas_data_num_r01)
each is 12 hours "wide"
- thus, we can keep 7 historic days (plus the current / more recent 12 hr timeframe)
table rotations are not TZ sensitive (making it easier to deal with servers in diff timezones), but this means that tables most likely won't have 12 to 12 local.

which table is actively being written into?

   day = "now" / MILLIS_PER_DAY;
   timeOfDay = "now" - (day * MILLIS_PER_DAY);
   table = ((day * TABLES_PER_DAY) // 2 tables per day
             + (timeOfDay / MILLES_PER_TABLE)); // 12 hrs of millis per day
   tableIndex = (table % TABLE_COUNT); // there are 15 rotating raw tables

The management of metric data, although physically stored in a database table, is semantically managed like a queue. We append the newest data onto the end of the table, and delete the oldest data from the beginning of the table.

Data Compression and Purging

Think of it like a funnel: all measurements flow into the fat end of the funnel, but over time eventually move downwards and are expelled through the thin end of the funnel. Unlike a traditional funnel, however, the objects are "compressed" into smaller objects as they make their way to the end of the funnel. It's sort of similar to the extrusion process (http://commons.wikimedia.org/wiki/File:Extrusion_process_2.png). As the measurements are forced through the pipe, they are compressed down into a different shape.

In reality, this compression happens many times and on many different levels over the "lifetime" of a metric. Every hour, a server-side job wakes up and performs a dual function: compression and purging.

Definitions

Compression - the act of selecting pieces of data from the raw tables that have crossed some time threshold and calculating the high, low, and average values from within discrete time slices.
Purging - is the deletion of data that has crossed a different time threshold which are "stale".

Below is a depiction of the various levels that the compression / purge job occurs on.

raw tables ---compress/purge-->
	1hr tables ---compress/purge-->
		6hr tables ---compress/purge-->
			1day tables ---compress/purge--> purged from system entirely

Data Retention Times

How long do we keep data compress in each table?

data is purged from the raw tables after – <see "Raw Table Rotations" below>
data is purged from the 1 hr table after – 14 days
data is purged from the 6 hr table after – 31 days
data is purged from the 1 day table after – 365 days

Note: these metric purge values are not configurable today, but are stored in the RHQ_SYSTEM_CONFIG table as rows with the follow keys, respectively: CAM_DATA_PURGE_1H, CAM_DATA_PURGE_6H, CAM_DATA_PURGE_1D.

Raw Table Rotations

since 2 weeks of data is kept in the raw tables, and since there are 15 tables each containing 12 hours of data, there is actually room enough to store 7.5 days worth of data. this forms 7 "active" days, and one "dead" 12-hr block. the dead table is simply the oldest raw table in the system at the time of purge. it is loosely calculated as follows:

   tableIndex = // see algorithm for computing table index above
   deadTableIndex = (tableIndex + 1) % TABLE_COUNT;

Since we are dealing with table rotations for the raw tables, the dead table is simply truncated (as opposed to deleting data older than some stale interval). Granted, the table rotation algorithm adds a bit more complexity to the way data is compressed / purged, but the gain is in the efficiency of the truncate operation, which executes much quicker than a SQL delete statement would.

Note: it's possible that if all servers in the system were down, that no instance would be available to run this hourly job. As a result, the actual compression algorithm ALSO looks in all of the active tables for stale data to purge too (and uses a traditional SQL delete statement for data older than 14 days). However, as long as at least one server in the system stays up, the compression job will run on its regularly scheduled frequency; thus, the operation against the active tables won't have anything to delete, and should execute quickly.

Metric Display

Next we deal with the subject of how we display metrics. Metrics will only be selected from one of 4 buckets: the raw tables, the 1hr table, the 6hr table, or the 1day table. In other words, it's never the case that some metrics are selected from the raw tables and some from the 1hr table. The algorithm for choosing which table to select from is as follows:

assuming that you want to view data with the following inclusive range [beginTime, endTime]

   if (("now" - RAW_PURGE_INTERVAL) < beginTime) { // raw purge interval is 7 days
      /*
       * since each raw table holds only 12 hrs of data,
       * need to select data from any table that holds
       * values with timestamps that fall between the
       * inclusive range [beginTime, endTime]
       */
   } else if ((now - purge1h) < beginTime) { // purge1h is 14 days
      // get data from 1 hr table only
   } else if ((now - purge6h) < beginTime) { // purge6d is 31 days
      // get data from 6 h table only
   } else {
      // get data from 1 day table only
   }

Examples:

viewing data from 8hr ago - will either select from 1 or 2 of the most recent raw tables
viewing data from between 12 and 13 days ago - will select from the 1hr table
viewing data from between 12 and 15 days ago - will select from the 6hr table
viewing data from between 16 and 30 days ago - will select from the 6hr table
viewing data from between 16 and 160 days ago - wil select from the 1d table

As you can see, only the beginTime (the oldest timestamp in the range) is used in the calculation for which table to pull data from.