Design-StorageOptions

Introduction

I was asked to share my thoughts regarding how metric capture could be handled, but in order to do this I need to first set the frame of reference other points are made in the context of.

Core Architecture

Fundamental Vision

So the first order is to discuss the overall architecture within which future technology choices should be made. The following point is something I have shared with two folks so far, a simple diagram for a proposed JON Kernel architecture:

Capture >> Summarize >> Analyze >> Respond

It would seem to me there are four valuable aspects to management and monitoring software as listed in the above flow. Some of these aspects are well understood (e.g. Capture), but what remains poorly implemented to date in existing solutions are the remaining three. Other aspects of management and monitoring (e.g. rendering, provisioning) don't seem particularly interesting, are well known and better solved by other technology stacks, and so IMO are not considered a core value. Regarding the above, lets define these so we have a common understanding.

The following four themes could be used to gauge whether a feature is essential or not.

Capture

Regarding Capture, this is well solved today as is seen in the Agent Plug-ins; these are perfectly capable. Capture transfers event and metric data to Summarize for Analysis; event and metric data is assumed to be either continuous or discontinuous streams of data over the time domain; capturing, storing, and analyzing these data are further assumed to represent the principle bottleneck for the entire system. Data and white-papers from Internet-scale service providers confirms this is commonly the case. Later in the discussion on Storage we will evaluate several options for addressing issues of capacity, latency, and scale.

Summarize

Summarize is the process of data reduction to improve performance of later analysis. Regarding Summarize, there is an alternative technique to the big-data approaches employed by Facebook, Yahoo, and others; the technique encompasses two general ideas: (a) summarization, a data reduction technique focused on retaining limited fidelity over a time domain by representing data as a series of polynomials, with the technique data reduction could be as high as a 1000:1 ratio; (b) compression, a data reduction technique useful for discrete data having only two fundamental values, on and off, and represented as a time-domain square wave where changes in state are represented by a date-value tuple, all values between two changes in state are presumed to be identical to the prior state, whether on or off. With regards to the former technique, a semi-continuous curve can be calculated to represent the data set with varying degrees of fidelity such that any intermediate point could be simply recalculated using the provided polynomials and time-domain value.

Analyze

Regarding Analyze there are no compelling stories out there that I am aware of, in this phase data is analyzed for patterns, predictions (probabilities) are calculated using Summarized data, data is used to predict future events based upon data velocity (first derivative), data acceleration (second derivative), percent-variance, and other algorithms. The principle point here is that the predictive probabilistic systems gain control over operational decisions, and the system is not reactive like much of the current state of art which tends to be operational heavy. The output from the analyze phase are probabilities that operational events will occur, ranged 0.0 to 1.0.

Respond

The Respond aspect induces operational activity based upon probabilities that key events will occur, its input are [0.0 ...1.0] ranged named values. In the simplified view of the world, in reactive systems, these induce Alerts when probability exceeds configured thresholds, much in the same way binary perfect failure detectors may be internally built on top of their more sophisticated counterparts, phi accrual failure detectors (which are probabilistic by their nature). However, in predictive systems the outputs from Analyze govern control of resources in advance of negative events occurring. The result will be a preemptively self-healing system, a system that can be more naturally elastic.

Implications

There are several implications that I see in the above.

Laser Focus

The first has to do with the team's ability to execute on features that the broader community has yet to see any solution to; it is my belief that this broader need is where the JON team can differentiate themselves, not with adding features that merely result in parity with competitive solutions; the team needs to play its own game, not someone else's.

First implication is that any existing feature that does not directly contribute to the core vision should be considered non-essential. Resources are limited, and to expend resources on areas that are non-essential or non-core (however you decide to define it) limits team productivity, and contributes to decreasing product value. Specifically, any feature, code, module, service, that is non-core (kernel) should be relegated to an adjunct repository, and the core repository should focus solely on these interesting value-added features. Additionally, all plug-ins should also be moved to adjunct repositories. IMO paid employees of Red Hat or those that are otherwise considered core contributors should be focusing solely on these, and any or all features peripheral to the core vision should be community maintained or let wither on the vine to die.

Bridging the Gap, or Maintaining the Legacy

Accomplishing such a great endeavor is very ambitious, and pragmatically we cannot burn the house to the ground in order to rebuild it; there are other projects that have been guilty of this approach and the community is not very forgiving for this sort of behavior, and it damages the brand. That said, it seems to me that the product is fairly inconsistent, and lacks an overall unifying concept or set of concepts. It appears as though the product has had a LOT of hands in it, it shows. So a systematic approach needs to be conceived that will accommodate the larger vision, while also allowing existing customers to continue the use of the product as they see it today. Everything discussed above, and below, can be introduced incrementally, and much like a box car, one can jump the train, one box car at a time -- that is, the present product could be augmented with a cleaned up interface for Capture, then introduce Summarize afterward; when the phase is complete the existing code (GUI server) jumps the box car onto the next implementation. From there on, after the new storage implementation is introduced, the remaining architectural elements may be introduced incrementally, and additional tools, interfaces, introduced for these without disturbing what already exists. The big point is that the Core GUI should not be subject to change on the first iteration; the new storage implementation should be wrapped using existing protocols layered on top of the new protocols and interfaces; the existing protocols or interfaces can be removed at out leisure over time.

Paradigm Shift

The above represent a fundamental paradigm shift with respect to how management and monitoring is viewed IMO. Instead of focusing on looking in retrospect to what changed, or what has failed , analytics enable the system to predict and handle events preemptively, and the system is forward-focused . This requires strong analytics, possibly only achievable with the likes of the R language (embedded in Hadoop, i.e. Mahout), which is to say that the analytics themselves be written in R, embedded, and the remaining portions be written in Java. But in order for analytics to act in real-time to system behavior, that is without significant latency, I suspect aggressive data summarization has to occur; data summarization can lead to other positive effects in the system (lowered table-space storage), but the principal reason for it is for analytics. Lots of the above capabilities are probably beyond the capabilities of the existing product, and perhaps even that of Java.

Metric Storage

One of the more sensitive areas of the system will be the relation between Capture and Summarize; it faces the highest data transfer rates and arguably occupies more table-space than any other part of the system; performance analysis shows this to be the most costly area of the product, and it also most limits our present capability to scale.

With that, lets drop down a few levels to discuss storage. There are several storage schemes that are applicable: main-memory (InfiniSpan), column-oriented databases (Hadoop), flat files (HDF5), and document-oriented databases (CouchDB). All of these will lend themselves to high performance, with varying degress of durability; the key factor here is that they lend themselves to highly optimized sequential lookups. Each of these technology stacks would work reasonably well in this space, many of the observations below are applicable to each of these. So before we discuss a specific technology stack, lets discuss some assumptions and observations.

Assumptions and Observations

Operational data passes across the relation between Capture and Summarize. Data fundamentally falls into one of two categories: binary (on/off), and numeric; although binary may be represented as numeric, it is assumed that the reader understands why this would be a poor choice as it all but eliminates some rather simple and elegant approaches at data compression. On the other hand, it arguably makes the code more complex to write and maintain.

It should be observed that history progresses linearly, may be discontinuous, and activity is strictly linearizable; events from the past do not occur after events from the future. Data should always arrive in order for any given resource, even though the time-domain may not increase monotonically. It SHOULD BE a constraint on the system that for any given resource (when I say resource hereafter I generally mean schedule-id) there is only one channel feeding data, preventing duplicates and ordering issues; data SHOULD ONLY be pushed by one agent thread across a single TCP channel, and only one agent may be permitted to observe/manage any given resource. Since data is linearizable, representation should favor append-only storage utilizing b+tree's or main-memory caches; history may be written or deleted, but never rewritten or reordered.

It is taken for granted that a resource may have child resources, that there are no cycles in the resource graph, and that at each node of the graph produce many kinds of metric data (presently represented in measurement definitions). It is further assumed that the number of monitored nodes, hosts in this case, is relatively large; in such cases the node will have a finite number of child resources (on the order of tens), that several of these resources are leaf nodes; however, there are complex resources (e.g. JBoss servers) that have graphs of child resources themselves, and it is assumed that the count of these complex resources is few (e.g. generally users don't install thousands of JBoss servers on a single host, they generally install two or three). It is assumed that each leaf resource may have a finite number of collected data (e.g. a CPU has both temperature, and Percent Utilization). For the sake of this discussion, it is assumed that the total count of resources is generally on the scale of 10^6, but at Internet-scale could quickly rise to a relative scale of 10^9 or greater. It is assumed, however, that regardless of scale, that a single top-level resource on a host (e.g. a service, a disk, a CPU) has a relatively finite number of child resources, generally on the order of 10^3 in aggregate for complex resources; also the general case may be that top-level resources on a given host are flat (no child resources).

In an elastic world, hosts come and go and their identities vary over time. Since the set of known hosts vary over time, and a specific set only occurs at a moment in time, what is of more interest (especially in context of the cloud) is the health and status of the aggregate "service", not of discrete components; discrete components (hosts, disks, servers, discrete-services, etc) should be autonomically managed; they are allowed, and in fact expected, to fail.

Related to that, in an elastic world fault tolerance and redundancy is paramount; catastrophic failure of any one component must never impact the overall health of the aggregate service. So first and foremost, we need to discuss market position; market position should dictate what fundamental approach is taken for metric storage. For customers with smaller private deployments (non-cloud), management and monitoring of discrete components is paramount as individual failures may represent a significant compromise in availability or reliability; for the likes of such life is a frantic attempt to beat out flames with a gasoline soaked towel. What smaller customers need differs from what other customers need. For customers with larger deployments (possibly cloud), and Internet-scale deployments (cloud), they tend to be more interested in the health of the aggregate service as they may leisurely start up a new virtual server instance, replace an appliance from spares in storage, etc. The former set of customers are not particularly interesting, lots of solutions are available (Nagios, et al) are available to tell you when your house is in flames, no need to create a solution for a need that does not exist. But what is more interesting are the latter sets of customers; for these few solutions exist for providing meaningful (human-oriented, i.e. simplified) aggregate health.

In order to maximize efficiency, one last observation, related to the priors, is made; sequential scans frequently occur for metric data of a particular resource (schedule-id). There are problems with the existing implementation; a single table stores metric data for multiple resources (schedule-ids), this means the storage representation cannot be optimized for sequential scans for individual resources, and related, costly lookups (n log(n)) of data by schedule-id are required. So in summary, these characteristics are important: (a) preserving linearizable data to improve locality of reference, and (b) avoiding costly lookups to find the data you want to analyze.

Approaches

Partition by Schedule ID

One approach that seems to make some sense is to partition data by schedule ID. At first this may seem counter-intuitive, but on further consideration it seems entirely natural. As users increase the count of monitored resources (schedule-ids), operations groups need to increase available schedule storage. A simple formula can be used to predict how much storage is required, calculated from the number of monitored resources and the time domain over which data is kept.

For example, if the implementation chosen were Hadoop, which is a column-oriented database (important for the linearizability constraint, above), and as it is capable of creating additional columns ad-hoc, and as each table can have column counts in the range of 10^6, each column could represent a single resources' metric state. As this is a column oriented database, table scans are naturally fully sequential, optimal for analysis. As the count of resources exceeds the capability of a single table, additional tables could be added. This approach would require a cache-able lookup correlating the schedule id to the specific column family; this lookup should be O(1).

JBoss Community Archive (Read Only)

RHQ 4.9