JBoss Community Archive (Read Only)

RHQ 4.9

Design - CallTime 2

This page is to work on our next version of a storage system for call time and response time data.

Requirements

The intention is to enhance the call time to be a superset of its current functionality. This would support data collection scenarios such as workflow instrumentation or the experimental byteman plugin grabing data at specific tiers. The hierarchy would support a simplified trace model (not intended for full traces, but for critical component traces). The most obvious examples would the http request path, the service component layers (seam, ejb3), and the data access tier (JPA, JDBC, Hibernate). This would cover the primary tiers necessary for production KPI monitoring.

The attached data would be numerical summable data across a set of transactions. Examples include the number of requests of a given http response code, the number of rows returned from a jdbc call, etc.

Scalability requirements mean this has to be low overhead to the instrumented application and have minimal storage and lookup impact on the server. Initially targeting a nominal capacity for long term running storage of 1000 monitored instances. This would include data purge and potential compression if feasible.

This system should support integration with alerting system for the most power.

System Differences

One major change from the old system that this implies is detaching the call time metrics from each separate resource so they can be collected at the application level into the hierarchy. The need for a new collection system that supports tracing through different parts of the stack requires this model and I don't believe we lose much. In fact it may be an improvement to pull this data up to a more visible location. This model would allow stack tracing of any JVM based plugin as well. 

Data Example

Instrumented Classes

Standard Deviations

In order to offer standard deviation information on the response times, I want to enhance the the bucket storage and transfer entities to add the necessary fields for a running variance calculation. For all location buckets it would mean storing the current counts, the mean and the moment (sum of the squares of the differences to the mean). We can store these in our typical bucket model and merge them to show deviations in larger bucketed times.

Persistence Options

There are a few options on persistence for this data. Historically, we've stored it in the db, like other metric data. It's been a two table system of keys and time series values for those keys. This system has been reasonable so far, but we've not performed a ton of scalability testing on it. The new features will add more load on the system and so may require alternate thinking. We'll want to be able to show time series data for specific tracepoints or traces across calculated time periods with bucket merging. And query against min/max/counts/std-dev, etc as part of reporting and data viewing. We'll also want merged views of the traces across cluster and then variance comparisons between them.

Like elsewhere we could consider RRD storage models instead, but the querying and on the fly calculation we're talking about here would likely suffer as compared to relation storage. Like the rest of the metric system, partitioned relational storage is also a viable option. To start with, I plan to perform some testing and analysis to see what the data levels will be and where the performance edges are for storage. 

In Memory Considerations

Key to this system will be the ability to correlate the instrument collected data and get it out of VM without impacting the monitored application too much. I'm going to put the result of some high level testing here to calculate some usage scenarios using the rhq code base. 

Data

I don't yet have data on the number of paths through this info, but with contexts there could be many thousands of points monitored which would result in a lot of data tracked in memory just to keep the key mappings. Points that match to code points don't need their context's stored separately (i.e. we don't need to store the method name and can statically track a surrogate temporary lookup key). But for the queries, we need some way to match and aggregate them. In order to reduce the data we could compress the keys or more likely store a hash of them such as SHA-1 which would reduce the storage to 40 bytes per key. This would reduce the in memory mappings by about 1/6th as tested, but result in the additional cpu overhead from hashing. Any hashing or lookup model would also require a model to match the lookups as stored somewhere outside the vm.

Another alternative is to reduce the stored data by tracking more limited data. This could be reduced to info such as the primary table in a query and the type of query. This wouldn't give detailed data access info, but could at least let you know that you selected from table x 20 times in a transaction. This would incur the cpu overhead of parsing the data.

The other primary data in memory would be the metrics. This could either be continuously tracked and read / reset by the client for bucketing, or be bucketed in VM and then output to disk for reading by the client.

JBoss.org Content Archive (Read Only), exported from JBoss Community Documentation Editor at 2020-03-13 08:10:37 UTC, last content change 2013-09-18 19:41:14 UTC.