Metrics Data Migration

About

This page will document the design, goals, and trade-offs for metrics data migration tool. The tool will transfer metrics related data from SQL storage to the new Cassandra backend. Only numeric metrics are to be transferred at this time.

Goals

For RHQ server with large inventories the tool should be able to gracefully migrate large amounts of data
The tool needs to be able to restart the process from where it was stopped. This helps in such cases as a migration failure due to environment stability or user cancelled the migration due to duration.
The tool will integrate with the RHQ upgrade procedures to make the product upgrade as simple as possible to the user.

Non Goals - Non Features

No schema updates are to be done by the tool. Dropping unnecessary tables is done by dbupgrade.

Design

Two migration modes:
1. Batched mode, permanent migration
  1. Data is migrated in batches of 10,000 rows
  2. First move the data to the new server
  3. Then delete the rows of data from old database
2. Non-batched mode
  1. Migrate everything at once
  2. Data could be deleted at the end
  3. Possible to leave all old data in the db at the end of migration
In batched mode, if the tool fails, the only possible overlap is for the rows that did not get deleted from the SQL database.
1. the limit of 10,000 is low enough to reduce the amount of double migration
2. do not duplicate data already migrated in Cassandra
The tool can be configured to migrates only a specific table per run or everything
1. this gives the user the opportunity to migrate data gradually if too much data was collected in SQL
In non-batched mode, the deletion process is configurable
1. users can turn of the deletion so the migration is faster and the data is removed by the user at a later time by truncating or dropping the table.
The tool computes the ttl for every single row migrated
1. new ttl is based on the timestamp of the row and expected ttl for the table
The tool will be independent to the existing installer and application.
1. A standalone approach will allow simple migration/transition for
  1. larger amounts of data
  2. cases where existing data might not be essential and thus not migrated by users

Testing

Use-cases
1. Oracle --> Cassandra
2. Postgres --> Cassandra
3. Edge cases to consider
  1. high-volume of metrics
  2. low-volume of metrics
  3. no metrics
4. Negative tests cases to consider
  1. connection failures
  2. if the test fails ... can the tool be used again? or do you need a new fresh cassandra database?
Verification ... or, "How do we determine that migration was successful?"
1. Output generated by the migration tool
  1. the migration tool generates a human readable log file
  2. the human readable log file generates a message saying the migration was successful ...or an error occurred
  3. if an error occurs in the migration ... a meaningful human readable error message is displayed
  4. # of rows read in Oracle/Postgres and number of rows? created in Cassandra ...on a per resource basis
2. Visual inspection in RHQ UI
  1. exactly what in RHQ should remain exactly the same after the backend switches from Oracle/Postgres to Cassandra
Unit tests on the data migration tool
Performance and baselining
1. How long should this take? What is an acceptable migration time?
2. What is the SLA for a large deployment?
3. If large deployments may take a long time to migrate ...
  1. is it wise to consider a tool that could migrate things incrementally? 1 resource at a time?
  2. will the UI or command line allow the ability to only migrate 1 or select resources at a time?
4. If the tool takes a long time to run and migrate the data ...there should be some visual indicator ...% complete, time remaining, etc.....
Risk areas ...If you could predict where a problem may manifests ...where would it be? For each risk area, list a possible mitigation approach.
Testing tool? Is there a need for a automated or semi-automated testing or verification tool?
1. A tool that accesses both backends ... and verifies metrics on a particular resource are identical.

Migration estimates

Estimation Technique:

Pick 4 plugins to be active on a system, add number of metrics
1. The plugins: Platform, JBoss AS5, JBoss Cache 3, JMX
All metrics enabled by user
System running for 1 year
Multiply with 2.5 to account for multiple subresources using the same descriptor
small 10 agents, medium 40 agents, large 125 agents
Aggregate Data Retention
1. 24 entries/day in 1h table for 14 days = 336 entries per collected metric
2. 4 entries/day in 6h table for 31 days = 124 entries per collected metric
3. 1 entries/day in 1d table for 365 days = 365 entris per collected metric
4. total = 825 data points per metric
5. Sizing:
  1. time_stamp = long = 8 bytes
  2. schedule_id = integer = 4 bytes
  3. value,min,max = double = 8 bytes
  4. Use 8 bytes for each data type to make things simpler
  5. 8 bytes * 5 = 40 bytes per data point
  6. Total space = 825 * 40 = 33KB/per metric
Raw Retention
1. Collection - 2 weeks retention per data point
2. sizing
  1. time_stamp = long = 8 bytes
  2. schedule_id = integer = 4 bytes
  3. value = 8 bytes
  4. use 8 bytes for each data type
  5. 8 bytes * 3 = 24 bytes per single collection
3. Summary metrics
  1. 1 collection every 10 minutes = 6/hour = 144/day
  2. total = 2016 data points/metric
  3. total size = 2016 * 24 bytes = 47KB / metric
4. Detail metrics
  1. 1 collection every 20 minutes = 3/hour = 72/day
  2. total = 1008 data points/metric
  3. total size = 1008 * 24 bytes = 24KB/metric

Estimate

Each agent:
1. 138 summary metrics
2. 197 detail metrics
3. 335 total metrics
Extrapolate per agent using 2.5 factor
1. 345 summary metrics
2. 492 detail metrics
3. 837 total metrics

Sizing Details

Size	Raw Summary	Raw Detail	Total Aggregate Retention
small 10 agents	345 * 10 = 3450 3450 * 2016 = 7,000K 3450 * 47KB ~ 158MB	197 * 10 = 1970 1970 * 1008 ~ 2,000K 1970 * 24KB ~ 16MB	837 * 10 = 8370 8370 * 825 ~ 7,000K 8370 * 33KB ~ 270MB
medium 40 agents	345 * 40 = 13800 13800 * 2016 ~ 28,000K 13800 * 47KB ~ 640MB	197 * 40 = 7880 7880 * 1008 ~ 8,000K 7880 * 24KB ~ 186MB	837 * 40 ~ 34K 34K * 825 ~ 28,000K 34K * 33KB ~ 1.1 GB
large 125 agents	345 * 125 ~ 43.5K 43.5K * 2016 ~ 86,000K 43.5K * 47KB ~ 2GB	197 * 125 ~ 25K 25K * 1008 ~ 25,000K 25K * 24KB ~ 0.5GB	837 * 125 ~ 110K 110K * 825 ~ 86,000K 110K * 33KB ~ 3.5GB

Totals

Size	Number of Rows	Disk Size
small 10 agents	16 million rows	474 MB
medium 40 agents	64 million rows	1.9 GB
large 125 agents	197 million row	6GB

Disclaimers:

These estimates do not take into account RHQ installations that monitor servers with large number of sub-resources (eg. 100 applications on one single AS instance).
For larger deployments or deployments with large numbers of sub-resource, just multiply the numbers presented here with a factor. For example, if there are 10 times more applications to be monitored than a typical AS install, one could pick a factor of 10 to get an upper estimate on the amount of data to be transferred.
These estimates could be refined and/or changed at any time based on community feedback.
These estimates are just for guidance purposes, actual deployments could be much smaller or much larger based on the actual amount of metric data collected. Examples of factors that influence size are: system configuration, post discovery configuration of resources, length of time since installation, db storage engine.

Plugin Details

Plugin	Summary Numeric Metrics	Detail Numeric Metrics	Total Numeric Metrics
jboss-as-7	62	239	301
perftest	62	37	99
platform	59	62	121
cassandra	58	16	74
jboss-as-5	54	106	160
jboss-as	38	80	118
oracle	22	512	534
mysql	19	250	269
hadoop	16	21	37
postgres	15	20	35
rhq-server	13	45	58
jmx	13	11	24
jboss-cache-v3	12	18	30
virt	11	8	19
tomcat	9	12	21
hardware	9	6	15
rhq-agent	8	17	25
sshd	8	15	23
iis	6	66	72
apache	4	24	28
samba	4	13	17
byteman	4	0	4
hibernate	3	26	29
jboss-cache	3	19	22
netservices	3	2	5
twitter	3	0	3
pattern-generator	2	0	2
irc	2	0	2
database	2	0	2
hudson	2	0	2
jira	1	1	2
onewire	1	0	1
snmptrapd	1	0	1
mod-cluster	1	0	1
script2	0	4	4
aliases	0	0	0
grub	0	0	0
nss	0	0	0
kickstart	0	0	0
filetemplate-bundle	0	0	0
script	0	0	0
ant-bundle	0	0	0
database	0	0	0
cron	0	0	0
services	0	0	0
cobbler	0	0	0
noop	0	0	0
lsof	0	0	0
augeas	0	0	0
postfix	0	0	0
raw-config-test	0	0	0
iptables	0	0	0
jdbctrace	0	0	0
sudoers	0	0	0
hosts	0	0	0

JBoss Community Archive (Read Only)

RHQ 4.10