Metrics Data Migration
About
This page will document the design, goals, and trade-offs for metrics data migration tool. The tool will transfer metrics related data from SQL storage to the new Cassandra backend. Only numeric metrics are to be transferred at this time.
Goals
- For RHQ server with large inventories the tool should be able to gracefully migrate large amounts of data
- The tool needs to be able to restart the process from where it was stopped. This helps in such cases as a migration failure due to environment stability or user cancelled the migration due to duration.
- The tool will integrate with the RHQ upgrade procedures to make the product upgrade as simple as possible to the user.
Non Goals - Non Features
- No schema updates are to be done by the tool. Dropping unnecessary tables is done by dbupgrade.
Design
- Two migration modes:
- Batched mode, permanent migration
- Data is migrated in batches of 10,000 rows
- First move the data to the new server
- Then delete the rows of data from old database
- Non-batched mode
- Migrate everything at once
- Data could be deleted at the end
- Possible to leave all old data in the db at the end of migration
- Batched mode, permanent migration
- In batched mode, if the tool fails, the only possible overlap is for the rows that did not get deleted from the SQL database.
- the limit of 10,000 is low enough to reduce the amount of double migration
- do not duplicate data already migrated in Cassandra
- The tool can be configured to migrates only a specific table per run or everything
- this gives the user the opportunity to migrate data gradually if too much data was collected in SQL
- In non-batched mode, the deletion process is configurable
- users can turn of the deletion so the migration is faster and the data is removed by the user at a later time by truncating or dropping the table.
- The tool computes the ttl for every single row migrated
- new ttl is based on the timestamp of the row and expected ttl for the table
- The tool will be independent to the existing installer and application.
- A standalone approach will allow simple migration/transition for
- larger amounts of data
- cases where existing data might not be essential and thus not migrated by users
- A standalone approach will allow simple migration/transition for
Testing
- Use-cases
- Oracle --> Cassandra
- Postgres --> Cassandra
- Edge cases to consider
- high-volume of metrics
- low-volume of metrics
- no metrics
- Negative tests cases to consider
- connection failures
- if the test fails ... can the tool be used again? or do you need a new fresh cassandra database?
- Verification ... or, "How do we determine that migration was successful?"
- Output generated by the migration tool
- the migration tool generates a human readable log file
- the human readable log file generates a message saying the migration was successful ...or an error occurred
- if an error occurs in the migration ... a meaningful human readable error message is displayed
- # of rows read in Oracle/Postgres and number of rows? created in Cassandra ...on a per resource basis
- Visual inspection in RHQ UI
- exactly what in RHQ should remain exactly the same after the backend switches from Oracle/Postgres to Cassandra
- exactly what in RHQ should remain exactly the same after the backend switches from Oracle/Postgres to Cassandra
- Output generated by the migration tool
- Unit tests on the data migration tool
- Performance and baselining
- How long should this take? What is an acceptable migration time?
- What is the SLA for a large deployment?
- If large deployments may take a long time to migrate ...
- is it wise to consider a tool that could migrate things incrementally? 1 resource at a time?
- will the UI or command line allow the ability to only migrate 1 or select resources at a time?
- If the tool takes a long time to run and migrate the data ...there should be some visual indicator ...% complete, time remaining, etc.....
- Risk areas ...If you could predict where a problem may manifests ...where would it be? For each risk area, list a possible mitigation approach.
- Testing tool? Is there a need for a automated or semi-automated testing or verification tool?
- A tool that accesses both backends ... and verifies metrics on a particular resource are identical.
- A tool that accesses both backends ... and verifies metrics on a particular resource are identical.
Migration estimates
Estimation Technique:
- Pick 4 plugins to be active on a system, add number of metrics
- The plugins: Platform, JBoss AS5, JBoss Cache 3, JMX
- All metrics enabled by user
- System running for 1 year
- Multiply with 2.5 to account for multiple subresources using the same descriptor
- small 10 agents, medium 40 agents, large 125 agents
- Aggregate Data Retention
- 24 entries/day in 1h table for 14 days = 336 entries per collected metric
- 4 entries/day in 6h table for 31 days = 124 entries per collected metric
- 1 entries/day in 1d table for 365 days = 365 entris per collected metric
- total = 825 data points per metric
- Sizing:
- time_stamp = long = 8 bytes
- schedule_id = integer = 4 bytes
- value,min,max = double = 8 bytes
- Use 8 bytes for each data type to make things simpler
- 8 bytes * 5 = 40 bytes per data point
- Total space = 825 * 40 = 33KB/per metric
- Raw Retention
- Collection - 2 weeks retention per data point
- sizing
- time_stamp = long = 8 bytes
- schedule_id = integer = 4 bytes
- value = 8 bytes
- use 8 bytes for each data type
- 8 bytes * 3 = 24 bytes per single collection
- Summary metrics
- 1 collection every 10 minutes = 6/hour = 144/day
- total = 2016 data points/metric
- total size = 2016 * 24 bytes = 47KB / metric
- Detail metrics
- 1 collection every 20 minutes = 3/hour = 72/day
- total = 1008 data points/metric
- total size = 1008 * 24 bytes = 24KB/metric
Estimate
- Each agent:
- 138 summary metrics
- 197 detail metrics
- 335 total metrics
- Extrapolate per agent using 2.5 factor
- 345 summary metrics
- 492 detail metrics
- 837 total metrics
- Sizing Details
Size Raw Summary Raw Detail Total Aggregate Retention small
10 agents345 * 10 = 3450
3450 * 2016 = 7,000K
3450 * 47KB ~ 158MB197 * 10 = 1970
1970 * 1008 ~ 2,000K
1970 * 24KB ~ 16MB837 * 10 = 8370
8370 * 825 ~ 7,000K
8370 * 33KB ~ 270MBmedium
40 agents345 * 40 = 13800
13800 * 2016 ~ 28,000K
13800 * 47KB ~ 640MB197 * 40 = 7880
7880 * 1008 ~ 8,000K
7880 * 24KB ~ 186MB837 * 40 ~ 34K
34K * 825 ~ 28,000K
34K * 33KB ~ 1.1 GB
large
125 agents345 * 125 ~ 43.5K
43.5K * 2016 ~ 86,000K
43.5K * 47KB ~ 2GB197 * 125 ~ 25K
25K * 1008 ~ 25,000K
25K * 24KB ~ 0.5GB837 * 125 ~ 110K
110K * 825 ~ 86,000K
110K * 33KB ~ 3.5GB - Totals
Size Number of Rows Disk Size small
10 agents16 million rows 474 MB medium
40 agents64 million rows 1.9 GB large
125 agents197 million row 6GB
Disclaimers:
- These estimates do not take into account RHQ installations that monitor servers with large number of sub-resources (eg. 100 applications on one single AS instance).
- For larger deployments or deployments with large numbers of sub-resource, just multiply the numbers presented here with a factor. For example, if there are 10 times more applications to be monitored than a typical AS install, one could pick a factor of 10 to get an upper estimate on the amount of data to be transferred.
- These estimates could be refined and/or changed at any time based on community feedback.
- These estimates are just for guidance purposes, actual deployments could be much smaller or much larger based on the actual amount of metric data collected. Examples of factors that influence size are: system configuration, post discovery configuration of resources, length of time since installation, db storage engine.
Plugin Details
Plugin | Summary Numeric Metrics | Detail Numeric Metrics | Total Numeric Metrics |
---|---|---|---|
jboss-as-7 | 62 | 239 | 301 |
perftest | 62 | 37 | 99 |
platform | 59 | 62 | 121 |
cassandra | 58 | 16 | 74 |
jboss-as-5 | 54 | 106 | 160 |
jboss-as | 38 | 80 | 118 |
oracle | 22 | 512 | 534 |
mysql | 19 | 250 | 269 |
hadoop | 16 | 21 | 37 |
postgres | 15 | 20 | 35 |
rhq-server | 13 | 45 | 58 |
jmx | 13 | 11 | 24 |
jboss-cache-v3 | 12 | 18 | 30 |
virt | 11 | 8 | 19 |
tomcat | 9 | 12 | 21 |
hardware | 9 | 6 | 15 |
rhq-agent | 8 | 17 | 25 |
sshd | 8 | 15 | 23 |
iis | 6 | 66 | 72 |
apache | 4 | 24 | 28 |
samba | 4 | 13 | 17 |
byteman | 4 | 0 | 4 |
hibernate | 3 | 26 | 29 |
jboss-cache | 3 | 19 | 22 |
netservices | 3 | 2 | 5 |
3 | 0 | 3 | |
pattern-generator | 2 | 0 | 2 |
irc | 2 | 0 | 2 |
database | 2 | 0 | 2 |
hudson | 2 | 0 | 2 |
jira | 1 | 1 | 2 |
onewire | 1 | 0 | 1 |
snmptrapd | 1 | 0 | 1 |
mod-cluster | 1 | 0 | 1 |
script2 | 0 | 4 | 4 |
aliases | 0 | 0 | 0 |
grub | 0 | 0 | 0 |
nss | 0 | 0 | 0 |
kickstart | 0 | 0 | 0 |
filetemplate-bundle | 0 | 0 | 0 |
script | 0 | 0 | 0 |
ant-bundle | 0 | 0 | 0 |
database | 0 | 0 | 0 |
cron | 0 | 0 | 0 |
services | 0 | 0 | 0 |
cobbler | 0 | 0 | 0 |
noop | 0 | 0 | 0 |
lsof | 0 | 0 | 0 |
augeas | 0 | 0 | 0 |
postfix | 0 | 0 | 0 |
raw-config-test | 0 | 0 | 0 |
iptables | 0 | 0 | 0 |
jdbctrace | 0 | 0 | 0 |
sudoers | 0 | 0 | 0 |
hosts | 0 | 0 | 0 |