This page will document the design, goals, and trade-offs for metrics data migration tool. The tool will transfer metrics related data from SQL storage to the new Cassandra backend. Only numeric metrics are to be transferred at this time.
For RHQ server with large inventories the tool should be able to gracefully migrate large amounts of data
The tool needs to be able to restart the process from where it was stopped. This helps in such cases as a migration failure due to environment stability or user cancelled the migration due to duration.
The tool will integrate with the RHQ upgrade procedures to make the product upgrade as simple as possible to the user.
Non Goals - Non Features
No schema updates are to be done by the tool. Dropping unnecessary tables is done by dbupgrade.
Two migration modes:
Batched mode, permanent migration
Data is migrated in batches of 10,000 rows
First move the data to the new server
Then delete the rows of data from old database
Non-batched mode
Migrate everything at once
Data could be deleted at the end
Possible to leave all old data in the db at the end of migration
In batched mode, if the tool fails, the only possible overlap is for the rows that did not get deleted from the SQL database.
the limit of 10,000 is low enough to reduce the amount of double migration
do not duplicate data already migrated in Cassandra
The tool can be configured to migrates only a specific table per run or everything
this gives the user the opportunity to migrate data gradually if too much data was collected in SQL
In non-batched mode, the deletion process is configurable
users can turn of the deletion so the migration is faster and the data is removed by the user at a later time by truncating or dropping the table.
The tool computes the ttl for every single row migrated
new ttl is based on the timestamp of the row and expected ttl for the table
The tool will be independent to the existing installer and application.
A standalone approach will allow simple migration/transition for
larger amounts of data
cases where existing data might not be essential and thus not migrated by users
Use-cases
Oracle --> Cassandra
Postgres --> Cassandra
Edge cases to consider
high-volume of metrics
low-volume of metrics
no metrics
Negative tests cases to consider
connection failures
if the test fails ... can the tool be used again? or do you need a new fresh cassandra database?
Verification ... or, "How do we determine that migration was successful?"
Output generated by the migration tool
the migration tool generates a human readable log file
the human readable log file generates a message saying the migration was successful ...or an error occurred
if an error occurs in the migration ... a meaningful human readable error message is displayed
# of rows read in Oracle/Postgres and number of rows? created in Cassandra ...on a per resource basis
Visual inspection in RHQ UI
exactly what in RHQ should remain exactly the same after the backend switches from Oracle/Postgres to Cassandra
Unit tests on the data migration tool
Performance and baselining
How long should this take? What is an acceptable migration time?
What is the SLA for a large deployment?
If large deployments may take a long time to migrate ...
is it wise to consider a tool that could migrate things incrementally? 1 resource at a time?
will the UI or command line allow the ability to only migrate 1 or select resources at a time?
If the tool takes a long time to run and migrate the data ...there should be some visual indicator ...% complete, time remaining, etc.....
Risk areas ...If you could predict where a problem may manifests ...where would it be? For each risk area, list a possible mitigation approach.
Testing tool? Is there a need for a automated or semi-automated testing or verification tool?
A tool that accesses both backends ... and verifies metrics on a particular resource are identical.
Pick 4 plugins to be active on a system, add number of metrics
The plugins: Platform, JBoss AS5, JBoss Cache 3, JMX
All metrics enabled by user
System running for 1 year
Multiply with 2.5 to account for multiple subresources using the same descriptor
small 10 agents, medium 40 agents, large 125 agents
Aggregate Data Retention
24 entries/day in 1h table for 14 days = 336 entries per collected metric
4 entries/day in 6h table for 31 days = 124 entries per collected metric
1 entries/day in 1d table for 365 days = 365 entris per collected metric
total = 825 data points per metric
Sizing:
time_stamp = long = 8 bytes
schedule_id = integer = 4 bytes
value,min,max = double = 8 bytes
Use 8 bytes for each data type to make things simpler
8 bytes * 5 = 40 bytes per data point
Total space = 825 * 40 = 33KB/per metric
Raw Retention
Collection - 2 weeks retention per data point
sizing
time_stamp = long = 8 bytes
schedule_id = integer = 4 bytes
value = 8 bytes
use 8 bytes for each data type
8 bytes * 3 = 24 bytes per single collection
Summary metrics
1 collection every 10 minutes = 6/hour = 144/day
total = 2016 data points/metric
total size = 2016 * 24 bytes = 47KB / metric
Detail metrics
1 collection every 20 minutes = 3/hour = 72/day
total = 1008 data points/metric
total size = 1008 * 24 bytes = 24KB/metric
Each agent:
138 summary metrics
197 detail metrics
335 total metrics
Extrapolate per agent using 2.5 factor
345 summary metrics
492 detail metrics
837 total metrics
Sizing Details
Size |
Raw Summary |
Raw Detail |
Total Aggregate Retention |
small |
345 * 10 = 3450 |
197 * 10 = 1970 |
837 * 10 = 8370 |
medium |
345 * 40 = 13800 |
197 * 40 = 7880 |
837 * 40 ~ 34K |
large |
345 * 125 ~ 43.5K |
197 * 125 ~ 25K |
837 * 125 ~ 110K |
Totals
Size |
Number of Rows |
Disk Size |
small |
16 million rows |
474 MB |
medium |
64 million rows |
1.9 GB |
large |
197 million row |
6GB |
These estimates do not take into account RHQ installations that monitor servers with large number of sub-resources (eg. 100 applications on one single AS instance).
For larger deployments or deployments with large numbers of sub-resource, just multiply the numbers presented here with a factor. For example, if there are 10 times more applications to be monitored than a typical AS install, one could pick a factor of 10 to get an upper estimate on the amount of data to be transferred.
These estimates could be refined and/or changed at any time based on community feedback.
These estimates are just for guidance purposes, actual deployments could be much smaller or much larger based on the actual amount of metric data collected. Examples of factors that influence size are: system configuration, post discovery configuration of resources, length of time since installation, db storage engine.
Plugin |
Summary Numeric Metrics |
Detail Numeric Metrics |
Total Numeric Metrics |
jboss-as-7 |
62 |
239 |
301 |
perftest |
62 |
37 |
99 |
platform |
59 |
62 |
121 |
cassandra |
58 |
16 |
74 |
jboss-as-5 |
54 |
106 |
160 |
jboss-as |
38 |
80 |
118 |
oracle |
22 |
512 |
534 |
mysql |
19 |
250 |
269 |
hadoop |
16 |
21 |
37 |
postgres |
15 |
20 |
35 |
rhq-server |
13 |
45 |
58 |
jmx |
13 |
11 |
24 |
jboss-cache-v3 |
12 |
18 |
30 |
virt |
11 |
8 |
19 |
tomcat |
9 |
12 |
21 |
hardware |
9 |
6 |
15 |
rhq-agent |
8 |
17 |
25 |
sshd |
8 |
15 |
23 |
iis |
6 |
66 |
72 |
apache |
4 |
24 |
28 |
samba |
4 |
13 |
17 |
byteman |
4 |
0 |
4 |
hibernate |
3 |
26 |
29 |
jboss-cache |
3 |
19 |
22 |
netservices |
3 |
2 |
5 |
|
3 |
0 |
3 |
pattern-generator |
2 |
0 |
2 |
irc |
2 |
0 |
2 |
database |
2 |
0 |
2 |
hudson |
2 |
0 |
2 |
jira |
1 |
1 |
2 |
onewire |
1 |
0 |
1 |
snmptrapd |
1 |
0 |
1 |
mod-cluster |
1 |
0 |
1 |
script2 |
0 |
4 |
4 |
aliases |
0 |
0 |
0 |
grub |
0 |
0 |
0 |
nss |
0 |
0 |
0 |
kickstart |
0 |
0 |
0 |
filetemplate-bundle |
0 |
0 |
0 |
script |
0 |
0 |
0 |
ant-bundle |
0 |
0 |
0 |
database |
0 |
0 |
0 |
cron |
0 |
0 |
0 |
services |
0 |
0 |
0 |
cobbler |
0 |
0 |
0 |
noop |
0 |
0 |
0 |
lsof |
0 |
0 |
0 |
augeas |
0 |
0 |
0 |
postfix |
0 |
0 |
0 |
raw-config-test |
0 |
0 |
0 |
iptables |
0 |
0 |
0 |
jdbctrace |
0 |
0 |
0 |
sudoers |
0 |
0 |
0 |
hosts |
0 |
0 |
0 |