Skip to end of metadata
Go to start of metadata

Metrics Data Migration

About

This page will document the design, goals, and trade-offs for metrics data migration tool. The tool will transfer metrics related data from SQL storage to the new Cassandra backend. Only numeric metrics are to be transferred at this time.

Goals

  1. For RHQ server with large inventories the tool should be able to gracefully migrate large amounts of data 
  2. The tool needs to be able to restart the process from where it was stopped. This helps in such cases as a migration failure due to environment stability or user cancelled the migration due to duration.
  3. The tool will integrate with the RHQ upgrade procedures to make the product upgrade as simple as possible to the user.

Non Goals - Non Features

  1. No schema updates are to be done by the tool.  Dropping unnecessary tables is done by dbupgrade.

Design

  1. Two migration modes:
    1. Batched mode, permanent migration
      1. Data is migrated in batches of 10,000 rows
      2. First move the data to the new server
      3. Then delete the rows of data from old database
    2. Non-batched mode
      1. Migrate everything at once
      2. Data could be deleted at the end
      3. Possible to leave all old data in the db at the end of migration
  2. In batched mode, if the tool fails, the only possible overlap is for the rows that did not get deleted from the SQL database.
    1. the limit of 10,000 is low enough to reduce the amount of double migration
    2. do not duplicate data already migrated in Cassandra
  3. The tool can be configured to migrates only a specific table per run or everything
    1. this gives the user the opportunity to migrate data gradually if too much data was collected in SQL
  4. In non-batched mode, the deletion process is configurable
    1. users can turn of the deletion so the migration is faster and the data is removed by the user at a later time by truncating or dropping the table.
  5. The tool computes the ttl for every single row migrated 
    1. new ttl is based on the timestamp of the row and expected ttl for the table
  6.  The tool will be independent to the existing installer and application. 
    1. A standalone approach will allow simple migration/transition for 
      1. larger amounts of data
      2. cases where existing data might not be essential and thus not migrated by users

Testing

  1. Use-cases
    1. Oracle --> Cassandra
    2. Postgres --> Cassandra
    3. Edge cases to consider
      1. high-volume of metrics
      2. low-volume of metrics
      3. no metrics    
      4.    
    4. Negative tests cases to consider
      1. connection failures
      2.  if the test fails ... can the tool be used again?  or do you need a new fresh cassandra database?  
      3.    
  2. Verification ... or, "How do we determine that migration was successful?"
    1. Output generated by the migration tool
      1. the migration tool generates a human readable log file
      2. the human readable log file generates a message saying the migration was successful ...or an error occurred
      3. if an error occurs in the migration ... a meaningful human readable error message is displayed 
      4. # of rows read in Oracle/Postgres and number of rows? created in Cassandra ...on a per resource basis
    2. Visual inspection in RHQ UI
      1. exactly what in RHQ should remain exactly the same after the backend switches from Oracle/Postgres to Cassandra   
        1.   
        2.    
        3.  
      2.  
  3.  Unit tests  on the data migration tool
    1.    
    2.    
    3.    
  4.  Performance and baselining
    1. How long should this take?  What is an acceptable migration time?  
    2. What is the SLA for a large deployment?
    3. If large deployments may take a long time to migrate ... 
      1. is it wise to consider a tool that could migrate things incrementally?  1 resource at a time?  
      2. will the UI or command line allow the ability to only migrate 1 or select resources at a time? 
    4.   If the tool takes a long time to run and migrate the data ...there should be some visual indicator ...% complete, time remaining, etc..... 
  5.  Risk areas ...If you could predict where a problem may manifests ...where would it be?  For each risk area, list a possible mitigation approach.
    1.   
    2.    
    3.    
  6.  Testing tool?   Is there a need for a automated or semi-automated testing or verification tool?
    1. A tool that accesses both backends ... and verifies metrics on a particular resource are identical.
      1.  

Migration estimates

Estimation Technique:
  1. Pick 4 plugins to be active on a system, add number of metrics
    1. The plugins: Platform, JBoss AS5, JBoss Cache 3, JMX
  2. All metrics enabled by user
  3. System running for 1 year
  4. Multiply with 2.5 to account for multiple subresources using the same descriptor
  5. small 10 agents, medium 40 agents, large 125 agents
  6. Aggregate Data Retention
    1. 24 entries/day in 1h table for 14 days = 336 entries per collected metric
    2. 4 entries/day in 6h table for 31 days = 124 entries per collected metric
    3. 1 entries/day in 1d table for 365 days = 365 entris per collected metric
    4. total = 825 data points per metric
    5. Sizing:
      1. time_stamp = long = 8 bytes
      2. schedule_id = integer = 4 bytes
      3. value,min,max = double = 8 bytes
      4. Use 8 bytes for each data type to make things simpler
      5. 8 bytes * 5 = 40 bytes per data point
      6. Total space = 825 * 40 = 33KB/per metric
  7. Raw Retention
    1. Collection - 2 weeks retention per data point
    2. sizing
      1. time_stamp = long = 8 bytes
      2. schedule_id = integer = 4 bytes
      3. value = 8 bytes
      4. use 8 bytes for each data type
      5. 8 bytes * 3 = 24 bytes per single collection
    3. Summary metrics
      1. 1 collection every 10 minutes = 6/hour = 144/day
      2. total = 2016 data points/metric
      3. total size = 2016 * 24 bytes = 47KB / metric
    4. Detail metrics
      1. 1 collection every 20 minutes = 3/hour = 72/day
      2. total = 1008 data points/metric
      3. total size = 1008 * 24 bytes = 24KB/metric
Estimate
  1. Each agent:
    1. 138 summary metrics
    2. 197 detail metrics
    3. 335 total metrics
  2. Extrapolate per agent using 2.5 factor
    1. 345 summary metrics
    2. 492 detail metrics
    3. 837 total metrics
  3. Sizing Details
    Size Raw Summary Raw Detail Total Aggregate Retention
    small
    10 agents
    345 * 10 = 3450
    3450 * 2016 = 7,000K
    3450 * 47KB ~ 158MB
    197 * 10 = 1970
    1970 * 1008 ~ 2,000K
    1970 * 24KB ~ 16MB
    837 * 10 = 8370
    8370 * 825 ~ 7,000K
    8370 * 33KB ~ 270MB
    medium
    40 agents
    345 * 40 = 13800
    13800 * 2016 ~ 28,000K
    13800 * 47KB ~ 640MB
    197 * 40 = 7880
    7880 * 1008 ~ 8,000K
    7880 * 24KB ~ 186MB
    837 * 40 ~ 34K
    34K * 825 ~ 28,000K
    34K * 33KB ~ 1.1 GB
    large
    125 agents
    345 * 125 ~ 43.5K
    43.5K * 2016 ~ 86,000K
    43.5K * 47KB ~ 2GB
    197 * 125 ~ 25K
    25K * 1008 ~ 25,000K
    25K * 24KB ~ 0.5GB
    837 * 125 ~ 110K
    110K * 825 ~ 86,000K
    110K * 33KB ~ 3.5GB
  4. Totals
    Size Number of Rows Disk Size
    small
    10 agents
    16 million rows 474 MB
    medium
    40 agents
    64 million rows 1.9 GB
    large
    125 agents
    197 million row 6GB
Disclaimers:
  1. These estimates do not take into account RHQ installations that monitor servers with large number of sub-resources (eg. 100 applications on one single AS instance).
  2. For larger deployments or deployments with large numbers of sub-resource, just multiply the numbers presented here with a factor. For example, if there are 10 times more applications to be monitored than a typical AS install, one could pick a factor of 10 to get an upper estimate on the amount of data to be transferred.
  3. These estimates could be refined and/or changed at any time based on community feedback.
  4. These estimates are just for guidance purposes, actual deployments could be much smaller or much larger based on the actual amount of metric data collected. Examples of factors that influence size are: system configuration, post discovery configuration of resources, length of time since installation, db storage engine.
Plugin Details
Plugin Summary Numeric Metrics Detail Numeric Metrics Total Numeric Metrics
jboss-as-7 62 239 301
perftest 62 37 99
platform 59 62 121
cassandra 58 16 74
jboss-as-5 54 106 160
jboss-as 38 80 118
oracle 22 512 534
mysql 19 250 269
hadoop 16 21 37
postgres 15 20 35
rhq-server 13 45 58
jmx 13 11 24
jboss-cache-v3 12 18 30
virt 11 8 19
tomcat 9 12 21
hardware 9 6 15
rhq-agent 8 17 25
sshd 8 15 23
iis 6 66 72
apache 4 24 28
samba 4 13 17
byteman 4 0 4
hibernate 3 26 29
jboss-cache 3 19 22
netservices 3 2 5
twitter 3 0 3
pattern-generator 2 0 2
irc 2 0 2
database 2 0 2
hudson 2 0 2
jira 1 1 2
onewire 1 0 1
snmptrapd 1 0 1
mod-cluster 1 0 1
script2 0 4 4
aliases 0 0 0
grub 0 0 0
nss 0 0 0
kickstart 0 0 0
filetemplate-bundle 0 0 0
script 0 0 0
ant-bundle 0 0 0
database 0 0 0
cron 0 0 0
services 0 0 0
cobbler 0 0 0
noop 0 0 0
lsof 0 0 0
augeas 0 0 0
postfix 0 0 0
raw-config-test 0 0 0
iptables 0 0 0
jdbctrace 0 0 0
sudoers 0 0 0
hosts 0 0 0
Labels:
cassandra cassandra Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.