JBoss Community Archive (Read Only)

RHQ 4.9

Metrics Data Migration

About

This page will document the design, goals, and trade-offs for metrics data migration tool. The tool will transfer metrics related data from SQL storage to the new Cassandra backend. Only numeric metrics are to be transferred at this time.

Goals

  1. For RHQ server with large inventories the tool should be able to gracefully migrate large amounts of data 

  2. The tool needs to be able to restart the process from where it was stopped. This helps in such cases as a migration failure due to environment stability or user cancelled the migration due to duration.

  3. The tool will integrate with the RHQ upgrade procedures to make the product upgrade as simple as possible to the user.

Non Goals - Non Features

  1. No schema updates are to be done by the tool.  Dropping unnecessary tables is done by dbupgrade.

Design

  1. Two migration modes:

    1. Batched mode, permanent migration

      1. Data is migrated in batches of 10,000 rows

      2. First move the data to the new server

      3. Then delete the rows of data from old database

    2. Non-batched mode

      1. Migrate everything at once

      2. Data could be deleted at the end

      3. Possible to leave all old data in the db at the end of migration

  2. In batched mode, if the tool fails, the only possible overlap is for the rows that did not get deleted from the SQL database.

    1. the limit of 10,000 is low enough to reduce the amount of double migration

    2. do not duplicate data already migrated in Cassandra

  3. The tool can be configured to migrates only a specific table per run or everything

    1. this gives the user the opportunity to migrate data gradually if too much data was collected in SQL

  4. In non-batched mode, the deletion process is configurable

    1. users can turn of the deletion so the migration is faster and the data is removed by the user at a later time by truncating or dropping the table.

  5. The tool computes the ttl for every single row migrated 

    1. new ttl is based on the timestamp of the row and expected ttl for the table

  6.  The tool will be independent to the existing installer and application. 

    1. A standalone approach will allow simple migration/transition for 

      1. larger amounts of data

      2. cases where existing data might not be essential and thus not migrated by users

Testing

  1. Use-cases

    1. Oracle --> Cassandra

    2. Postgres --> Cassandra

    3. Edge cases to consider

      1. high-volume of metrics

      2. low-volume of metrics

      3. no metrics    

      4.    

    4. Negative tests cases to consider

      1. connection failures

      2.  if the test fails ... can the tool be used again?  or do you need a new fresh cassandra database?  

      3.    

  2. Verification ... or, "How do we determine that migration was successful?"

    1. Output generated by the migration tool

      1. the migration tool generates a human readable log file

      2. the human readable log file generates a message saying the migration was successful ...or an error occurred

      3. if an error occurs in the migration ... a meaningful human readable error message is displayed 

      4. # of rows read in Oracle/Postgres and number of rows? created in Cassandra ...on a per resource basis

    2. Visual inspection in RHQ UI

      1. exactly what in RHQ should remain exactly the same after the backend switches from Oracle/Postgres to Cassandra   

        1.   

        2.    

        3.  

      2.  

  3.  Unit tests  on the data migration tool

    1.    

    2.    

    3.    

  4.  Performance and baselining

    1. How long should this take?  What is an acceptable migration time?  

    2. What is the SLA for a large deployment?

    3. If large deployments may take a long time to migrate ... 

      1. is it wise to consider a tool that could migrate things incrementally?  1 resource at a time?  

      2. will the UI or command line allow the ability to only migrate 1 or select resources at a time? 

    4.   If the tool takes a long time to run and migrate the data ...there should be some visual indicator ...% complete, time remaining, etc..... 

  5.  Risk areas ...If you could predict where a problem may manifests ...where would it be?  For each risk area, list a possible mitigation approach.

    1.   

    2.    

    3.    

  6.  Testing tool?   Is there a need for a automated or semi-automated testing or verification tool?

    1. A tool that accesses both backends ... and verifies metrics on a particular resource are identical.

      1.  

Migration estimates

Estimation Technique:

  1. Pick 4 plugins to be active on a system, add number of metrics

    1. The plugins: Platform, JBoss AS5, JBoss Cache 3, JMX

  2. All metrics enabled by user

  3. System running for 1 year

  4. Multiply with 2.5 to account for multiple subresources using the same descriptor

  5. small 10 agents, medium 40 agents, large 125 agents

  6. Aggregate Data Retention

    1. 24 entries/day in 1h table for 14 days = 336 entries per collected metric

    2. 4 entries/day in 6h table for 31 days = 124 entries per collected metric

    3. 1 entries/day in 1d table for 365 days = 365 entris per collected metric

    4. total = 825 data points per metric

    5. Sizing:

      1. time_stamp = long = 8 bytes

      2. schedule_id = integer = 4 bytes

      3. value,min,max = double = 8 bytes

      4. Use 8 bytes for each data type to make things simpler

      5. 8 bytes * 5 = 40 bytes per data point

      6. Total space = 825 * 40 = 33KB/per metric

  7. Raw Retention

    1. Collection - 2 weeks retention per data point

    2. sizing

      1. time_stamp = long = 8 bytes

      2. schedule_id = integer = 4 bytes

      3. value = 8 bytes

      4. use 8 bytes for each data type

      5. 8 bytes * 3 = 24 bytes per single collection

    3. Summary metrics

      1. 1 collection every 10 minutes = 6/hour = 144/day

      2. total = 2016 data points/metric

      3. total size = 2016 * 24 bytes = 47KB / metric

    4. Detail metrics

      1. 1 collection every 20 minutes = 3/hour = 72/day

      2. total = 1008 data points/metric

      3. total size = 1008 * 24 bytes = 24KB/metric

Estimate

  1. Each agent:

    1. 138 summary metrics

    2. 197 detail metrics

    3. 335 total metrics

  2. Extrapolate per agent using 2.5 factor

    1. 345 summary metrics

    2. 492 detail metrics

    3. 837 total metrics

  3. Sizing Details

    Size

    Raw Summary

    Raw Detail

    Total Aggregate Retention

    small
    10 agents

    345 * 10 = 3450
    3450 * 2016 = 7,000K
    3450 * 47KB ~ 158MB

    197 * 10 = 1970
    1970 * 1008 ~ 2,000K
    1970 * 24KB ~ 16MB

    837 * 10 = 8370
    8370 * 825 ~ 7,000K
    8370 * 33KB ~ 270MB

    medium
    40 agents

    345 * 40 = 13800
    13800 * 2016 ~ 28,000K
    13800 * 47KB ~ 640MB

    197 * 40 = 7880
    7880 * 1008 ~ 8,000K
    7880 * 24KB ~ 186MB

    837 * 40 ~ 34K
    34K * 825 ~ 28,000K
    34K * 33KB ~ 1.1 GB

    large
    125 agents

    345 * 125 ~ 43.5K
    43.5K * 2016 ~ 86,000K
    43.5K * 47KB ~ 2GB

    197 * 125 ~ 25K
    25K * 1008 ~ 25,000K
    25K * 24KB ~ 0.5GB

    837 * 125 ~ 110K
    110K * 825 ~ 86,000K
    110K * 33KB ~ 3.5GB

  4. Totals

    Size

    Number of Rows

    Disk Size

    small
    10 agents

    16 million rows

    474 MB

    medium
    40 agents

    64 million rows

    1.9 GB

    large
    125 agents

    197 million row

    6GB

Disclaimers:

  1. These estimates do not take into account RHQ installations that monitor servers with large number of sub-resources (eg. 100 applications on one single AS instance).

  2. For larger deployments or deployments with large numbers of sub-resource, just multiply the numbers presented here with a factor. For example, if there are 10 times more applications to be monitored than a typical AS install, one could pick a factor of 10 to get an upper estimate on the amount of data to be transferred.

  3. These estimates could be refined and/or changed at any time based on community feedback.

  4. These estimates are just for guidance purposes, actual deployments could be much smaller or much larger based on the actual amount of metric data collected. Examples of factors that influence size are: system configuration, post discovery configuration of resources, length of time since installation, db storage engine.

Cassandra estimates

Estimation Technique:

  1. Pick only the AS7 plugin and essential plugins, add number of metrics

    1. essential plugins = platform + storage node

  2. All metrics enabled by user

  3. System running for 1 year

  4. Multiply with 5 to account for multiple sub-resources using the same descriptor

  5. small 10 agents, medium 40 agents, large 125 agents

  6. Aggregate 

    1. 24 entries/day in 1h table for 14 days = 336 entries per collected metric

    2. 4 entries/day in 6h table for 31 days = 124 entries per collected metric

    3. 1 entries/day in 1d table for 365 days = 365 entris per collected metric

    4. total = 825 data points per metric

    5. Sizing:

      1. single col size =  35 bytes

      2. min + max + avg = 3* single col size = 105 bytes

      3. total per aggregate ~ 105 bytes

      4. row key = 27 bytes

      5. Total space = 825 * 105 bytes + 3 * 27 bytes ~ 85KB / per metric

  7. Raw

    1. Collection - 1 weeks retention per data point

    2. sizing

      1. col name = 8 bytes

      2. col val = 8 bytes

      3. meta data = 23 bytes

      4. total per raw ~ 39 bytes

      5. row key = 27 bytes

    3. Summary metrics

      1. 1 collection every 10 minutes = 6/hour = 144/day

      2. total = 1008 data points/metric

      3. total size = 1008 * 39 bytes + 27 bytes ~ 39KB / metric

    4. Detail metrics

      1. 1 collection every 20 minutes = 3/hour = 72/day

      2. total = 504 data points/metric

      3. total size = 504 * 39 bytes + 27 bytes ~ 20KB/metric

  8. Metrics Index

    1. data = 21 bytes

    2. index = 27 bytes

    3. 3 total entries per metrics = 1 for one each aggregate table 

    4. total size = 3 * 21 bytes + 27 bytes = 90 bytes

Estimate

  1. Each agent:

    1. (62 + 59 + 58) = 179  ~ 180 summary metrics

    2. (239 + 62 + 16) = 317 ~ 320 detail metrics

    3. 496 ~ 500 total metrics

  2. Extrapolate per agent using 4 factor

    1. 720 summary metrics

    2. 1280 detail metrics

    3. 2000 total metrics

  3. Sizing Details

    Size

    Raw Summary

    Raw Detail

    Total Aggregate Retention

    small 
    10 agents

    720 * 10 = 7200  
    7200 * 2016 = 7,000K 
    3450 * 47KB ~ 158MB

    197 * 10 = 1970 
    1970 * 1008 ~ 2,000K 
    1970 * 24KB ~ 16MB

    837 * 10 = 8370 
    8370 * 825 ~ 7,000K 
    8370 * 33KB ~ 270MB

    medium 
    40 agents

    345 * 40 = 13800 
    13800 * 2016 ~ 28,000K 
    13800 * 47KB ~ 640MB

    197 * 40 = 7880 
    7880 * 1008 ~ 8,000K 
    7880 * 24KB ~ 186MB

    837 * 40 ~ 34K 
    34K * 825 ~ 28,000K 
    34K * 33KB ~ 1.1 GB 

    large 
    125 agents

    345 * 125 ~ 43.5K 
    43.5K * 2016 ~ 86,000K 
    43.5K * 47KB ~ 2GB

    197 * 125 ~ 25K 
    25K * 1008 ~ 25,000K 
    25K * 24KB ~ 0.5GB

    837 * 125 ~ 110K 
    110K * 825 ~ 86,000K 
    110K * 33KB ~ 3.5GB

  4. Totals

    Size

    Number of Rows

    Disk Size

    small 
    10 agents

    16 million rows

    474 MB

    medium 
    40 agents

    64 million rows

    1.9 GB

    large 
    125 agents

    197 million row

    6GB

Plugin Details

Plugin

Summary Numeric Metrics

Detail Numeric Metrics

Total Numeric Metrics

jboss-as-7

62

239

301

perftest

62

37

99

platform

59

62

121

cassandra

58

16

74

jboss-as-5

54

106

160

jboss-as

38

80

118

oracle

22

512

534

mysql

19

250

269

hadoop

16

21

37

postgres

15

20

35

rhq-server

13

45

58

jmx

13

11

24

jboss-cache-v3

12

18

30

virt

11

8

19

tomcat

9

12

21

hardware

9

6

15

rhq-agent

8

17

25

sshd

8

15

23

iis

6

66

72

apache

4

24

28

samba

4

13

17

byteman

4

0

4

hibernate

3

26

29

jboss-cache

3

19

22

netservices

3

2

5

twitter

3

0

3

pattern-generator

2

0

2

irc

2

0

2

database

2

0

2

hudson

2

0

2

jira

1

1

2

onewire

1

0

1

snmptrapd

1

0

1

mod-cluster

1

0

1

script2

0

4

4

aliases

0

0

0

grub

0

0

0

nss

0

0

0

kickstart

0

0

0

filetemplate-bundle

0

0

0

script

0

0

0

ant-bundle

0

0

0

database

0

0

0

cron

0

0

0

services

0

0

0

cobbler

0

0

0

noop

0

0

0

lsof

0

0

0

augeas

0

0

0

postfix

0

0

0

raw-config-test

0

0

0

iptables

0

0

0

jdbctrace

0

0

0

sudoers

0

0

0

hosts

0

0

0

JBoss.org Content Archive (Read Only), exported from JBoss Community Documentation Editor at 2020-03-13 08:09:46 UTC, last content change 2013-12-05 23:31:13 UTC.