Storage Sizing Analysis

Overview

The adoption of Cassandra as a data store is a major change in architecture. One of the questions that is frequently asked by both new and existing users is, "how much disk space will I need?" While it is unlikely that we can answer that question with great precision, we do want to provide some baselines to give users something with which to work for planning purposes. This document will provide an overview of the files on disk with which we are concerned, how we can calculate the sizes of those files, and what kind of tooling we can provide for users to crunch the numbers. Cassandra stores data in two places - the commit log and the data directories.

Note that much of the discussion that follows talks about the data format as stored on disk as opposed to the CQL format. CQL transposes data such that a table contains many rows with each row having a fixed set of columns. On disk however, several CQL rows will be stored as multiple columns in a single, variable length row (assuming those CQL rows all share the same primary key).

CommitLog

See the architecture wiki doc[5] for background. The document is a bit dated, but all of the information is still relevant. The CommitLog segment file size, which is specified by the commitlog_segment_size_in_mb property in cassandra.yaml, is set to the default value of 32 MB. And the commitlog_total_space_in_mb property defaults to 4 GB. If the commit_log directory larger than this, then Cassandra will flush every table in the oldest segment and remove it.

SSTables

The in-memory representation of a table (or column family to be precise) is a Memtable. When Cassandra flushes a Memtable, it is written to disk as an immutable SSTable. Once written to disk, an SSTable is never updated. An SSTable is comprised of six components.

Data
Index
Bloom filter
Partition index
Partition summary
CompressionInfo
Statistics

Each of these components has its own corresponding file as shown in the following example,

$ ls data/rhq/raw_metrics/
rhq-raw_metrics-ic-914-CompressionInfo.db 
rhq-raw_metrics-ic-914-Index.db           
rhq-raw_metrics-ic-914-TOC.txt
rhq-raw_metrics-ic-914-Data.db            
rhq-raw_metrics-ic-914-Statistics.db
rhq-raw_metrics-ic-914-Filter.db          
rhq-raw_metrics-ic-914-Summary.db

The following sections provide a brief summary of each of the components.

Data

Stores the actual rows and columns.

Bloom Filter

An index of data structures to determine if a partition key is not present in the SSTable. Used to reduce disk seeks during reads.

Partition Index

Maps partition keys to their offsets in the data file. Each index entry contains information including row size, column count, and column indexes (if any).

Partition Summary

A sampling of the partition index that includes the index boundaries.

CompressionInfo

Compression is currently disabled so this component file does not get generated. See bug 1015628 for details about re-enabling compression.

Statistics

Stores meta data such as estimated row size and estimated column count. These estimates are provided as histograms.

Approach

One of the biggest challenges with this type of analysis is that there are several things that are too difficult to determine. We cannot calculate the number of commit log segments on disk at any given point. The number of segments is influenced by a number of factors including memtable flushes, garbage collection, and node restarts.

We cannot determine the number of SSTables on disk. A new SSTable is generated when a memtable is flushed, but there are a lot of variables that can influence when a flush occurs including overall heap size, read/write load, garbage collection, and data size. Remember that SSTables contain overlapping data for a table. For instance, suppose we have 4 SSTables on disk - S1, S2, S3, and S4. S1 and S2 might have overlapping fragments for a given set of rows. S3 might contain an entirely new set of columns for those row, and S4 could contain a subset of the rows/columns in S3. This data fragmentation and duplication results in higher disk utilization. Only with the combination of all four SSTables can we be guaranteed to have the full set of data.

Metric data is not stored indefinitely. Data retention rules are enforced in order to avoid unbounded data growth. Data retention is implemented using Cassandra's expiring columns feature. The TTL (time to live) meta data field of a column is specified in seconds. Once the duration specified by the TTL has surpassed, Cassandra marks the column expired, and any space it consumes on disk is eventually reclaimed. Figuring out how much space is consumed by expired data would be very complicated at best.

Now let's talk some things we can ascertain with a higher degree of certainty. Like other databases, Cassandra supports a number of data types, some of which have a fixed size while others have a variable size. Integers for example are always four bytes, and strings can have a variable length. With the help of good analysis[2] and a thorough source code review, we can with 100% precision calculate the sizes of low level data structures on disk including rows, columns, bloom filters, and partition index entries. We need to establish some invariants in order to produce meaningful, accurate calculations.

Assume only a single SSTable
Only consider non-expired data
Use a fixed size (in terms of numbers of rows and columns) data set
- For metric data, this can be determined from the number of measurement schedules and their collection intervals