Cassandra Configuration and Tuning

Overview

This page provides information about configuring and tuning Cassandra. It is intended to serve as a guide on what sort of management operations we may want to perform in different situations. This document will hopefully be able to provide some answers to questions like,

What are indicators that a node has high memory utilization?
What things can be done to alleviate high memory utilization?
What can be done if a node has high disk space utilization?
When should more nodes be added to the cluster?

Memtable Memory Usage

A memtable is the in-memory representation of a column family or table. When a memtable gets too large, Cassandra flushes it to disk. There are several propertie in cassandra.yaml that affect how much overall memory that memtables use as well as how frequently they are flushed to disk.

Property	Description
flush_largest_memtables_at	A fraction of the total heap memory that if exceeded will force a flush of the largest memtable to disk. After the GC does a full concurrent mark sweep (CMS) collection, Cassandra checks this configuration value and triggers the flush when it is exceeded. Defaults to 1/3 of the JVM's max heap size if no value is specified in cassandra.yaml.
memtable_total_space_in_mb	The total memory for memtables to use. This value is used to calculate a threshold that when a memtable exceeds it, a flush to disk will be triggered.
memtable_flush_queue_size	The number of full memtables to allow pending flush.
memtable_flush_writers	The number of flush writer threads. These threads will block on disk I/O and each one holds a memtable in memory while it is blocked.

GCInspector is a period task that Cassandra runs, and as its name implies, it inspects various statistics produced by the GC. After a full collection, GCInspector looks at Cassandra's memory usage. If the overall heap usage exceeds flush_largest_memtables_at, an immediate flush of the largest memtable is triggered. A message like the following will be written to the log file,

WARN [ScheduledTasks:1] 2013-02-20 11:49:12,484 GCInspector.java (line 142) Heap is 0.8275743567425272 full. You may need to reduce memtable and/or cache sizes. Cassandra will now flush up to the two largest memtables to free up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically

If you see this warning message in the logs multiple times while the write load on a node is constant or increasing, then the node will likely hit an OutOfMemoryError soon thereafter.

What should be done when the GCInspector logs a warning that the heap is full?

The safest course of action is to,

Bring the node down.
Increase the heap size.
Restart the node.

JVM parameters like heap size are set in cassandra-env.sh. We should be able to support this workflow through the Cassandra plugin so that it can be automated without manual intervention. If more memory cannot be allocated to the node, then the next best thing to do would be to add another node to the cluster to further distribute and divide the existing load on the cluster.

During normal operation when will memtables get flushed to disk?

Aside from excessive heap usage memtables will regularly get flushed to disk. MeteredFlusher is a scheduled task that determines what if any memtables to flush. When it decides to flush a memtable, it writes a meesage to the server log that looks like,

INFO [OptionalTasks:1] 2013-02-20 11:48:03,162 MeteredFlusher.java (line 58) flushing high-traffic column family CFS(Keyspace='rhq', ColumnFamily='raw_metrics') (estimated 1594967 bytes)

It uses the memtable_total_space_in_mb, memtable_flush_queue_size, and memtable_flush_writers properties to determine whether or not a flush is necessary. The formula it uses looks like,

long size = cfs.getTotalMemtableLiveSize();
int maxInFlight = (int) Math.ceil((double) (1 // live memtable
                                            + 1 // potentially a flushed memtable being counted by jamm
                                            + DatabaseDescriptor.getFlushWriters()
                                            + DatabaseDescriptor.getFlushQueueSize())
                                        / (1 + cfs.indexManager.getIndexesBackedByCfs().size()));
if (size > (DatabaseDescriptor.getTotalMemtableSpaceInMB() * 1048576L - flushingBytes) / maxInFlight)

Suppose memtable_total_space_in_mb is set to 100 MB, memtable_flush_queue_size has the default of 4, memtable_flush_writers also has a default of 1, and the size variable in the code snippet works out to 14 MB. Assuming there are no secondary indexes on the column family, maxInFlight will be 7 and the big expression in the if block works out to roughly 12 MB. Since size is larger at 14 MB, the memtable will be flushed.

Roughly speaking, once the size of a memtable reaches memtable_total_space_in_mb / maxInFlight, then it will get flushed. Using the memtable_total_space_in_mb, memtable_flush_queue_size, and memtable_flush_writers properties allows us to control how large memtables get.

Compaction

Compaction is the process by which Cassandra periodically merges SSTables (i.e., data files on disk) into larger SSTables. When a memtable gets full, Cassandra flushes it out to disk as an SSTable. An SSTable is immutable. Once written, an SSTable is never update again, only read from. Over time the number of SSTables for a column family will grow. In addition to consuming more disk space, read performance can start to deteriorate as the number of SSTables increases. Since a row can be spread across several SSTables, Cassandra may have to read from multiple SSTables for a read. There are disk seeks involved when reading from each SSTable which will lead to higher latency. Compaction effectively reduces the number of SSTables which in turns decreases the disk space used as well as decreases read latency since there are fewer SSTables to access. Tombstones are also purged during compaction.

During compaction there will be a spike in disk usage as well as in I/O. Read performance can also degrade during compaction due to the increased I/O activity. Compaction itself should be fast as it is all sequential I/O.

Here is an example of what data files on disk might look like:

$ ls target/cassandra/node0/data/rhq/one_hour_metrics/
rhq-one_hour_metrics-ib-141-CompressionInfo.db rhq-one_hour_metrics-ib-142-Statistics.db
rhq-one_hour_metrics-ib-141-Data.db            rhq-one_hour_metrics-ib-142-Summary.db
rhq-one_hour_metrics-ib-141-Filter.db          rhq-one_hour_metrics-ib-142-TOC.txt
rhq-one_hour_metrics-ib-141-Index.db           rhq-one_hour_metrics-ib-143-CompressionInfo.db
rhq-one_hour_metrics-ib-141-Statistics.db      rhq-one_hour_metrics-ib-143-Data.db
rhq-one_hour_metrics-ib-141-Summary.db         rhq-one_hour_metrics-ib-143-Filter.db
rhq-one_hour_metrics-ib-141-TOC.txt            rhq-one_hour_metrics-ib-143-Index.db
rhq-one_hour_metrics-ib-142-CompressionInfo.db rhq-one_hour_metrics-ib-143-Statistics.db
rhq-one_hour_metrics-ib-142-Data.db            rhq-one_hour_metrics-ib-143-Summary.db
rhq-one_hour_metrics-ib-142-Filter.db          rhq-one_hour_metrics-ib-143-TOC.txt
rhq-one_hour_metrics-ib-142-Index.db

The SSTables are the files that end in Data.db. There are three such files.

rhq-one_hour_metrics-ib-141-Data.db
rhq-one_hour_metrics-ib-142-Data.db
rhq-one_hour_metrics-ib-143-Data.db

After compaction finishes, Cassandra will log statements like the following,

INFO [CompactionExecutor:16] 2013-03-11 14:38:15,911 CompactionTask.java (line 273) Compacted 4 sstables to [/Users/jsanda/Development/redhat/rhq/modules/enterprise/server/server-metrics/target/cassandra/node0/data/rhq/twenty_four_hour_metrics/rhq-twenty_four_hour_metrics-ib-145,]. 48,839,028 bytes to 45,533,132 (~93% of original) in 9,765ms = 4.446880MB/s. 100 total rows, 87 unique. Row merge counts were {1:76, 2:10, 3:0, 4:1, }

DEBUG [CompactionExecutor:16] 2013-03-11 14:38:15,911 CompactionTask.java (line 275) CF Total Bytes Compacted: 2,084,541,324

These statements provide a lot of details. The INFO statement tells us that it compacted the twenty_four_hour_metrics table. Prior to compaction the SSTables were roughly 46.58 MB on disk, and after compaction they were about 43.42 MB on disk, resulting in a 7% decrease in disk usage. It took 9.765 seconds for compaction to run for an overall throughput of 4.446880MB/s. The next line tells us that a total of 100 rows were compacted that included 87 keys. The last line about row merge counts may not be obvious at first glance. It gives us a summary of how many SSTables each row is spread across. 76 rows were only in 1 SSTable, 10 rows were spread across 2 SSTables, and 1 row was spread across 4 SSTables.

The DEBUG statement tells us the total number of bytes that have been compacted since Cassandra has started. In this case, 1.94 GB have been compacted so far.