Design - Data Replication and Consistency

Current Design

Replication is configured by the replication_factor (RF) which is set at the keyspace level. RHQ uses two keyspaces - system_auth and rhq. Remember that Cassandra uses consistent hashing and forms a token ring. In earlier versions of Cassandra (prior to 1.2), each node was assigned a single token and therefore owned a single partition on the ring. Virtual nodes were introduced in 1.2 and that allows for a node to be assigned multiple tokens and therefore own multiple partitions. RHQ is using virtual nodes in large part because it greatly simplifies the work of reassigning tokens and rebalancing the cluster after nodes are added or removed. As the cluster size changes, the RF of each is changed according to the following:

Number of nodes	system_auth RF	rhq RF
1	1	1
2	2	2
3	2	2
>= 4	3	3

The first thing to note is that the RF of neither system_auth nor rhq is user-configurable. This intentional for a couple reasons. First, understanding the ramifications of changing the RF requires a certain level of understanding of Cassandra. The management of Cassandra in RHQ is predicated on users not having an in-depth (or any for that matter) understanding of Cassandra. Secondly, the RF impacts application code. Not only does it impact performance of both reads and writes, but it also impacts the consistency guarantees of client read/write operations.

We currently use consistency level (CL) ONE for all read/write operations. The reason for this is because it is the default consistency level provided by the CQL driver and changes in this area have not been a priority to date. This means that for multiple nodes, we cannot guarantee strong consistency for reads which is in start contrast from the relational database. It is not as bad as it sounds however. In the absence of failure, replicas (nodes that share the same token(s)) should be consistent. Furthermore, when a client query is executed, Cassandra does read repair in the background to ensure that all replicas have the latest known values. RHQ runs ant-entropy repair as part of regularly scheduled storage cluster maintenance which will ensure that replicas are consistent.

It should be noted that the CL with which writes are done has no impact on alerting.

The following sections discuss the implications of each RF.

One Node

With a single node there is no data replication, and all reads guarantee strong consistency. In other words, a query will return the latest value(s) for the columns/rows being queried.

Two Nodes

All writes go to both nodes. Inconsistent reads are possible though since we use CL ONE. All reads and writes can be served with a node down. Another thing to consider is the footprint on disk. Going from one to two nodes will not decrease the size of data on disk.

Three Nodes

The RF remains 2 which means metric data for a given schedule ID will be stored on two of three nodes (since the schedule ID is the partition key). Inconsistent reads are possible since both reads and writes are done at CL ONE. The footprint on disk of node will be reduced since each node no longer owns all partition keys.

If one of the nodes goes down, we can continue performing both reads and writes without any problem since one of the replicas is still up.

If two nodes are down, then both reads and writes will fail, but not necessarily all of them. Even though there is a still a live, healthy node, if both of the replicas for a partition key to which we are reading or writing are down, then we will get a UnavailableException on the client side. Authentication and authorization checks are performed on each request at CL ONE. Remember that the RF of the system_auth keyspace is 2; consequently, a write request for example, could fail during the authorization check without the write ever being attempted. Each node stores permissions in a local cache. A query for the user is performed on a cache miss. That query can fail if both replicas are down. Unfortunately, the UnavailableException gets wrapped. In the Cassandra log we will see,

ERROR [Native-Transport-Requests:551] 2013-09-09 16:30:01,338 ErrorMessage.java (line 210) Unexpected exception during request
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level ONE
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2258)
	at com.google.common.cache.LocalCache.get(LocalCache.java:3990)
	at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3994)
	at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4878)
	at org.apache.cassandra.service.ClientState.authorize(ClientState.java:290)
	at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:170)
	at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:163)
	at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:147)
	at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:67)
	at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:100)
	at org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:223)
	at org.apache.cassandra.transport.messages.ExecuteMessage.execute(ExecuteMessage.java:121)
	at org.apache.cassandra.transport.Message$Dispatcher.messageReceived(Message.java:287)
	at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
	at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
	at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
	at org.jboss.netty.handler.execution.ChannelUpstreamEventRunnable.doRun(ChannelUpstreamEventRunnable.java:43)
	at org.jboss.netty.handler.execution.ChannelEventRunnable.run(ChannelEventRunnable.java:67)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:724)
Caused by: java.lang.RuntimeException: org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level ONE
	at org.apache.cassandra.auth.Auth.selectUser(Auth.java:247)
	at org.apache.cassandra.auth.Auth.isSuperuser(Auth.java:84)
	at org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:50)
	at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:68)
	at org.apache.cassandra.service.ClientState$1.load(ClientState.java:276)
	at org.apache.cassandra.service.ClientState$1.load(ClientState.java:273)
	at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3589)
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2374)
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2337)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2252)
	... 20 more
Caused by: org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level ONE
	at org.apache.cassandra.db.ConsistencyLevel.assureSufficientLiveNodes(ConsistencyLevel.java:250)
	at org.apache.cassandra.service.ReadCallback.assureSufficientLiveNodes(ReadCallback.java:161)
	at org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:870)
	at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:805)
	at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:133)
	at org.apache.cassandra.auth.Auth.selectUser(Auth.java:236)
	... 29 more

but on the client side (in the RHQ server log) we will see something like,

16:35:01,821 ERROR [org.rhq.server.metrics.MetricsServer] (New I/O worker #24) An error occurred while
inserting raw data MeasurementDataNumeric[name=metric004, value=95067.0, scheduleId=14837, timestamp=1378758886941]:
com.datastax.driver.core.exceptions.DriverInternalError: An unexpected error occured server side:
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException:
org.apache.cassandra.exceptions.UnavailableException: Cannot achieve consistency level ONE

Four Nodes

The RF increases to 3 for both the system_auth and rhq keyspaces. Metric data for a given schedule ID will be stored on three nodes. The footprint on disk for each node will not decrease as we are increasing the number of replicas. Inconsistent reads are possible since both reads and writes are done at CL ONE.

We can perform all reads and writes with two nodes down. With three nodes down though, reads and writes will start to fail as previously described.

Five or More Nodes

The RF stays at 3 for both the syste_auth and rhq keyspaces. The footprint on disk for each node will decrease. Inconsistent reads are possible since both reads and writes are done at CL ONE.

We can perform all reads and writes with two nodes down. With three nodes down, client requests may start failing.

Proposed Design

For storing metric data, availability takes priority over consistency. It is also important to be able to sustain a high level of write throughput as the number of writes increases. It therefore makes sense to continue writing metrics at CL ONE.

As users are accustomed to strong consistency with the legacy implementation, it may therefore be desirable to provide strong consistency for user-facing queries. The CQL driver allows clients to specify a retry policy that determines what do when a request results in a TimeoutException or an UnavailableException. The former is thrown when the coordinator or one of the replicas takes too long to respond. The latter is thrown when not enough replicas are available to satisfy the specified CL. We can use a rety policy that favors both consistency and availability. If a request cannot be completed for the specified CL, the driver will automatically retry at a lower CL in an effort to still return results to the client. This offers the best of both worlds.

There needs to be an analysis to ensure that these changes do not yield adverse effects on overall performance. Lastly, it might be nice to make things configurable if the user is not concerned about strong consistency. Then we would simply perform all reads and writes at CL ONE as we do today.

JBoss Community Archive (Read Only)

RHQ 4.9