Scaling - On the Definitions of View Consistency

Background

After some conversation and further thought, it would seem to me that this area of the product reveals a fundamental design approach that, while it works for small deployments with tens of nodes, it cannot scale to thousands of nodes. Product management has stated a requirement to support deployments in the low thousands of nodes. In fact the approach taken in much of the code similarly has as its basis an assumption that is not applicable to large deployments.

The fundamental assumption behind the approaches to date asserts that consistency, with respect to an observer whose frame of reference is what is rendered in the JON console, is achieved when configuration state has been consistently applied to all managed nodes (agents); views of state reflect this, and as such updates to configuration state are strictly sequential, and subsequent updates to state cannot proceed till prior updates complete (strictly serialized and blocking). To reduce user perceived latency, a commonly employed programming pattern has been to schedule update activities as asynchronous jobs to offload all communication with managed nodes (which occurs in a strictly sequential manner); however, since all communication occurs in a sequential manner, the time till the job completes, and hence the earliest effective time at which the user may schedule another update, is directly proportional to the number of managed nodes.

Alternatives

Short term alternatives that could better leverage compute resources have been considered, such as by creating one asynchronous job for every managed node (agent), thereby transforming the sequential job into a parallel one. But upon further consideration, in the case associated to this defect, the same issue arises where the user is effectively blocked from making any further changes to the auto-group until such time as all agents have successfully applied the updated configuration. Certainly the time interval is decreased, but for large deployments this will likely remain an issue. Another alternative is needed.

Problem Statement

So there are two fundamental design flaws here:

sequential code rather than inherently parallelized code
- sequential code does not leverage compute power of the cluster; refer to topics on lists in functional and distributed programming languages, esp. Erlang
interface elements locked out awaiting completion events

Defining View Consistency

Academic research material on the topic of distributed systems and consistency indicates that perfect consistency in distributed architectures is infeasible; while perfect consistency may be achieved in smaller deployments of ten or less nodes, and then only optimistically, perfect consistency cannot be achieved in larger deployments where network partitions and other transient errors are commonplace. Therefore, it is imperative to define view consistency in terms applicable for distributed systems rather than in terms for the traditional two-node client-server model.

An approach that would scale well is one that defines view consistency from user's intent frame of reference rather than upon enacted/actualized state. This section investigates this idea, where:

views reflect deletions prior to any activity spanning multiple systems (write to the view first, then act)
- perhaps through the use of offload tables, or queues
- possible background deletion, read-repair, other...
views reflect insertions after all associated activities have completed (act first, then write to the view)
- perhaps through the use of upload tables, or queues
- generally this approach works for short-lived activities, but trade-offs need to be made when dealing with potentially long running activities

Defining the semantics of view consistency for loosely coupled distributed systems interconnected using synchronous protocols in the presence of byzantine faults is a first priority.

Ideally insertions should occur synchronously for the client with respect to a predetermined level of redundancy (See Amazon Dynamo, NRW). In our case, we should write to the database, and queue a series of update jobs synchronously, then return control flow. Whereas in systems that coordinate updates and insertions across multiple resource managers insertions are reflected in the view, the database, last; for deletions, the view (database) is updated first and lastly reflected in the remaining resource managers.

These suggest that deletions are asynchronous in nature, while insertions are synchronous - each with respect to state transfer. For the insertion case, once state has been transfered to the server, control flow may return status codes or state synchronously. With regards to the user-client, all calls are synchronous, but internally activities such as those related to deletions may occur in the background, but the view MUST be consistent with user intent at the time control flow is returned.

Let me provide some examples to make this a little more concrete; these are only provided as examples, and may not accurately reflect JON today, their intent is to make a point. Consider group resource configuration updates. Updating the group resource configuration causes an update to each of the agents. Rather than updating all the agents synchronously, the "insertion" occurs in the database synchronously, but all updates to agents happen in parallel asynchronously. The user intent is immediately reflected in the view, represented in the database; the user may push additional updates, which synchronously updates the view by the time when control flow is returned without having to wait for previous updates to be enacted. Another example, deployments, could occur in the same manner. Deployments are interesting in that deletions are possible, the deletion request would occur synchronously with respect to the caller, and the (database) view would reflect this synchronously, but the undeployment activity would occur asynchronous with respect to the caller.

The topic raises lots of questions with regards to monitored resources which may be in different states at any point in time.

References

[1] Byzantine Fault Tolerance
[2] Network Partitions
[3] View Consistency
[4] Eventual Consistency
[5] CAP Theorem
[6] Quartz