This page summarizes results from various online/offline and f2f discussions regarding clustering of the microkernel. It tries to fix some terms, sketches approaches which came up during the discussions and lists some of the open questions.
Introduction
Clustering is the common term for partitioning and replicating data. Partitioning means splitting data up among different storage locations such that each data item exists at a single location only. Partitioning enhances throughput through load balancing. Replication means a data item is kept in multiple storage locations at the same time. Replication enhances reliability through redundancy.
The CAP theorem states that it is impossible for a distributed computer system to provide consistency, availability and partition tolerance at the same time. (Note there is a slight clash of terminology: in the context of the CAP theorem partition tolerance means tolerance to the network becoming partitioned.)
The clustering mechanism for the microkernel will trade consistency for availability in the face of network partition and for reduced latency under normal operation. The goal is to implement some form of eventual consistency.
Cluster topology
On a first level, a cluster splits the JCR tree into various partitions. On a second level each partition may then be replicated to various cluster nodes.
Partitioning
A challenge for all partitioning schemes is enforcing atomicity of commits across different partitions. When atomicity is desired, some form of atomic commitment protocol (ACP) has to be in place. The two-phase commit protocol seems to be standard here. Note however, that there is no ACP which is completely tolerant against network failures (See [1] proposition 7.1 and 7.2). That is, every ACP may cause cluster nodes to become blocked in the face of network failures. Furthermore no ACP can guarantee independent recovery of failed clusters.
Move operations across partitions pose another challenge: with a naive implementation a whole subtree would have to be physically moved.
One approach for partitioning the JCR tree is splitting it up by path: each cluster node contains the nodes pertaining to a direct child node of the root node. In this scenario the root node resembles a set of mount points onto which the various cluster nodes are mounted. Optimally the root node itself is stateless. An open question with this approach is how to aggregate the individual revision histories of the partitions into a consistent revision history of the root.
Another approach for partitioning the JCR tree is splitting by subtree: in this scheme each cluster node contains some subtree of the entire JCR tree. The leave (JCR) nodes in each cluster node contain pointers to the cluster nodes which contain their actual child (JCR) nodes. One specific form of this scheme would dedicate a cluster node to the root (JCR) node and a couple of cluster nodes for the subtrees below. This scheme very much resembles the splitting by path scheme from above only that now the root is not state less. A disadvantage is that the root might become a bottleneck. However, since there is a dedicated cluster node for the root, it could be optimally tuned. On the other hand this scheme does not have the problem of aggregating the revision histories since these are effectively stored in the root node. Furthermore this scheme lends itself nicely to map reduce approaches.
Replication
Replicating the same data items to various cluster nodes poses consistency concerns. A eventually consistent system guarantees, that after a finite time of quiescence, all cluster nodes in a replication set are in the same state. A possible approach to combine JCR with eventually consistent replication transparently, is to make cluster synchronization appear as normal session operations to clients. That is, when some items on a cluster node change due to a synchronization activity, the cluster node will generate JCR observation events for all sessions currently open on this cluster node.
An open question is how to best implement cluster node synchronization: how (topological) should changes in cluster propagate to other clusters in order to ensure convergence. That is, in order not to arrive at an unstable system which propagates changes forth and back between cluster nodes. One approach which resembles branching and merging is proposed in [2]. However the branch operation doesn't seem viable. Another approach which might be more flexible is to use vector clocks for establishing a happens before relation on transactions from different cluster nodes.
Another open question is how to synchronize conflicting transactions from different cluster nodes. Merging can be done to a certain degree. Other cases might have to rollback transactions on some cluster nodes. Consider for example the case where a node x is moved such that it becomes a child of node y on one cluster node and at them same time node y is moved such that it becomes a child of node x on another cluster node. The situation is inherently symmetric and neither move can follow the other. On cluster synchronization there seems to be no better way than to undo one (or both?) of the transactions which contained these move operations.
Distributed B+Tree
This approach was first mentioned on the dev list. The basic idea is to store the MK nodes in a B+Tree and use the paths of the nodes as keys. To support efficient enumeration of child nodes, the order of the linearized MK tree is according to a breadth first search.
The B+Tree could be implemented as described in the paper 'A practical scalable distributed B-tree' [3]. The B-Tree implementation is built on top of Sinfonia [4]. Sinfonia is a distributed block oriented storage mechanism, which supports simple transactions. Sinfonia works with optimistic concurrency control but still uses locks for a short amount of time to commit changes. Transactions that failed because of a conflict are simply retried until they succeed.
Presentations
References
[1] Concurrency Control and Recovery in Database Systems. Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman. 1987
[2] Eventually Consistent Transactions. Sebastian Burckhardt, Manuel Fahndrich, Daan Leijen, and Mooly Sagiv. October 2011
[3] A practical scalable distributed B-tree. Marcos K. Aguilera, Wojciech Golab, Mehul A. Shah. 2008
[4] Sinfonia: a new paradigm for building scalable distributed systems. Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christos Karamanolis. 2007