Large numbers of child nodes

In general, it is not advisable, performance-wise, to store a very large amount of children nodes directly under one parent. Hierarchies should be designed to be more deep than wide. See http://modeshape.wordpress.com/2014/08/14/improving-performance-with-large-numbers-of-child-nodes for different tips on how to improve the design of a node hierarchy.

ModeShape has been designed to handle a single node having a large number of child nodes as efficiently as possible when it comes to reading and writing. However, if an application stores a lot more nodes than that under a single parent, performance in terms of latency and memory consumption starts to decrease proportionally to the number of stored children. To help mitigate the cases when the node hierarchy cannot be changed, ModeShape offers a couple of features:

Segmenting child nodes into blocks

ModeShape has an optional feature that allows segmenting the parent's list of child references into multiple blocks, where each block is small enough to deal with. This optimization is performed in the background rather than doing it during the Session's save() operation. As a consequence, the actual number of child references stored in any block might vary significantly from the "optimal" value. And while ModeShape is capable of transparently handling any size blocks, performance when dealing with very large numbers of child nodes will be improve when the block sizes are optimized.

Configuration

The segmenting function is not enabled by default for a repository, meaning that ModeShape will store all children nodes under the same parent. To enable it, you need to add it to the repository configuration:

JSON

"storage" : {
  "documentOptimization" : 
      "childCountTarget" : 1000,
      "childCountTolerance" : 10,
      "threadPool" : "modeshape-opt",
      "initialTime" : "00:00",
      "intervalInHours" : 24
   }
}

or in JBoss AS:

Wildfly Configuration

            
   <repository name="sample" 
      document-optimization-child-count-target="1000" 
      document-optimization-child-count-tolerance="10"
      document-optimization-initial-time="02:00"
      document-optimization-interval="24"
      document-optimization-thread-pool="modeshape-opt"
/>

where the first 2 attributes control the desired number of children per segment and the variance tolerance, while the last 3 control the details of the thread-pool that spawns the actual threads performing the optimization process.

ModeShape actually performs really well while using a single block for storing child references, even for moderate numbers of children (~10K).

Accessing by path

Navigating to a node by using its path is perhaps one of the most common access patterns in JCR. This uses the 'Node.getNode(String)' method that takes a relative path, and essentially boils down to finding a particular child node with the supplied name and same-name-sibling index. ModeShape internally indexes the children in each block by both name, so finding nodes by name (and SNS) are as fast as possible, even if multiple blocks need to be accessed.

Iterating

Another common access pattern is to iterate over some or all of a parent node's children, using the 'Node.getNodes()' and 'Node.getNodes(String)' methods. The resulting NodeIterator will transparently access the children in one block at a time, and will continue with all blocks until the last child reference is found or until the caller halts the iteration.

Accessing by identifier

Another common access pattern is to find a node by identifier, using the 'Session.getNodeByIdentifier(String)' method. ModeShape handles this request by directly finding the node by its identifier, and only needs to access the parent's (or ancestors') child references only when the node's name or path is requested by the caller (via the 'Node.getName()' or 'Node.getPath()' methods).

Unordered large collections

Unordered collections are 4 special mixin types that allow users to "annotate" their nodes, marking them as large, unordered collections. These special node types tell ModeShape that it can optimized how to internally store and find the children of those nodes. When a parent node is marked with one of these mixins, ModeShape stores references to the child nodes in special groups called buckets. Each bucket has a unique identifier under a given parent which is made up of the first N characters of the SHA1 of the name of each child node. This makes it very easy and fast to find children by name.

For example, if N is 2, then the bucket names range from 00 to FF, resulting in 16^2 buckets. In this case, when adding a new child to a parent node, ModeShape computes the SHA1(child_name) and uses the first 2 characters from the SHA1 result to determine in which bucket to place that child. Similarly, the buckets range from 000 to FFF, resulting in 16^3 buckets.

ModeShape does all this under the covers, and none of it is exposed to client applications, which just use the normal JCR API as usual. The great advantage of this approach (compared to the others) is lookup performance for any JCR operation where the path of a node is known (for example session.getNode(path)). Internally, ModeShape can immediately determine the bucket in which the child is stored, and can quickly access it. In contrast, when performing the same operation for a regular, non ordered collection node, all the child references of the parent node have to be walked in order to find a reference with a matching particular name.

This approach does offer great performance benefits, but also has some limitations (see below) in terms of JCR operations not being supported.

Usage

The 4 predefined mixins are:

Mixin name	Number of possible buckets
`mode:unorderedTinyCollection`	16
`mode:unorderedSmallCollection`	256
`mode:unorderedLargeCollection`	4096
`mode:unorderedHugeCollection`	65536

and they don't define any additional JCR properties or child nodes. You can choose only one, and once there are children you cannot add or remove these mixins. How do you know which mixin to use? The biggest factor will be the upper limit on the average number of child nodes stored in each bucket, which you can estimate by dividing the total number of expected child nodes by the number of buckets. Don't make this average too small, though, since that will increase the overhead for storing all the child references. Think carefully, because the only way to change the number of buckets is to manually copy or move the nodes to a new parent. We recommend prototyping your choice so you can measure performance with your own data and client application.

The mixins can be used either via the node type definitions:

[test:smallCollection] > mode:unorderedSmallCollection
  - * (undefined) multiple 
  - * (undefined) 
  + * (nt:base) = nt:unstructured

or via regular JCR API like:

node.addMixin("mode:unorderedLargeCollection").

In this last case, such mixins can only be added & removed at runtime if the node on which they are added or removed does not have any children. Also, multiple unordered mixins cannot be added on the same node. Only the first one counts.

Limitations

These special mixins, although they provide really fast access by name or path, come at a cost in the form of various JCR operations & features that are not supported:

JCR feature/operation	Description
same name siblings	not supported
ordering	not supported
renaming	not supported
versioning	not supported. However, children of unordered collections can be versioned normally, like regular nodes
federating external sources	not supported
copy	partially supported. An unordered collection cannot be copied, but other nodes can be copied into such a collection
clone	partially supported. An unordered collection cannot be cloned, but other nodes can be cloned into such a collection

All other JCR operations & features are supported just like for regular nodes.

Accessing by path

Navigating to a node using its path for these types of collections offers the best performance improvement (in terms of latency and memory consumption) out of all the JCR operations and is far superior to the same operations performed on a regular node.

Iterating

Has better memory consumption but equal to or possibly worse latency than regular nodes because buckets are lazily loaded into memory. That means that the more children are iterated - i.e. the longer the iteration takes - the higher the chance of extra buckets needed to be loaded into memory.

Accessing by identifier

Using the 'Session.getNodeByIdentifier(String)' method offers the same performance as regular nodes since ModeShape will do the lookup directly, without using the child references in any way. However, operations like 'Node.getName()' or 'Node.getPath()' might perform worse because in those cases ModeShape has to find the name of a node starting from its ID, which means looking and possibly loading each bucket until a reference is found.

Adding nodes

Adding nodes will also be improved, especially when compared to regular nodes that maintain order of the children or that allow children with the same names.

JBoss Community Archive (Read Only)

ModeShape 5

Large numbers of child nodes

Segmenting child nodes into blocks

Configuration

Accessing by path

Iterating

Accessing by identifier

Unordered large collections

Usage

Limitations

Accessing by path

Iterating

Accessing by identifier

Adding nodes