Federation - ModeShape 5

Usually a ModeShape repository owns all of its own data. It stores all of the information about all nodes within a persistent store and the repository has a single binary store used to persist all BINARY values.

Sometimes it would be nice if a ModeShape repository could include some data that is actually owned and managed by an external system. Clients could access internal data (owned by ModeShape) and external data (owned by an external system) in exactly the same way, using the JCR API. ModeShape might cache this external data (for performance reasons), but it would never store any of this external data.

The ability to access external and internal data in exactly the same way as if it were stored in one place is what we call federation. This page describes how ModeShape federation works, the important concepts used in federation, and how your applications can use federation.

Concepts and terminology

The following diagram shows a conventional repository that owns all of its data, including all nodes, properties, and binary values. This is usually what people think of when they think of a database.

images/author/download/attachments/103547135/repository-node-structure.png
Conventional repository

In repositories like this, all of the nodes are treated the same way, since they're all owned by ModeShape. However, if federation is enabled on the repository, then other kinds of nodes appear and other concepts become important:

images/author/download/attachments/103547135/federated-node-structure.png
Federated repository

The next sections talk about these concepts.

External system and sources

An external system is a system outside of ModeShape that owns its own data and that ModeShape interacts with to access (and optionally update) that data. The external system might be a data store, or it might be a service that dynamically produces data. Examples of external systems are Oracle 11i, Cassandra, MongoDB, Git, SVN, SAP, file systems, CMIS, RPM repositories, and JCR repositories.

Whereas an external system is a kind of software system, we use the term external source to describe an addressable instance or installation of the external system. For example, external sources might include a particular database instance, a particular Git repository, a particular file system on a specific machine, or a particular instance of a CMIS repository.

In the digram above, two external sources are shown and labeled "External Source A" and "External Source B". (But the diagram does not define what kind of system they are.)

Connectors

A ModeShape connector is the software used to interact with a specific kind of external system. A connector consists of compiled Java classes and resources, and is usually packaged as a JAR with dependencies on 3rd party libraries. ModeShape defines a connector SPI (or Service Provider Interface) which the connector must implement. Generally connectors can read and update data in the external system, although a connector implementation may support only read operations.

To be useful, however, a connector must be instantiated and that instance configured to talk to a specific external source. Then that connector instance's job is to create a single, virtual tree of nodes that represents the data in the external source. Note that the connector doesn't create the entire tree up front; instead, the connector creates the nodes in that virtual tree only when ModeShape asks for them. Thus, the potential tree of nodes for a given source might be massive, but only the nodes being used will be materialized.

The diagram of the federated repository shown above includes two connector instances, each of which is configured to talk to one of the external sources.

Internal nodes

An internal node is any node within a ModeShape repository that is owned by ModeShape and stored within a persistent store. In a regular repository (without federation), all nodes are internal nodes.

Internal nodes are shown in the federated repository diagram above as the rust-colored and purple nodes.

External nodes

An external node is any node within a federated ModeShape repository that is not owned by ModeShape but instead is dynamically generated to represent some portion of data in an external source. ModeShape clients view internal and external nodes in exactly the same way, but internally ModeShape handles internal and external nodes in very different ways.

External nodes are shown in the federated repository diagram above as the green, yellow, and blue colored nodes.

Federated nodes

A federated node is simply an internal node that contains some children that are external nodes. In other words, only federated nodes can have internal nodes and external nodes as children, whereas internal nodes can only have other internal nodes as children and external nodes can only have other external nodes as children.

Federated nodes are shown in the federated repository diagram above as the purple nodes.

Projection

A projection is a portion of the repository (really a subgraph) whose nodes are all external nodes that are representations of some of the data in an external source. The nodes are dynamically generated (by the connector's logic) as needed, and can optionally be cached for a configurable amount of time.

The federated repository diagram above shows three projections, labeled "Projection 1", "Projection 2", and "Projection 3". Strictly speaking, projections do not have a name, so the labels are merely for discussion purposes. Note how projections 1 and 2 both project external nodes from "External Source A", whereas projection 3 only projects the external nodes from "External Source B". We often will talk about an external source as having one or more projections; thus "External Source A" has two projections ("Projection 1" and "Projection 2"), while "External Source B" has only one projection ("Projection 3").

Each projection maps a specific subtree of the virtual tree (created by a connector talking to an external source) underneath a specific federated node. A simple path is used to identify the subtree of external nodes, and a simple path is used to identify the federated node. ModeShape uses a projection expression that is a string with these two paths:

  <workspace-name>':' <path-to-federated-node> '=>' <path-in-external-source-of-node>

where

<workspace-name> is the name of the ModeShape workspace where the projection is to be placed
<path-to-federated-node> is a regular absolute path to the existing internal node under which the external nodes are to appear
<path-in-external-source-of-node is a regular absolute path in the virtual tree created by the connector of the node whose children are to appear as children of the federated node.

Projections can be defined within a repository's configuration (making them available immediately upon startup of the repository) or programmatically added or removed by client applications using the FederationManager interface.

FederationManager

The ModeShape public API includes the org.modeshape.jcr.api.federation.FederationManager interface that defines several methods for programmatically creating and removing projections. Note that at this time it is not possible to programmatically create, modify, or remove external sources, so these must be defined within the repository configuration.

How it works

Federation is intended to be completely transparent to clients. There is no apparent difference between internal nodes, federated nodes, and external nodes. Some operations might not be permitted on external nodes (e.g., if the connector is read-only), but that's also true of internal nodes (though the reason while the operation is not permitted may be different).

Navigation

As clients navigate the nodes in the repository, they typically ask for one (or multiple) children of a particular node. Clients repeat this process until they access the node(s) they're looking for.

ModeShape performs these operations differently depending upon the kind of node:

If the parent is an internal node, then all children will also be internal. Therefore, to find a particular child by name, ModeShape obtains the parent's child reference to obtain the child's node key, and then looks up the node with that key in persistent store. This is the "conventional" behavior, and this incurs no overhead even when the repository is configured to use federation.
If the parent is a federated node, then the process is very similar to internal nodes, except that the internal and external child references are managed separately. ModeShape then looks at the child's node key to determine (from the key itself) if the child exists in local persistent store or in an external source. If in an external source, ModeShape then calls to the connector to ask for the representation of the requested node.
If the parent is an external node, then ModeShape obtains the parent's child reference and looks up the node with that key in the same connector. The connector then generates a representation of the requested node.

Lookup by identifier

All nodes (both internal and external) can be accessed by Session.getNodeByIdentifier(String, where the identifier is the same string returned by calling the getIdentifier() method on the node. ModeShape can tell from the identifier whether it is for an external node, and if so it will look up the node in the connector.

Per the JCR specification, clients should treat these identifiers as opaque. In fact, ModeShape identifiers follow a fairly complex pattern that will likely be difficult to reverse engineer, and which may change at any time.

Caching

ModeShape actually uses an in-memory LRU cache of the nodes. So although the navigation and lookup steps mentioned above don't discuss using the LRU cache, ModeShape always consults this cache when it needs to find a node with a particular node key. If found in the cache, the node will simply be used. If the cache does not contain the node, then it will consult the persistent store or the connector to obtain (and cache) the node.

Normally, nodes in the LRS cache are evicted after a certain time. However, external nodes can have an additional internal property that specifies whether the nodes should be cached or not at all, in the LRU cache.

Of course, a node is also evicted from the cache if the node has been changed and persisted (e.g., via Session.save() or user transaction commit), even if that change was made on a different process in the cluster.

Querying

A connector decides which external nodes are to be indexes.

The connector instance can be configured with a "queryable" boolean parameter that states whether any of the content is to be queryable. This defaults to true.
The connector can mark any or all nodes as not queryable.

Thus, even though a connector implementation may be written such that some or all of the external nodes can be queried, a repository configuration can configure an instance of that connector and override the behavior so that no nodes are queryable.

If a connector is implemented by marking all nodes as not queryable, then configuring an instance of that connector with queryable=true has no effect.

Any nodes that are queryable will be included in the index, as long as ModeShape is notified of new nodes. By default, external nodes are not automatically indexed, so to index them simply use ModeShape's API for reindexing.

Once indexed, the nodes can be queried just like any other nodes.

Paging connectors

A connector works by creating a node representation of external data, and that node contains the references to the node's children. These references are relatively small (just the ID and name of the child), and for many connectors this is sufficient and fast enough. However, when the number of children under a node starts to increase, building the list of child references for a parent node can become noticeable and even burdensome, especially when few (if any) of the child references may ultimately be resolved into nodes because no client actually uses those references.

A pageable connector is one that want to expose the children of nodes in a "page by page" fashion, where the parent node only contains the first page of child references and subsequent pages are loaded only if needed. This turns out to be quite effective, since when clients navigate a specific path (or ask for a specific child of a parent by its name) ModeShape doesn't need to use the child references in a node's document and can instead simply have the connector resolve such (relative or absolute external) paths into an identifier and then ask for the document with that ID.

Therefore, the only time the child references are needed are when clients iterate over the children of a node. A pageable connector will only be asked for as many pages as needed to handle the client's iteration, making it very efficient for exposing a node structure that can contain nodes with numerous children.

Current connectors

See this section for a list of the current supported connectors.