Skip to end of metadata
Go to start of metadata

Many repositories are used (at least in part) to manage files and other artifacts, including service definitions, policy files, images, media, documents, presentations, application components, reusable libraries, configuration files, application installations, databases schemas, management scripts, and so on. Unlocking the information buried within all of those files is what ModeShape sequencing is all about. As files are loaded into the repository, you ModeShape instance can automatically sequence these files to extract from their content meaningful information that can be stored in the repository, where it can then be searched, accessed, and analyzed using the JCR API.

Sequencers

Sequencers are just POJOs that implement a specific interface, and their job is to process a stream of data (supplied by ModeShape) to extract meaningful content that usually takes the form of a structured graph. Exactly what content is up to each sequencer implementation. For example, ModeShape comes with an image sequencer that extracts the simple metadata from different kinds of image files (e.g., JPEG, GIF, PNG, etc.). Another example is the Compact Node Definition (CND) sequencer that processes the CND files to extract and produce a structured representation of the node type definitions, property definitions, and child node definitions contained within the file.

Sequencers are configured to identify the kinds of nodes that the sequencers can work against. When content in the repository changes, ModeShape looks to see which (if any) sequencers might be able to run on the changed content. If any sequencer configurations do match, those sequencers are run against the content, and the structured graph output of the sequencers is then written back into the repository (at a location dictated by the sequencer configuration). And once that information is in the repository, it can be easily found and accessed via the standard JCR API.

In other words, ModeShape uses sequencers to help you extract more meaning from the artifacts you already are managing, and makes it much easier for applications to find and use all that valuable information. All without your applications doing anything extra.

Stream Sequencers

The StreamSequencer interface defines the single method that must be implemented by a sequencer:

A new instance is created for each sequencing operation, so there is no need for the class to be synchronized or thread-safe. Additionally, when a sequencer configuration includes properties (see configuring a sequencer), ModeShape will set those properties on the StreamSequencer implementation using JavaBean-style setter methods. This makes it easy to define sequencer-specific properties on the sequencer configurations, while making it easy to implement with JavaBean-style setter methods.

Implementations are responsible for processing the content in the supplied InputStream content and generating structured content using the supplied SequencerOutput interface. The StreamSequencerContext provides additional details about the information that is being sequenced, including the location and properties of the node being sequenced, the MIME type of the node being sequenced, and a Problems object where the sequencer can record problems that aren't severe enough to warrant throwing an exception. The StreamSequencerContext also provides access to the ValueFactories that can be used to create Path, Name, and any other value objects.

The SequencerOutput interface is fairly easy to use, and its job is to hide from the sequencer all the specifics about where the output is being written. Therefore, the interface has only a few methods for implementations to call. Two methods set the property values on a node, while the other sets references to other nodes in the repository. Use these methods to describe the properties of the nodes you want to create, using relative paths for the nodes and valid JCR property names for properties and references. ModeShape will ensure that nodes are created or updated whenever they're needed.

ModeShape will create nodes of type nt:unstructured unless you specify the value for the jcr:primaryType property. You can also specify the values for the jcr:mixinTypes property if you want to add mixins to any node.

Path Expressions

Each sequencer must be configured to describe the areas or types of content that the sequencer is capable of handling. This is done by specifying these patterns using path expressions that identify the nodes (or node patterns) that should be sequenced and where to store the output generated by the sequencer. We'll see how to fully configure a sequencer in the next chapter, but before then let's dive into path expressions in more detail.

A path expression consist of two parts: a selection criteria (or an input path) and an output path:

The inputPath part defines an expression for the path of a node that is to be sequenced. Input paths consist of '/' separated segments, where each segment represents a pattern for a single node's name (including the same-name-sibling indexes) and '@' signifies a property name.

Let's first look at some simple examples:

Input Path Description
/a/b Match node "b" that is a child of the top level node "a". Neither node may have any same-name-sibilings.
/a/* Match any child node of the top level node "a".
/a/*.txt Match any child node of the top level node "a" that also has a name ending in ".txt".
/a/*.txt Match any child node of the top level node "a" that also has a name ending in ".txt".
/a/b@c Match the property "c" of node "/a/b".
/a/b[2] The second child named "b" below the top level node "a".
/a/b[2,3,4] The second, third or fourth child named "b" below the top level node "a".
/a/b[*] Any (and every) child named "b" below the top level node "a".
//a/b Any node named "b" that exists below a node named "a", regardless of where node "a" occurs. Again, neither node may have any same-name-sibilings.

Simple Input Path Examples

With these simple examples, you can probably discern the most important rules. First, the '*' is a wildcard character that matches any character or sequence of characters in a node's name (or index if appearing in between square brackets), and can be used in conjunction with other characters (e.g., "*.txt").

Second, square brackets (i.e., '[]') are used to match a node's same-name-sibiling index. You can put a single non-negative number or a comma-separated list of non-negative numbers. Use '0' to match a node that has no same-name-sibilings, or any positive number to match the specific same-name-sibling.

Third, combining two delimiters (e.g., "//") matches any sequence of nodes, regardless of what their names are or how many nodes. Often used with other patterns to identify nodes at any level matching other patterns. Three or more sequential slash characters are treated as two.

Many input paths can be created using just these simple rules. However, input paths can be more complicated. Here are some more examples:

Input Path Description
/a/(b|c|d) Match children of the top level node "a" that are named "b", "c" or "d". None of the nodes may have same-name-sibling indexes.
/a/b[c/d] Match node "b" child of the top level node "a", when node "b" has a child named "c", and "c" has a child named "d". Node "b" is the selected node, while nodes "c" and "d" are used as criteria but are not selected.
/a(/(b|c|d|)/e)[f/g/@something] Match node "/a/b/e", "/a/c/e", "/a/d/e", or "/a/e" when they also have a child "f" that itself has a child "g" with property "something". None of the nodes may have same-name-sibling indexes.

More Complex Input Path Examples

These examples show a few more advanced rules. Parentheses (i.e., '(' and ')') can be used to define a set of options for names, as shown in the first and third rules. Whatever part of the selected node's path appears between the parentheses is captured for use within the output path. Thus, the first input path in the previous table would match node "/a/b", and "b" would be captured and could be used within the output path using "$1", where the number used in the output path identifies the parentheses.

Square brackets can also be used to specify criteria on a node's properties or children. Whatever appears in between the square brackets does not appear in the selected node.

So far, we've talked about how input paths and output paths are independent of the repository and workspace. However, there are times when it's desirable to configure sequencers to only work against content in a specific source and/or specific workspace. In these cases, it is possible to specify the repository name and workspace names before the path. For example:

Input Path Description
source:default:/a/(b|c|d) Match nodes in the "default" workspace within the "source" source that are children of the top level node "a" and named "b", "c" or "d". None of the nodes may have same-name-sibling indexes.
:default:/a/(b|c|d) Match nodes in the "default" workspace within any source source that are children of the top level node "a" and named "b", "c" or "d". None of the nodes may have same-name-sibling indexes.
source::/a/(b|c|d) Match nodes in any workspace in the "source" source that are children of the top level node "a" and named "b", "c" or "d". None of the nodes may have same-name-sibling indexes.
::/a/(b|c|d) Match nodes in any within any source source that are children of the top level node "a" and named "b", "c" or "d". None of the nodes may have same-name-sibling indexes. (This is equivalent to the path "/a/(b | c | d)".)

Input Paths with Source and Workspace Names

Again, the rules are pretty straightforward. You can leave off the repository name and workspace name, or you can prepend the path with "{sourceNamePattern}:{workspaceNamePattern}:", where "{sourceNamePattern} is a regular-expression pattern used to match the applicable source names, and "{workspaceNamePattern} is a regular-expression pattern used to match the applicable workspace names. A blank pattern implies any match, and is a shorthand notation for ".*". Note that the repository names may not include forward slashes (e.g., '/') or colons (e.g., ':').

Let's go back to the previous code fragment and look at the first path expression:

This matches a node named "jcr:content" with property "jcr:data" but no siblings with the same name, and that is a child of a node whose name ends with ".jpg", ".jpeg", ".gif", ".bmp", ".pcx", or ".png" that may have any same-name-sibling index. These nodes can appear at any level in the repository. Note how the input path capture the filename (the segment containing the file extension), including any same-name-sibling index. This filename is then used in the output path, which is where the sequenced content is placed.

Out-of-the-box Sequencers

A number of sequencers are already available in ModeShape, and are outlined in detail later in the document. Note that we do want to build more sequencers in the upcoming releases.

Creating Custom Sequencers

The current release of ModeShape comes with eleven sequencers. However, it's very easy to create your own sequencers and to then configure ModeShape to use them in your own application.

Creating a custom sequencer involves the following steps:

  1. Create a Maven 3 project for your sequencer;
  2. Implement the StreamSequencer interface with your own implementation, and create unit tests to verify the functionality and expected behavior;
  3. Add the sequencer configuration to the ModeShape SequencingService in your application as described in the previous chapter; and
  4. Deploy the JAR file with your implementation (as well as any dependencies), and make them available to ModeShape in your application.
    It's that simple.

Creating the Maven 3 project

The first step is to create the Maven 3 project that you can use to compile your code and build the JARs. Maven 3 automates a lot of the work, and since you're already set up to use Maven, using Maven for your project will save you a lot of time and effort. Of course, you don't have to use Maven 3, but then you'll have to get the required libraries and manage the compiling and building process yourself.

ModeShape may provide in the future a Maven archetype for creating sequencer projects. If you'd find this useful and would like to help create it, please join the community.

In lieu of a Maven archetype, you may find it easier to start with a small existing sequencer project. The modeshape-sequencer-images project is a small, self-contained sequencer implementation that has only the minimal dependencies. See the Git repository: http://github.com/ModeShape/modeshape//tree/modeshape-2.8.0.Final/extensions/modeshape-sequencer-images/

You can create your Maven project any way you'd like. For examples, see the Maven 3 documentation. Once you've done that, just add the dependencies in your project's pom.xml dependencies section:

These are minimum dependencies required for compiling a sequencer. Of course, you'll have to add other dependencies that your sequencer needs.

As for testing, you probably will want to add more dependencies, such as those listed here:

Testing ModeShape sequencers does not require a JCR repository or the ModeShape services. (For more detail, see the testing section.) However, if you want to do integration testing with a JCR repository and the ModeShape services, you'll need additional dependencies for these libraries.

At this point, your project should be set up correctly, and you're ready to move on to write your custom implementation of the StreamSequencer interface. As stated earlier, this should be fairly straightforward: process the stream and generate the output that's appropriate for the kind of file being sequenced.

Let's look at an example. Here is the complete code for the ImageMetadataSequencer implementation:

where the ImageMetadataLexicon class contains the Name constants and is defined as:

Notice how the image metadata is extracted and the output graph is generated. A single node is created with the name image:metadata and with the image:metadata node type. No mixins are defined for the node, but several properties are set on the node using the values obtained from the image metadata. After this method returns, the constructed graph will be saved to the repository in all of the places defined by its configuration. (This is why only relative paths are used in the sequencer.)

Testing custom sequencers

The sequencing framework was designed to make testing sequencers much easier. In particular, the StreamSequencer interface does not make use of the JCR API. So instead of requiring a fully-configured JCR repository and ModeShape system, unit tests for a sequencer can focus on testing that the content is processed correctly and the desired output graph is generated.

For a complete example of a sequencer unit test, see the ImageMetadataSequencerTest unit test in the org.modeshape.sequencer.images package of the modeshape-sequencers-image project.

The following code fragment shows one way of testing a sequencer, using JUnit 4.4 assertions and some of the classes made available by ModeShape. Of course, this example code does not do any error handling and does not make all the assertions a real test would.

It's also useful to test that a sequencer produces no output for something it should not understand:

These are just two simple tests that show ways of testing a sequencer. Some tests may get quite involved, especially if a lot of output data is produced.

It may also be useful to create some integration tests that configure ModeShape to use a custom sequencer, and to then upload content using the JCR API, verifying that the custom sequencer did run. However, remember that ModeShape runs sequencers asynchronously in the background, and you must synchronize your tests to ensure that the sequencers have a chance to run before checking the results.

Summary

In this chapter, we described how ModeShape sequences files as they're uploaded into a repository. We've also learned in previous chapters about the ModeShape execution contexts, graph model, and connectors. In the next part we'll put all these pieces together to learn how to set up a ModeShape repository and access it using the JCR API.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.