Chapter 5. Sequencing framework

Many repositories are used (at least in part) to manage files and other artifacts, including service definitions, policy files, images, media, documents, presentations, application components, reusable libraries, configuration files, application installations, databases schemas, management scripts, and so on. Unlocking the information buried within all of those files is what ModeShape sequencing is all about. As files are loaded into the repository, you ModeShape instance can automatically sequence these files to extract from their content meaningful information that can be stored in the repository, where it can then be searched, accessed, and analyzed using the JCR API.

5.1. Sequencers

Sequencers are just POJOs that implement a specific interface, and their job is to process a stream of data (supplied by ModeShape) to extract meaningful content that usually takes the form of a structured graph. Exactly what content is up to each sequencer implementation. For example, ModeShape comes with an image sequencer that extracts the simple metadata from different kinds of image files (e.g., JPEG, GIF, PNG, etc.). Another example is the Compact Node Definition (CND) sequencer that processes the CND files to extract and produce a structured representation of the node type definitions, property definitions, and child node definitions contained within the file.

Sequencers are configured to identify the kinds of nodes that the sequencers can work against. When content in the repository changes, ModeShape looks to see which (if any) sequencers might be able to run on the changed content. If any sequencer configurations do match, those sequencers are run against the content, and the structured graph output of the sequencers is then written back into the repository (at a location dictated by the sequencer configuration). And once that information is in the repository, it can be easily found and accessed via the standard JCR API.

In other words, ModeShape uses sequencers to help you extract more meaning from the artifacts you already are managing, and makes it much easier for applications to find and use all that valuable information. All without your applications doing anything extra.

5.2. Stream Sequencers

The StreamSequencer interface defines the single method that must be implemented by a sequencer:

public interface StreamSequencer {

    /**
     * Sequence the data found in the supplied stream, placing the output 
     * information into the supplied map.
     *
     * @param stream the stream with the data to be sequenced; never null
     * @param output the output from the sequencing operation; never null
     * @param context the context for the sequencing operation; never null
     */
    void sequence( InputStream stream, SequencerOutput output, StreamSequencerContext context );
}

A new instance is created for each sequencing operation, so there is no need for the class to be synchronized or thread-safe. Additionally, when a sequencer configuration includes properties (see configuring a sequencer), ModeShape will set those properties on the StreamSequencer implementation using JavaBean-style setter methods. This makes it easy to define sequencer-specific properties on the sequencer configurations, while making it easy to implement with JavaBean-style setter methods.

Implementations are responsible for processing the content in the supplied InputStream content and generating structured content using the supplied SequencerOutput interface. The StreamSequencerContext provides additional details about the information that is being sequenced, including the location and properties of the node being sequenced, the MIME type of the node being sequenced, and a Problems object where the sequencer can record problems that aren't severe enough to warrant throwing an exception. The StreamSequencerContext also provides access to the ValueFactories that can be used to create Path, Name, and any other value objects.

The SequencerOutput interface is fairly easy to use, and its job is to hide from the sequencer all the specifics about where the output is being written. Therefore, the interface has only a few methods for implementations to call. Two methods set the property values on a node, while the other sets references to other nodes in the repository. Use these methods to describe the properties of the nodes you want to create, using relative paths for the nodes and valid JCR property names for properties and references. ModeShape will ensure that nodes are created or updated whenever they're needed.

public interface SequencerOutput {

  /**
   * Set the supplied property on the supplied node.  The allowable
   * values are any of the following:
   *   - primitives (which will be autoboxed)
   *   - String instances
   *   - String arrays
   *   - byte arrays
   *   - InputStream instances
   *   - Calendar instances
   *
   * @param nodePath the path to the node containing the property; 
   * may not be null
   * @param property the name of the property to be set
   * @param values the value(s) for the property; may be empty if 
   * any existing property is to be removed
   */
  void setProperty( String nodePath, String property, Object... values );
  void setProperty( Path nodePath, Name property, Object... values );

  /**
   * Set the supplied reference on the supplied node.
   *
   * @param nodePath the path to the node containing the property; 
   * may not be null
   * @param property the name of the property to be set
   * @param paths the paths to the referenced property, which may be
   * absolute paths or relative to the sequencer output node;
   * may be empty if any existing property is to be removed
   */
  void setReference( String nodePath, String property, String... paths );
}

Note

ModeShape will create nodes of type nt:unstructured unless you specify the value for the jcr:primaryType property. You can also specify the values for the jcr:mixinTypes property if you want to add mixins to any node.

5.3. Path Expressions

Each sequencer must be configured to describe the areas or types of content that the sequencer is capable of handling. This is done by specifying these patterns using path expressions that identify the nodes (or node patterns) that should be sequenced and where to store the output generated by the sequencer. We'll see how to fully configure a sequencer in the next chapter, but before then let's dive into path expressions in more detail.

A path expression consist of two parts: a selection criteria (or an input path) and an output path:

  inputPath => outputPath

The inputPath part defines an expression for the path of a node that is to be sequenced. Input paths consist of '/' separated segments, where each segment represents a pattern for a single node's name (including the same-name-sibling indexes) and '@' signifies a property name.

Let's first look at some simple examples:

Table 5.1. Simple Input Path Examples

Input Path	Description
/a/b	Match node "`b`" that is a child of the top level node "`a`". Neither node may have any same-name-sibilings.
/a/*	Match any child node of the top level node "`a`".
/a/*.txt	Match any child node of the top level node "`a`" that also has a name ending in "`.txt`".
/a/*.txt	Match any child node of the top level node "`a`" that also has a name ending in "`.txt`".
/a/b@c	Match the property "`c`" of node "`/a/b`".
/a/b[2]	The second child named "`b`" below the top level node "`a`".
/a/b[2,3,4]	The second, third or fourth child named "`b`" below the top level node "`a`".
/a/b[*]	Any (and every) child named "`b`" below the top level node "`a`".
//a/b	Any node named "`b`" that exists below a node named "`a`", regardless of where node "`a`" occurs. Again, neither node may have any same-name-sibilings.

With these simple examples, you can probably discern the most important rules. First, the '*' is a wildcard character that matches any character or sequence of characters in a node's name (or index if appearing in between square brackets), and can be used in conjunction with other characters (e.g., "*.txt").

Second, square brackets (i.e., '[' and ']') are used to match a node's same-name-sibiling index. You can put a single non-negative number or a comma-separated list of non-negative numbers. Use '0' to match a node that has no same-name-sibilings, or any positive number to match the specific same-name-sibling.

Third, combining two delimiters (e.g., "//") matches any sequence of nodes, regardless of what their names are or how many nodes. Often used with other patterns to identify nodes at any level matching other patterns. Three or more sequential slash characters are treated as two.

Many input paths can be created using just these simple rules. However, input paths can be more complicated. Here are some more examples:

Table 5.2. More Complex Input Path Examples

Input Path	Description
/a/(b\|c\|d)	Match children of the top level node "`a`" that are named "`b`", "`c`" or "`d`". None of the nodes may have same-name-sibling indexes.
/a/b[c/d]	Match node "`b`" child of the top level node "`a`", when node "`b`" has a child named "`c`", and "`c`" has a child named "`d`". Node "`b`" is the selected node, while nodes "`c`" and "`d`" are used as criteria but are not selected.
/a(/(b\|c\|d\|)/e)[f/g/@something]	Match node "`/a/b/e`", "`/a/c/e`", "`/a/d/e`", or "`/a/e`" when they also have a child "`f`" that itself has a child "`g`" with property "`something`". None of the nodes may have same-name-sibling indexes.

These examples show a few more advanced rules. Parentheses (i.e., '(' and ')') can be used to define a set of options for names, as shown in the first and third rules. Whatever part of the selected node's path appears between the parentheses is captured for use within the output path. Thus, the first input path in the previous table would match node "/a/b", and "b" would be captured and could be used within the output path using "$1", where the number used in the output path identifies the parentheses.

Square brackets can also be used to specify criteria on a node's properties or children. Whatever appears in between the square brackets does not appear in the selected node.

So far, we've talked about how input paths and output paths are independent of the repository and workspace. However, there are times when it's desirable to configure sequencers to only work against content in a specific source and/or specific workspace. In these cases, it is possible to specify the repository name and workspace names before the path. For example:

Table 5.3. Input Paths with Source and Workspace Names

Input Path	Description
source:default:/a/(b\|c\|d)	Match nodes in the "`default`" workspace within the "`source`" source that are children of the top level node "`a`" and named "`b`", "`c`" or "`d`". None of the nodes may have same-name-sibling indexes.
:default:/a/(b\|c\|d)	Match nodes in the "`default`" workspace within any source source that are children of the top level node "`a`" and named "`b`", "`c`" or "`d`". None of the nodes may have same-name-sibling indexes.
source::/a/(b\|c\|d)	Match nodes in any workspace in the "`source`" source that are children of the top level node "`a`" and named "`b`", "`c`" or "`d`". None of the nodes may have same-name-sibling indexes.
::/a/(b\|c\|d)	Match nodes in any within any source source that are children of the top level node "`a`" and named "`b`", "`c`" or "`d`". None of the nodes may have same-name-sibling indexes. (This is equivalent to the path "`/a/(b\|c\|d)`".)

Again, the rules are pretty straightforward. You can leave off the repository name and workspace name, or you can prepend the path with "{sourceNamePattern}:{workspaceNamePattern}:", where "{sourceNamePattern} is a regular-expression pattern used to match the applicable source names, and "{workspaceNamePattern} is a regular-expression pattern used to match the applicable workspace names. A blank pattern implies any match, and is a shorthand notation for ".*". Note that the repository names may not include forward slashes (e.g., '/') or colons (e.g., ':').

Let's go back to the previous code fragment and look at the first path expression:

  //(*.(jpg|jpeg|gif|bmp|pcx|png)[*])/jcr:content[@jcr:data] => /images/$1

This matches a node named "jcr:content" with property "jcr:data" but no siblings with the same name, and that is a child of a node whose name ends with ".jpg", ".jpeg", ".gif", ".bmp", ".pcx", or ".png" that may have any same-name-sibling index. These nodes can appear at any level in the repository. Note how the input path capture the filename (the segment containing the file extension), including any same-name-sibling index. This filename is then used in the output path, which is where the sequenced content is placed.

5.4. Out-of-the-box Sequencers

A number of sequencers are already available in ModeShape, and are outlined in detail later in the document. Note that we do want to build more sequencers in the upcoming releases.

5.5. Creating Custom Sequencers

The current release of ModeShape comes with eleven sequencers. However, it's very easy to create your own sequencers and to then configure ModeShape to use them in your own application.

Creating a custom sequencer involves the following steps:

Create a Maven 3 project for your sequencer;
Implement the StreamSequencer interface with your own implementation, and create unit tests to verify the functionality and expected behavior;
Add the sequencer configuration to the ModeShape SequencingService in your application as described in the previous chapter; and
Deploy the JAR file with your implementation (as well as any dependencies), and make them available to ModeShape in your application.

It's that simple.

5.5.1. Creating the Maven 3 project

The first step is to create the Maven 3 project that you can use to compile your code and build the JARs. Maven 3 automates a lot of the work, and since you're already set up to use Maven, using Maven for your project will save you a lot of time and effort. Of course, you don't have to use Maven 3, but then you'll have to get the required libraries and manage the compiling and building process yourself.

Note

ModeShape may provide in the future a Maven archetype for creating sequencer projects. If you'd find this useful and would like to help create it, please join the community.

In lieu of a Maven archetype, you may find it easier to start with a small existing sequencer project. The modeshape-sequencer-images project is a small, self-contained sequencer implementation that has only the minimal dependencies. See the Git repository: http://github.com/ModeShape/modeshape//tree/modeshape-2.8.1.Final/extensions/modeshape-sequencer-images/

You can create your Maven project any way you'd like. For examples, see the Maven 3 documentation. Once you've done that, just add the dependencies in your project's pom.xml dependencies section:




<dependency>

  <groupId>org.modeshape</groupId>

  <artifactId>modeshape-graph</artifactId>

  <version>2.9-SNAPSHOT</version>

</dependency>

These are minimum dependencies required for compiling a sequencer. Of course, you'll have to add other dependencies that your sequencer needs.

As for testing, you probably will want to add more dependencies, such as those listed here:




<!-- ModeShape-related unit testing utilities and classes -->

<dependency>

  <groupId>org.modeshape</groupId>

  <artifactId>modeshape-graph</artifactId>

  <version>2.9-SNAPSHOT</version>

  <type>test-jar</type>

  <scope>test</scope>

</dependency>

<dependency>

  <groupId>org.modeshape</groupId>

  <artifactId>modeshape-common</artifactId>

  <version>2.9-SNAPSHOT</version>

  <type>test-jar</type>

  <scope>test</scope>

</dependency>

<!-- Unit testing -->

<dependency>

  <groupId>junit</groupId>

  <artifactId>junit</artifactId>

  <version>4.4</version>

  <scope>test</scope>

</dependency>

<dependency>

    <groupId>org.mockito</groupId>

    <artifactId>mockito-all</artifactId>

    <version>1.8.4</version>

    <scope>test</scope>

</dependency>

<dependency>

  <groupId>org.hamcrest</groupId>

  <artifactId>hamcrest-library</artifactId>

  <version>1.1</version>

  <scope>test</scope>

</dependency>

<!-- Logging with Log4J -->

<dependency>

  <groupId>org.slf4j</groupId>

  <artifactId>slf4j-log4j12</artifactId>

  <version>1.6.1</version>

  <scope>test</scope>

</dependency>

<dependency>

  <groupId>log4j</groupId>

  <artifactId>log4j</artifactId>

  <version>1.2.16</version>

  <scope>test</scope>

</dependency>

Testing ModeShape sequencers does not require a JCR repository or the ModeShape services. (For more detail, see the testing section.) However, if you want to do integration testing with a JCR repository and the ModeShape services, you'll need additional dependencies for these libraries.




<!-- ModeShape JCR Repository -->

<dependency>

  <groupId>org.modeshape</groupId>

  <artifactId>modeshape-jcr</artifactId>

  <version>2.9-SNAPSHOT</version>

  <scope>test</scope>

</dependency>

<!-- Java Content Repository API -->

<dependency>

  <groupId>javax.jcr</groupId>

  <artifactId>jcr</artifactId>

  <version>2.0</version>

  <scope>test</scope>

</dependency>

At this point, your project should be set up correctly, and you're ready to move on to write your custom implementation of the StreamSequencer interface. As stated earlier, this should be fairly straightforward: process the stream and generate the output that's appropriate for the kind of file being sequenced.

Let's look at an example. Here is the complete code for the ImageMetadataSequencer implementation:

public class ImageMetadataSequencer implements StreamSequencer {

		/**
		 * {@inheritDoc}
		 * 
		 * @see StreamSequencer#sequence(InputStream, SequencerOutput, StreamSequencerContext)
		 */
		public void sequence( InputStream stream,
		                      SequencerOutput output,
		                      StreamSequencerContext context ) {

		    ImageMetadata metadata = new ImageMetadata();
		    metadata.setInput(stream);
		    metadata.setDetermineImageNumber(true);
		    metadata.setCollectComments(true);

		    // Process the image stream and extract the metadata ...
		    if (!metadata.check()) {
		        metadata = null;
		    }

		    // Generate the output graph if we found useful metadata ...
		    if (metadata != null) {
		        PathFactory pathFactory = context.getValueFactories().getPathFactory();
		        Path metadataNode = pathFactory.createRelativePath(ImageMetadataLexicon.METADATA_NODE);

		        // Place the image metadata into the output map ...
		        output.setProperty(metadataNode, JcrLexicon.PRIMARY_TYPE, "image:metadata");
		        // output.psetProperty(metadataNode, nameFactory.create(IMAGE_MIXINS), "");
		        output.setProperty(metadataNode, JcrLexicon.MIMETYPE, metadata.getMimeType());
		        // output.setProperty(metadataNode, nameFactory.create(IMAGE_ENCODING), "");
		        output.setProperty(metadataNode, ImageMetadataLexicon.FORMAT_NAME, metadata.getFormatName());
		        output.setProperty(metadataNode, ImageMetadataLexicon.WIDTH, metadata.getWidth());
		        output.setProperty(metadataNode, ImageMetadataLexicon.HEIGHT, metadata.getHeight());
		        output.setProperty(metadataNode, ImageMetadataLexicon.BITS_PER_PIXEL, metadata.getBitsPerPixel());
		        output.setProperty(metadataNode, ImageMetadataLexicon.PROGRESSIVE, metadata.isProgressive());
		        output.setProperty(metadataNode, ImageMetadataLexicon.NUMBER_OF_IMAGES, metadata.getNumberOfImages());
		        output.setProperty(metadataNode, ImageMetadataLexicon.PHYSICAL_WIDTH_DPI, metadata.getPhysicalWidthDpi());
		        output.setProperty(metadataNode, ImageMetadataLexicon.PHYSICAL_HEIGHT_DPI, metadata.getPhysicalHeightDpi());
		        output.setProperty(metadataNode, ImageMetadataLexicon.PHYSICAL_WIDTH_INCHES, metadata.getPhysicalWidthInch());
		        output.setProperty(metadataNode, ImageMetadataLexicon.PHYSICAL_HEIGHT_INCHES, metadata.getPhysicalHeightInch());
		    }
		}
}

where the ImageMetadataLexicon class contains the Name constants and is defined as:

	/**
	 * A lexicon of names used within the image sequencer.
	 */
	@Immutable
	public class ImageMetadataLexicon {

	    public static class Namespace {
	        public static final String URI = "http://www.modeshape.org/images/1.0";
	        public static final String PREFIX = "image";
	    }

	    public static final Name METADATA_NODE = new BasicName(Namespace.URI, "metadata");
	    public static final Name FORMAT_NAME = new BasicName(Namespace.URI, "formatName");
	    public static final Name WIDTH = new BasicName(Namespace.URI, "width");
	    public static final Name HEIGHT = new BasicName(Namespace.URI, "height");
	    public static final Name BITS_PER_PIXEL = new BasicName(Namespace.URI, "bitsPerPixel");
	    public static final Name PROGRESSIVE = new BasicName(Namespace.URI, "progressive");
	    public static final Name NUMBER_OF_IMAGES = new BasicName(Namespace.URI, "numberOfImages");
	    public static final Name PHYSICAL_WIDTH_DPI = new BasicName(Namespace.URI, "physicalWidthDpi");
	    public static final Name PHYSICAL_HEIGHT_DPI = new BasicName(Namespace.URI, "physicalHeightDpi");
	    public static final Name PHYSICAL_WIDTH_INCHES = new BasicName(Namespace.URI, "physicalWidthInches");
	    public static final Name PHYSICAL_HEIGHT_INCHES = new BasicName(Namespace.URI, "physicalHeightInches");

	}

Notice how the image metadata is extracted and the output graph is generated. A single node is created with the name image:metadata and with the image:metadata node type. No mixins are defined for the node, but several properties are set on the node using the values obtained from the image metadata. After this method returns, the constructed graph will be saved to the repository in all of the places defined by its configuration. (This is why only relative paths are used in the sequencer.)

5.5.2. Testing custom sequencers

The sequencing framework was designed to make testing sequencers much easier. In particular, the StreamSequencer interface does not make use of the JCR API. So instead of requiring a fully-configured JCR repository and ModeShape system, unit tests for a sequencer can focus on testing that the content is processed correctly and the desired output graph is generated.

Note

For a complete example of a sequencer unit test, see the ImageMetadataSequencerTest unit test in the org.modeshape.sequencer.images package of the modeshape-sequencers-image project.

The following code fragment shows one way of testing a sequencer, using JUnit 4.4 assertions and some of the classes made available by ModeShape. Of course, this example code does not do any error handling and does not make all the assertions a real test would.

StreamSequencer sequencer = new ImageMetadataSequencer();
MockSequencerOutput output = new MockSequencerOutput();
MockSequencerContext context = new MockSequencerContext();
InputStream stream = null;
try {
    stream = this.getClass().getClassLoader().getResource("caution.gif").openStream();
    sequencer.sequence(stream,output,context);   // writes to 'output'
    assertThat(output.getPropertyValues("image:metadata", "jcr:primaryType"), 
               is(new Object[] {"image:metadata"}));
    assertThat(output.getPropertyValues("image:metadata", "jcr:mimeType"), 
               is(new Object[] {"image/gif"}));
    // ... make more assertions here
    assertThat(output.hasReferences(), is(false));
} finally {
    stream.close();
}

It's also useful to test that a sequencer produces no output for something it should not understand:

Sequencer sequencer = new ImageMetadataSequencer();
MockSequencerOutput output = new MockSequencerOutput();
MockSequencerContext context = new MockSequencerContext();
InputStream stream = null;
try {
    stream = this.getClass().getClassLoader().getResource("caution.pict").openStream();
    sequencer.sequence(stream,output,context);   // writes to 'output'
    assertThat(output.hasProperties(), is(false));
    assertThat(output.hasReferences(), is(false));
} finally {
    stream.close();
}

These are just two simple tests that show ways of testing a sequencer. Some tests may get quite involved, especially if a lot of output data is produced.

It may also be useful to create some integration tests that configure ModeShape to use a custom sequencer, and to then upload content using the JCR API, verifying that the custom sequencer did run. However, remember that ModeShape runs sequencers asynchronously in the background, and you must synchronize your tests to ensure that the sequencers have a chance to run before checking the results.

5.6. Summary

In this chapter, we described how ModeShape sequences files as they're uploaded into a repository. We've also learned in previous chapters about the ModeShape execution contexts, graph model, and connectors. In the next part we'll put all these pieces together to learn how to set up a ModeShape repository and access it using the JCR API.