Chapter 7. Sequencing framework

Many repositories are used (at least in part) to manage files and other artifacts, including service definitions, policy files, images, media, documents, presentations, application components, reusable libraries, configuration files, application installations, databases schemas, management scripts, and so on. Unlocking the information buried within all of those files is what JBoss DNA sequencing is all about. As files are loaded into the repository, you JBoss DNA instance can automatically sequence these files to extract from their content meaningful information that can be stored in the repository, where it can then be searched, accessed, and analyzed using the JCR API.

7.1. Sequencers

Sequencers are just POJOs that implement a specific interface, and their job is to process a stream of data (supplied by JBoss DNA) to extract meaningful content that usually takes the form of a structured graph. Exactly what content is up to each sequencer implementation. For example, JBoss DNA comes with an image sequencer that extracts the simple metadata from different kinds of image files (e.g., JPEG, GIF, PNG, etc.). Another example is the Compact Node Definition (CND) sequencer that processes the CND files to extract and produce a structured representation of the node type definitions, property definitions, and child node definitions contained within the file.

Sequencers are configured to identify the kinds of nodes that the sequencers can work against. When content in the repository changes, JBoss DNA looks to see which (if any) sequencers might be able to run on the changed content. If any sequencer configurations do match, those sequencers are run against the content, and the structured graph output of the sequencers is then written back into the repository (at a location dictated by the sequencer configuration). And once that information is in the repository, it can be easily found and accessed via the standard JCR API.

In other words, JBoss DNA uses sequencers to help you extract more meaning from the artifacts you already are managing, and makes it much easier for applications to find and use all that valuable information. All without your applications doing anything extra.

7.2. Stream Sequencers

The StreamSequencer interface defines the single method that must be implemented by a sequencer:

public interface StreamSequencer {

    /**
     * Sequence the data found in the supplied stream, placing the output 
     * information into the supplied map.
     *
     * @param stream the stream with the data to be sequenced; never null
     * @param output the output from the sequencing operation; never null
     * @param context the context for the sequencing operation; never null
     */
    void sequence( InputStream stream, SequencerOutput output, StreamSequencerContext context );
}

Implementations are responsible for processing the content in the supplied InputStream content and generating structured content using the supplied SequencerOutput interface. The StreamSequencerContext provides additional details about the information that is being sequenced, including the location and properties of the node being sequenced, the MIME type of the node being sequenced, and a Problems object where the sequencer can record problems that aren't severe enough to warrant throwing an exception. The StreamSequencerContext also provides access to the ValueFactories that can be used to create Path, Name, and any other value objects.

The SequencerOutput interface is fairly easy to use, and its job is to hide from the sequencer all the specifics about where the output is being written. Therefore, the interface has only a few methods for implementations to call. Two methods set the property values on a node, while the other sets references to other nodes in the repository. Use these methods to describe the properties of the nodes you want to create, using relative paths for the nodes and valid JCR property names for properties and references. JBoss DNA will ensure that nodes are created or updated whenever they're needed.

public interface SequencerOutput {

  /**
   * Set the supplied property on the supplied node.  The allowable
   * values are any of the following:
   *   - primitives (which will be autoboxed)
   *   - String instances
   *   - String arrays
   *   - byte arrays
   *   - InputStream instances
   *   - Calendar instances
   *
   * @param nodePath the path to the node containing the property; 
   * may not be null
   * @param property the name of the property to be set
   * @param values the value(s) for the property; may be empty if 
   * any existing property is to be removed
   */
  void setProperty( String nodePath, String property, Object... values );
  void setProperty( Path nodePath, Name property, Object... values );

  /**
   * Set the supplied reference on the supplied node.
   *
   * @param nodePath the path to the node containing the property; 
   * may not be null
   * @param property the name of the property to be set
   * @param paths the paths to the referenced property, which may be
   * absolute paths or relative to the sequencer output node;
   * may be empty if any existing property is to be removed
   */
  void setReference( String nodePath, String property, String... paths );
}

Note

JBoss DNA will create nodes of type nt:unstructured unless you specify the value for the jcr:primaryType property. You can also specify the values for the jcr:mixinTypes property if you want to add mixins to any node.

7.3. Path Expressions

Each sequencer must be configured to describe the areas or types of content that the sequencer is capable of handling. This is done by specifying these patterns using path expressions that identify the nodes (or node patterns) that should be sequenced and where to store the output generated by the sequencer. We'll see how to fully configure a sequencer in the next chapter, but before then let's dive into path expressions in more detail.

A path expression consist of two parts: a selection criteria (or an input path) and an output path:

  inputPath => outputPath

The inputPath part defines an expression for the path of a node that is to be sequenced. Input paths consist of '/' separated segments, where each segment represents a pattern for a single node's name (including the same-name-sibling indexes) and '@' signifies a property name.

Let's first look at some simple examples:

Table 7.1. Simple Input Path Examples

Input Path	Description
/a/b	Match node "`b`" that is a child of the top level node "`a`". Neither node may have any same-name-sibilings.
/a/*	Match any child node of the top level node "`a`".
/a/*.txt	Match any child node of the top level node "`a`" that also has a name ending in "`.txt`".
/a/*.txt	Match any child node of the top level node "`a`" that also has a name ending in "`.txt`".
/a/b@c	Match the property "`c`" of node "`/a/b`".
/a/b[2]	The second child named "`b`" below the top level node "`a`".
/a/b[2,3,4]	The second, third or fourth child named "`b`" below the top level node "`a`".
/a/b[*]	Any (and every) child named "`b`" below the top level node "`a`".
//a/b	Any node named "`b`" that exists below a node named "`a`", regardless of where node "`a`" occurs. Again, neither node may have any same-name-sibilings.

With these simple examples, you can probably discern the most important rules. First, the '*' is a wildcard character that matches any character or sequence of characters in a node's name (or index if appearing in between square brackets), and can be used in conjunction with other characters (e.g., "*.txt").

Second, square brackets (i.e., '[' and ']') are used to match a node's same-name-sibiling index. You can put a single non-negative number or a comma-separated list of non-negative numbers. Use '0' to match a node that has no same-name-sibilings, or any positive number to match the specific same-name-sibling.

Third, combining two delimiters (e.g., "//") matches any sequence of nodes, regardless of what their names are or how many nodes. Often used with other patterns to identify nodes at any level matching other patterns. Three or more sequential slash characters are treated as two.

Many input paths can be created using just these simple rules. However, input paths can be more complicated. Here are some more examples:

Table 7.2. More Complex Input Path Examples

Input Path	Description
/a/(b\|c\|d)	Match children of the top level node "`a`" that are named "`b`", "`c`" or "`d`". None of the nodes may have same-name-sibling indexes.
/a/b[c/d]	Match node "`b`" child of the top level node "`a`", when node "`b`" has a child named "`c`", and "`c`" has a child named "`d`". Node "`b`" is the selected node, while nodes "`c`" and "`d`" are used as criteria but are not selected.
/a(/(b\|c\|d\|)/e)[f/g/@something]	Match node "`/a/b/e`", "`/a/c/e`", "`/a/d/e`", or "`/a/e`" when they also have a child "`f`" that itself has a child "`g`" with property "`something`". None of the nodes may have same-name-sibling indexes.

These examples show a few more advanced rules. Parentheses (i.e., '(' and ')') can be used to define a set of options for names, as shown in the first and third rules. Whatever part of the selected node's path appears between the parentheses is captured for use within the output path. Thus, the first input path in the previous table would match node "/a/b", and "b" would be captured and could be used within the output path using "$1", where the number used in the output path identifies the parentheses.

Square brackets can also be used to specify criteria on a node's properties or children. Whatever appears in between the square brackets does not appear in the selected node.

Let's go back to the previous code fragment and look at the first path expression:

  //(*.(jpg|jpeg|gif|bmp|pcx|png)[*])/jcr:content[@jcr:data] => /images/$1

This matches a node named "jcr:content" with property "jcr:data" but no siblings with the same name, and that is a child of a node whose name ends with ".jpg", ".jpeg", ".gif", ".bmp", ".pcx", or ".png" that may have any same-name-sibling index. These nodes can appear at any level in the repository. Note how the input path capture the filename (the segment containing the file extension), including any same-name-sibling index. This filename is then used in the output path, which is where the sequenced content is placed.

7.4. Out-of-the-box Sequencers

A number of sequencers are already available in JBoss DNA, and are outlined in detail later in the document. Note that we do want to build more sequencers in the upcoming releases.

7.5. Creating Custom Sequencers

The current release of JBoss DNA comes with eleven sequencers. However, it's very easy to create your own sequencers and to then configure JBoss DNA to use them in your own application.

Creating a custom sequencer involves the following steps:

Create a Maven 2 project for your sequencer;
Implement the StreamSequencer interface with your own implementation, and create unit tests to verify the functionality and expected behavior;
Add the sequencer configuration to the JBoss DNA SequencingService in your application as described in the previous chapter; and
Deploy the JAR file with your implementation (as well as any dependencies), and make them available to JBoss DNA in your application.

It's that simple.

7.5.1. Creating the Maven 2 project

The first step is to create the Maven 2 project that you can use to compile your code and build the JARs. Maven 2 automates a lot of the work, and since you're already set up to use Maven, using Maven for your project will save you a lot of time and effort. Of course, you don't have to use Maven 2, but then you'll have to get the required libraries and manage the compiling and building process yourself.

Note

JBoss DNA may provide in the future a Maven archetype for creating sequencer projects. If you'd find this useful and would like to help create it, please join the community.

In lieu of a Maven archetype, you may find it easier to start with a small existing sequencer project. The dna-sequencer-images project is a small, self-contained sequencer implementation that has only the minimal dependencies. See the subversion repository: http://anonsvn.jboss.org/repos/dna/trunk/extensions/dna-sequencer-images/

You can create your Maven project any way you'd like. For examples, see the Maven 2 documentation. Once you've done that, just add the dependencies in your project's pom.xml dependencies section:




<dependency>

  <groupId>org.jboss.dna</groupId>

  <artifactId>dna-graph</artifactId>

  <version>0.7</version>

</dependency>

These are minimum dependencies required for compiling a sequencer. Of course, you'll have to add other dependencies that your sequencer needs.

As for testing, you probably will want to add more dependencies, such as those listed here:




<!-- DNA-related unit testing utilities and classes -->

<dependency>

  <groupId>org.jboss.dna</groupId>

  <artifactId>dna-graph</artifactId>

  <version>0.7</version>

  <type>test-jar</type>

  <scope>test</scope>

</dependency>

<dependency>

  <groupId>org.jboss.dna</groupId>

  <artifactId>dna-common</artifactId>

  <version>0.7</version>

  <type>test-jar</type>

  <scope>test</scope>

</dependency>

<!-- Unit testing -->

<dependency>

  <groupId>junit</groupId>

  <artifactId>junit</artifactId>

  <version>4.4</version>

  <scope>test</scope>

</dependency>

<dependency>

  <groupId>org.hamcrest</groupId>

  <artifactId>hamcrest-library</artifactId>

  <version>1.1</version>

  <scope>test</scope>

</dependency>

<!-- Logging with Log4J -->

<dependency>

  <groupId>org.slf4j</groupId>

  <artifactId>slf4j-log4j12</artifactId>

  <version>1.5.8</version>

  <scope>test</scope>

</dependency>

<dependency>

  <groupId>log4j</groupId>

  <artifactId>log4j</artifactId>

  <version>1.2.14</version>

  <scope>test</scope>

</dependency>

Testing JBoss DNA sequencers does not require a JCR repository or the JBoss DNA services. (For more detail, see the testing section.) However, if you want to do integration testing with a JCR repository and the JBoss DNA services, you'll need additional dependencies for these libraries.




<!-- JBoss DNA JCR Repository -->

<dependency>

  <groupId>org.jboss.dna</groupId>

  <artifactId>dna-jcr</artifactId>

  <version>0.7</version>

  <scope>test</scope>

</dependency>

<!-- Java Content Repository API -->

<dependency>

  <groupId>javax.jcr</groupId>

  <artifactId>jcr</artifactId>

  <version>1.0.1</version>

  <scope>test</scope>

</dependency>

At this point, your project should be set up correctly, and you're ready to move on to write your custom implementation of the StreamSequencer interface. As stated earlier, this should be fairly straightforward: process the stream and generate the output that's appropriate for the kind of file being sequenced.

Let's look at an example. Here is the complete code for the ImageMetadataSequencer implementation:

public class ImageMetadataSequencer implements StreamSequencer {

    public static final String METADATA_NODE = "image:metadata";
    public static final String IMAGE_PRIMARY_TYPE = "jcr:primaryType";
    public static final String IMAGE_MIXINS = "jcr:mixinTypes";
    public static final String IMAGE_MIME_TYPE = "jcr:mimeType";
    public static final String IMAGE_ENCODING = "jcr:encoding";
    public static final String IMAGE_FORMAT_NAME = "image:formatName";
    public static final String IMAGE_WIDTH = "image:width";
    public static final String IMAGE_HEIGHT = "image:height";
    public static final String IMAGE_BITS_PER_PIXEL = "image:bitsPerPixel";
    public static final String IMAGE_PROGRESSIVE = "image:progressive";
    public static final String IMAGE_NUMBER_OF_IMAGES = "image:numberOfImages";
    public static final String IMAGE_PHYSICAL_WIDTH_DPI = "image:physicalWidthDpi";
    public static final String IMAGE_PHYSICAL_HEIGHT_DPI = "image:physicalHeightDpi";
    public static final String IMAGE_PHYSICAL_WIDTH_INCHES = "image:physicalWidthInches";
    public static final String IMAGE_PHYSICAL_HEIGHT_INCHES = "image:physicalHeightInches";

    /**
     * {@inheritDoc}
     */
    public void sequence( InputStream stream, SequencerOutput output, 
                          SequencerContext context ) {
        ImageMetadata metadata = new ImageMetadata();
        metadata.setInput(stream);
        metadata.setDetermineImageNumber(true);
        metadata.setCollectComments(true);

        // Process the image stream and extract the metadata ...
        if (!metadata.check()) {
            metadata = null;
        }
        // Generate the output graph if we found useful metadata ...
        if (metadata != null) {
            // Place the image metadata into the output map ...
            output.setProperty(METADATA_NODE, IMAGE_PRIMARY_TYPE, "image:metadata");
            // output.psetProperty(METADATA_NODE, IMAGE_MIXINS, "");
            output.setProperty(METADATA_NODE, IMAGE_MIME_TYPE, metadata.getMimeType());
            // output.setProperty(METADATA_NODE, IMAGE_ENCODING, "");
            output.setProperty(METADATA_NODE, IMAGE_FORMAT_NAME, metadata.getFormatName());
            output.setProperty(METADATA_NODE, IMAGE_WIDTH, metadata.getWidth());
            output.setProperty(METADATA_NODE, IMAGE_HEIGHT, metadata.getHeight());
            output.setProperty(METADATA_NODE, IMAGE_BITS_PER_PIXEL, metadata.getBitsPerPixel());
            output.setProperty(METADATA_NODE, IMAGE_PROGRESSIVE, metadata.isProgressive());
            output.setProperty(METADATA_NODE, IMAGE_NUMBER_OF_IMAGES, 
                               metadata.getNumberOfImages());
            output.setProperty(METADATA_NODE, IMAGE_PHYSICAL_WIDTH_DPI,  
						metadata.getPhysicalWidthDpi());
            output.setProperty(METADATA_NODE, IMAGE_PHYSICAL_HEIGHT_DPI,  
						metadata.getPhysicalHeightDpi());
            output.setProperty(METADATA_NODE, IMAGE_PHYSICAL_WIDTH_INCHES,  
						metadata.getPhysicalWidthInch());
            output.setProperty(METADATA_NODE, IMAGE_PHYSICAL_HEIGHT_INCHES,  
						metadata.getPhysicalHeightInch());
        }
    }
}

Notice how the image metadata is extracted and the output graph is generated. A single node is created with the name image:metadata and with the image:metadata node type. No mixins are defined for the node, but several properties are set on the node using the values obtained from the image metadata. After this method returns, the constructed graph will be saved to the repository in all of the places defined by its configuration. (This is why only relative paths are used in the sequencer.)

7.5.2. Testing custom sequencers

The sequencing framework was designed to make testing sequencers much easier. In particular, the StreamSequencer interface does not make use of the JCR API. So instead of requiring a fully-configured JCR repository and JBoss DNA system, unit tests for a sequencer can focus on testing that the content is processed correctly and the desired output graph is generated.

Note

For a complete example of a sequencer unit test, see the ImageMetadataSequencerTest unit test in the org.jboss.dna.sequencer.images package of the dna-sequencers-image project.

The following code fragment shows one way of testing a sequencer, using JUnit 4.4 assertions and some of the classes made available by JBoss DNA. Of course, this example code does not do any error handling and does not make all the assertions a real test would.

StreamSequencer sequencer = new ImageMetadataSequencer();
MockSequencerOutput output = new MockSequencerOutput();
MockSequencerContext context = new MockSequencerContext();
InputStream stream = null;
try {
    stream = this.getClass().getClassLoader().getResource("caution.gif").openStream();
    sequencer.sequence(stream,output,context);   // writes to 'output'
    assertThat(output.getPropertyValues("image:metadata", "jcr:primaryType"), 
               is(new Object[] {"image:metadata"}));
    assertThat(output.getPropertyValues("image:metadata", "jcr:mimeType"), 
               is(new Object[] {"image/gif"}));
    // ... make more assertions here
    assertThat(output.hasReferences(), is(false));
} finally {
    stream.close();
}

It's also useful to test that a sequencer produces no output for something it should not understand:

Sequencer sequencer = new ImageMetadataSequencer();
MockSequencerOutput output = new MockSequencerOutput();
MockSequencerContext context = new MockSequencerContext();
InputStream stream = null;
try {
    stream = this.getClass().getClassLoader().getResource("caution.pict").openStream();
    sequencer.sequence(stream,output,context);   // writes to 'output'
    assertThat(output.hasProperties(), is(false));
    assertThat(output.hasReferences(), is(false));
} finally {
    stream.close();
}

These are just two simple tests that show ways of testing a sequencer. Some tests may get quite involved, especially if a lot of output data is produced.

It may also be useful to create some integration tests that configure JBoss DNA to use a custom sequencer, and to then upload content using the JCR API, verifying that the custom sequencer did run. However, remember that JBoss DNA runs sequencers asynchronously in the background, and you must synchronize your tests to ensure that the sequencers have a chance to run before checking the results.

7.6. Summary

In this chapter, we described how JBoss DNA sequences files as they're uploaded into a repository. We've also learned in previous chapters about the JBoss DNA execution contexts, graph model, and connectors. In the next part we'll put all these pieces together to learn how to set up a JBoss DNA repository and access it using the JCR API.