JBoss Community Archive (Read Only)

ModeShape 3

Custom text extractors

Earlier in the Introduction to ModeShape, we briefly described text extractors and how they work. In this section we go into more detail about the framework and describe all the steps for developing your own custom text extractors.

The text extraction framework

A text extractor is actually just a plain old Java object (POJO). Creating an extractor is pretty straightforward: create a Java class that extends a single abstract class, called TextExtractor:

package org.modeshape.jcr.api.text;

import javax.jcr.Node;
import javax.jcr.Property;
import javax.jcr.RepositoryException;

public abstract class TextExtractor {


     * Determine if this extractor is capable of processing content with the supplied MIME type.
     * @param mimeType the MIME type; never null
     * @return true if this extractor can process content with the supplied MIME type, or false otherwise.
    public abstract boolean supportsMimeType( String mimeType );

     * Extract text from the given {@link Binary}, using the given output to record the results.
     * @param binary the binary value that can be used in the extraction process; never <code>null</code>
     * @param output the output from the sequencing operation; never <code>null</code>
     * @param context the context for the sequencing operation; never <code>null</code>
     * @throws Exception if there is a problem during the extraction process
    public abstract void extractFrom( Binary binary,
                                      TextExtractor.Output output,
                                      Context context ) throws Exception;

     * Allows subclasses to process the stream of binary value property in "safe" fashion, making sure the stream is closed at the
     * end of the operation.
     * @param binary a {@link org.modeshape.jcr.api.Binary} who is expected to contain a non-null binary value.
     * @param operation a {@link org.modeshape.jcr.api.text.TextExtractor.BinaryOperation} which should work with the stream
     * @param <T> the return type of the binary operation
     * @return whatever type of result the stream operation returns
     * @throws Exception if there is an error processing the stream
    protected final <T> T processStream( Binary binary,
                                         BinaryOperation<T> operation ) throws Exception {

     * Interface which can be used by subclasses to process the input stream of a binary property.
     * @param <T> the return type of the binary operation
    protected interface BinaryOperation<T> {
        T execute( InputStream stream ) throws Exception;

     * Interface which provides additional information to the text extractors, during the extraction operation.
    public interface Context {
        String mimeTypeOf( String name,
                           Binary binaryValue ) throws RepositoryException, IOException;

     * The interface passed to a TextExtractor to which the extractor should record all text content.
    public interface Output {
         * Record the text as being extracted. This method can be called multiple times during a single extract.
         * @param text the text extracted from the content.
        void recordText( String text );

The abstract class also contains fields and getters (not shown above) for the name and logger that are automatically set by ModeShape during repository initialization.

There are two abstract methods that must be implemented: supportsMimeType(...) and extractFrom(...). The first is fairly obvious: simply return true for all of the MIME types for which the extractor is capable of processing. The extractFrom method is the meat of the implementation, and should process the BINARY value's contents and write the searchable text to the supplied Output object.

Note that the processStream(...) method is a utility that can be called by the extractFrom and that properly opens the BINARY value's stream, processes the content, and ensures that the stream is always closed. Your implementation can therefore implement the extractFrom method as follows:

public void extractFrom( final Binary binary,
                         final TextExtractor.Output output,
                         final Context context ) throws Exception {
    processStream(binary, new BinaryOperation<Object>() {
        public Object execute( InputStream stream ) throws Exception {
            // Custom logic to read the stream and write to 'output'
            return null;

This can make your implementation a little easier, but feel free to just implement the extractFrom method directly process the stream.

Creating a new sequencer

Create the Maven module

Create a Sequencer subclass

Create unit tests

Package your library

Deploy and configure

JBoss.org Content Archive (Read Only), exported from JBoss Community Documentation Editor at 2020-03-11 12:06:57 UTC, last content change 2012-11-16 17:09:44 UTC.