Skip to end of metadata
Go to start of metadata

Earlier in the Introduction to ModeShape, we briefly described text extractors and how they work. In this section we go into more detail about the framework and describe all the steps for developing your own custom text extractors.

The text extraction framework

A text extractor is actually just a plain old Java object (POJO). Creating an extractor is pretty straightforward: create a Java class that extends a single abstract class, called TextExtractor:

The abstract class also contains fields and getters (not shown above) for the name and logger that are automatically set by ModeShape during repository initialization.

There are two abstract methods that must be implemented: supportsMimeType(...) and extractFrom(...). The first is fairly obvious: simply return true for all of the MIME types for which the extractor is capable of processing. The extractFrom method is the meat of the implementation, and should process the BINARY value's contents and write the searchable text to the supplied Output object.

Note that the processStream(...) method is a utility that can be called by the extractFrom and that properly opens the BINARY value's stream, processes the content, and ensures that the stream is always closed. Your implementation can therefore implement the extractFrom method as follows:

This can make your implementation a little easier, but feel free to just implement the extractFrom method directly process the stream.

Creating a new sequencer

Create the Maven module

Create a Sequencer subclass

Create unit tests

Package your library

Deploy and configure

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.