Chapter 38. Writing custom text extractors

Creating a custom text extractor involves the following steps:

Create a Maven 3 project for your detector;
Implement the TextExtractor interface with your own implementation, and create unit tests to verify the functionality and expected behavior; and
Deploy the JAR file with your implementation (as well as any dependencies), and make them available to ModeShape in your application via ModeShape's configuration as described earlier.

It's that simple.

The first step is to create the Maven 3 project that you can use to compile your code and build the JARs. Maven 3 automates a lot of the work, and since you're already set up to use Maven, using Maven for your project will save you a lot of time and effort. Of course, you don't have to use Maven 3, but then you'll have to get the required libraries and manage the compiling and building process yourself.

Note

ModeShape may provide in the future a Maven archetype for creating detector projects. If you'd find this useful and would like to help create it, please join the community.

Note

The modeshape-extractor-tika project is a small, self-contained detector implementation that that you can use to help you get going. Starting with this project's source and modifying it to suit your needs may be the easiest way to get started. See the subversion repository: http://github.com/ModeShape/modeshape//tree/modeshape-2.8.1.Final/extensions/modeshape-extractor-tika/

You can create your Maven project any way you'd like. For examples, see the Maven 3 documentation. Once you've done that, just add the dependencies in your project's pom.xml dependencies section:


<dependency>

  <groupId>org.modeshape</groupId>

  <artifactId>modeshape-common</artifactId>

  <version>2.9-SNAPSHOT</version>

</dependency>

<dependency>

  <groupId>org.modeshape</groupId>

  <artifactId>modeshape-graph</artifactId>

  <version>2.9-SNAPSHOT</version>

</dependency>

<dependency>

  <groupId>org.slf4j</groupId>

  <artifactId>slf4j-api</artifactId>

  <version>1.8.4</version>

</dependency>

These are minimum dependencies required for compiling a detector. Of course, you'll have to add other dependencies that your sequencer needs.

As for testing, you probably will want to add more dependencies, such as those listed here:


<dependency>

  <groupId>junit</groupId>

  <artifactId>junit</artifactId>

  <version>4.8</version>

  <scope>test</scope>

</dependency>

<dependency>

  <groupId>org.hamcrest</groupId>

  <artifactId>hamcrest-library</artifactId>

  <version>1.1</version>

  <scope>test</scope>

</dependency>

<!-- Logging with Log4J -->

<dependency>

  <groupId>org.slf4j</groupId>

  <artifactId>slf4j-log4j12</artifactId>

  <version>1.8.4</version>

  <scope>test</scope>

</dependency>

<dependency>

  <groupId>log4j</groupId>

  <artifactId>log4j</artifactId>

  <version>1.2.16</version>

  <scope>test</scope>

</dependency>

After you've created the project, simply implement the TextExtractor interface. As mentioned in the JavaDoc, the "supportsMimeType" method will be called by ModeShape first, and only if your implementation returns true for a given MIME type will the "extractFrom" method be called. The supplied TextExtractorContext object provides information about the text being processed, while the TextExtractorOutput is a simple interface that your extractor uses to record one or more strings containing the extracted text.

Testing should be quite straightforward as text extractors can simply be instantiated and called by your test methods, which can simply instantiate the TextExtractorContext class (with the correct information) and either mock or implement the TextExtractorOutput interface. Again, see the test cases in the Tika text extractor module for ideas.