JBoss.orgCommunity Documentation
Creating a custom text extractor involves the following steps:
Create a Maven 3 project for your detector;
Implement the TextExtractor interface with your own implementation, and create unit tests to verify the functionality and expected behavior; and
Deploy the JAR file with your implementation (as well as any dependencies), and make them available to ModeShape in your application via ModeShape's configuration as described earlier.
It's that simple.
The first step is to create the Maven 3 project that you can use to compile your code and build the JARs. Maven 3 automates a lot of the work, and since you're already set up to use Maven, using Maven for your project will save you a lot of time and effort. Of course, you don't have to use Maven 3, but then you'll have to get the required libraries and manage the compiling and building process yourself.
ModeShape may provide in the future a Maven archetype for creating detector projects. If you'd find this useful and would like to help create it, please join the community.
The modeshape-extractor-tika project is a small, self-contained detector implementation that that you can use to help you get going. Starting with this project's source and modifying it to suit your needs may be the easiest way to get started. See the subversion repository: http://github.com/ModeShape/modeshape//tree/modeshape-2.8.1.Final/extensions/modeshape-extractor-tika/
You can create your Maven project any way you'd like. For examples, see the Maven 3 documentation.
Once you've done that, just add the dependencies in your project's pom.xml
dependencies section:
<dependency>
<groupId>org.modeshape</groupId>
<artifactId>modeshape-common</artifactId>
<version>2.9-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.modeshape</groupId>
<artifactId>modeshape-graph</artifactId>
<version>2.9-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.8.4</version>
</dependency>
These are minimum dependencies required for compiling a detector. Of course, you'll have to add other dependencies that your sequencer needs.
As for testing, you probably will want to add more dependencies, such as those listed here:
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.8</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.hamcrest</groupId>
<artifactId>hamcrest-library</artifactId>
<version>1.1</version>
<scope>test</scope>
</dependency>
<!-- Logging with Log4J -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.8.4</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.16</version>
<scope>test</scope>
</dependency>
After you've created the project, simply implement the TextExtractor interface. As mentioned in the JavaDoc,
the "supportsMimeType" method will be called by ModeShape first, and only if your implementation
returns true for a given MIME type will the "extractFrom" method be called. The supplied TextExtractorContext
object provides information about the text being processed, while the TextExtractorOutput is a simple
interface that your extractor uses to record one or more strings containing the extracted text.
Testing should be quite straightforward as text extractors can simply be instantiated and called by your
test methods, which can simply instantiate the TextExtractorContext
class (with the correct information)
and either mock or implement the TextExtractorOutput interface. Again, see the test cases in the
Tika text extractor module
for ideas.