JBoss Community Archive (Read Only)

ModeShape 5

Tika text extractor

This text extractor uses the Tika library to extract text from a variety of file formats. It will automatically discover all of the Tika Parser implementations that are defined in META-INF/services/org.apache.tika.parser.Parser text files accessible via the current classloader and that contain the class names of the Parser implementations (one class name per line in each file). In other words, simply ensure that the Tika libraries for the appropriate file formats are on the classpath, and the text extractor will be able to use them all.

This text extractor can be configured in a ModeShape configuration by specifying several optional properties:

To use this extractor, simply include the modeshape-extractor-tika JAR and the appropriate required Tika JARs are on the classpath (or via Maven) and configure the repository in a similar fashion to:

{
  "name" : "Sample Config",
  "textExtraction": {
    "extractors" : {
      "tikaExtractor":{
        "name" : "Tika content-based extractor",
         "classname" : "tika"
      }
    }
  }
}
In Wildfly/EAP, the extractor can also be configured through the ModeShape subsystem:

 <subsystem xmlns="urn:jboss:domain:modeshape:2.0">
    <repository  name="sample">
        ...
        <text-extractors>
            <text-extractor classname="tika" name="tika-extractor" module="org.modeshape.extractor.tika"/>
            OR
            <text-extractor classname="org.modeshape.extractor.tika.TikaTextExtractor" name="tika-extractor" module="org.modeshape.extractor.tika"/>
        </text-extractors>
    </repository>
</subsystem>
JBoss.org Content Archive (Read Only), exported from JBoss Community Documentation Editor at 2020-03-11 12:13:01 UTC, last content change 2016-04-08 06:42:35 UTC.