Tika text extractor - ModeShape 2.8

The TikaTextExtractor class is an implementation of TextExtractor that uses the Tika toolkit from Apache to parse and extract text from a variety of file types, including Microsoft Office, PDF, HTML, plain text, XML, and others.

This sequencer is not enabled by default, but it's very easy to add this text extractor to the ModeShape configuration. To do so in a configuration file, simply add the following fragment under the "<mode:textExtractors>" element (which should be immediately under the "<configuration>" root element):

<mode:textExtractor jcr:name="Tika Text Extractors">
  <mode:description>Text extractors using Tika parsers</mode:description>
  <mode:classname>org.modeshape.extractor.tika.TikaTextExtractor</mode:classname>
  <!--
  A comma- or whitespace-delimited list of MIME types that are to be excluded.
  The following are excluded by default, but the default is completely overridden
  when this property is set. In other words, if you explicitly exclude any MIME types,
  be sure to list all of the MIME types you want to exclude. Exclusions always
  have a higher precedence than inclusions.
  -->
  <mode:excludedMimeTypes>
     application/x-archive,application/x-bzip,application/x-bzip2,
     application/x-cpio,application/x-gtar,application/x-gzip,
     application/x-ta,application/zip,application/vnd.teiid.vdb
  </mode:excludedMimeTypes>
  <!--
  A comma- or whitespace-delimited list of MIME types that are to be included.
  If this is used, then the extractor will include only those MIME types found
  in this list for which there is an available parser (unless the MIME type
  is also excluded). Including explicit MIME types is often easier if text is
  to be extracted for are only a few MIME types.
  -->
  <mode:includedMimeTypes>
     application/msword,application/vnd.oasis.opendocument.text
  </mode:includedMimeTypes>
</mode:textExtractor>

Note that because Tika can process many different MIME types, you can easily specify which MIME types should be included or excluded. It is considered a best practice to specifically include all of the MIME types from which text should be extracted. One reason is that text extraction can be an expensive operation, so you may want to limit it to a specific set of file types. Second, explicitly listing out all of the MIME types is much easier to see and understand. And third, Tika supports a few MIME types without extra libraries, but generally it requires additional dependencies for each type of file, and you probably want to depend on only those libraries that you actually need.

After changing the configuration, be sure to include the necessary libraries. If your application is using Maven, you will need the following dependency:

<dependency>
	<groupId>org.modeshape</groupId>
	<artifactId>modeshape-extractor-tika</artifactId>
	<version>2.8.0.Final</version>
</dependency>

plus the following dependencies based upon the file types:

Dependency	Description of files
<dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-compress</artifactId> <version>1.1</version> </dependency>
Compressed archive formats, such as 'ar', 'cpio', 'tar', 'zip', 'gzip' and 'bzip2'.
<dependency> <groupId>asm</groupId> <artifactId>asm</artifactId> <version>3.1</version> </dependency>
Used for parsing Java files.
<dependency> <groupId>com.drewnoakes</groupId> <artifactId>metadata-extractor</artifactId> <version>2.4.0-beta-1</version> </dependency>
Exif and other image metadata.
<dependency> <groupId>de.l3s.boilerpipe</groupId> <artifactId>boilerpipe</artifactId> <version>1.1.0</version> </dependency>
Boilerpipe HTML templates
<dependency> <groupId>rome</groupId> <artifactId>rome</artifactId> <version>0.9</version> </dependency>
RSS and Atom feeds using the Rome library.
<dependency> <groupId>edu.ucar</groupId> <artifactId>netcdf</artifactId> <version>4.2-min</version> </dependency> <dependency> <groupId>commons-httpclient</groupId> <artifactId>commons-httpclient</artifactId> <version>3.1</version> </dependency>
NetCDF and HDF file formats, which are used within the scientific data community but generally not elsewhere.
<dependency> <groupId>org.apache.james</groupId> <artifactId>apache-mime4j</artifactId> <version>0.6</version> </dependency>
Raw email messages and mbox files typically used within a file-based email system.

Tika third-party dependencies that must be manually included

The following dependencies are automatically included by the Tika text extractor module, but if any are not needed in your application or project may be explicitly excluded without problems.

Dependency	Description of files
<dependency> <groupId>org.apache.poi</groupId> <artifactId>poi</artifactId> <version>${poi.version}</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-scratchpad</artifactId> <version>${poi.version}</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> <version>${poi.version}</version> <exclusions> <exclusion> <groupId>stax</groupId> <artifactId>stax-api</artifactId> </exclusion> <exclusion> <groupId>xml-apis</groupId> <artifactId>xml-apis</artifactId> </exclusion> </exclusions> </dependency>
Microsoft Office and Open Office file formats
<dependency> <groupId>org.apache.geronimo.specs</groupId> <artifactId>geronimo-stax-api_1.0_spec</artifactId> <version>1.0.1</version> </dependency>
XML files
<dependency> <groupId>org.ccil.cowan.tagsoup</groupId> <artifactId>tagsoup</artifactId> <version>1.2</version> </dependency>
HTML files
<dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox</artifactId> <version>1.4.0</version> </dependency> <!-- TIKA-370: PDFBox declares the Bouncy Castle dependencies as optional, but we prefer to have them always to avoid problems with encrypted PDFs. --> <dependency> <groupId>org.bouncycastle</groupId> <artifactId>bcmail-jdk15</artifactId> <version>1.45</version> </dependency> <dependency> <groupId>org.bouncycastle</groupId> <artifactId>bcprov-jdk15</artifactId> <version>1.45</version> </dependency>
PDF files

Tika third-party dependencies (included by default)

If you're not using Maven, the be sure to put onto your classpath all of the JAR files from the Maven modules listed above.