Skip to end of metadata
Go to start of metadata

The TikaTextExtractor class is an implementation of TextExtractor that uses the Tika toolkit from Apache to parse and extract text from a variety of file types, including Microsoft Office, PDF, HTML, plain text, XML, and others.

This sequencer is not enabled by default, but it's very easy to add this text extractor to the ModeShape configuration. To do so in a configuration file, simply add the following fragment under the "<mode:textExtractors>" element (which should be immediately under the "<configuration>" root element):

Note that because Tika can process many different MIME types, you can easily specify which MIME types should be included or excluded. It is considered a best practice to specifically include all of the MIME types from which text should be extracted. One reason is that text extraction can be an expensive operation, so you may want to limit it to a specific set of file types. Second, explicitly listing out all of the MIME types is much easier to see and understand. And third, Tika supports a few MIME types without extra libraries, but generally it requires additional dependencies for each type of file, and you probably want to depend on only those libraries that you actually need.

After changing the configuration, be sure to include the necessary libraries. If your application is using Maven, you will need the following dependency:

plus the following dependencies based upon the file types:

Dependency Description of files
Compressed archive formats, such as 'ar', 'cpio', 'tar', 'zip', 'gzip' and 'bzip2'.
Used for parsing Java files.
Exif and other image metadata.
Boilerpipe HTML templates
RSS and Atom feeds using the Rome library.
NetCDF and HDF file formats, which are used within the scientific data community but generally not elsewhere.
Raw email messages and mbox files typically used within a file-based email system.

Tika third-party dependencies that must be manually included

The following dependencies are automatically included by the Tika text extractor module, but if any are not needed in your application or project may be explicitly excluded without problems.

Dependency Description of files
Microsoft Office and Open Office file formats
XML files
HTML files
PDF files

Tika third-party dependencies (included by default)

If you're not using Maven, the be sure to put onto your classpath all of the JAR files from the Maven modules listed above.

Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.