org.modeshape.extractor.tika
Class TikaTextExtractor

java.lang.Object
  extended by org.modeshape.extractor.tika.TikaTextExtractor
All Implemented Interfaces:
TextExtractor

public class TikaTextExtractor
extends Object
implements TextExtractor

A TextExtractor that uses the Apache Tika library.

This extractor will automatically discover all of the Tika Parser implementations that are defined in META-INF/services/org.apache.tika.parser.Parser text files accessible via the current classloader and that contain the class names of the Parser implementations (one class name per line in each file).

This text extractor can be configured in a ModeShape configuration by specifying several optional properties:


Field Summary
static Set<String> DEFAULT_EXCLUDED_MIME_TYPES
          The MIME types that are excluded by default.
 
Constructor Summary
TikaTextExtractor()
           
 
Method Summary
 void addExcludedMimeType(String excludedMimeType)
          Add another MIME type that should be excluded.
 void addIncludedMimeType(String includedMimeType)
          Add another MIME type that should be excluded.
 void excludeMimeType(String mimeType)
          Exclude the MIME type from extraction.
 void extractFrom(InputStream stream, TextExtractorOutput output, TextExtractorContext context)
          Sequence the data found in the supplied stream, placing the output information into the supplied map.
 Set<String> getExcludedMimeTypes()
          Set the MIME types that should be excluded.
 Set<String> getIncludedMimeTypes()
          Get the MIME types that are explicitly requested to be included.
 void includeMimeType(String mimeType)
          Include the MIME type from extraction.
protected  org.apache.tika.parser.DefaultParser initialize()
          This class lazily initializes the DefaultParser instance.
 void setExcludedMimeTypes(String excludedMimeTypes)
          Set the MIME types that should be excluded.
 void setIncludedMimeTypes(String includedMimeTypes)
          Set the MIME types that should be included.
 boolean supportsMimeType(String mimeType)
          Determine if this extractor is capable of processing content with the supplied MIME type.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_EXCLUDED_MIME_TYPES

public static final Set<String> DEFAULT_EXCLUDED_MIME_TYPES
The MIME types that are excluded by default. Currently, this list consists of:

Constructor Detail

TikaTextExtractor

public TikaTextExtractor()
Method Detail

supportsMimeType

public boolean supportsMimeType(String mimeType)
Determine if this extractor is capable of processing content with the supplied MIME type.

Specified by:
supportsMimeType in interface TextExtractor
Parameters:
mimeType - the MIME type; never null
Returns:
true if this extractor can process content with the supplied MIME type, or false otherwise.
See Also:
TextExtractor.supportsMimeType(java.lang.String)

extractFrom

public void extractFrom(InputStream stream,
                        TextExtractorOutput output,
                        TextExtractorContext context)
                 throws IOException
Sequence the data found in the supplied stream, placing the output information into the supplied map.

ModeShape's SequencingService determines the sequencers that should be executed by monitoring the changes to one or more workspaces that it is monitoring. Changes in those workspaces are aggregated and used to determine which sequencers should be called. If the sequencer implements this interface, then this method is called with the property that is to be sequenced along with the interface used to register the output. The framework takes care of all the rest.

Specified by:
extractFrom in interface TextExtractor
Parameters:
stream - the stream with the data to be sequenced; never null
output - the output from the sequencing operation; never null
context - the context for the sequencing operation; never null
Throws:
IOException - if there is a problem reading the stream
See Also:
TextExtractor.extractFrom(java.io.InputStream, org.modeshape.graph.text.TextExtractorOutput, org.modeshape.graph.text.TextExtractorContext)

initialize

protected org.apache.tika.parser.DefaultParser initialize()
This class lazily initializes the DefaultParser instance.

Returns:
the default parser; same as parser

getIncludedMimeTypes

public Set<String> getIncludedMimeTypes()
Get the MIME types that are explicitly requested to be included. This list may not correspond to the MIME types that can be handled via the available Parser implementations.

Returns:
the set of MIME types that are to be included; never null

setIncludedMimeTypes

public void setIncludedMimeTypes(String includedMimeTypes)
Set the MIME types that should be included. This method clears all previously-set excluded MIME types.

Parameters:
includedMimeTypes - the whitespace-delimited or comma-separated list of MIME types that are to be included

addIncludedMimeType

public void addIncludedMimeType(String includedMimeType)
Add another MIME type that should be excluded. This method does not clear any included MIME types that were previously set.

Parameters:
includedMimeType - the MIME type that is to be included

includeMimeType

public void includeMimeType(String mimeType)
Include the MIME type from extraction.

Parameters:
mimeType - MIME type that should be included

getExcludedMimeTypes

public Set<String> getExcludedMimeTypes()
Set the MIME types that should be excluded.

Returns:
the set of MIME types that are to be excluded; never null

setExcludedMimeTypes

public void setExcludedMimeTypes(String excludedMimeTypes)
Set the MIME types that should be excluded. This method clears all previously-set excluded MIME types.

Parameters:
excludedMimeTypes - the whitespace-delimited or comma-separated list of MIME types that are to be excluded

addExcludedMimeType

public void addExcludedMimeType(String excludedMimeType)
Add another MIME type that should be excluded. This method does not clear any excluded MIME types that were previously set.

Parameters:
excludedMimeType - the MIME type that is to be excluded

excludeMimeType

public void excludeMimeType(String mimeType)
Exclude the MIME type from extraction.

Parameters:
mimeType - MIME type that should be excluded


Copyright © 2008-2012 JBoss, a division of Red Hat. All Rights Reserved.