ModeShape Distribution 3.0.0.Beta4

org.modeshape.extractor.tika
Class TikaTextExtractor

java.lang.Object
  extended by org.modeshape.jcr.api.text.TextExtractor
      extended by org.modeshape.extractor.tika.TikaTextExtractor

public class TikaTextExtractor
extends TextExtractor

A TextExtractor that uses the Apache Tika library.

This extractor will automatically discover all of the Tika Parser implementations that are defined in META-INF/services/org.apache.tika.parser.Parser text files accessible via the current classloader and that contain the class names of the Parser implementations (one class name per line in each file).

This text extractor can be configured in a ModeShape configuration by specifying several optional properties:


Nested Class Summary
 
Nested classes/interfaces inherited from class org.modeshape.jcr.api.text.TextExtractor
TextExtractor.BinaryOperation<T>, TextExtractor.Context, TextExtractor.Output
 
Field Summary
static Set<String> DEFAULT_EXCLUDED_MIME_TYPES
          The MIME types that are excluded by default.
protected static Logger LOGGER
           
 
Constructor Summary
TikaTextExtractor()
           
 
Method Summary
 void extractFrom(Binary binary, TextExtractor.Output output, TextExtractor.Context context)
          Extract text from the given Binary, using the given output to record the results.
 Set<String> getExcludedMimeTypes()
          Set the MIME types that should be excluded.
 Set<String> getIncludedMimeTypes()
          Get the MIME types that are explicitly requested to be included.
protected  org.apache.tika.parser.DefaultParser initialize()
          This class lazily initializes the DefaultParser instance.
protected  org.apache.tika.metadata.Metadata prepareMetadata(Binary binary)
          Creates a new tika metadata object used by the parser.
 void setExcludedMimeTypes(Collection<String> excludedMimeTypes)
           
 void setExcludedMimeTypes(String excludedMimeTypes)
          Set the MIME types that should be excluded.
 void setIncludedMimeTypes(Collection<String> includedMimeTypes)
           
 void setIncludedMimeTypes(String includedMimeTypes)
          Set the MIME types that should be included.
 boolean supportsMimeType(String mimeType)
          Determine if this extractor is capable of processing content with the supplied MIME type.
 
Methods inherited from class org.modeshape.jcr.api.text.TextExtractor
getLogger, getName, processStream, setLogger, setName
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOGGER

protected static final Logger LOGGER

DEFAULT_EXCLUDED_MIME_TYPES

public static final Set<String> DEFAULT_EXCLUDED_MIME_TYPES
The MIME types that are excluded by default. Currently, this list consists of:

Constructor Detail

TikaTextExtractor

public TikaTextExtractor()
Method Detail

supportsMimeType

public boolean supportsMimeType(String mimeType)
Determine if this extractor is capable of processing content with the supplied MIME type.

Specified by:
supportsMimeType in class TextExtractor
Parameters:
mimeType - the MIME type; never null
Returns:
true if this extractor can process content with the supplied MIME type, or false otherwise.

extractFrom

public void extractFrom(Binary binary,
                        TextExtractor.Output output,
                        TextExtractor.Context context)
                 throws Exception
Extract text from the given Binary, using the given output to record the results.

Specified by:
extractFrom in class TextExtractor
Parameters:
binary - the binary value that can be used in the extraction process; never null
output - the output from the sequencing operation; never null
context - the context for the sequencing operation; never null
Throws:
Exception - if there is a problem during the extraction process

prepareMetadata

protected final org.apache.tika.metadata.Metadata prepareMetadata(Binary binary)
                                                           throws IOException,
                                                                  RepositoryException
Creates a new tika metadata object used by the parser. This will contain the mime-type of the content being parsed, if this is available to the underlying context. If not, Tika's autodetection mechanism is used to try and get the mime-type.

Parameters:
binary - a org.modeshape.jcr.api.Binary instance of the content being parsed
Returns:
a Metadata instance.
Throws:
IOException - if auto-detecting the mime-type via Tika fails
RepositoryException - if error obtaining MIME-type of the binary parameter

initialize

protected org.apache.tika.parser.DefaultParser initialize()
This class lazily initializes the DefaultParser instance.

Returns:
the default parser; same as parser

getIncludedMimeTypes

public Set<String> getIncludedMimeTypes()
Get the MIME types that are explicitly requested to be included. This list may not correspond to the MIME types that can be handled via the available Parser implementations.

Returns:
the set of MIME types that are to be included; never null

setIncludedMimeTypes

public void setIncludedMimeTypes(String includedMimeTypes)
Set the MIME types that should be included. This method clears all previously-set excluded MIME types.

Parameters:
includedMimeTypes - the whitespace-delimited or comma-separated list of MIME types that are to be included

setIncludedMimeTypes

public void setIncludedMimeTypes(Collection<String> includedMimeTypes)

getExcludedMimeTypes

public Set<String> getExcludedMimeTypes()
Set the MIME types that should be excluded.

Returns:
the set of MIME types that are to be excluded; never null

setExcludedMimeTypes

public void setExcludedMimeTypes(String excludedMimeTypes)
Set the MIME types that should be excluded. This method clears all previously-set excluded MIME types.

Parameters:
excludedMimeTypes - the whitespace-delimited or comma-separated list of MIME types that are to be excluded

setExcludedMimeTypes

public void setExcludedMimeTypes(Collection<String> excludedMimeTypes)

ModeShape Distribution 3.0.0.Beta4

Copyright © 2008-2012 JBoss, a division of Red Hat. All Rights Reserved.