TikaTextExtractor (ModeShape Distribution Library Reference (5.0.0.Final))

java.lang.Object
- org.modeshape.jcr.api.text.TextExtractor
- - org.modeshape.extractor.tika.TikaTextExtractor

```
public class TikaTextExtractor
extends TextExtractor
```
A TextExtractor that uses the Apache Tika library.
This extractor will automatically discover all of the Tika Parser implementations that are defined in META-INF/services/org.apache.tika.parser.Parser text files accessible via the current classloader and that contain the class names of the Parser implementations (one class name per line in each file).

This text extractor can be configured in a ModeShape configuration by specifying several optional properties:
- excludedMimeTypes - The comma- or whitespace-separated list of MIME types that should be excluded from text extraction, even if there is a Tika Parser available for that MIME type. By default, the MIME types for package files are excluded, though explicitly setting any excluded MIME types will override these default.
- includedMimeTypes - The comma- or whitespace-separated list of MIME types that should be included in text extraction. This extractor will ignore any MIME types in this list that are not covered by Tika Parser implementations.

Nested Class Summary
- Nested classes/interfaces inherited from class org.modeshape.jcr.api.text.TextExtractor
  TextExtractor.BinaryOperation<T>, TextExtractor.Context, TextExtractor.Output

Field Summary

Fields
Modifier and Type	Field and Description
`protected static Set<org.apache.tika.mime.MediaType>`	`DEFAULT_EXCLUDED_MIME_TYPES` The MIME types that are excluded by default.
`protected static Logger`	`LOGGER`

Constructor Summary

Constructors
Constructor and Description

TikaTextExtractor()
No-arg constructor is required because this is instantiated by reflection.

Constructors
Constructor and Description
`TikaTextExtractor()` No-arg constructor is required because this is instantiated by reflection.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`extractFrom(Binary binary, TextExtractor.Output output, TextExtractor.Context context)` Extract text from the given `Binary`, using the given output to record the results.
`protected Set<org.apache.tika.mime.MediaType>`	`getExcludedMediaTypes()`
`protected Set<org.apache.tika.mime.MediaType>`	`getIncludedMediaTypes()`
`protected Set<org.apache.tika.mime.MediaType>`	`getParserSupportedMediaTypes()`
`protected org.apache.tika.parser.DefaultParser`	`initialize()` This class lazily initializes the `DefaultParser` instance.
`protected org.apache.tika.metadata.Metadata`	`prepareMetadata(Binary binary, TextExtractor.Context context)` Creates a new tika metadata object used by the parser.
`protected void`	`setWriteLimit(Integer writeLimit)` Sets the write limit for the Tika parser, representing the maximum number of characters that should be extracted by the TIKA parser.
`boolean`	`supportsMimeType(String mimeType)` Determine if this extractor is capable of processing content with the supplied MIME type.
`String`	`toString()`

Methods inherited from class org.modeshape.jcr.api.text.TextExtractor
getExcludedMimeTypes, getIncludedMimeTypes, getName, logger, processStream, setLogger, setName

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Field Detail
  - LOGGER
```
protected static final Logger LOGGER
```
  - DEFAULT_EXCLUDED_MIME_TYPES
```
protected static final Set<org.apache.tika.mime.MediaType> DEFAULT_EXCLUDED_MIME_TYPES
```
    The MIME types that are excluded by default. Currently, this list consists of:
    - application/x-archive
    - application/x-bzip
    - application/x-bzip2
    - application/x-cpio
    - application/x-gtar
    - application/x-gzip
    - application/x-tar
    - application/zip
    - application/vnd.teiid.vdb
    - image/*
    - audio/*
    - video/*
- Constructor Detail
  - TikaTextExtractor
```
public TikaTextExtractor()
```
    No-arg constructor is required because this is instantiated by reflection.
- Method Detail
  - supportsMimeType
```
public boolean supportsMimeType(String mimeType)
```
    Description copied from class: TextExtractor
    
    Determine if this extractor is capable of processing content with the supplied MIME type.
    
    Specified by:
    
    supportsMimeType in class TextExtractor
    
    Parameters:
    
    mimeType - the MIME type; never null
    
    Returns:
    
    true if this extractor can process content with the supplied MIME type, or false otherwise.
  - extractFrom
```
public void extractFrom(Binary binary,
                        TextExtractor.Output output,
                        TextExtractor.Context context)
                 throws Exception
```
    Description copied from class: TextExtractor
    
    Extract text from the given Binary, using the given output to record the results.
    
    Specified by:
    
    extractFrom in class TextExtractor
    
    Parameters:
    
    binary - the binary value that can be used in the extraction process; never null
    
    output - the output from the sequencing operation; never null
    
    context - the context for the sequencing operation; never null
    
    Throws:
    
    Exception - if there is a problem during the extraction process
  - prepareMetadata
```
protected final org.apache.tika.metadata.Metadata prepareMetadata(Binary binary,
                                                                  TextExtractor.Context context)
                                                           throws IOException,
                                                                  RepositoryException
```
    Creates a new tika metadata object used by the parser. This will contain the mime-type of the content being parsed, if this is available to the underlying context. If not, Tika's autodetection mechanism is used to try and get the mime-type.
    
    Parameters:
    
    binary - a org.modeshape.jcr.api.Binary instance of the content being parsed
    
    context - the extraction context; may not be null
    
    Returns:
    
    a Metadata instance.
    
    Throws:
    
    IOException - if auto-detecting the mime-type via Tika fails
    
    RepositoryException - if error obtaining MIME-type of the binary parameter
  - initialize
```
protected org.apache.tika.parser.DefaultParser initialize()
```
    This class lazily initializes the DefaultParser instance.
    
    Returns:
    
    the default parser; same as parser
  - setWriteLimit
```
protected void setWriteLimit(Integer writeLimit)
```
    Sets the write limit for the Tika parser, representing the maximum number of characters that should be extracted by the TIKA parser.
    
    Parameters:
    
    writeLimit - an Integer which represents the write limit; may be null
    
    See Also:
    
    BodyContentHandler.BodyContentHandler(int)
  - getExcludedMediaTypes
```
protected Set<org.apache.tika.mime.MediaType> getExcludedMediaTypes()
```
  - getIncludedMediaTypes
```
protected Set<org.apache.tika.mime.MediaType> getIncludedMediaTypes()
```
  - getParserSupportedMediaTypes
```
protected Set<org.apache.tika.mime.MediaType> getParserSupportedMediaTypes()
```
  - toString
```
public String toString()
```
    Overrides:
    
    toString in class Object

Class TikaTextExtractor

Nested Class Summary

Nested classes/interfaces inherited from class org.modeshape.jcr.api.text.TextExtractor

Field Summary

Constructor Summary

Method Summary

Methods inherited from class org.modeshape.jcr.api.text.TextExtractor

Methods inherited from class java.lang.Object

Field Detail

LOGGER

DEFAULT_EXCLUDED_MIME_TYPES

Constructor Detail

TikaTextExtractor

Method Detail

supportsMimeType

extractFrom

prepareMetadata

initialize

setWriteLimit

getExcludedMediaTypes

getIncludedMediaTypes

getParserSupportedMediaTypes

toString