public class TikaTextExtractor extends TextExtractor
TextExtractor
that uses the Apache Tika library.
This extractor will automatically discover all of the Tika Parser
implementations that are defined in
META-INF/services/org.apache.tika.parser.Parser
text files accessible via the current classloader and that contain
the class names of the Parser implementations (one class name per line in each file).
This text extractor can be configured in a ModeShape configuration by specifying several optional properties:
package files
are excluded, though explicitly setting any excluded MIME types will
override these default.TextExtractor.BinaryOperation<T>, TextExtractor.Context, TextExtractor.Output
Modifier and Type | Field and Description |
---|---|
protected static Set<org.apache.tika.mime.MediaType> |
DEFAULT_EXCLUDED_MIME_TYPES
The MIME types that are excluded by default.
|
protected static Logger |
LOGGER |
Constructor and Description |
---|
TikaTextExtractor()
No-arg constructor is required because this is instantiated by reflection.
|
Modifier and Type | Method and Description |
---|---|
void |
extractFrom(Binary binary,
TextExtractor.Output output,
TextExtractor.Context context)
Extract text from the given
Binary , using the given output to record the results. |
protected Set<org.apache.tika.mime.MediaType> |
getExcludedMediaTypes() |
protected Set<org.apache.tika.mime.MediaType> |
getIncludedMediaTypes() |
protected Set<org.apache.tika.mime.MediaType> |
getParserSupportedMediaTypes() |
protected org.apache.tika.parser.DefaultParser |
initialize()
This class lazily initializes the
DefaultParser instance. |
protected org.apache.tika.metadata.Metadata |
prepareMetadata(Binary binary,
TextExtractor.Context context)
Creates a new tika metadata object used by the parser.
|
protected void |
setWriteLimit(Integer writeLimit)
Sets the write limit for the Tika parser, representing the maximum number of characters that should be extracted by the
TIKA parser.
|
boolean |
supportsMimeType(String mimeType)
Determine if this extractor is capable of processing content with the supplied MIME type.
|
String |
toString() |
getExcludedMimeTypes, getIncludedMimeTypes, getName, logger, processStream, setLogger, setName
protected static final Logger LOGGER
protected static final Set<org.apache.tika.mime.MediaType> DEFAULT_EXCLUDED_MIME_TYPES
public TikaTextExtractor()
public boolean supportsMimeType(String mimeType)
TextExtractor
supportsMimeType
in class TextExtractor
mimeType
- the MIME type; never nullpublic void extractFrom(Binary binary, TextExtractor.Output output, TextExtractor.Context context) throws Exception
TextExtractor
Binary
, using the given output to record the results.extractFrom
in class TextExtractor
binary
- the binary value that can be used in the extraction process; never null
output
- the output from the sequencing operation; never null
context
- the context for the sequencing operation; never null
Exception
- if there is a problem during the extraction processprotected final org.apache.tika.metadata.Metadata prepareMetadata(Binary binary, TextExtractor.Context context) throws IOException, RepositoryException
binary
- a org.modeshape.jcr.api.Binary
instance of the content being parsedcontext
- the extraction context; may not be nullMetadata
instance.IOException
- if auto-detecting the mime-type via Tika failsRepositoryException
- if error obtaining MIME-type of the binary parameterprotected org.apache.tika.parser.DefaultParser initialize()
DefaultParser
instance.parser
protected void setWriteLimit(Integer writeLimit)
writeLimit
- an Integer
which represents the write limit; may be nullBodyContentHandler.BodyContentHandler(int)
protected Set<org.apache.tika.mime.MediaType> getExcludedMediaTypes()
protected Set<org.apache.tika.mime.MediaType> getIncludedMediaTypes()
protected Set<org.apache.tika.mime.MediaType> getParserSupportedMediaTypes()
Copyright © 2008–2016 JBoss, a division of Red Hat. All rights reserved.