|
ModeShape Distribution 3.0.0.Beta4 | |||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.modeshape.jcr.api.text.TextExtractor
org.modeshape.extractor.tika.TikaTextExtractor
public class TikaTextExtractor
A TextExtractor
that uses the Apache Tika library.
This extractor will automatically discover all of the Tika Parser
implementations that are defined in
META-INF/services/org.apache.tika.parser.Parser
text files accessible via the current classloader and that contain
the class names of the Parser implementations (one class name per line in each file).
This text extractor can be configured in a ModeShape configuration by specifying several optional properties:
package files
are excluded, though explicitly setting any excluded MIME types will
override these default.
Nested Class Summary |
---|
Nested classes/interfaces inherited from class org.modeshape.jcr.api.text.TextExtractor |
---|
TextExtractor.BinaryOperation<T>, TextExtractor.Context, TextExtractor.Output |
Field Summary | |
---|---|
static Set<String> |
DEFAULT_EXCLUDED_MIME_TYPES
The MIME types that are excluded by default. |
protected static Logger |
LOGGER
|
Constructor Summary | |
---|---|
TikaTextExtractor()
|
Method Summary | |
---|---|
void |
extractFrom(Binary binary,
TextExtractor.Output output,
TextExtractor.Context context)
Extract text from the given Binary , using the given output to record the results. |
Set<String> |
getExcludedMimeTypes()
Set the MIME types that should be excluded. |
Set<String> |
getIncludedMimeTypes()
Get the MIME types that are explicitly requested to be included. |
protected org.apache.tika.parser.DefaultParser |
initialize()
This class lazily initializes the DefaultParser instance. |
protected org.apache.tika.metadata.Metadata |
prepareMetadata(Binary binary)
Creates a new tika metadata object used by the parser. |
void |
setExcludedMimeTypes(Collection<String> excludedMimeTypes)
|
void |
setExcludedMimeTypes(String excludedMimeTypes)
Set the MIME types that should be excluded. |
void |
setIncludedMimeTypes(Collection<String> includedMimeTypes)
|
void |
setIncludedMimeTypes(String includedMimeTypes)
Set the MIME types that should be included. |
boolean |
supportsMimeType(String mimeType)
Determine if this extractor is capable of processing content with the supplied MIME type. |
Methods inherited from class org.modeshape.jcr.api.text.TextExtractor |
---|
getLogger, getName, processStream, setLogger, setName |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected static final Logger LOGGER
public static final Set<String> DEFAULT_EXCLUDED_MIME_TYPES
Constructor Detail |
---|
public TikaTextExtractor()
Method Detail |
---|
public boolean supportsMimeType(String mimeType)
supportsMimeType
in class TextExtractor
mimeType
- the MIME type; never null
public void extractFrom(Binary binary, TextExtractor.Output output, TextExtractor.Context context) throws Exception
Binary
, using the given output to record the results.
extractFrom
in class TextExtractor
binary
- the binary value that can be used in the extraction process; never null
output
- the output from the sequencing operation; never null
context
- the context for the sequencing operation; never null
Exception
- if there is a problem during the extraction processprotected final org.apache.tika.metadata.Metadata prepareMetadata(Binary binary) throws IOException, RepositoryException
binary
- a org.modeshape.jcr.api.Binary
instance of the content being parsed
Metadata
instance.
IOException
- if auto-detecting the mime-type via Tika fails
RepositoryException
- if error obtaining MIME-type of the binary parameterprotected org.apache.tika.parser.DefaultParser initialize()
DefaultParser
instance.
parser
public Set<String> getIncludedMimeTypes()
public void setIncludedMimeTypes(String includedMimeTypes)
includedMimeTypes
- the whitespace-delimited or comma-separated list of MIME types that are to be includedpublic void setIncludedMimeTypes(Collection<String> includedMimeTypes)
public Set<String> getExcludedMimeTypes()
public void setExcludedMimeTypes(String excludedMimeTypes)
excludedMimeTypes
- the whitespace-delimited or comma-separated list of MIME types that are to be excludedpublic void setExcludedMimeTypes(Collection<String> excludedMimeTypes)
|
ModeShape Distribution 3.0.0.Beta4 | |||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |