JBoss.orgCommunity Documentation
This sequencer is included in JBoss DNA and processes Microsoft Office documents, including Word documents, Excel spreadsheets, and PowerPoint presentations. With documents, the sequencer attempts to infer the internal structure from the heading styles. With presentations, the sequencer extracts the slides, titles, text and slide thumbnails. With spreadsheets, the sequencer extracts the names of the sheets. And, the sequencer extracts for all the files the general file information, including the name of the author, title, keywords, subject, comments, and various dates.
To use this sequencer, simply include the dna-sequencer-msoffice
JAR and all of the
POI JARs
in your application and configure the JcrConfiguration
to use this sequencer using something similar to:
JcrConfiguration config = ... config.sequencer("Microsoft Office Document Sequencer") .usingClass("org.jboss.dna.sequencer.msoffice.MSOfficeMetadataSequencer") .loadedFromClasspath() .setDescription("Sequences MS Office documents, including spreadsheets and presentations") .sequencingFrom("//(*.(*.(doc|docx|ppt|pps|xls)[*])/jcr:content[@jcr:data]") .andOutputtingTo("/msoffice/$1");