Chapter 21. Microsoft Office Document Sequencer

Chapter 21. Microsoft Office® Document Sequencer

This sequencer is included in JBoss DNA and processes Microsoft Office documents, including Word documents, Excel spreadsheets, and PowerPoint presentations. With documents, the sequencer attempts to infer the internal structure from the heading styles. With presentations, the sequencer extracts the slides, titles, text and slide thumbnails. With spreadsheets, the sequencer extracts the names of the sheets. And, the sequencer extracts for all the files the general file information, including the name of the author, title, keywords, subject, comments, and various dates.

To use this sequencer, simply include the dna-sequencer-msoffice JAR and all of the POI JARs in your application and configure the JcrConfiguration to use this sequencer using something similar to:

JcrConfiguration config = ...

config.sequencer("Microsoft Office Document Sequencer")
      .usingClass("org.jboss.dna.sequencer.msoffice.MSOfficeMetadataSequencer")
      .loadedFromClasspath()
      .setDescription("Sequences MS Office documents, including spreadsheets and presentations")
      .sequencingFrom("//(*.(*.(doc|docx|ppt|pps|xls)[*])/jcr:content[@jcr:data]")
      .andOutputtingTo("/msoffice/$1");