JBoss Community Archive (Read Only)

ModeShape 5

Microsoft Office files

The Microsoft Office sequencer is included in ModeShape and processes Microsoft Office documents, including Word documents, Excel spreadsheets, and PowerPoint presentations. With documents, the sequencer attempts to infer the internal structure from the heading styles. With presentations, the sequencer extracts the slides, titles, text and slide thumbnails. With spreadsheets, the sequencer extracts the names of the sheets and text of each sheet. Also, the sequencer extracts for all the files the general file information, including the name of the author, title, keywords, subject, comments, and various dates.


This sequencer generates a simple graph structure containing a variety of metadata from the Office document. The example below provides example output (in the JCR document view) from a Word document sequenced into /document.

<document jcr:primaryType="msoffice:metadata"
          msoffice:title="My Word Document"
          msoffice:subject="My Subject"
          msoffice:author="James Joyce"
          msoffice:keywords="essay english term paper"
          msoffice:comment="This is my English 101 term paper"
          msoffice:thumbnail="..." />

As indicated in the CND below, sequencing Excel spreadsheets will add a msoffice:xlssheet child node for each slide containing name (msoffice:sheet_name)and the text (msoffice:text) for each sheet.Sequencing PowerPoint presentations adds a child node for each slide containing the title (msoffice:title), slide text (msoffice:text), and thumbnail image (msoffice:thumbnail) for each slide.

[msoffice:metadata] > nt:unstructured, mix:mimeType
  - msoffice:title (string)
  - msoffice:subject (string)
  - msoffice:author (string)
  - msoffice:keywords (string)
  - msoffice:comment (string)
  - msoffice:template (string)
  - msoffice:last_saved_by (string)
  - msoffice:revision (string)
  - msoffice:total_editing_time (long)
  - msoffice:last_printed (date)
  - msoffice:created (date)
  - msoffice:saved (date)
  - msoffice:pages (long)
  - msoffice:words (long)
  - msoffice:characters (long)
  - msoffice:creating_application (string)
  - msoffice:thumbnail (binary)

  //Word specific data
  + msoffice:heading (msoffice:heading) sns

  // PowerPoint specific data
  + msoffice:slide (msoffice:pptslide) sns

  // Excel specific data
  - msoffice:full_content (string)
  + msoffice:sheet (msoffice:xlssheet) sns

  - msoffice:title (string)
  - msoffice:text (string)
  - msoffice:notes (string)
  - msoffice:thumbnail (binary)

  - msoffice:sheet_name (string)
  - msoffice:text (string)

  - msoffice:heading_name (string)
  - msoffice:heading_level (long)

To use this sequencer, simply include the modeshape-sequencer-msoffice JAR and all of the POI JARs in your application and configure the repository to use this sequencer using something similar to:

    "name" : "MS Office Test Repository",
    "sequencing" : {
        "removeDerivedContentWithOriginal" : true,
        "sequencers" : {
            "MS Office Sequencer" : {
                "description" : "Office sequencer",
                "classname" : "msoffice",
                "pathExpressions" : [ "default://(*.(xls|doc|ppt))/jcr:content[@jcr:data] => /output/$1" ]
JBoss.org Content Archive (Read Only), exported from JBoss Community Documentation Editor at 2020-03-11 12:12:59 UTC, last content change 2016-04-07 07:31:45 UTC.