Chapter 30. Text Sequencers

30.1. Delimited Text Sequencer
30.2. Fixed Width Text Sequencer

The text sequencers extract data from text streams. There are separate sequencers for character-delimited sequencing and fixed width sequencing, but both treat the incoming text stream as a series of rows (separated by line-terminators, as defined in BufferedReader.readLine() with each row consisting of one or more columns. As noted above, each text sequencer provides its own mechanism for splitting the row into columns.

The AbstractTextSequencer class provides a number of JavaBean properties that are common to both of the concrete text sequencer classes:

Table 30.1. AbstractTextSequencer properties

Property	Description
commentMarker	Optional property that, if set, indicates that any line beginning with exactly this string should be treated as a comment and should not be processed further. If this value is null, then all lines will be sequenced. The default value for this property is null.
maximumLinesToRead	Optional property that, if set, limits the number of lines that will be read during sequencing. Additional lines will be ignored. If this value is non-positive, all lines will be read and sequenced. Comment lines are not counted towards this total. The default value of this property is -1 (indicating that all lines should be read and sequenced).
rowFactoryClassName	Optional property that, if set, provides the name of a class that provides a custom implementation of the `RowFactory` interface. This class must have a no-argument, public constructor. If set, an instance of this class will be created each time that the sequencer sequences an input stream and will be used to provide the output structure of the graph. If this property is set to null, a default implementation will be used. The default value of this property is null.

The default row factory creates one node in the output location for each row sequenced from the source and adds each column with the row as a child node of the row node. The output graph takes the following form (all nodes have primary type nt:unstructured:

 <graph root>
     + text:row[1]
     |   + text:column[1] (jcr:mixinTypes = text:column, text:data = <column1 data>)
     |   + ...
     |   + text:column[n] (jcr:mixinTypes = text:column, text:data = <columnN data>)
     + ...
     + text:row[m]
         + text:column[1] (jcr:mixinTypes = text:column, text:data = <column1 data>)
         + ...
         + text:column[n] (jcr:mixinTypes = text:column, text:data = <columnN data>)

30.1. Delimited Text Sequencer

The DelimitedTextSequencer splits rows into columns based on a regular expression pattern. Although the default pattern is a comma, any regular expression can be provided allowing for more sophisticated splitting patterns.

The DelimitedTextSequencer class provides an additional JavaBean property to override the default regular expression pattern:

Table 30.2. DelimitedTextSequencer properties

Property	Description
splitPattern	Optional property that, if set, sets the regular expression pattern that is used to split each row into columns. This property may not be set to null and defaults to ",".

To use this sequencer, simply include the modeshape-sequencer-text JAR in your application and configure the JcrConfiguration to use this sequencer using something similar to:



JcrConfiguration config = ...


config.sequencer("Delimited Text Sequencer")

      .usingClass("org.modeshape.sequencer.text.DelimitedTextSequencer")

      .loadedFromClasspath()

      .setDescription("Sequences delimited files to extract values")

      .sequencingFrom("//(*.(txt)[*])/jcr:content[@jcr:data]")

      .setProperty("splitPattern", "|")

      .andOutputtingTo("/txt/$1");

30.2. Fixed Width Text Sequencer

The FixedWidthTextSequencer splits rows into columns based on predefined positions. The default setting is to have a single column per row. It also provides an additional JavaBean property to override the default start positions for each column.

Table 30.3. FixedWidthTextSequencer properties

Property	Description
columnStartPositions	Optional property that, if set, provides the start position of each column after the first. The start positions are concatenated into a single, comma-delimited string. The default value is the empty string (implying that each row should be treated as a single column). This property may not be set to null. There is an implicit column start position of 0 that never needs to be specified.

To use this sequencer, simply include the modeshape-sequencer-text JAR in your application and configure the JcrConfiguration to use this sequencer using something similar to:



JcrConfiguration config = ...


config.sequencer("Fixed Width Text Sequencer")

      .usingClass("org.modeshape.sequencer.text.FixedWidthTextSequencer")

      .loadedFromClasspath()

      .setDescription("Sequences fixed width files to extract values")

      .sequencingFrom("//(*.(txt)[*])/jcr:content[@jcr:data]")

      .setProperty("columnStartPositions", "3,6,15")

      .andOutputtingTo("/txt/$1");