Class StreamingPathFilter


  • public class StreamingPathFilter
    extends Object
    Streaming path filter node factory for continuous queries and/or transformations over very large or infinitely long XML input.

    Background
    The W3C XQuery and XPath languages often require the entire input document to be buffered in memory for a query to be executed in its full generality [Background Paper, More Papers]. In other words, XQuery and XPath are hard to stream over very large or infinitely long XML inputs without violating some aspects of the W3C specifications. However, subsets of these languages (or simplified cousins) can easily support streaming.

    In fact, most use cases dealing with very large XML input documents do not require the full forward and backward navigational capabilities of XQuery and XPath across independent element subtrees. Rather those use cases are record oriented, treating element subtrees (i.e. records) independently, individually selecting/projecting/transforming record after record, one record at a time. For example, consider an XML document with one million records, each describing a published book, music album or web server log entry. A query to find the titles of books that have more than three authors looks at each record individually, hence can easily be streamed. Another use case is splitting a document into several sub-documents based on the content of each record.

    More interestingly, consider a P2P XML content messaging router, network transducer, transcoder, proxy or message queue that continuously filters, transforms, routes and dispatches messages from infinitely long streams, with the behaviour defined by deeply inspecting rules (i.e. queries) based on content, network parameters or other metadata. This class provides a convenient solution for such common use cases operating on very large or infinitely long XML input. The solution uses a strongly simplified location path language (which is modelled after XPath but not XPath compliant), in combination with a NodeFactory and an optional XQuery. The solution is not necessarily faster than building the full document tree, but it consumes much less main memory.

    Here is how it works You specify a simple "location path" such as /books/book or /weblogs/_2004/_05/entry. The path may contain wildcards and indicates which elements should be retained. All elements not matching the path will be thrown away during parsing. Each retained element is fully build (including its ancestors and descendants) and then made available to the application via a callback to an application-provided StreamingTransform object.

    The StreamingTransform can operate on the fully build element (subtree) in arbitrary ways. For example, it can simply print the element to screen or disk and then forget about it. Or it can add the element (subtree) to the document currently build by the Builder. In addition, a transform can check conditions such as has book more than three authors? A transform can also replace the element with a different element or a list of arbitrary generated nodes. For example, if a book has more than three authors, just the book title with a authorCount attribute can be added to the document, instead of the entire book element subtree.

    Typically, simple StreamingTransforms are formulated in custom Java code, whereas complex ones are formulated as an XQuery.

    Streaming Location Path Syntax

     locationPath := {'/'step}...
     step := [prefix':']localName
     prefix := '*' | '' | XMLNamespacePrefix
     localName := '*' | XMLLocalName
     
    A location path consists of zero or more location steps separated by "/". A step consists of an optional XML namespace prefix followed by a local name. The wildcard symbol '*' means: Match anything. An empty prefix ('') means: Match if in no namespace (i.e. null namespace).

    Example legal location steps are:

     book       (Match elements named "book" in no namespace)
     :book      (Match elements named "book" in no namespace)
     bib:book   (Match elements named "book" in "bib" namespace)
     bib:*      (Match elements with any name in "bib" namespace)
     *:book     (Match elements named "book" in any namespace, including no namespace)
     *:*        (Match elements with any name in any namespace, including no namespace)
     :*         (Match elements with any name in no namespace)
     
    Obviously, the location path language is quite simplistic, supporting the "child" axis only. For example, axes such as descendant ("//"), ancestors, following, preceding, as well as predicates and other XPath features are not supported. Typically, this does not matter though, because a full XQuery can still be used on each element (subtree) matching the location path, as follows: Example Usage The following is complete and efficient code for parsing and iterating through millions of "person" records in a database-like XML document, printing all residents of "San Francisco", while never allocating more memory than needed to hold one person element:
     StreamingTransform myTransform = new StreamingTransform() {
         public Nodes transform(Element person) {
             Nodes results = XQueryUtil.xquery(person, "name[../address/city = 'San Francisco']");
             if (results.size() > 0) {
                 System.out.println("name = " + results.get(0).getValue());
             }
             return new Nodes(); // mark current element as subject to garbage collection
         }
     };
    
     // parse document with a filtering Builder
     Builder builder = new Builder(new StreamingPathFilter("/persons/person", null)
         .createNodeFactory(null, myTransform));
     builder.build(new File("/tmp/persons.xml"));
     
    To find the title of all books that have more than three authors and have 'Monterey' and 'Aquarium' somewhere in the title:
     String path = "/books/book";
     Map prefixes = new HashMap();
     prefixes.put("bib", "http://www.example.org/bookshelve/records");
     prefixes.put("xsd", "http://www.w3.org/2001/XMLSchema");
    
     StreamingTransform myTransform = new StreamingTransform() {
         private Nodes NONE = new Nodes();
    
         // execute XQuery against each element matching location path
         public Nodes transform(Element subtree) {
             Nodes results = XQueryUtil.xquery(subtree,
                "title[matches(., 'Monterey') and matches(., 'Aquarium') and count(../author) > 3]");
    
             for (int i=0; i < results.size(); i++) {
                 // do something useful with query results; here we just print them
                 System.out.println(XOMUtil.toPrettyXML(results.get(i)));
             }
             return NONE; // current subtree becomes subject to garbage collection
             // returning empty node list removes current subtree from document being build.
             // returning new Nodes(subtree) retains the current subtree.
             // returning new Nodes(some other nodes) replaces the current subtree with
             // some other nodes.
             // if you want (SAX) parsing to terminate at this point, simply throw an exception
         }
     };
    
     // parse document with a filtering Builder
     StreamingPathFilter filter = new StreamingPathFilter(path, prefixes);
     Builder builder = new Builder(filter.createNodeFactory(null, myTransform));
     Document doc = builder.build(new File("/tmp/books.xml"));
     System.out.println("doc.size()=" + doc.getRootElement().getChildElements().size());
     System.out.println(XOMUtil.toPrettyXML(doc));
     

    Here is a similar snippet version that takes a filtering Builder from a thread-safe pool with optimized parser configuration:

     ...
     ... same as above
     ...
     final StreamingPathFilter filter = new StreamingPathFilter(path, prefixes);
     BuilderPool pool = new BuilderPool(100, new BuilderFactory() {
         protected Builder newBuilder(XMLReader parser, boolean validate) {
             return new Builder(parser, validate, filter.createNodeFactory(null, myTransform));
         }
       }
     );
    
     Builder builder = pool.getBuilder(false);
     Document doc = builder.build(new File("/tmp/books.xml"));
     System.out.println("doc.size()=" + doc.getRootElement().getChildElements().size());
     
    Applicability This class is well suited for a P2P XML content messaging router, network transducer, transcoder, proxy or message queue that continuously filters, transforms, routes and dispatches messages from infinitely long streams.

    However, this class is less suited for classic database oriented use cases. Here, scalability is limited as the input stream is sequentially scanned, without exploiting the indexing and random access properties typical for (relational) database environments. For such database oriented use cases, consider using the Saxon SQL extensions functions to XQuery, or consider building your own mixed relational/XQuery integration layer, or consider using a database technology with native XQuery support.

    Version:
    $Revision: 1.63 $, $Date: 2005/08/12 21:26:30 $
    Author:
    whoschek.AT.lbl.DOT.gov, $Author: hoschek3 $
    • Constructor Detail

      • StreamingPathFilter

        public StreamingPathFilter​(String locationPath,
                                   Map prefixes)
                            throws StreamingPathFilterException
        Constructs a compiled filter from the given location path and prefix --> namespaceURI map.
        Parameters:
        locationPath - the path expression to compile
        prefixes - a map of prefix --> namespaceURI associations, each of type String --> String.
        Throws:
        StreamingPathFilterException - if the location path has a syntax error
    • Method Detail

      • createNodeFactory

        public nu.xom.NodeFactory createNodeFactory​(nu.xom.NodeFactory childFactory,
                                                    StreamingTransform transform)
        Creates and returns a new node factory for this path filter, to be be passed to a Builder.

        Like a Builder, the node factory can be reused serially, but is not thread-safe because it is stateful. If you need thread-safety, call this method each time you need a new node factory for a new thread.

        Parameters:
        childFactory - an optional factory to delegate calls to. All calls except makeRootElement(), startMakingElement() and finishMakingElement() are delegated to the child factory. If this parameter is null it defaults to the factory returned by XOMUtil.getIgnoreWhitespaceOnlyTextNodeFactory().
        transform - an application-specific callback called by the returned node factory whenever an element matches the filter's entire location path. May be null in which case the identity transformation is used, adding the matching element unchanged and "as is" to the document being build by a Builder.
        Returns:
        a node factory for this path filter