Class StreamingPathFilter
- java.lang.Object
-
- nux.xom.xquery.StreamingPathFilter
-
public class StreamingPathFilter extends Object
Streaming path filter node factory for continuous queries and/or transformations over very large or infinitely long XML input.Background
The W3C XQuery and XPath languages often require the entire input document to be buffered in memory for a query to be executed in its full generality [Background Paper, More Papers]. In other words, XQuery and XPath are hard to stream over very large or infinitely long XML inputs without violating some aspects of the W3C specifications. However, subsets of these languages (or simplified cousins) can easily support streaming.In fact, most use cases dealing with very large XML input documents do not require the full forward and backward navigational capabilities of XQuery and XPath across independent element subtrees. Rather those use cases are record oriented, treating element subtrees (i.e. records) independently, individually selecting/projecting/transforming record after record, one record at a time. For example, consider an XML document with one million records, each describing a published book, music album or web server log entry. A query to find the titles of books that have more than three authors looks at each record individually, hence can easily be streamed. Another use case is splitting a document into several sub-documents based on the content of each record.
More interestingly, consider a P2P XML content messaging router, network transducer, transcoder, proxy or message queue that continuously filters, transforms, routes and dispatches messages from infinitely long streams, with the behaviour defined by deeply inspecting rules (i.e. queries) based on content, network parameters or other metadata. This class provides a convenient solution for such common use cases operating on very large or infinitely long XML input. The solution uses a strongly simplified location path language (which is modelled after XPath but not XPath compliant), in combination with a
NodeFactory
and an optional XQuery. The solution is not necessarily faster than building the full document tree, but it consumes much less main memory.Here is how it works You specify a simple "location path" such as
/books/book
or/weblogs/_2004/_05/entry
. The path may contain wildcards and indicates which elements should be retained. All elements not matching the path will be thrown away during parsing. Each retained element is fully build (including its ancestors and descendants) and then made available to the application via a callback to an application-providedStreamingTransform
object.The
StreamingTransform
can operate on the fully build element (subtree) in arbitrary ways. For example, it can simply print the element to screen or disk and then forget about it. Or it can add the element (subtree) to the document currently build by theBuilder
. In addition, a transform can check conditions such as has book more than three authors? A transform can also replace the element with a different element or a list of arbitrary generated nodes. For example, if a book has more than three authors, just the book title with aauthorCount
attribute can be added to the document, instead of the entire book element subtree.Typically, simple
StreamingTransforms
are formulated in custom Java code, whereas complex ones are formulated as an XQuery.Streaming Location Path Syntax
locationPath := {'/'step}... step := [prefix':']localName prefix := '*' | '' | XMLNamespacePrefix localName := '*' | XMLLocalName
A location path consists of zero or more location steps separated by "/". A step consists of an optional XML namespace prefix followed by a local name. The wildcard symbol '*' means: Match anything. An empty prefix ('') means: Match if in no namespace (i.e. null namespace).Example legal location steps are:
book (Match elements named "book" in no namespace) :book (Match elements named "book" in no namespace) bib:book (Match elements named "book" in "bib" namespace) bib:* (Match elements with any name in "bib" namespace) *:book (Match elements named "book" in any namespace, including no namespace) *:* (Match elements with any name in any namespace, including no namespace) :* (Match elements with any name in no namespace)
Obviously, the location path language is quite simplistic, supporting the "child" axis only. For example, axes such as descendant ("//"), ancestors, following, preceding, as well as predicates and other XPath features are not supported. Typically, this does not matter though, because a full XQuery can still be used on each element (subtree) matching the location path, as follows: Example Usage The following is complete and efficient code for parsing and iterating through millions of "person" records in a database-like XML document, printing all residents of "San Francisco", while never allocating more memory than needed to hold one person element:StreamingTransform myTransform = new StreamingTransform() { public Nodes transform(Element person) { Nodes results = XQueryUtil.xquery(person, "name[../address/city = 'San Francisco']"); if (results.size() > 0) { System.out.println("name = " + results.get(0).getValue()); } return new Nodes(); // mark current element as subject to garbage collection } }; // parse document with a filtering Builder Builder builder = new Builder(new StreamingPathFilter("/persons/person", null) .createNodeFactory(null, myTransform)); builder.build(new File("/tmp/persons.xml"));
To find the title of all books that have more than three authors and have 'Monterey' and 'Aquarium' somewhere in the title:String path = "/books/book"; Map prefixes = new HashMap(); prefixes.put("bib", "http://www.example.org/bookshelve/records"); prefixes.put("xsd", "http://www.w3.org/2001/XMLSchema"); StreamingTransform myTransform = new StreamingTransform() { private Nodes NONE = new Nodes(); // execute XQuery against each element matching location path public Nodes transform(Element subtree) { Nodes results = XQueryUtil.xquery(subtree, "title[matches(., 'Monterey') and matches(., 'Aquarium') and count(../author) > 3]"); for (int i=0; i < results.size(); i++) { // do something useful with query results; here we just print them System.out.println(XOMUtil.toPrettyXML(results.get(i))); } return NONE; // current subtree becomes subject to garbage collection // returning empty node list removes current subtree from document being build. // returning new Nodes(subtree) retains the current subtree. // returning new Nodes(some other nodes) replaces the current subtree with // some other nodes. // if you want (SAX) parsing to terminate at this point, simply throw an exception } }; // parse document with a filtering Builder StreamingPathFilter filter = new StreamingPathFilter(path, prefixes); Builder builder = new Builder(filter.createNodeFactory(null, myTransform)); Document doc = builder.build(new File("/tmp/books.xml")); System.out.println("doc.size()=" + doc.getRootElement().getChildElements().size()); System.out.println(XOMUtil.toPrettyXML(doc));
Here is a similar snippet version that takes a filtering
Builder
from a thread-safe pool with optimized parser configuration:... ... same as above ... final StreamingPathFilter filter = new StreamingPathFilter(path, prefixes); BuilderPool pool = new BuilderPool(100, new BuilderFactory() { protected Builder newBuilder(XMLReader parser, boolean validate) { return new Builder(parser, validate, filter.createNodeFactory(null, myTransform)); } } ); Builder builder = pool.getBuilder(false); Document doc = builder.build(new File("/tmp/books.xml")); System.out.println("doc.size()=" + doc.getRootElement().getChildElements().size());
Applicability This class is well suited for a P2P XML content messaging router, network transducer, transcoder, proxy or message queue that continuously filters, transforms, routes and dispatches messages from infinitely long streams.However, this class is less suited for classic database oriented use cases. Here, scalability is limited as the input stream is sequentially scanned, without exploiting the indexing and random access properties typical for (relational) database environments. For such database oriented use cases, consider using the Saxon SQL extensions functions to XQuery, or consider building your own mixed relational/XQuery integration layer, or consider using a database technology with native XQuery support.
- Version:
- $Revision: 1.63 $, $Date: 2005/08/12 21:26:30 $
- Author:
- whoschek.AT.lbl.DOT.gov, $Author: hoschek3 $
-
-
Constructor Summary
Constructors Constructor Description StreamingPathFilter(String locationPath, Map prefixes)
Constructs a compiled filter from the given location path and prefix --> namespaceURI map.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description nu.xom.NodeFactory
createNodeFactory(nu.xom.NodeFactory childFactory, StreamingTransform transform)
Creates and returns a new node factory for this path filter, to be be passed to aBuilder
.
-
-
-
Constructor Detail
-
StreamingPathFilter
public StreamingPathFilter(String locationPath, Map prefixes) throws StreamingPathFilterException
Constructs a compiled filter from the given location path and prefix --> namespaceURI map.- Parameters:
locationPath
- the path expression to compileprefixes
- a map of prefix --> namespaceURI associations, each of type String --> String.- Throws:
StreamingPathFilterException
- if the location path has a syntax error
-
-
Method Detail
-
createNodeFactory
public nu.xom.NodeFactory createNodeFactory(nu.xom.NodeFactory childFactory, StreamingTransform transform)
Creates and returns a new node factory for this path filter, to be be passed to aBuilder
.Like a
Builder
, the node factory can be reused serially, but is not thread-safe because it is stateful. If you need thread-safety, call this method each time you need a new node factory for a new thread.- Parameters:
childFactory
- an optional factory to delegate calls to. All calls exceptmakeRootElement()
,startMakingElement()
andfinishMakingElement()
are delegated to the child factory. If this parameter isnull
it defaults to the factory returned byXOMUtil.getIgnoreWhitespaceOnlyTextNodeFactory()
.transform
- an application-specific callback called by the returned node factory whenever an element matches the filter's entire location path. May benull
in which case the identity transformation is used, adding the matching element unchanged and "as is" to the document being build by aBuilder
.- Returns:
- a node factory for this path filter
-
-