JBoss Community Archive (Read Only)

ModeShape 5

Query and search

The JCR API defines a way to query a repository for content that meets user-defined criteria. The JCR 2.0 API actually makes it possible for implementations to support multiple query languages, and the specification requires support for two languages: JCR-SQL2 and JCR-QOM. JCR 1.0 defined two other languages (XPath and JCR-SQL), though these languages were deprecated in JCR 2.0.

Choosing a query language

At this time, ModeShape supports five query languages:

  • JCR-SQL2

  • JCR-SQL

  • XPath

  • JCR-JQOM (programmatic API)

  • full-text search (a language that reuses the full-text search expression grammar used in the second parameter of the CONTAINS(...) function of the JCR-SQL2 language)

So which language should you choose?

You might think you should pick one based upon how quickly it can be executed, but that's not the case with ModeShape. As we'll see below, ModeShape plans, optimizes, and executes queries in all languages in exactly the same way. The only part of querying that is language specific is the parsing of your query expression.

With ModeShape, the best reason to pick one language over another is based upon how easy it is to express your query. Sure, the JCR-SQL2 language is by far the most expressive and most powerful. But the path oriented criteria of some queries might be more easily and simply expressed using XPath. Or if all you're doing is finding nodes that satisfy some full-text search criteria, then ModeShape's full-text search language might be the best match.

Not all JCR implementations execute their queries in the same way. Some (including Jackrabbit) have completely different execution paths for different languages, meaning queries in some languages are simply faster than equivalent queries expressed in other languages.

Speed and performance

ModeShape will always be able to execute a valid and properly-constructed query. Without help the repository must scan the whole workspace and apply the query's criteria. As you might expect, scanning the entire workspace might be very inefficient and slow, though the query will indeed produce the correct results.

Index only what you need

If you want to improve the performance of your queries, you need to define indexes that the engine can use to more quickly find nodes that satisfy your criteria. We'll see later on how to define indexes and how you can tell which indexes (if any) ModeShape is using for your queries.

This is similar to how one designs a relational database. After you define your table structure, you have to specify the indexes that the database will maintain. If a query can't use any of the indexes, then the database will perform a table scan. Adding a useful index, on the other hand, can dramatically improve query response times.

By the way, ModeShape 2.x and 3.x had automatic indexing, which meant that all of the content in the repository was indexed. This takes a considerable amount of time, and most of that time is wasted since most of those indexes would never be used in your queries. With ModeShape 4.0 we've changed to having you define only those indexes that make sense for your queries. This eliminates a lot of work that the repository use to do upon every save, and makes the whole repository faster.

Update indexes synchronously or asynchronously

With ModeShape 4.0 (Beta2) you can also choose for each index whether it should be updated synchronously as part of the Session's save operation, or asynchronously in a separate thread. Essentially, the synchronous indexes are updated sequentially in the same thread as your session's save operation. Indexes are synchronous is the default; it's the most consistent and it's similar to how earlier versions worked by default. Use synchronous indexes anytime your application saves content and expects to immediately see those changes reflected in the query results.

If your application can handle (some) queries being "eventually consistent", then you should consider making those indexes asynchronous. For example, if you use a query to get the comments on a blog post, showing the latest comment is not always a requirement. Using asynchronous indexes can help you increase performance of your application by decreasing the amount of work that has to be done during the save operations. Each asynchronous index is updated in its own thread, so on powerful machines it's possible that multiple indexes can be updated concurrently, reducing the overall wall clock time it takes for all those index updates to complete.

When you cluster your repository, the synchronous vs asynchronous behavior really only applies to the local process where the changes are actually made. Updates to all indexes in other processes are always synchronous. In other words, your save operations do not wait for remote indexes to be updated.

Your repository can use any combination of synchronous and asynchronous indexes, and your choice should be driven largely by your application's needs and expectations.

Providing room for growth

ModeShape stores indexes via index providers, and 4.0 comes with a simple, fast local index provider that stores indexes on the file-system. This works even in a cluster, where each process running ModeShape will have its own copy of the indexes. This pattern may seem surprising, but it performs quite well, since all indexing and all query execution can be done locally and in-process. We do anticipate adding more providers, and our roadmap includes tasks for adding providers that use Solr and ElasticSearch. Just be aware that some index providers may not support all variations of index definitions. For example, the local index provider does not support text indexes for full-text search.

Get only the results you use

Even when you issue a query that has lots of rows in the result, ModeShape fetches the results in batches. This means that you only have to load those result rows that you use. Of course, any query where the first row depends on the other rows (like ORDER BY) may force ModeShape to fully evaluate the query before returning the first batch. But rest assured that ModeShape never stores more than a few batches in memory or on the heap; the remainder (if needed) are cached off heap to minimize the memory overhead and impact of each query.

How to execute queries

This section outlines the basic steps of using the JCR API to issue queries and process results.

Creating queries

There are two ways to create a JCR Query object. The first is by supplying a query expression and the name of the query language, and this can be done with the standard JCR API:

// Obtain the query manager for the session via the workspace ...
javax.jcr.query.QueryManager queryManager = session.getWorkspace().getQueryManager();

// Create a query object ...
String language = ...  // e.g. javax.jcr.query.Query.JCR_SQL2
String expression = ...
javax.jcr.query.Query query = queryManager.createQuery(expression,language);

Before returning the Query, ModeShape finds a parser for the language given by the language parameter, and uses this parser to create a language-independent object representation of the query. (Note that any grammatical errors in the expression result in an immediate exception.) This object representation is what JCR 2.0 calls the "Query Object Model", or QOM. After parsing, ModeShape embeds the QOM into the Query object.

The second approach for creating a Query object is to programmatically build up the query using the QueryObjectModelFactory. Again, this uses the standard JCR API. Here's a simple example:

// Obtain the query manager for the session via the workspace ...
javax.jcr.query.QueryManager queryManager = session.getWorkspace().getQueryManager();
javax.jcr.query.qom.QueryObjectModelFactory factory = queryManager.getQOMFactory();

// Create the parts of a query object ...
javax.jcr.query.qom.Source selector = factory.selector(...);
javax.jcr.query.qom.Constraint constraints = ...
javax.jcr.query.qom.Column[] columns = ...
javax.jcr.query.qom.Ordering[] orderings = ...
javax.jcr.query.qom.QueryObjectModel model =
    factory.createQuery(selector,constraints,orderings,columns);

// The model is a query ...
javax.jcr.query.Query query = model;

Of course, the QueryObjectModelFactory can create lots variations of selectors, joins, constraints, and orderings. ModeShape fully supports this style of creating queries, and it even offers some very useful extensions (described below).

Executing queries

As we mentioned above, all ModeShape Query objects contain the object representation of the query, called the query object model. No matter which query language is used or whether the query was created programmatically, the ModeShape uses the same kind of model objects to represent every single query.

So when the JCR client executes the query:

javax.jcr.query.Query query = ...

// Execute the query and get the results ...
javax.jcr.query.QueryResult result = query.execute();

ModeShape then takes the query's object model and runs it through a series of steps to plan, validate, optimize, and finally execute the query:

  1. Planning - in this step, ModeShape converts the language-independent query object model into a canonical relational query plan that outlines the various relational operations that need to be performed for this query. The query plan forms a tree, with each leaf node representing an access query against the indexes. However, this plan isn't quite ready to be used.

  2. Validation - not all queries that are well-formed can be executed, so ModeShape then validates the canonical query plan to make sure that all named selectors exist, all named properties exist on the selectors, that all aliases are properly used, and that all identifiers are resolvable. If the query fails validation, an exception is thrown immediately.

  3. Optimization - the canonical plan should mirror the actual query model, but it may not be the most simple or efficient plan. ModeShape runs the canonical plan through a rule-based optimizer to produce an optimum and executable plan. For example, one rule rewrites right outer joins as left outer joins. Another rule looks for identity joins (e.g., ISSAMENODE join criteria or equi-join criteria involving node identifiers), and if possible removes the join altogether (replacing it with additional criteria) or copies criteria on one side of the join to the other. Another rule removes parts of the plan that (based upon criteria) will never return any rows. Yet another rule determines the best algorithm for joining tuples. Overall, there are about a dozen such rules, and all are intended to make the query plans more easily and efficiently executed.

  4. Index planning - in this step, ModeShape figures out for each access query in the plan which of the available indexes apply and which of those indexes should be used. The latter involves looking at the estimated cost, estimated number of rows returned by an index (i.e., the cardinality), and the estimated selectivity of the index (the ratio of rows returned to the total number of rows in the index).

  5. Execution - the optimized plan is then executed: each access query in the plan is issued and the resulting tuples processed and combined to form the result set's tuples. Tuples are processed in batches, often processing only the batches of tuples needed to give your application the rows it asks for. As your application asks for more rows in the next batch, ModeShape automatically and transparently processes the next batch of results. This continues as long as your application asks for rows in the results, or until there are no more results.

Now that you know more about how ModeShape actually works, you can understand why ModeShape can achieve such good query performance regardless of the language you choose to use.

Query plans

Your applications have access to the optimized query plan that ModeShape computes for every query. To get it you must use the public ModeShape API, and specifically you can cast the javax.jcr.query.Query object to a org.modeshape.jcr.api.query.Query object (that extends javax.jcr.query.Query):

javax.jcr.query.QueryManager queryManager = session.getWorkspace().getQueryManager();
javax.jcr.query.Query query = queryManager.createQuery(...);
org.modeshape.jcr.api.query.Query msQuery = (org.modeshape.jcr.api.query.Query)query;

// Get the query plan without executing it ...
org.modeshape.jcr.api.query.QueryResult result = msQuery.explain();
String plan = result.getQueryPlan();

You can use this technique if you don't necessary want to execute the query. But, if you've already executed a query, simply cast the javax.jcr.query.QueryResult to the ModeShape-specific org.modeshape.jcr.api.query.QueryResult object (that extends javax.jcr.query.QueryResult):

javax.jcr.Query query = queryManager.createQuery(...);
org.modeshape.jcr.api.query.QueryResult result = (org.modeshape.jcr.api.query.QueryResult)query.execute();
String plan = result.getQueryPlan();

Here's a query plan for the query "SELECT * FROM

[nt:base]

":

Access [nt:base]
  Project [nt:base] <PROJECT_COLUMNS=[[nt:base].[jcr:primaryType], [nt:base].[jcr:mixinTypes], [nt:base].[jcr:path], [nt:base].[jcr:name], [nt:base].[jcr:score], [nt:base].[mode:localName], [nt:base].[mode:depth], [nt:base].[mode:id]], PROJECT_COLUMN_TYPES=[STRING, STRING, STRING, STRING, DOUBLE, STRING, LONG, STRING]>
    Source [nt:base] <SOURCE_NAME=__ALLNODES__, SOURCE_ALIAS=nt:base, SOURCE_COLUMNS=[...]>

Query plans represent a tree of relational operations. This query plan includes the "Access", "Project", and "Source" operations. Note that for simplicity, the SOURCE_COLUMNS=[] in the last line has been truncated. Normally it would include columns for all available property definitions.

ModeShape supports a number of relational operations:

Operation

Number of child operations

Description

Project

1

Specifies the specific subset of columns to be retrieved.

Select

1

Filters the rows by applying criteria.

Sort

1

Sorts rows based upon a subset of columns. Also includes the sort direction for each column and whether to remove duplicates.

Source

0..n

Defines the 'table' from which the rows are obtained.

Limit

1

Limits the number of rows returned.

SetOperation

2

Performs a UNION, INTERSECT, or EXCEPT on two child operations.

DependentQuery

2

Performs two operations, where the left side must be done before the right.

Join

2

An operation that defines the join type, join criteria, and join strategy.

Access

1

 

A portion of the query plan that operates upon a single table (or selector). All descendant operations will apply to that single table, while all ancestors will apply to combinations of tables.

Index

1

Identify an index that may be used

DupRemoval

1

Removes duplicate tuples based upon combination of primary keys.

Let's look again at the example query plan:

Access [nt:base]
  Project [nt:base] <PROJECT_COLUMNS=[[nt:base].[jcr:primaryType], [nt:base].[jcr:mixinTypes], [nt:base].[jcr:path], [nt:base].[jcr:name], [nt:base].[jcr:score], [nt:base].[mode:localName], [nt:base].[mode:depth], [nt:base].[mode:id]], PROJECT_COLUMN_TYPES=[STRING, STRING, STRING, STRING, DOUBLE, STRING, LONG, STRING]>
    Source [nt:base] <SOURCE_NAME=__ALLNODES__, SOURCE_ALIAS=nt:base, SOURCE_COLUMNS=[...]>

We can see that the first operation is Access, which means the entire query plan deals with a single "table" (i.e., selector). Below that is the Project operation, which specifies the complete set of columns that will be returned by the access query (or complete query in this case). Because there are no criteria, the operation below that is the Source operation for the "nt:base" table.

Let's look at a slightly more complicated query:

SELECT * FROM [mix:title] WHERE [jcr:title] = 'The Title'

Here's the resulting query plan (again with a simplified SOURCE_COLUMNS=[] section):

SELECT * FROM [mix:title] WHERE [mix:title].[jcr:title] = 'The Title'
 plan -> Access [mix:title]
  Project [mix:title] <PROJECT_COLUMNS=[[mix:title].[jcr:title], [mix:title].[jcr:description], [mix:title].[jcr:path], [mix:title].[jcr:name], [mix:title].[jcr:score], [mix:title].[mode:localName], [mix:title].[mode:depth], [mix:title].[mode:id]], PROJECT_COLUMN_TYPES=[STRING, STRING, STRING, STRING, DOUBLE, STRING, LONG, STRING]>
    Select [mix:title] <SELECT_CRITERIA=[mix:title].[jcr:title] = 'The Title'>
      Select [mix:title] <SELECT_CRITERIA=[mix:title].[jcr:mixinTypes] IN ('mix:title','mode:publishArea')>
        Source [mix:title] <SOURCE_NAME=__ALLNODES__, SOURCE_ALIAS=mix:title, SOURCE_COLUMNS=[...]>
          Index [mix:title] <INDEX_SPECIFICATION=descriptionIndex, provider=local, cost~=100, cardinality~=2, selectivity~=0.6666667, constraints=[[mix:title].[jcr:title] = 'The Title'], INDEX_USED=true>

This is very similar to the previous query plan, except we know have 2 Select operations, each with a different criteria. One of the criteria is

[mix:title].[jcr:title] = 'The Title'

which we can easily see comes directly from our query's WHERE clause. The other Select operation may be a bit surprising:

[mix:title].[jcr:mixinTypes] IN ('mix:title','mode:publishArea')

Where did this come from? It actually originates in the FROM clause of our query. Because we're querying the "mix:title" selector, we know that the results should include all nodes that have this mixin or any subtype of this mixin. ModeShape translates this into an additional criteria. And when this query was executed, our repository only had one other node type that subtypes mix:title, and that's the internal mode:publishArea.

What if we had explicitly defined a number of indexes, and several apply to either of these criteria? In that case, we'd then see one or more Index operations below the Source operation:

SELECT * FROM [mix:title] WHERE [mix:title].[jcr:title] = 'The Title'
 plan -> Access [mix:title]
  Project [mix:title] <PROJECT_COLUMNS=[[mix:title].[jcr:title], [mix:title].[jcr:description], [mix:title].[jcr:path], [mix:title].[jcr:name], [mix:title].[jcr:score], [mix:title].[mode:localName], [mix:title].[mode:depth], [mix:title].[mode:id]], PROJECT_COLUMN_TYPES=[STRING, STRING, STRING, STRING, DOUBLE, STRING, LONG, STRING]>
    Select [mix:title] <SELECT_CRITERIA=[mix:title].[jcr:title] = 'The Title'>
      Select [mix:title] <SELECT_CRITERIA=[mix:title].[jcr:mixinTypes] IN ('mix:title','mode:publishArea')>
        Source [mix:title] <SOURCE_NAME=__ALLNODES__, SOURCE_ALIAS=mix:title, SOURCE_COLUMNS=[...]>
          Index [mix:title] <INDEX_SPECIFICATION=descriptionIndex, provider=local, cost~=100, cardinality~=1, selectivity~=0.00002, constraints=[[mix:title].[jcr:title] = 'The Title'], INDEX_USED=true>

In this query plan, the Index operation is telling us that there is a single index called "descriptionIndex" owned by the "local" index provider, and using this index for this query would result in an estimated cost of 100, should return 1 row out of 50000 ("cardinality~=1, selectivity~=0.00002"). The "INDEX_USED=true" attribute signals that the query engine has decided to use it.

Processing results

The QueryResult object you get back from executing a Query has two ways of getting at the results of the query.

Getting tuples and values

You can get the values for each of the properties selected by the query using the query result's RowIterator. This is similar to Iterator<Row> (with a few more methods), and you use it to iterate over the {{Row}}s in the results.

RowIterator rows = results.getRow();
while ( rows.hasNext() ) {
    Row row = rows.nextRow();
    ...
}

The Row interface makes it easy to get values using the selected property names:

RowIterator rows = results.getRow();
while ( rows.hasNext() ) {
    Row row = rows.nextRow();
    Value v1 = row.getValue("prop1");
    Value v2 = row.getValue("prop2");
    // etc.
}

or more generically by index, allowing you to even loop over the values in each row:

RowIterator rows = results.getRow();
while ( rows.hasNext() ) {
    Row row = rows.nextRow();
    for ( Value value : row.getValues() ) {
        // do something with each value
    }
}

It's even possible to get the Node objects for the row. If your query has just a single selector (e.g., "table" in the FROM clause), then you can simply call getNode():

RowIterator rows = results.getRow();
while ( rows.hasNext() ) {
    Row row = rows.nextRow();
    Node node = row.getNode();
    // do something with the node
}

But if your query includes result columns from multiple selectors, then you must specify the selector name:

String selectorName1 = // from the query
String selectorName2 = // from the query
RowIterator rows = results.getRow();
while ( rows.hasNext() ) {
    Row row = rows.nextRow();
    Node node1 = row.getNode(selectorName1);
    Node node2 = row.getNode(selectorName2);
    // do something with the nodes
}

If you just need the path instead of the whole Node, you can get that from the Row, too. Here's how you do it with results from a single selector:

RowIterator rows = results.getRow();
while ( rows.hasNext() ) {
    Row row = rows.nextRow();
    Path path = row.getPath();
    // do something with the node
}

If your query includes result columns from multiple selectors, then you must specify the selector name:

String selectorName1 = // from the query
String selectorName2 = // from the query
RowIterator rows = results.getRow();
while ( rows.hasNext() ) {
    Row row = rows.nextRow();
    Path path1 = row.getPath(selectorName1);
    Path path2 = row.getPath(selectorName2);
    // do something with the paths
}

Using NodeIterator

If your query selects properties from a single selector (e.g., table or node type), and you just want to go through the {{Node}}s in the results, then there's an even simpler way to get them:

NodeIterator nodes = results.getNodes();
while ( nodes.hasNext() ) {
    Node node = nodes.nextNode();
    // do something with the node
}

This is slightly easier than using the RowIterator approach described earlier. Just remember that you can only do this if your query involves a single selector.

ModeShape's public API

ModeShape's public API is carefully designed to extend the standard JCR API interfaces but to override the return types to use any ModeShape-specific extensions. For example, if you start with a ModeShape-specific org.modeshape.jcr.api.Repository interface, then using the ModeShape-specific query features is very easy:

// All interfaces are in ModeShape's public API except those explicitly shown as from JCR API ...
javax.jcr.Repository jcrRepository = ...
Repository repo = (Repository)jcrRepository.

// Get a session ...
Session session = repo.login(...);

// Create a query ...
Query query = session.getWorkspace().getQueryManager();

// Get the query plan without execution ...
QueryResult result = query.explain();
String plan = result.getQueryPlan();

// Or execute the query and get the plan
result = query.execute();
plan = result.getQueryPlan();

Use as much or as little of the ModeShape-specific public API as you want.

SQL and JCR-SQL2 extensions

ModeShape adds several features to its support of the standard JCR-SQL and JCR-SQL2 grammars. These extensions include support for:

  1. Additional join types with FULL OUTER JOIN and CROSS JOIN

  2. UNION, INTERSECT, and EXCEPT set operations. Duplicate rows are removed unless the operator is followed by the ALL keyword.

  3. Removing duplicate rows with SELECT DISTINCT ...

  4. Non-correlated subqueries in the WHERE clause; multiple subqueries can be used in a single query, and they can even be nested

  5. Limit the number of rows returned with LIMIT count

  6. Skip initial rows with OFFSET number

  7. Constrain the depth of a node with DEPTH(selectorName)

  8. Constrain the path of a node with PATH(selectorName)

  9. Constrain the references from a node with REFERENCE(selectorName.property) and REFERENCE(selectorName)

  10. Set criteria to specify multiple criteria values using IN and NOT IN

  11. Ranges of criteria values using BETWEEN lower AND upper and optionally specifying whether to exclude the lower and/or upper values

  12. Use simple arithmetic in criteria and ORDER BY clauses, such as SCORE(type1)*3 + SCORE(type2)

  13. Use pseudo-columns to include the path, score, node name, node local name, node depth, and node identifier in result columns or in criteria

  14. Use NOT LIKE as an operator in comparison criteria. This is equivalent to wrapping a LIKE comparison criteria in a NOT(...) clause.

  15. Use RELIKE as an operator to tests LIKE patterns stored in property values against a supplied string parameter.

  16. Setting a custom Locale

More detail of the particular extensions can be found in the JCR-SQL2 grammar.

Simply use these extensions within your JCR-SQL or JCR-SQL2 query expressions strings, and use the standard JCR API to obtain a Query:

// Obtain the query manager for the session via the workspace ...
javax.jcr.query.QueryManager queryManager = session.getWorkspace().getQueryManager();

// Create a query object ...
String language = ...
String expression = ...  // USE THE EXTENSIONS HERE
javax.jcr.query.Query query = queryManager.createQuery(expression,language);

// And use the query ...

Query Object Model extensions

The extensions in the JCR-SQL and JCR-SQL2 languages can also be used when building queries programmatically using the JCR Query Object Model API. ModeShape defines as part of its public API the org.modeshape.jcr.api.query.qom.QueryObjectModelFactory interface that extends the standard javax.jcr.query.qom.QueryObjectModelFactory interface, and which contains methods providing ways to construct a QOM with the extended features.

Pseudocolumns

ModeShape adds support for several columns that don't really exist on any node types but that are implicit in JCR. You can use these columns anywhere in a query that columns can normally be used.

Pseudocolumn Name

Description

jcr:uuid

The UUID of the node. The value is equivalent to the result of javax.jcr.Node.getUUID().

jcr:path

The path of the node. The value is equivalent to the result of javax.jcr.Node.getPath().

jcr:name

The name of the node, excluding any same-name-sibling index. The value is equivalent to the result of javax.jcr.Node.getName().

mode:localName

The local name of the node. The value is equivalent to jcr:name except without the namespace portion of the name.

mode:depth

The depth of the node, which is equal to the number of segments in the path.

mode:id

The identifier of the node. The value is equivalent to the result of javax.jcr.Node.getIdentifier().

Additional join types

The standard javax.jcr.query.qom.QueryObjectModelFactory interface uses a String to specify the join type:

package javax.jcr.query.qom;

public interface QueryObjectModelFactory {
    ...
    /**
     * Performs a join between two node-tuple sources.
     *
     * The query is invalid if 'left' is the same source as 'right'.
     *
     * @param left the left node-tuple source; non-null
     * @param right the right node-tuple source; non-null
     * @param joinType either QueryObjectModelConstants.JCR_JOIN_TYPE_INNER,
     *        QueryObjectModelConstants.JCR_JOIN_TYPE_LEFT_OUTER, or
     *        QueryObjectModelConstants.JCR_JOIN_TYPE_RIGHT_OUTER.
     * @param joinCondition the join condition; non-null
     * @return the join; non-null
     * @throws InvalidQueryException if a particular validity test is possible on this method,
     *         the implemention chooses to perform that test (and not leave it until later,
     *         on {@link #createQuery}), and the parameters given fail that test
     * @throws RepositoryException if the operation otherwise fails
     */
    public Join join( Source left,
                      Source right,
                      String joinType,
                      JoinCondition joinCondition ) throws InvalidQueryException, RepositoryException;
    ...
}

In addition to the three standard constants, ModeShape supports two additional constant values:

  • javax.jcr.query.qom.QueryObjectModelConstants.JCR_JOIN_TYPE_INNER

  • javax.jcr.query.qom.QueryObjectModelConstants.JCR_JOIN_TYPE_LEFT_OUTER

  • javax.jcr.query.qom.QueryObjectModelConstants.JCR_JOIN_TYPE_RIGHT_OUTER

  • org.modeshape.jcr.api.query.qom.QueryObjectModelConstants.JCR_JOIN_TYPE_CROSS

  • org.modeshape.jcr.api.query.qom.QueryObjectModelConstants.JCR_JOIN_TYPE_FULL_OUTER

Set operations with UNION, INTERSECT, and EXCEPT

Creating a set query is very similar to creating a normal SELECT type query, but instead the following on org.modeshape.jcr.api.query.qom.QueryObjectModelFactory are used:

package org.modeshape.jcr.api.query.qom;

public interface QueryObjectModelFactory {
    ...
    /**
     * Creates a query with one or more selectors.
     *
     * @param source the node-tuple source; non-null
     * @param constraint the constraint, or null if none
     * @param orderings zero or more orderings; null is equivalent to a zero-length array
     * @param columns the columns; null is equivalent to a zero-length array
     * @param limit the limit; null is equivalent to having no limit
     * @param isDistinct true if the query should return distinct values; or false if no
     *        duplicate removal should be performed
     * @return the select query; non-null
     * @throws InvalidQueryException if a particular validity test is possible on this method,
     *         the implemention chooses to perform that test and the parameters given fail that
     *         test. See the individual QOM factory methods for the validity criteria
     *         of each query element.
     * @throws RepositoryException if another error occurs.
     */
    public SelectQuery select( Source source,
                               Constraint constraint,
                               Ordering[] orderings,
                               Column[] columns,
                               Limit limit,
                               boolean isDistinct ) throws InvalidQueryException, RepositoryException;

    /**
     * Creates a query command that effectively appends the results of the right-hand query
     * to those of the left-hand query.
     *
     * @param left the query command that represents left-side of the set operation;
     *        non-null and must have columns that are equivalent and union-able to those
     *        of the right-side query
     * @param right the query command that represents right-side of the set operation;
     *        non-null and must have columns that are equivalent and union-able to those
     *        of the left-side query
     * @param orderings zero or more orderings; null is equivalent to a zero-length array
     * @param limit the limit; null is equivalent to having no limit
     * @param all true if duplicate rows in the left- and right-hand side results should
     *        be included, or false if duplicate rows should be eliminated
     * @return the select query; non-null
     * @throws InvalidQueryException if a particular validity test is possible on this method,
     *         the implemention chooses to perform that test and the parameters given fail
     *         that test. See the individual QOM factory methods for the validity criteria
     *         of each query element.
     * @throws RepositoryException if another error occurs.
     */
    public SetQuery union( QueryCommand left,
                           QueryCommand right,
                           Ordering[] orderings,
                           Limit limit,
                           boolean all ) throws InvalidQueryException, RepositoryException;

    /**
     * Creates a query command that returns all rows that are both in the result of the
     * left-hand query and in the result of the right-hand query.
     *
     * @param left the query command that represents left-side of the set operation;
     *        non-null and must have columns that are equivalent and union-able to those
     *        of the right-side query
     * @param right the query command that represents right-side of the set operation;
     *        non-null and must have columns that are equivalent and union-able to those
     *        of the left-side query
     * @param orderings zero or more orderings; null is equivalent to a zero-length array
     * @param limit the limit; null is equivalent to having no limit
     * @param all true if duplicate rows in the left- and right-hand side results should
     *        be included, or false if duplicate rows should be eliminated
     * @return the select query; non-null
     * @throws InvalidQueryException if a particular validity test is possible on this method,
     *         the implemention chooses to perform that test and the parameters given fail
     *         that test. See the individual QOM factory methods for the validity criteria
     *         of each query element.
     * @throws RepositoryException if another error occurs.
     */
    public SetQuery intersect( QueryCommand left,
                               QueryCommand right,
                               Ordering[] orderings,
                               Limit limit,
                               boolean all ) throws InvalidQueryException, RepositoryException;

    /**
     * Creates a query command that returns all rows that are in the result of the left-hand
     * query but not in the result of the right-hand query.
     *
     * @param left the query command that represents left-side of the set operation;
     *        non-null and must have columns that are equivalent and union-able to those
     *        of the right-side query
     * @param right the query command that represents right-side of the set operation;
     *        non-null and must have columns that are equivalent and union-able to those
     *        of the left-side query
     * @param orderings zero or more orderings; null is equivalent to a zero-length array
     * @param limit the limit; null is equivalent to having no limit
     * @param all true if duplicate rows in the left- and right-hand side results should
     *        be included, or false if duplicate rows should be eliminated
     * @return the select query; non-null
     * @throws InvalidQueryException if a particular validity test is possible on this method,
     *         the implemention chooses to perform that test and the parameters given fail
     *         that test. See the individual QOM factory methods for the validity criteria
     *         of each query element.
     * @throws RepositoryException if another error occurs.
     */
    public SetQuery except( QueryCommand left,
                            QueryCommand right,
                            Ordering[] orderings,
                            Limit limit,
                            boolean all ) throws InvalidQueryException, RepositoryException;
    ...
}

Note that the select(...) method returns a SelectQuery while the union(...), intersect(...) and except(...) methods return a SetQuery. The SelectQuery and SetQuery interfaces are defined by ModeShape and both extend ModeShape's QueryCommand interface. This interface is then used in the methods to create SetQuery.

The SetQuery object is not executable. To create the corresponding javax.jcr.Query object, pass the SetQuery to the following method on org.modeshape.jcr.api.query.qom.QueryObjectModelFactory:

package org.modeshape.jcr.api.query.qom;

public interface QueryObjectModelFactory {
    ...
    /**
     * Creates a set query.
     *
     * @param command set query; non-null
     * @return the executable query; non-null
     * @throws InvalidQueryException if a particular validity test is possible on this method,
     *         the implemention chooses to perform that test and the parameters given fail
     *         that test. See the individual QOM factory methods for the validity criteria
     *         of each query element.
     * @throws RepositoryException if another error occurs.
     */
    public SetQueryObjectModel createQuery( SetQuery command ) throws InvalidQueryException, RepositoryException;
    ...
}

The resulting SetQueryObjectModel extends javax.jcr.query.Query and SetQuery and can be executed and treated similarly to the standard javax.jcr.query.qom.QueryObjectModel (that also extends javax.jcr.query.Query).

correlated subqueries

ModeShape defines a Subquery interface that extends the standard javax.jcr.query.qom.StaticOperand interface, and thus can be used on the right-hand side of any Criteria:

public interface Subquery extends StaticOperand {
    /**
     * Gets the {@link QueryCommand} that makes up the subquery.
     *
     * @return the query command; non-null
     */
    public QueryCommand getQuery();
}

Subqueries can be created by passing a QueryCommand into this org.modeshape.jcr.query.qom.QueryObjectModelFactory method:

package org.modeshape.jcr.api.query.qom;

public interface QueryObjectModelFactory {
    ...
    /**
     * Creates a subquery that can be used as a {@link StaticOperand} in another query.
     *
     * @param subqueryCommand the query command that is to be used as the subquery
     * @return the constraint; non-null
     * @throws InvalidQueryException if a particular validity test is possible on this method,
     *         the implemention chooses to perform that test (and not leave it until later,
     *         on {@link #createQuery}), and the parameters given fail that test
     * @throws RepositoryException if the operation otherwise fails
     */
    public Subquery subquery( QueryCommand subqueryCommand ) throws InvalidQueryException, RepositoryException;
    ...
}

The resulting Subquery is a StaticOperand that can then be used to create a Criteria.

Removing duplicate rows

The org.modeshape.jcr.query.qom.QueryObjectModelFactory interface includes a variation of the standard QueryObjectModeFactory.select(...)
method with an additional isDistinct flag that controls whether duplicate rows should be removed:

package org.modeshape.jcr.api.query.qom;

public interface QueryObjectModelFactory {
    ...
    /**
     * Creates a query with one or more selectors.
     *
     * @param source the node-tuple source; non-null
     * @param constraint the constraint, or null if none
     * @param orderings zero or more orderings; null is equivalent to a zero-length array
     * @param columns the columns; null is equivalent to a zero-length array
     * @param limit the limit; null is equivalent to having no limit
     * @param isDistinct true if the query should return distinct values; or false if no
     *        duplicate removal should be performed
     * @return the select query; non-null
     * @throws InvalidQueryException if a particular validity test is possible on this method,
     *         the implemention chooses to perform that test and the parameters given fail
     *         that test. See the individual QOM factory methods for the validity criteria
     *         of each query element.
     * @throws RepositoryException if another error occurs.
     */
    public SelectQuery select( Source source,
                               Constraint constraint,
                               Ordering[] orderings,
                               Column[] columns,
                               Limit limit,
                               boolean isDistinct ) throws InvalidQueryException, RepositoryException;
    ...
}

Limit and offset results

ModeShape defines a Limit interface as a top-level object that can be used to create queries that limit the number of rows and/or skip a number of initial rows:

public interface Limit {

    /**
     * Get the number of rows skipped before the results begin.
     *
     * @return the offset; always 0 or a positive number
     */
    public int getOffset();

    /**
     * Get the maximum number of rows that are to be returned.
     *
     * @return the maximum number of rows; always positive, or equal to Integer.MAX_VALUE if there is no limit
     */
    public int getRowLimit();

    /**
     * Determine whether this limit clause is necessary.
     *
     * @return true if the number of rows is not limited and there is no offset, or false otherwise
     */
    public boolean isUnlimited();

    /**
     * Determine whether this limit clause defines an offset.
     *
     * @return true if there is an offset, or false if there is no offset
     */
    public boolean isOffset();
}

These range constraints can be constructed using this org.modeshape.jcr.query.qom.QueryObjectModelFactory method:

package org.modeshape.jcr.api.query.qom;

public interface QueryObjectModelFactory {
    ...
    /**
     * Evaluates to a limit on the maximum number of tuples in the results and the
     * number of rows that are skipped before the first tuple in the results.
     *
     * @param rowLimit the maximum number of rows; must be a positive number, or Integer.MAX_VALUE if there is to be a
     *        non-zero offset but no limit
     * @param offset the number of rows to skip before beginning the results; must be 0 or a positive number
     * @return the operand; non-null
     * @throws InvalidQueryException if a particular validity test is possible on this method,
     *         the implemention chooses to perform that test (and not leave it until later, on createQuery),
     *         and the parameters given fail that test
     * @throws RepositoryException if the operation otherwise fails
     */
    public Limit limit( int rowLimit,
                        int offset ) throws InvalidQueryException, RepositoryException;
    ...
}

The Limit objects can then be used when creating queries using a variation of the standard QueryObjectModeFactory.select(...) defined in the org.modeshape.jcr.query.qom.QueryObjectModelFactory interface:

package org.modeshape.jcr.api.query.qom;

public interface QueryObjectModelFactory {
    ...
    /**
     * Creates a query with one or more selectors.
     *
     * @param source the node-tuple source; non-null
     * @param constraint the constraint, or null if none
     * @param orderings zero or more orderings; null is equivalent to a zero-length array
     * @param columns the columns; null is equivalent to a zero-length array
     * @param limit the limit; null is equivalent to having no limit
     * @param isDistinct true if the query should return distinct values; or false if no
     *        duplicate removal should be performed
     * @return the select query; non-null
     * @throws InvalidQueryException if a particular validity test is possible on this method,
     *         the implemention chooses to perform that test and the parameters given fail
     *         that test. See the individual QOM factory methods for the validity criteria
     *         of each query element.
     * @throws RepositoryException if another error occurs.
     */
    public SelectQuery select( Source source,
                               Constraint constraint,
                               Ordering[] orderings,
                               Column[] columns,
                               Limit limit,
                               boolean isDistinct ) throws InvalidQueryException, RepositoryException;
    ...
}

Similarly, the Limit objects can be passed to the ModeShape-specific except(...), union(...), intersect(...) methods, too.

Depth constraints

ModeShape defines a DepthPath interface that extends the standard javax.jcr.query.qom.DynamicOperand interface, and thus can be used as part of a WHERE clause to constrain the depth of the nodes accessed by a selector:

public interface NodeDepth extends javax.jcr.query.qom.DynamicOperand {

    /**
     * Get the selector symbol upon which this operand applies.
     *
     * @return the one selector names used by this operand; never null
     */
    public String getSelectorName();
}

These range constraints can be constructed using this org.modeshape.jcr.query.qom.QueryObjectModelFactory method:

package org.modeshape.jcr.api.query.qom;

public interface QueryObjectModelFactory {
    ...
    /**
     * Evaluates to a LONG value equal to the depth of a node in the specified selector.
     *
     * The query is invalid if selector is not the name of a selector in the query.
     *
     * @param selectorName the selector name; non-null
     * @return the operand; non-null
     * @throws InvalidQueryException if a particular validity test is possible on this method,
     *         the implemention chooses to perform that test (and not leave it until later, on createQuery),
     *         and the parameters given fail that test
     * @throws RepositoryException if the operation otherwise fails
     */
    public NodeDepth nodeDepth( String selectorName ) throws InvalidQueryException, RepositoryException;
    ...
}

Path constraints

ModeShape defines a NodePath interface that extends the standard javax.jcr.query.qom.DynamicOperand interface, and thus can be used as part of a WHERE clause to constrain the path of nodes accessed by a selector:

public interface NodePath extends javax.jcr.query.qom.DynamicOperand {

    /**
     * Get the selector symbol upon which this operand applies.
     *
     * @return the one selector names used by this operand; never null
     */
    public String getSelectorName();
}

These range constraints can be constructed using this org.modeshape.jcr.query.qom.QueryObjectModelFactory method:

package org.modeshape.jcr.api.query.qom;

public interface QueryObjectModelFactory {
    ...
    /**
     * Evaluates to a PATH value equal to the prefix-qualified path of a node in the specified selector.
     *
     * The query is invalid if selector is not the name of a selector in the query.
     *
     * @param selectorName the selector name; non-null
     * @return the operand; non-null
     * @throws InvalidQueryException if a particular validity test is possible on this method,
     *         the implemention chooses to perform that test (and not leave it until later, on createQuery),
     *         and the parameters given fail that test
     * @throws RepositoryException if the operation otherwise fails
     */
    public NodePath nodePath( String selectorName ) throws InvalidQueryException, RepositoryException;
    ...
}

Criteria on references from a node

ModeShape defines a ReferenceValue interface that extends the standard javax.jcr.query.qom.DynamicOperand interface, and thus can be used as part of a WHERE or ORDER BY clause:

public interface ReferenceValue extends DynamicOperand {
    ...
    /**
     * Get the selector symbol upon which this operand applies.
     *
     * @return the one selector names used by this operand; never null
     */
    public String getSelectorName();

    /**
     * Get the name of the one reference property.
     *
     * @return the property name; or null if this operand applies to any reference property
     */
    public String getPropertyName();
}

These reference value operand allow a query to easily place constraints on a particular REFERENCE property or (more importantly) any REFERENCE properties on the nodes. The former is a more simple alternative to using a regular comparison constraint with the REFERENCE property on one side and the "jcr:uuid" property on the other. The latter effectively means "where the node references (with any property) some other nodes", and this is something that standard JCR-SQL2 cannot represent.

They are created using these org.modeshape.jcr.query.qom.QueryObjectModelFactory methods:

package org.modeshape.jcr.api.query.qom;

public interface QueryObjectModelFactory {
    ...
    /**
     * Creates a dynamic operand that evaluates to the REFERENCE value of the any property
     * on the specified selector.
     *
     * The query is invalid if:
     * - selector is not the name of a selector in the query, or
     * - property is not a syntactically valid JCR name.
     *
     * @param selectorName the selector name; non-null
     * @return the operand; non-null
     * @throws InvalidQueryException if a particular validity test is possible on this method,
     *         the implemention chooses to perform that test (and not leave it until later, on createQuery),
     *         and the parameters given fail that test
     * @throws RepositoryException if the operation otherwise fails
     */
    public ReferenceValue referenceValue( String selectorName ) throws InvalidQueryException, RepositoryException;

    /**
     * Creates a dynamic operand that evaluates to the REFERENCE value of the specified
     * property on the specified selector.
     *
     * The query is invalid if:
     * - selector is not the name of a selector in the query, or
     * - property is not a syntactically valid JCR name.
     *
     * @param selectorName the selector name; non-null
     * @param propertyName the reference property name; non-null
     * @return the operand; non-null
     * @throws InvalidQueryException if a particular validity test is possible on this method,
     *         the implemention chooses to perform that test (and not leave it until later, on createQuery),
     *         and the parameters given fail that test
     * @throws RepositoryException if the operation otherwise fails
     */
    public ReferenceValue referenceValue( String selectorName,
                                          String propertyName ) throws InvalidQueryException, RepositoryException;
    ...
}

Range criteria with BETWEEN

ModeShape defines a Between interface that extends the standard javax.jcr.query.qom.Constraint interface, and thus can be used as part of a WHERE clause:

public interface Between extends Constraint {

    /**
     * Get the dynamic operand specification.
     *
     * @return the dynamic operand; never null
     */
    public DynamicOperand getOperand();

    /**
     * Get the lower bound operand.
     *
     * @return the lower bound; never null
     */
    public StaticOperand getLowerBound();

    /**
     * Get the upper bound operand.
     *
     * @return the upper bound; never null
     */
    public StaticOperand getUpperBound();

    /**
     * Return whether the lower bound is to be included in the results.
     *
     * @return true if the {@link #getLowerBound() lower bound} is to be included, or false otherwise
     */
    public boolean isLowerBoundIncluded();

    /**
     * Return whether the upper bound is to be included in the results.
     *
     * @return true if the {@link #getUpperBound() upper bound} is to be included, or false otherwise
     */
    public boolean isUpperBoundIncluded();
}

These range constraints can be constructed using this org.modeshape.jcr.query.qom.QueryObjectModelFactory method:

package org.modeshape.jcr.api.query.qom;

public interface QueryObjectModelFactory {
    ...
    /**
     * Tests that the value (or values) defined by the supplied dynamic operand are
     * within a specified range. The range is specified by a lower and upper bound,
     * and whether each of the boundary values is included in the range.
     *
     * @param operand the dynamic operand describing the values that are to be constrained
     * @param lowerBound the lower bound of the range
     * @param upperBound the upper bound of the range
     * @param includeLowerBound true if the lower boundary value is not be included
     * @param includeUpperBound true if the upper boundary value is not be included
     * @return the constraint; non-null
     * @throws InvalidQueryException if a particular validity test is possible on this method,
     *         the implemention chooses to perform that test (and not leave it until later, on createQuery),
     *         and the parameters given fail that test
     * @throws RepositoryException if the operation otherwise fails
     */
    public Between between( DynamicOperand operand,
                            StaticOperand lowerBound,
                            StaticOperand upperBound,
                            boolean includeLowerBound,
                            boolean includeUpperBound ) throws InvalidQueryException, RepositoryException;
    ...
}

To create a NOT BETWEEN ... criteria, simply create the Between criteria object, and then pass that into the standard QueryObjectModelFactory.not(Criteria) method.

Set criteria with IN and NOT IN

ModeShape defines a SetCriteria interface that extends the standard javax.jcr.query.qom.Constraint interface, and thus can be used as part of a WHERE clause:

public interface SetCriteria extends Constraint {

    /**
     * Get the dynamic operand specification for the left-hand side of the set criteria.
     *
     * @return the dynamic operand; never null
     */
    public DynamicOperand getOperand();

    /**
     * Get the static operands for this set criteria.
     *
     * @return the static operand; never null and never empty
     */
    public Collection<? extends StaticOperand> getValues();
}

These set constraints can be constructed using this org.modeshape.jcr.query.qom.QueryObjectModelFactory method:

package org.modeshape.jcr.api.query.qom;

public interface QueryObjectModelFactory {
    ...
   /**
     * Tests that the value (or values) defined by the supplied dynamic operand are
     * found within the specified set of values.
     *
     * @param operand the dynamic operand describing the values that are to be constrained
     * @param values the static operand values; may not be null or empty
     * @return the constraint; non-null
     * @throws InvalidQueryException if a particular validity test is possible on this method,
     *         the implemention chooses to perform that test (and not leave it until later, on createQuery),
     *         and the parameters given fail that test
     * @throws RepositoryException if the operation otherwise fails
     */
    public SetCriteria in( DynamicOperand operand,
                           StaticOperand... values ) throws InvalidQueryException, RepositoryException;
    ...
}

To create a NOT IN criteria, simply create the IN criteria to get a SetCriteria object, and then pass that into the standard QueryObjectModelFactory.not(Criteria) method.

Arithmetic operands

ModeShape defines an ArithmeticOperand interface that extends the javax.jcr.query.qom.DynamicOperand, and thus can be used anywhere a DynamicOperand can be used.

public interface ArithmeticOperand extends DynamicOperand {

    /**
     * Get the operator for this binary operand.
     *
     * @return the operator; never null
     */
    public String getOperator();

    /**
     * Get the left-hand operand.
     *
     * @return the left-hand operator; never null
     */
    public DynamicOperand getLeft();

    /**
     * Get the right-hand operand.
     *
     * @return the right-hand operator; never null
     */
    public DynamicOperand getRight();
}

These can be constructed using additional org.modeshape.jcr.query.qom.QueryObjectModelFactory methods:

package org.modeshape.jcr.api.query.qom;

public interface QueryObjectModelFactory {
    ...
    /**
     * Create an arithmetic dynamic operand that adds the numeric value of the two supplied operand(s).
     *
     * @param left the left-hand-side operand; not null
     * @param right the right-hand-side operand; not null
     * @return the dynamic operand; non-null
     * @throws InvalidQueryException if a particular validity test is possible on this method,
     *         the implemention chooses to perform that test (and not leave it until later, on createQuery),
     *         and the parameters given fail that test
     * @throws RepositoryException if the operation otherwise fails
     */
    public ArithmeticOperand add( DynamicOperand left,
                                  DynamicOperand right ) throws InvalidQueryException, RepositoryException;

    /**
     * Create an arithmetic dynamic operand that subtracts the numeric value of the second operand from the numeric value of the
     * first.
     *
     * @param left the left-hand-side operand; not null
     * @param right the right-hand-side operand; not null
     * @return the dynamic operand; non-null
     * @throws InvalidQueryException if a particular validity test is possible on this method,
     *         the implemention chooses to perform that test (and not leave it until later, on createQuery),
     *         and the parameters given fail that test
     * @throws RepositoryException if the operation otherwise fails
     */
    public ArithmeticOperand subtract( DynamicOperand left,
                                       DynamicOperand right ) throws InvalidQueryException, RepositoryException;

    /**
     * Create an arithmetic dynamic operand that multplies the numeric value of the first operand by the numeric value of the
     * second.
     *
     * @param left the left-hand-side operand; not null
     * @param right the right-hand-side operand; not null
     * @return the dynamic operand; non-null
     * @throws InvalidQueryException if a particular validity test is possible on this method,
     *         the implemention chooses to perform that test (and not leave it until later, on createQuery),
     *         and the parameters given fail that test
     * @throws RepositoryException if the operation otherwise fails
     */
    public ArithmeticOperand multiply( DynamicOperand left,
                                       DynamicOperand right ) throws InvalidQueryException, RepositoryException;

    /**
     * Create an arithmetic dynamic operand that divides the numeric value of the first operand by the numeric value of the
     * second.
     *
     * @param left the left-hand-side operand; not null
     * @param right the right-hand-side operand; not null
     * @return the dynamic operand; non-null
     * @throws InvalidQueryException if a particular validity test is possible on this method,
     *         the implemention chooses to perform that test (and not leave it until later, on createQuery),
     *         and the parameters given fail that test
     * @throws RepositoryException if the operation otherwise fails
     */
    public ArithmeticOperand divide( DynamicOperand left,
                                     DynamicOperand right ) throws InvalidQueryException, RepositoryException;
    ...
}

Search and text extraction

The full-text search language and JCR-SQL2's full-text search constraint both have the ability to find nodes using a simpler search-engine-like expression with wildcards and phrases.

One can pretty easily imagine how ModeShape performs these matches against a node's name and properties containing STRING, LONG, DATE, DOUBLE, DECIMAL, NAME, and PATH values. But what about BINARY values? In order to determine whether the search-engine-like search expressions match, doesn't ModeShape have to determine what text is contained within each BINARY value?

The short answer is that yes, ModeShape can only match against the BINARY value if it can extract the text from that value. And this is where text extraction come into play.

A text extractor is a component that knows how to extract searchable text from a BINARY value. Each text extract describes whether it can process files of a particular MIME type, and if it can then ModeShape will (when necessary) call the extractor to obtain for a supplied BINARY value the searchable text.

Using indexes

As we mentioned above, out of the box a repository will have no indexes, and executing every query will require scanning the repository for all nodes that satisfy the criteria. In this section we'll look at the difference concepts involved with indexes, talk about the different kinds of indexes, and show you how to define custom indexes.

Concepts

This section describes some of the concepts that are important in understanding how indexes are used and how to define them.

Index

A ModeShape index is a mechanism that can very efficiently return a set of node keys that satisfies some criteria. Very often that criteria might be just one of several appearing in a query. But if the index is defined properly for the set of data, the index can very quickly produce a set of node keys, and the remaining criteria can be evaluated against the other properties of those nodes.

Index definition

An index definition is the specification of the structure and attributes of what should be indexed. Index definitions can be included in the repository configuration and/or dynamically registered, modified, or unregistered via the IndexManager (accessible via the org.modeshape.jcr.api.Workspace public interface).

Each index definition specifies in which workspaces it should be used via a workspace match rule. This rule is a regular expression pattern that will be used to match against existing or newly-created workspace names. Here are several examples:

Workspace match rule

Description

myWorkspace

Match against a single workspace named "myWorkspace".

first | second | third

Matches only workspaces named "first", "second", or "third"

*workspace

Matches all workspaces with names ending in "workspace"

*

Matches all workspaces. This is the default rule if none is specified.

When a well-formed and complete index definition is registered, ModeShape will create the associated index in each applicable workspace, make sure it is kept up-to-date with changes in workspace content, and potentially use the index when executing queries. When new workspaces are created, all index definitions are evaluated to see if they apply to the new workspace name and, if so, new indexes are created for the new workspace. Likewise, when a workspace is removed, ModeShape will remove any of that workspace's indexes.

Node type and columns

Each index definition includes the name of a single node type used to limit the scope of the associated indexes. Only nodes that are of that type (or subtype) will be added to the indexes described by that index definition. For example, if an index definition has a node type of "mix:title", then the corresponding indexes will include all nodes whose primary type or mixin types includes "mix:title" or one of its subtypes.

Each index definition also includes one or more columns that define which property values should be included in the index. When a node is to be added to an index, the property value(s) for each column are extracted from the node and inserted into the index. If the node does not have a column's property, then a null value is used. If the property contains multiple values, then each value is inserted into the index.

Dynamic Operands and column names

Some of the JCR-SQL2 and ModeShape specific operands can be used together with index definitions only if those index definitions contain a column with a specific name, as follows:

Operand

Column name

Description

NAME

jcr:name

The prefixed name of the node.

LOCALNAME

mode:localName

The local name without a namespace.

DEPTH

mode:depth

The integer depth of the node.

PATH

jcr:path

The full path of the node.

CHILDCOUNT

mode:childCount

The integer number of children of the node.

NODEID

mode:id

The string identifier of a node as returned by javax.jcr.Node.getIdentifier().

Index kind

There are several different kinds of indexes:

Index Kind

Description

Value

The basic index that can track many different values and the keys of the nodes on which those values appear. Use this kind of index unless there is a better alternative.

Enumerated

An index specialized to track a limited number of distinct values. Use this kind of index when the associated property has constraints that limit the values.

Unique

An index specialized to track a single node key for each distinct value. Use this kind of index only when the associated property always has unique values. (There is no way to specify a property definition has globally unique values.)

Text

An index specialized to handle LIKE and/or CONTAINS criteria.

Nodetype

An index specialized to track the node types for nodes.

Synchronous vs asynchronous indexes

A *synchronous* index will be updated with changes made by a Session.save() call before that Session.save() call returns. In fact, all synchronous indexes will be updated before the save completes. This is strongly consistent behavior, and as such it is the default behavior.

Alternatively, an index can be defined as *asynchronous*, which means that it will updated with the changes made by a Session.save() in a separate thread, meaning the indexes will very likely not contain the changes immediately when the client's save call completes. Using asynchronous indexes can offer tremendous performance benefits, since far less work is done during the Session.save() call. But asynchronous indexes should only be used in applications that can tolerate some "eventually-consistent" behavior from queries that uses those indexes.

Index providers

Each index definition also specifies the name of an index provider, which is a software component that is responsible for owning and using the indexes described by that definition. Multiple index definitions can refer to the same index provider. Index providers are always specified in the repository configuration.

See this for a list of the index providers that come bundled with ModeShape.

Index selection

A repository is likely to contain multiple index definitions, which means each workspace is likely to have multiple indexes. How does ModeShape know which indexes to use?

When ModeShape plans and optimizes a query, it consults each of the index providers to see which of their indexes can be used for each access operation in the given query plan. Each index provider evaluates the access operation's criteria, and for each index it deems applicable it estimates the cost of using the index, the number of rows that will be returned by the index for this query, and the total number of nodes that the index contains. ModeShape records this information within the query plan, and when executing each access operation will pick the index with the lowest overall cost and highest selectivity (the ratio of returned to total nodes).

Simple example

Let's imagine a repository for a catalog of published books and professional journals. It uses several node types defined in CND format as:

<ex='http://www.example.com'>

[ex:book] > mix:title
- ex:authors (STRING) multiple
- ex:isbn (STRING) mandatory
- ex:publishDate (DATE)

Our application needs to be able to find books by title and/or within a range of publish dates, so it uses several different queries:

SELECT * FROM [ex:book] WHERE [ex:publishDate] BETWEEN $1 AND $2

SELECT * FROM [ex:book] WHERE [jcr:title] = $1

SELECT * FROM [ex:book] WHERE [jcr:title] LIKE $1 AND [ex:publishDate] BETWEEN $2 AND $3

We can use two indexes to cover these queries. The first is with the publish date, and will index all nodes of type ex:book (in all workspaces) based upon the date the books are published:

  • Index name: "bookPublishDates"

  • Type: "value" (the default)

  • Provider: "local"

  • Workspaces: "*" (the default)

  • Node type: "ex:book"

  • Column(s): ex:publishDate(DATE)

The second index will cover the title of the books. Again, it will index all nodes of type ex:book (in all workspaces) and will quickly find all node keys based upon the title:

  • Index name: "bookTitles"

  • Type: "value" (the default)

  • Provider: "local"

  • Workspaces: "*" (the default)

  • Node type: "ex:book"

  • Column(s): jcr:title(STRING)

When our queries are planned and optimized, ModeShape will figure out which index can be used for each query.

Slightly more complex example

Now, let's assume that we now want our application to also store professional journals in addition to books. We can do this by modifying our node types:

<ex='http://www.example.com'>

[ex:book] > mix:title
- ex:authors (STRING) mandatory multiple
- ex:publishDate (DATE) mandatory
- ex:isbn (STRING) mandatory

[ex:journalVolume] > mix:title
- ex:publishDate (DATE) mandatory
- ex:volume (LONG) mandatory
+ * (ex:journalArticle) = ex:journalArticle

[ex:journalArticle] > mix:title
- ex:authors (STRING) mandatory multiple
- ex:firstPage (LONG) mandatory
- ex:lastPage (LONG) mandatory

Our application now uses a series of queries:

SELECT * FROM [ex:book] WHERE [ex:publishDate] BETWEEN $1 AND $2

SELECT * FROM [ex:book] WHERE [jcr:title] = $1

SELECT * FROM [ex:book] WHERE [jcr:title] LIKE $1 AND [ex:publishDate] BETWEEN $2 AND $3

SELECT * FROM [ex:journalVolume] WHERE [ex:publishDate] BETWEEN $1 AND $2

SELECT * FROM [ex:journalArticle] WHERE [jcr:title] = $1

SELECT * FROM [ex:journalVolume] AS vol JOIN [ex:journalArticle] AS article ON ISCHILDNODE(article,vol)
WHERE article.[jcr:title] LIKE $1 AND vol.[ex:publishDate] BETWEEN $2 AND $3

The first 3 queries are identical to our previous example, so we can just adopt the same 2 indexes and add several more for the journal articles. The first new index covers the journal's publish dates, and will index all nodes of type ex:journalVolume (in all workspaces) based upon the date the volumes are published:

  • Index name: "journalPublishDates"

  • Type: "value" (the default)

  • Provider: "local"

  • Synchronous: "true"

  • Workspaces: "*" (the default)

  • Node type: "ex:journalVolume"

  • Column(s): ex:publishDate(DATE)

The next new index will cover the title of the journal articles. Again, it will index all nodes of type ex:journalArticle (in all workspaces) and will quickly find all node keys based upon the title:

  • Index name: "journalTitles"

  • Type: "value" (the default)

  • Provider: "local"

  • Synchronous: "true"

  • Workspaces: "*" (the default)

  • Node type: "ex:journalArticle"

  • Column(s): jcr:title(STRING)

Again, when our queries are planned and optimized, ModeShape will figure out which index can be used for each query. If we're querying journals, we'll use one set of indexes, and if we're querying books we'll use a different set of indexes.

Let's look at the most complicated query:

SELECT * FROM [ex:journalVolume] AS vol JOIN [ex:journalArticle] AS article ON ISCHILDNODE(article,vol)
WHERE article.[jcr:title] LIKE $1 AND vol.[ex:publishDate] BETWEEN $2 AND $3

This query involves a join, and so our query plan would then have different parts that look abstractly like:

  • JOIN: ISCHILDNODE(B,A)

    • ACCESS "A": SELECT * FROM

      [ex:journalVolume]

      WHERE

      [ex:publishDate]

      BETWEEN $2 AND $3

    • ACCESS "B": SELECT * FROM

      [ex:journalArticle]

      WHERE

      [jcr:title]

      LIKE $1

In this case, the "journalPublishDates" will be used for the first access operation, while "journalTitles index will be used for the second one. These two result sets will be joined together based upon the parent-child relationship.

What happens if we want to issue a single query to find books and articles? For that, we need to refactor our node types.

Refactored example

We still want to store the same content, but now we want to use a query that finds both books and articles given some criteria. In the previous section, the books and journals are not related, so we'd have to use a union. But if we slightly refactor our node types to extract some common concepts, we can then achieve what we want. Here are our updated node types:

<ex='http://www.example.com'>

[ex:authored] mixin
- ex:authors (STRING) mandatory multiple

[ex:published] mixin
- ex:publishDate (DATE) mandatory

[ex:book] > mix:title, ex:published, ex:authored
- ex:isbn (STRING) mandatory

[ex:journalVolume] > mix:title, ex:published
- ex:volume (LONG) mandatory
+ * (ex:journalArticle) = ex:journalArticle

[ex:journalArticle] > mix:title, ex:authored
- ex:firstPage (LONG) mandatory
- ex:lastPage (LONG) mandatory

Notice that we've extracted the ex:authors property into a new mixin called ex:authored and used it in both ex:book and ex:journalArticle. Similarly, we've extracted the ex:publishDate property into another new mixin called ex:published and used it in both ex:book and ex:journalVolume.

Let's look again at the set of queries our application will use:

SELECT * FROM [ex:book] WHERE [ex:publishDate] BETWEEN $1 AND $2

SELECT * FROM [ex:book] WHERE [jcr:title] = $1

SELECT * FROM [ex:book] WHERE [jcr:title] LIKE $1 AND [ex:publishDate] BETWEEN $2 AND $3

SELECT * FROM [ex:journalVolume] WHERE [ex:publishDate] BETWEEN $1 AND $2

SELECT * FROM [ex:journalArticle] WHERE [jcr:title] = $1

SELECT * FROM [ex:journalVolume] AS vol JOIN [ex:journalArticle] AS article ON ISCHILDNODE(article,vol)
WHERE article.[jcr:title] LIKE $1 AND vol.[ex:publishDate] BETWEEN $2 AND $3

SELECT * FROM [ex:published] WHERE [ex:publishDate] BETWEEN $1 AND $2

SELECT * FROM [ex:authored] WHERE [ex:authors] = $1

The first 6 queries are identical to our previous example, but in the previous example we had 4 indexes. Because we've extracted some common elements, we can actually get by with just 3 indexes, even though we've added a few additional queries.

Our first index covers the publish dates of books and journal volumes by indexing all nodes of type ex:published (in all workspaces) based upon the date the volumes are published:

  • Index name: "publishDates"

  • Type: "value" (the default)

  • Provider: "local"

  • Synchronous: "true"

  • Workspaces: "*" (the default)

  • Node type: "ex:published"

  • Column(s): ex:publishDate(DATE)

Our second index will cover the title of the journal articles and books by indexing all nodes of type mix:title (in all workspaces) and will quickly find all node keys based upon the title:

  • Index name: "titles"

  • Type: "value" (the default)

  • Provider: "local"

  • Synchronous: "true"

  • Workspaces: "*" (the default)

  • Node type: "mix:title"

  • Column(s): jcr:title(STRING)

Our last query finds all books and journal articles by author names, and we need a third index to cover this case:

  • Index name: "authors"

  • Type: "value" (the default)

  • Provider: "local"

  • Synchronous: "true"

  • Workspaces: "*" (the default)

  • Node type: "ex:authored"

  • Column(s): ex:authors(STRING)

When our queries are planned and optimized, ModeShape will figure out which index can be used for each query. Let's see how the index definition's node types play into a few different queries.

Let's start with these two queries:

SELECT * FROM [ex:book] WHERE [ex:publishDate] BETWEEN $1 AND $2

SELECT * FROM [ex:journalVolume] WHERE [ex:publishDate] BETWEEN $1 AND $2

Because both ex:book and ex:journalVolume are subtypes of "ex:published" (the node type of our "publishDates" index), the "publishDates" index will end up being used by both queries. No worries, though: the first query will still only return books and the second query will only return journal volumes.

Why does this work? Even though the "publishDates" index contains both books and journal volumes, the date range will likely reduce the number of potential matches to a much smaller subset than exist in our workspace. So even if the index returns both books and journal volumes, the number is small enough that we can quickly choose only those that are books or only those that are journal volumes.

This also makes it possible to use the index on our second-to-last query that finds all books and journal volumes published during some date range:

SELECT * FROM [ex:published] WHERE [ex:publishDate] BETWEEN $1 AND $2

Configuring

To configure an index in a repository JSON configuration file, you add the indexing section to the configuration:

"indexes" : {
        "nodesByName" : {
            "kind" : "value",
            "provider" : "local",
            "synchronous" : "true",
            "nodeType" : "nt:base",
            "columns" : "jcr:name(NAME)"
        },
        "nodesByLocalName" : {
            "kind" : "value",
            "provider" : "local",
            "synchronous" : "true",
            "nodeType" : "nt:base",
            "columns" : "mode:localName(STRING)"
        },
        "nodesByDepth" : {
            "kind" : "value",
            "provider" : "local",
            "synchronous" : "true",
            "nodeType" : "nt:base",
            "columns" : "mode:depth(LONG)"
        },
        "nodesByPath" : {
            "kind" : "value",
            "provider" : "local",
            "synchronous" : "true",
            "nodeType" : "nt:base",
            "columns" : "jcr:path(PATH)"
        },
       "indexWithDefaultNodeType" : {
            "kind" : "value",
            "provider" : "local",
            "columns" : "jcr:title(STRING)"
        },
        "indexWithSingleColumn" : {
            "kind" : "value",
            "provider" : "local",
            "nodeType" : "mix:title",
            "columns" : "jcr:title(STRING)"
        },
        "indexWithMultipleColumns" : {
            "kind" : "value",
            "provider" : "local",
            "nodeType" : "mix:title",
            "columns" : "jcr:title(STRING), jcr:description( STRING )"
        },
        "enumeratedIndexWithNonBuiltInNames" : {
            "kind" : "enumerated",
            "provider" : "local",
            "nodeType" : "foo:bar",
            "columns" : "foo:value( STRING )"
        },
        "uniqueIndexWithNonBuiltInNames" : {
            "kind" : "unique",
            "provider" : "local",
            "nodeType" : "foo:bar",
            "columns" : "foo:value( STRING )"
        },
        "textIndexWithNonBuiltInNames" : {
            "kind" : "text",
            "provider" : "local",
            "nodeType" : "foo:bar",
            "columns" : "jcr:title(STRING)"
        },
        "nodeTypes" : {
            "kind" : "nodeType",
            "provider" : "local",
            "nodeType" : "nt:base",
            "columns" : "jcr:primaryType(STRING)"
        }
    }

or in the Wildfly ModeShape configuration section:

<indexes>
  <index name="names" provider-name="local" kind="value" synchronous="true" node-type="nt:base" columns="jcr:name(NAME)" />
</indexes>

These configuration fragments use all of the attributes, including those that have default values. Feel free to leave out those attributes that have defaults.

Managing index definitions

While defining indexes is straightforward as shown above, in certain cases a more fine-grained controlled over index definitions may be desired - i.e. being able to manage index definitions at runtime (as opposed to at configuration time). This can be accomplished using the ModeShape index specific API, starting from a given org.modeshape.jcr.api.Workspace instance:

public interface Workspace extends javax.jcr.Workspace {
  /**
     * Returns the <code>IndexManager</code> object that can be used to programmatically register and unregister index
     * definitions.
     *
     * @return the <code>IndexManager</code> object.
     * @throws RepositoryException if an error occurs.
     */
    public IndexManager getIndexManager() throws RepositoryException;
}

and then using the IndexManager instance:

public interface IndexManager {

    /**
     * Get the names of the available index providers.
     *
     * @return the immutable set of provider names; never null but possibly empty
     */
    Set<String> getProviderNames();

    /**
     * Get a map of the registered index definitions keyed by their names. The resulting map is immutable, but it is updated
     * whenever an index definition is added, updated, or removed. To add an index, use
     * {@link #registerIndex(IndexDefinition, boolean)} or {@link #registerIndexes(IndexDefinition[], boolean)}; to remove an
     * index, use {@link #unregisterIndexes}.
     *
     * @return the index definitions; never null but possibly empty
     */
    Map<String, IndexDefinition> getIndexDefinitions();

    /**
     * Register a new definition for an index.
     *
     * @param indexDefinition the definition; may not be null
     * @param allowUpdate true if the definition can update or ovewrite an existing definition with the same name, or false if
     *        calling this method should result in an exception when the repository already contains a definition with the same
     *        name already exists
     * @throws InvalidIndexDefinitionException if the new definition is invalid
     * @throws IndexExistsException if <code>allowUpdate</code> is <code>false</code> and the <code>IndexDefinition</code>
     *         specifies a node type name that is already registered.
     * @throws RepositoryException if there is a problem registering the new definition, or if an existing index
     */
    void registerIndex( IndexDefinition indexDefinition,
                        boolean allowUpdate ) throws InvalidIndexDefinitionException, IndexExistsException, RepositoryException;

    /**
     * Register new definitions for several indexes.
     *
     * @param indexDefinitions the definitions; may not be null
     * @param allowUpdate true if each of the definition can update or ovewrite an existing definition with the same name, or
     *        false if calling this method should result in an exception when the repository already contains any definition with
     *        names that match the supplied definitions
     * @throws InvalidIndexDefinitionException if the new definition is invalid
     * @throws IndexExistsException if <code>allowUpdate</code> is <code>false</code> and the <code>IndexDefinition</code>
     *         specifies a node type name that is already registered.
     * @throws RepositoryException if there is a problem registering the new definition, or if an existing index
     */
    void registerIndexes( IndexDefinition[] indexDefinitions,
                          boolean allowUpdate ) throws InvalidIndexDefinitionException, IndexExistsException, RepositoryException;

    /**
     * Removes an existing index definition.
     *
     * @param indexNames the names of the index definition to be removed; may not be null
     * @throws NoSuchIndexException there is no index with the supplied name
     * @throws RepositoryException if there is a problem registering the new definition, or if an existing index
     */
    void unregisterIndexes( String... indexNames ) throws NoSuchIndexException, RepositoryException;

    /**
     * Create a new template that can be used to programmatically define an index.
     *
     * @return the new template; may not be null
     */
    IndexDefinitionTemplate createIndexDefinitionTemplate();

    /**
     * Create a new template that can be used to programmatically define a column on an index.
     *
     * @return the new template; may not be null
     */
    IndexColumnDefinitionTemplate createIndexColumnDefinitionTemplate();
}

Additional considerations

Queries using AND operators

When analyzing which index to use for a query that contains the AND operator, ModeShape will look at each constraint separately and will use an index even if it only applies to one part of the composite constraint. This is because all parts of the AND constraint have to be satisfied and using an index to filter out even a part is much faster than looking at all the nodes in the repository.

Queries using OR operators

Unlike the previous case, if a query contains constraints composed with the OR operator, ModeShape will only select an index if that index applies to all the parts of the constraint. If that is not the case, ModeShape will default to processing the query looking at all the nodes in the repository. If different parts of the OR constraint are covered by different indexes and you still want ModeShape to use those indexes, you can re-write the query using the UNION operator, essentially splitting the different parts of the constraint.

Reindexing

If index definitions are configured, when the repository starts up, depending on the reindexing configuration, it will reindex some or all of the nodes in the repository.

The default behavior (called if_missing) is to reindex all the nodes in the repository if one of the defined indexes is missing. An index is considered missing if there is no data associated with it, but the index is present in the configuration and enabled.

Starting with ModeShape 4.5, a new reindexing option called incremental is available. It only works if a couple of conditions are met:

  • the repository journal is configured. See https://docs.jboss.org/author/display/MODE50/Journaling.

  • the configured index providers support incremental reindexing. An index provider supports incremental reindexing if it's able to store and return a timestamp which indicates the latest time at which indexes were successfully updated by that provider.

If both of the above are true, incremental reindexing means each of the indexing providers will ask the journal for the changed nodes since the providers were last active (i.e. since the last successful index update). Once they get the changed nodes, only those nodes will be reindexed.

If incremental reindexing is configured and the repository journal is not present, the repository will log a warning at startup and will not perform any kind of reindexing

Reindexing is performed asynchronously by default, but it can be changed to synchronous via the configuration (see below).

Configuration

Reindexing can be configured via the JSON or AS configuration files like so:

 "reindexing" : {
        "async" : false,  //defaults to "true"
        "mode" : "incremental" //defaults to "if_missing"
  }

and

 <repository name="sample2">
      <reindexing async="true" mode="if_missing"/>
  </repository>

Workspace API

Reindexing can also be triggered programmatically via the ModeShape org.modeshape.jcr.api.Workspace API:

   /**
     * Crawl and re-index the content in this workspace. This method blocks until the indexing is completed.
     *
     * @throws AccessDeniedException if the session does not have the privileges to reindex the workspace
     * @throws RepositoryException if there is a problem with this session or workspace
     * @see #reindexAsync()
     * @see #reindexAsync(String)
     * @see #reindex(String)
     */
    void reindex() throws RepositoryException;

    /**
     * Crawl and index the content starting at the supplied path in this workspace, to the designated depth.
     *
     * @param path the path of the content to be indexed
     * @throws IllegalArgumentException if the workspace or path are null, or if the depth is less than 1
     * @throws AccessDeniedException if the session does not have the privileges to reindex this part of the workspace
     * @throws RepositoryException if there is a problem with this session or workspace
     * @see #reindex()
     * @see #reindexAsync()
     * @see #reindexAsync(String)
     */
    void reindex( String path ) throws RepositoryException;

    /**
     * Reindex only the nodes from this workspace that have changed since the given timestamp. This only works if the repository
     * has journaling enabled. This method blocks until the indexing is completed.
     *
     * @param timestamp a {@code long} timestamp starting with which all changed nodes will be reindexed.
     * @throws RepositoryException if anything fails.
     * @since 4.5
     */
    void reindexSince( long timestamp ) throws RepositoryException;

    /**
     * Asynchronously crawl and re-index the content in this workspace.
     *
     * @return a future representing the asynchronous operation; never null
     * @throws AccessDeniedException if the session does not have the privileges to reindex the workspace
     * @throws RepositoryException if there is a problem with this session or workspace
     * @see #reindex()
     * @see #reindex(String)
     * @see #reindexAsync(String)
     */
    Future<Boolean> reindexAsync() throws RepositoryException;

    /**
     * Asynchronously crawl and index the content starting at the supplied path in this workspace, to the designated depth.
     *
     * @param path the path of the content to be indexed
     * @return a future representing the asynchronous operation; never null
     * @throws IllegalArgumentException if the workspace or path are null, or if the depth is less than 1
     * @throws AccessDeniedException if the session does not have the privileges to reindex this part of the workspace
     * @throws RepositoryException if there is a problem with this session or workspace
     * @see #reindex()
     * @see #reindex(String)
     * @see #reindexAsync()
     */
    Future<Boolean> reindexAsync( String path ) throws RepositoryException;

    /**
     * Asynchronously reindex only the nodes from this workspace that have changed since the given timestamp.
     * This only works if the repository has journaling enabled.
     *
     * @param timestamp a {@code long} timestamp starting with which all changed nodes will be reindexed.
     * @return a future representing the asynchronous operation; never null
     * @throws RepositoryException if anything fails.
     * @see #reindexSince(long)
     * @since 4.5
     */
    Future<Boolean> reindexSinceAsync( long timestamp ) throws RepositoryException;
JBoss.org Content Archive (Read Only), exported from JBoss Community Documentation Editor at 2020-03-11 12:12:36 UTC, last content change 2017-02-03 19:43:08 UTC.