Full text search - ModeShape 5

There are times when a formal structured query language is overkill, and the easiest way to find the right content is to perform a search, like you would with a search engine such as Google or Yahoo! This is where ModeShape's full-text search language comes in, because it allows you to use the JCR query API but with a far simpler, Google-style search grammar.

This query language is actually defined by the JCR 2.0 specification as the full-text search expression grammar used in the second parameter of the CONTAINS(...) function of the JCR-SQL2 language. We just pulled it out and made it available as a first-class query language, such that a full-text search query supplied by the user, full-text-query, is equivalent to executing this JCR-SQL2:

SELECT * FROM [nt:base] WHERE CONTAINS([nt:base],'full-text-query')

This language allows a JCR client to construct a query to find nodes with property values that match the supplied terms. Nodes that "best" match the terms are returned before nodes that have a lesser match.

In certain cases, if full text search indexes are used, certain providers (for example Lucene) may perform a number of optimizations, such as (but not limited to) eliminating stop words (e.g., "the", "a", "and", etc.), treating terms independent of case, and converting words to base forms using a process called stemming (e.g., "running" into "run", "customers" into "customer").

Search terms can also include phrases by simply wrapping the phrase with double-quotes. For example, the search term 'table "customer invoice"' would rank higher those nodes with properties containing the phrase "customer invoice" than nodes with properties containing just "customer" or "invoice".

Term in the query are implicitly AND-ed together, meaning that the matches occur when a node has property values that match all of the terms. However, it is also possible to put an "OR" in between two terms where either of those terms may occur.

By default, all terms are assumed to be positive terms, in the sense that the occurrence of the term will increase the rank of any nodes containing the value. However, it is possible to specify that terms should not appear in the results. This is called a negative term, and it reduces the rank of any node whose property values contain the the value. To specify a negative term, simply prefix the term with a hyphen ('-').

Each term may also contain wildcards to specify the pattern to be matched (or negated). ModeShape supports two different sets of wildcards:

'*' matches zero or more characters, and '?' matches any single character; and
'%' matches zero or more characters, and '{_}' matches any single character.

The former are wildcards that are more commonly used in various systems (including older JCR repository implementations), while the latter are the wildcards used in LIKE expressions in both JCR-SQL and JCR-SQL2. Both families are supported for convenience, and you can also mix and match the various wildcards, such as 'ta*bl_' and 'ta?_ble*'. (Of course, placing multiple '*' or '%' characters next to each other offers no real benefit, as it is equivalent to a single '*' or '%'.)

If you want to use these characters literally in a term and do not want them to be treated as wildcards, they must be escaped by prefixing them with a '{{}}' character. For example, this full text search expression:

table\* 'customer invoice\?'

will would rank higher those nodes with properties containing 'table*' (including the unescaped asterisk as a wildcard) and those containing the phrase "customer invoice?" (including the unescaped question mark as a wildcard). To use a literal backslash character, simply escape it as well.

When using this query language, the QueryResult always contains the "jcr:path" and "jcr:score" columns.

ModeShape handles leading and trailing wildcards in very different ways. When trailing wildcards are used, even a few characters preceding the wildcard can be used to quickly narrow down the potential results using the internal reverse indexes. However, when terms start with a wildcard ModeShape cannot use the internal reverse indexes to help narrow the results. Thus, performing a search with a leading wildcard must be done in a pretty inefficient manner in a process that is something analogous to a relational database's table scan. Where possible, avoid using leading wildcards in your search terms.

Grammar

The grammar for this full-text search language is specified in Section 6.7.19 of the JCR 2.0 specification, but it is also included here as a convenience.

The grammar is presented using the same EBNF nomenclature as used in the JCR 2.0 specification. Terms are surrounded by matching square brackets (e.g., '[' and ']') denote optional terms that appear zero or one times. Terms surrounded by matching braces (e.g., '}' and '{') denote terms that appear zero or more times. Parentheses are used to identify groups, and are often used to surround possible values.

FulltextSearch ::= Disjunct {Space 'OR' Space Disjunct}

Disjunct ::= Term {Space Term}

Term ::= ['-'] SimpleTerm

SimpleTerm ::= Word | '"' Word {Space Word} '"'

Word ::= NonSpaceChar {NonSpaceChar}

Space ::= SpaceChar {SpaceChar}

NonSpaceChar ::= Char - SpaceChar /* Any Char except SpaceChar */

SpaceChar ::= ' '

Char ::= /* Any character */

As you can see, this is a pretty simple and straightforward query language. But this language makes it extremely easy to find all the nodes in the repository that match a set of terms.