Chapter 15. Search Configuration

<repository-service default-repository="db1">
  <repositories>
    <repository name="db1" system-workspace="ws" default-workspace="ws">
       ....
      <workspaces>
        <workspace name="ws">
       ....
          <query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">
            <properties>
              <property name="index-dir" value="${java.io.tmpdir}/temp/index/db1/ws" />
              <property name="synonymprovider-class" value="org.exoplatform.services.jcr.impl.core.query.lucene.PropertiesSynonymProvider" />
              <property name="synonymprovider-config-path" value="/synonyms.properties" />
              <property name="indexing-config-path" value="/indexing-configuration.xml" />
              <property name="query-class" value="org.exoplatform.services.jcr.impl.core.query.QueryImpl" />
            </properties>
          </query-handler>
        ... 
        </workspace>
     </workspaces>
    </repository>        
  </repositories>
</repository-service>

15.2. Configuration parameters

Table 15.1.

Parameter	Default	Description	Since
index-dir	none	The location of the index directory. This parameter is mandatory. Up to 1.9, this parameter called "indexDir"	1.0
use-compoundfile	true	Advises lucene to use compound files for the index files.	1.9
min-merge-docs	100	Minimum number of nodes in an index until segments are merged.	1.9
volatile-idle-time	3	Idle time in seconds until the volatile index part is moved to a persistent index even though minMergeDocs is not reached.	1.9
max-merge-docs	Integer.MAX_VALUE	Maximum number of nodes in segments that will be merged. The default value changed in JCR 1.9 to Integer.MAX_VALUE.	1.9
merge-factor	10	Determines how often segment indices are merged.	1.9
max-field-length	10000	The number of words that are fulltext indexed at most per property.	1.9
cache-size	1000	Size of the document number cache. This cache maps uuids to lucene document numbers	1.9
force-consistencycheck	false	Runs a consistency check on every startup. If false, a consistency check is only performed when the search index detects a prior forced shutdown.	1.9
auto-repair	true	Errors detected by a consistency check are automatically repaired. If false, errors are only written to the log.	1.9
query-class	QueryImpl	Class name that implements the javax.jcr.query.Query interface.This class must also extend from the class: org.exoplatform.services.jcr.impl.core.query.AbstractQueryImpl.	1.9
document-order	true	If true and the query does not contain an 'order by' clause, result nodes will be in document order. For better performance when queries return a lot of nodes set to 'false'.	1.9
result-fetch-size	Integer.MAX_VALUE	The number of results when a query is executed. Default value: Integer.MAX_VALUE (-> all).	1.9
excerptprovider-class	DefaultXMLExcerpt	The name of the class that implements org.exoplatform.services.jcr.impl.core.query.lucene.ExcerptProvider and should be used for the rep:excerpt() function in a query.	1.9
support-highlighting	false	If set to true additional information is stored in the index to support highlighting using the rep:excerpt() function.	1.9
synonymprovider-class	none	The name of a class that implements org.exoplatform.services.jcr.impl.core.query.lucene.SynonymProvider. The default value is null (-> not set).	1.9
synonymprovider-config-path	none	The path to the synonym provider configuration file. This path interpreted is relative to the path parameter. If there is a path element inside the SearchIndex element, then this path is interpreted and relative to the root path of the path. Whether this parameter is mandatory or not, it depends on the synonym provider implementation. The default value is null (-> not set).	1.9
indexing-configuration-path	none	The path to the indexing configuration file.	1.9
indexing-configuration-class	IndexingConfigurationImpl	The name of the class that implements org.exoplatform.services.jcr.impl.core.query.lucene.IndexingConfiguration.	1.9
force-consistencycheck	false	If setting to true, a consistency check is performed, depending on the parameter forceConsistencyCheck. If setting to false, no consistency check is performed on startup, even if a redo log had been applied.	1.9
spellchecker-class	none	The name of a class that implements org.exoplatform.services.jcr.impl.core.query.lucene.SpellChecker.	1.9
spellchecker-more-popular	true	If setting true, spellchecker returns only the suggest words that are as frequent or more frequent than the checked word. If setting false, spellchecker returns null (if checked word exit in dictionary), or spellchecker will return most close suggest word.	1.10
spellchecker-min-distance	0.55f	Minimal distance between checked word and proposed suggest word.	1.10
errorlog-size	50(Kb)	The default size of error log file in Kb.	1.9
upgrade-index	false	Allows JCR to convert an existing index into the new format. Also, it is possible to set this property via system property, for example: -Dupgrade-index=true Indexes before JCR 1.12 will not run with JCR 1.12. Hence you have to run an automatic migration: Start JCR with -Dupgrade-index=true. The old index format is then converted in the new index format. After the conversion the new format is used. On the next start, you don't need this option anymore. The old index is replaced and a back conversion is not possible - therefore better take a backup of the index before. (Only for migrations from JCR 1.9 and later.)	1.12
analyzer	org.apache.lucene.analysis.standard.StandardAnalyzer	Class name of a lucene analyzer to use for fulltext indexing of text.	1.12

15.3. Global Search Index

15.3.1. Global Search Index Configuration

The global search index is configured in the above-mentioned configuration file (portal/WEB-INF/conf/jcr/repository-configuration.xml) in the tag "query-handler".

<query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">

In fact, when using Lucene, you should always use the same analyzer for indexing and for querying, otherwise the results are unpredictable. You don't have to worry about this, eXo JCR does this for you automatically. If you don't like the StandardAnalyzer configured by default, just replace it by your own.

If you don't have a handy QueryHandler, you should learn how to create a customized Handler in 5 minutes.

15.3.2. Customized Search Indexes and Analyzers

By default Exo JCR uses the Lucene standard Analyzer to index contents. This analyzer uses some standard filters in the method that analyzes the content:

public TokenStream tokenStream(String fieldName, Reader reader) {
    StandardTokenizer tokenStream = new StandardTokenizer(reader, replaceInvalidAcronym);
    tokenStream.setMaxTokenLength(maxTokenLength);
    TokenStream result = new StandardFilter(tokenStream);
    result = new LowerCaseFilter(result);
    result = new StopFilter(result, stopSet);
    return result;
  }

The first one (StandardFilter) removes 's (as 's in "Peter's") from the end of words and removes dots from acronyms.
The second one (LowerCaseFilter) normalizes token text to lower case.
The last one (StopFilter) removes stop words from a token stream. The stop set is defined in the analyzer.

For specific cases, you may wish to use additional filters like ISOLatin1AccentFilter, which replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalents.

In order to use a different filter, you have to create a new analyzer, and a new search index to use the analyzer. You put it in a jar, which is deployed with your application.

15.3.2.1. Creating the filter

The ISOLatin1AccentFilter is not present in the current Lucene version used by eXo. You can use the attached file. You can also create your own filter, the relevant method is

public final Token next(final Token reusableToken) throws java.io.IOException

which defines how chars are read and used by the filter.

15.3.2.2. Creating the analyzer

The analyzer has to extends org.apache.lucene.analysis.standard.StandardAnalyzer, and overload the method

public TokenStream tokenStream(String fieldName, Reader reader)

to put your own filters. You can have a glance at the example analyzer attached to this article.

15.3.2.3. Creating the search index

Now, we have the analyzer, we have to write the SearchIndex, which will use the analyzer. Your have to extends org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex. You have to write the constructor, to set the right analyzer, and the method

public Analyzer getAnalyzer() {
    return MyAnalyzer;
  }

to return your analyzer. You can see the attached SearchIndex.

Note

Since 1.12 version, we can set Analyzer directly in configuration. So, creation new SearchIndex only for new Analyzer is redundant.

15.3.2.4. Configuring your application to use your SearchIndex

In portal/WEB-INF/conf/jcr/repository-configuration.xml, you have to replace each

<query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">

by your own class

<query-handler class="mypackage.indexation.MySearchIndex">

15.3.2.5. Configure your application to use your Analyzer

In portal/WEB-INF/conf/jcr/repository-configuration.xml, you have to add parameter "analyzer" to each query-handler config:

<query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">
   <properties>
      ...
      <property name="analyzer" value="org.exoplatform.services.jcr.impl.core.MyAnalyzer"/>
      ...
   </properties>
</query-handler>

When you start exo, your SearchIndex will start to index contents with the specified filters.

15.4. Indexing Adjustments

15.4.1. IndexingConfiguration

Starting with version 1.9, the default search index implementation in JCR allows you to control which properties of a node are indexed. You also can define different analyzers for different nodes.

The configuration parameter is called indexingConfiguration and per default is not set. This means all properties of a node are indexed.

If you wish to configure the indexing behavior, you need to add a parameter to the query-handler element in your configuration file.

<param name="indexing-configuration-path" value="/indexing_configuration.xml"/>

15.4.2. Indexing rules

15.4.2.1. Node Scope Limit

To optimize the index size, you can limit the node scope so that only certain properties of a node type are indexed.

With the below configuration, only properties named Text are indexed for nodes of type nt:unstructured. This configuration also applies to all nodes whose type extends from nt:unstructured.

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured">
    <property>Text</property>
  </index-rule>
</configuration>

Please note that you have to declare the namespace prefixes in the configuration element that you are using throughout the XML file!

15.4.2.2. Indexing Boost Value

It is also possible to configure a boost value for the nodes that match the index rule. The default boost value is 1.0. Higher boost values (a reasonable range is 1.0 - 5.0) will yield a higher score value and appear as more relevant.

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured"
              boost="2.0">
    <property>Text</property>
  </index-rule>
</configuration>

If you do not wish to boost the complete node but only certain properties, you can also provide a boost value for the listed properties:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured">
    <property boost="3.0">Title</property>
    <property boost="1.5">Text</property>
  </index-rule>
</configuration>

15.4.2.3. Conditional Index Rules

You may also add a condition to the index rule and have multiple rules with the same nodeType. The first index rule that matches will apply and all remain ones are ignored:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured"
              boost="2.0"
              condition="@priority = 'high'">
    <property>Text</property>
  </index-rule>
  <index-rule nodeType="nt:unstructured">
    <property>Text</property>
  </index-rule>
</configuration>

In the above example, the first rule only applies if the nt:unstructured node has a priority property with a value 'high'. The condition syntax supports only the equals operator and a string literal.

You may also refer properties in the condition that are not on the current node:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured"
              boost="2.0"
              condition="ancestor::*/@priority = 'high'">
    <property>Text</property>
  </index-rule>
  <index-rule nodeType="nt:unstructured"
              boost="0.5"
              condition="parent::foo/@priority = 'low'">
    <property>Text</property>
  </index-rule>
  <index-rule nodeType="nt:unstructured"
              boost="1.5"
              condition="bar/@priority = 'medium'">
    <property>Text</property>
  </index-rule>
  <index-rule nodeType="nt:unstructured">
    <property>Text</property>
  </index-rule>
</configuration>

The indexing configuration also allows you to specify the type of a node in the condition. Please note however that the type match must be exact. It does not consider sub types of the specified node type.

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured"
              boost="2.0"
              condition="element(*, nt:unstructured)/@priority = 'high'">
    <property>Text</property>
  </index-rule>
</configuration>

15.4.2.4. Exclusion from the Node Scope Index

Per default all configured properties are fulltext indexed if they are of type STRING and included in the node scope index. A node scope search finds normally all nodes of an index. That is, the select jcr:contains(., 'foo') returns all nodes that have a string property containing the word 'foo'. You can exclude explicitly a property from the node scope index:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured">
    <property nodeScopeIndex="false">Text</property>
  </index-rule>
</configuration>

15.4.3. Indexing Aggregates

Sometimes it is useful to include the contents of descendant nodes into a single node to easier search on content that is scattered across multiple nodes.

JCR allows you to define indexed aggregates, basing on relative path patterns and primary node types.

The following example creates an indexed aggregate on nt:file that includes the content of the jcr:content node:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"
               xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <aggregate primaryType="nt:file">
    <include>jcr:content</include>
  </aggregate>
</configuration>

You can also restrict the included nodes to a certain type:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"
               xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <aggregate primaryType="nt:file">
    <include primaryType="nt:resource">jcr:content</include>
  </aggregate>
</configuration>

You may also use the * to match all child nodes:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"
               xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <aggregate primaryType="nt:file">http://wiki.exoplatform.com/xwiki/bin/edit/JCR/Search+Configuration
    <include primaryType="nt:resource">*</include>
  </aggregate>
</configuration>

If you wish to include nodes up to a certain depth below the current node, you can add multiple include elements. E.g. the nt:file node may contain a complete XML document under jcr:content:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"
               xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <aggregate primaryType="nt:file">
    <include>*</include>
    <include>*/*</include>
    <include>*/*/*</include>
  </aggregate>
</configuration>

15.4.4. Property-Level Analyzers

15.4.4.1. Example

In this configuration section, you define how a property has to be analyzed. If there is an analyzer configuration for a property, this analyzer is used for indexing and searching of this property. For example:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <analyzers> 
        <analyzer class="org.apache.lucene.analysis.KeywordAnalyzer">
            <property>mytext</property>
        </analyzer>
        <analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer">
            <property>mytext2</property>
        </analyzer>
  </analyzers> 
</configuration>

The configuration above means that the property "mytext" for the entire workspace is indexed (and searched) with the Lucene KeywordAnalyzer, and property "mytext2" with the WhitespaceAnalyzer. Using different analyzers for different languages is particularly useful.

The WhitespaceAnalyzer tokenizes a property, the KeywordAnalyzer takes the property as a whole.

15.4.4.2. Characteristics of Node Scope Searches

When using analyzers, you may encounter an unexpected behavior when searching within a property compared to searching within a node scope. The reason is that the node scope always uses the global analyzer.

Let's suppose that the property "mytext" contains the text : "testing my analyzers" and that you haven't configured any analyzers for the property "mytext" (and not changed the default analyzer in SearchIndex).

If your query is for example:

xpath = "//*[jcr:contains(mytext,'analyzer')]"

This xpath does not return a hit in the node with the property above and default analyzers.

Also a search on the node scope

xpath = "//*[jcr:contains(.,'analyzer')]"

won't give a hit. Realize that you can only set specific analyzers on a node property, and that the node scope indexing/analyzing is always done with the globally defined analyzer in the SearchIndex element.

Now, if you change the analyzer used to index the "mytext" property above to

<analyzer class="org.apache.lucene.analysis.Analyzer.GermanAnalyzer">
     <property>mytext</property>
</analyzer>

and you do the same search again, then for

xpath = "//*[jcr:contains(mytext,'analyzer')]"

you would get a hit because of the word stemming (analyzers - analyzer).

The other search,

xpath = "//*[jcr:contains(.,'analyzer')]"

still would not give a result, since the node scope is indexed with the global analyzer, which in this case does not take into account any word stemming.

In conclusion, be aware that when using analyzers for specific properties, you might find a hit in a property for some search text, and you do not find a hit with the same search text in the node scope of the property!

Note

Both index rules and index aggregates influence how content is indexed in JCR. If you change the configuration, the existing content is not automatically re-indexed according to the new rules. You, therefore, have to manually re-index the content when you change the configuration!

15.4.5. Advanced features

eXo JCR supports some advanced features, which are not specified in JSR 170:

Get a text excerpt with highlighted words that matches the query: ExcerptProvider.
Search a term and its synonyms: SynonymSearch
Search similar nodes: SimilaritySearch
Check spelling of a full text query statement: SpellChecker
Define index aggregates and rules: IndexingConfiguration.