JBoss.orgCommunity Documentation

Chapter 15. Search Configuration

15.1. XML Configuration
15.2. Configuration parameters
15.3. Global Search Index
15.3.1. Global Search Index Configuration
15.3.2. Customized Search Indexes and Analyzers
15.4. Indexing Adjustments
15.4.1. IndexingConfiguration
15.4.2. Indexing rules
15.4.3. Indexing Aggregates
15.4.4. Property-Level Analyzers
15.4.5. Advanced features

JCR index configuration. You can find this file here: .../portal/WEB-INF/conf/jcr/repository-configuration.xml

<repository-service default-repository="db1">
  <repositories>
    <repository name="db1" system-workspace="ws" default-workspace="ws">
       ....
      <workspaces>
        <workspace name="ws">
       ....
          <query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">
            <properties>
              <property name="index-dir" value="${java.io.tmpdir}/temp/index/db1/ws" />
              <property name="synonymprovider-class" value="org.exoplatform.services.jcr.impl.core.query.lucene.PropertiesSynonymProvider" />
              <property name="synonymprovider-config-path" value="/synonyms.properties" />
              <property name="indexing-config-path" value="/indexing-configuration.xml" />
              <property name="query-class" value="org.exoplatform.services.jcr.impl.core.query.QueryImpl" />
            </properties>
          </query-handler>
        ... 
        </workspace>
     </workspaces>
    </repository>        
  </repositories>
</repository-service>

Table 15.1. 

ParameterDefaultDescriptionSince
index-dirnoneThe location of the index directory. This parameter is mandatory. Up to 1.9, this parameter called "indexDir"1.0
use-compoundfiletrueAdvises lucene to use compound files for the index files.1.9
min-merge-docs100Minimum number of nodes in an index until segments are merged.1.9
volatile-idle-time3Idle time in seconds until the volatile index part is moved to a persistent index even though minMergeDocs is not reached.1.9
max-merge-docsInteger.MAX_VALUEMaximum number of nodes in segments that will be merged. The default value changed in JCR 1.9 to Integer.MAX_VALUE.1.9
merge-factor10Determines how often segment indices are merged.1.9
max-field-length10000The number of words that are fulltext indexed at most per property.1.9
cache-size1000Size of the document number cache. This cache maps uuids to lucene document numbers1.9
force-consistencycheckfalseRuns a consistency check on every startup. If false, a consistency check is only performed when the search index detects a prior forced shutdown.1.9
auto-repairtrueErrors detected by a consistency check are automatically repaired. If false, errors are only written to the log.1.9
query-classQueryImplClass name that implements the javax.jcr.query.Query interface.This class must also extend from the class: org.exoplatform.services.jcr.impl.core.query.AbstractQueryImpl.1.9
document-ordertrueIf true and the query does not contain an 'order by' clause, result nodes will be in document order. For better performance when queries return a lot of nodes set to 'false'.1.9
result-fetch-sizeInteger.MAX_VALUEThe number of results when a query is executed. Default value: Integer.MAX_VALUE (-> all).1.9
excerptprovider-classDefaultXMLExcerptThe name of the class that implements org.exoplatform.services.jcr.impl.core.query.lucene.ExcerptProvider and should be used for the rep:excerpt() function in a query.1.9
support-highlightingfalseIf set to true additional information is stored in the index to support highlighting using the rep:excerpt() function.1.9
synonymprovider-classnoneThe name of a class that implements org.exoplatform.services.jcr.impl.core.query.lucene.SynonymProvider. The default value is null (-> not set).1.9
synonymprovider-config-pathnoneThe path to the synonym provider configuration file. This path interpreted is relative to the path parameter. If there is a path element inside the SearchIndex element, then this path is interpreted and relative to the root path of the path. Whether this parameter is mandatory or not, it depends on the synonym provider implementation. The default value is null (-> not set).1.9
indexing-configuration-pathnoneThe path to the indexing configuration file.1.9
indexing-configuration-classIndexingConfigurationImplThe name of the class that implements org.exoplatform.services.jcr.impl.core.query.lucene.IndexingConfiguration.1.9
force-consistencycheckfalseIf setting to true, a consistency check is performed, depending on the parameter forceConsistencyCheck. If setting to false, no consistency check is performed on startup, even if a redo log had been applied.1.9
spellchecker-classnoneThe name of a class that implements org.exoplatform.services.jcr.impl.core.query.lucene.SpellChecker.1.9
spellchecker-more-populartrueIf setting true, spellchecker returns only the suggest words that are as frequent or more frequent than the checked word. If setting false, spellchecker returns null (if checked word exit in dictionary), or spellchecker will return most close suggest word.1.10
spellchecker-min-distance0.55fMinimal distance between checked word and proposed suggest word.1.10
errorlog-size50(Kb)The default size of error log file in Kb.1.9
upgrade-indexfalseAllows JCR to convert an existing index into the new format. Also, it is possible to set this property via system property, for example: -Dupgrade-index=true Indexes before JCR 1.12 will not run with JCR 1.12. Hence you have to run an automatic migration: Start JCR with -Dupgrade-index=true. The old index format is then converted in the new index format. After the conversion the new format is used. On the next start, you don't need this option anymore. The old index is replaced and a back conversion is not possible - therefore better take a backup of the index before. (Only for migrations from JCR 1.9 and later.)1.12
analyzerorg.apache.lucene.analysis.standard.StandardAnalyzerClass name of a lucene analyzer to use for fulltext indexing of text.1.12

By default Exo JCR uses the Lucene standard Analyzer to index contents. This analyzer uses some standard filters in the method that analyzes the content:

public TokenStream tokenStream(String fieldName, Reader reader) {
    StandardTokenizer tokenStream = new StandardTokenizer(reader, replaceInvalidAcronym);
    tokenStream.setMaxTokenLength(maxTokenLength);
    TokenStream result = new StandardFilter(tokenStream);
    result = new LowerCaseFilter(result);
    result = new StopFilter(result, stopSet);
    return result;
  }

For specific cases, you may wish to use additional filters like ISOLatin1AccentFilter, which replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalents.

In order to use a different filter, you have to create a new analyzer, and a new search index to use the analyzer. You put it in a jar, which is deployed with your application.

You may also add a condition to the index rule and have multiple rules with the same nodeType. The first index rule that matches will apply and all remain ones are ignored:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured"
              boost="2.0"
              condition="@priority = 'high'">
    <property>Text</property>
  </index-rule>
  <index-rule nodeType="nt:unstructured">
    <property>Text</property>
  </index-rule>
</configuration>

In the above example, the first rule only applies if the nt:unstructured node has a priority property with a value 'high'. The condition syntax supports only the equals operator and a string literal.

You may also refer properties in the condition that are not on the current node:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured"
              boost="2.0"
              condition="ancestor::*/@priority = 'high'">
    <property>Text</property>
  </index-rule>
  <index-rule nodeType="nt:unstructured"
              boost="0.5"
              condition="parent::foo/@priority = 'low'">
    <property>Text</property>
  </index-rule>
  <index-rule nodeType="nt:unstructured"
              boost="1.5"
              condition="bar/@priority = 'medium'">
    <property>Text</property>
  </index-rule>
  <index-rule nodeType="nt:unstructured">
    <property>Text</property>
  </index-rule>
</configuration>

The indexing configuration also allows you to specify the type of a node in the condition. Please note however that the type match must be exact. It does not consider sub types of the specified node type.

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <index-rule nodeType="nt:unstructured"
              boost="2.0"
              condition="element(*, nt:unstructured)/@priority = 'high'">
    <property>Text</property>
  </index-rule>
</configuration>

Sometimes it is useful to include the contents of descendant nodes into a single node to easier search on content that is scattered across multiple nodes.

JCR allows you to define indexed aggregates, basing on relative path patterns and primary node types.

The following example creates an indexed aggregate on nt:file that includes the content of the jcr:content node:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"
               xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <aggregate primaryType="nt:file">
    <include>jcr:content</include>
  </aggregate>
</configuration>

You can also restrict the included nodes to a certain type:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"
               xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <aggregate primaryType="nt:file">
    <include primaryType="nt:resource">jcr:content</include>
  </aggregate>
</configuration>

You may also use the * to match all child nodes:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"
               xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <aggregate primaryType="nt:file">http://wiki.exoplatform.com/xwiki/bin/edit/JCR/Search+Configuration
    <include primaryType="nt:resource">*</include>
  </aggregate>
</configuration>

If you wish to include nodes up to a certain depth below the current node, you can add multiple include elements. E.g. the nt:file node may contain a complete XML document under jcr:content:

<?xml version="1.0"?>
<!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd">
<configuration xmlns:jcr="http://www.jcp.org/jcr/1.0"
               xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <aggregate primaryType="nt:file">
    <include>*</include>
    <include>*/*</include>
    <include>*/*/*</include>
  </aggregate>
</configuration>

When using analyzers, you may encounter an unexpected behavior when searching within a property compared to searching within a node scope. The reason is that the node scope always uses the global analyzer.

Let's suppose that the property "mytext" contains the text : "testing my analyzers" and that you haven't configured any analyzers for the property "mytext" (and not changed the default analyzer in SearchIndex).

If your query is for example:

xpath = "//*[jcr:contains(mytext,'analyzer')]"

This xpath does not return a hit in the node with the property above and default analyzers.

Also a search on the node scope

xpath = "//*[jcr:contains(.,'analyzer')]"

won't give a hit. Realize that you can only set specific analyzers on a node property, and that the node scope indexing/analyzing is always done with the globally defined analyzer in the SearchIndex element.

Now, if you change the analyzer used to index the "mytext" property above to

<analyzer class="org.apache.lucene.analysis.Analyzer.GermanAnalyzer">
     <property>mytext</property>
</analyzer>

and you do the same search again, then for

xpath = "//*[jcr:contains(mytext,'analyzer')]"

you would get a hit because of the word stemming (analyzers - analyzer).

The other search,

xpath = "//*[jcr:contains(.,'analyzer')]"

still would not give a result, since the node scope is indexed with the global analyzer, which in this case does not take into account any word stemming.

In conclusion, be aware that when using analyzers for specific properties, you might find a hit in a property for some search text, and you do not find a hit with the same search text in the node scope of the property!

eXo JCR supports some advanced features, which are not specified in JSR 170: