JBoss.orgCommunity Documentation
JCR index configuration. You can find this file here:
.../portal/WEB-INF/conf/jcr/repository-configuration.xml
<repository-service default-repository="db1"> <repositories> <repository name="db1" system-workspace="ws" default-workspace="ws"> .... <workspaces> <workspace name="ws"> .... <query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex"> <properties> <property name="index-dir" value="${java.io.tmpdir}/temp/index/db1/ws" /> <property name="synonymprovider-class" value="org.exoplatform.services.jcr.impl.core.query.lucene.PropertiesSynonymProvider" /> <property name="synonymprovider-config-path" value="/synonyms.properties" /> <property name="indexing-config-path" value="/indexing-configuration.xml" /> <property name="query-class" value="org.exoplatform.services.jcr.impl.core.query.QueryImpl" /> </properties> </query-handler> ... </workspace> </workspaces> </repository> </repositories> </repository-service>
Table 15.1.
Parameter | Default | Description | Since |
---|---|---|---|
index-dir | none | The location of the index directory. This parameter is mandatory. Up to 1.9, this parameter called "indexDir" | 1.0 |
use-compoundfile | true | Advises lucene to use compound files for the index files. | 1.9 |
min-merge-docs | 100 | Minimum number of nodes in an index until segments are merged. | 1.9 |
volatile-idle-time | 3 | Idle time in seconds until the volatile index part is moved to a persistent index even though minMergeDocs is not reached. | 1.9 |
max-merge-docs | Integer.MAX_VALUE | Maximum number of nodes in segments that will be merged. The default value changed in JCR 1.9 to Integer.MAX_VALUE. | 1.9 |
merge-factor | 10 | Determines how often segment indices are merged. | 1.9 |
max-field-length | 10000 | The number of words that are fulltext indexed at most per property. | 1.9 |
cache-size | 1000 | Size of the document number cache. This cache maps uuids to lucene document numbers | 1.9 |
force-consistencycheck | false | Runs a consistency check on every startup. If false, a consistency check is only performed when the search index detects a prior forced shutdown. | 1.9 |
auto-repair | true | Errors detected by a consistency check are automatically repaired. If false, errors are only written to the log. | 1.9 |
query-class | QueryImpl | Class name that implements the javax.jcr.query.Query interface.This class must also extend from the class: org.exoplatform.services.jcr.impl.core.query.AbstractQueryImpl. | 1.9 |
document-order | true | If true and the query does not contain an 'order by' clause, result nodes will be in document order. For better performance when queries return a lot of nodes set to 'false'. | 1.9 |
result-fetch-size | Integer.MAX_VALUE | The number of results when a query is executed. Default value: Integer.MAX_VALUE (-> all). | 1.9 |
excerptprovider-class | DefaultXMLExcerpt | The name of the class that implements org.exoplatform.services.jcr.impl.core.query.lucene.ExcerptProvider and should be used for the rep:excerpt() function in a query. | 1.9 |
support-highlighting | false | If set to true additional information is stored in the index to support highlighting using the rep:excerpt() function. | 1.9 |
synonymprovider-class | none | The name of a class that implements org.exoplatform.services.jcr.impl.core.query.lucene.SynonymProvider. The default value is null (-> not set). | 1.9 |
synonymprovider-config-path | none | The path to the synonym provider configuration file. This path interpreted is relative to the path parameter. If there is a path element inside the SearchIndex element, then this path is interpreted and relative to the root path of the path. Whether this parameter is mandatory or not, it depends on the synonym provider implementation. The default value is null (-> not set). | 1.9 |
indexing-configuration-path | none | The path to the indexing configuration file. | 1.9 |
indexing-configuration-class | IndexingConfigurationImpl | The name of the class that implements org.exoplatform.services.jcr.impl.core.query.lucene.IndexingConfiguration. | 1.9 |
force-consistencycheck | false | If setting to true, a consistency check is performed, depending on the parameter forceConsistencyCheck. If setting to false, no consistency check is performed on startup, even if a redo log had been applied. | 1.9 |
spellchecker-class | none | The name of a class that implements org.exoplatform.services.jcr.impl.core.query.lucene.SpellChecker. | 1.9 |
spellchecker-more-popular | true | If setting true, spellchecker returns only the suggest words that are as frequent or more frequent than the checked word. If setting false, spellchecker returns null (if checked word exit in dictionary), or spellchecker will return most close suggest word. | 1.10 |
spellchecker-min-distance | 0.55f | Minimal distance between checked word and proposed suggest word. | 1.10 |
errorlog-size | 50(Kb) | The default size of error log file in Kb. | 1.9 |
upgrade-index | false | Allows JCR to convert an existing index into the new format. Also, it is possible to set this property via system property, for example: -Dupgrade-index=true Indexes before JCR 1.12 will not run with JCR 1.12. Hence you have to run an automatic migration: Start JCR with -Dupgrade-index=true. The old index format is then converted in the new index format. After the conversion the new format is used. On the next start, you don't need this option anymore. The old index is replaced and a back conversion is not possible - therefore better take a backup of the index before. (Only for migrations from JCR 1.9 and later.) | 1.12 |
analyzer | org.apache.lucene.analysis.standard.StandardAnalyzer | Class name of a lucene analyzer to use for fulltext indexing of text. | 1.12 |
The global search index is configured in the above-mentioned
configuration file
(portal/WEB-INF/conf/jcr/repository-configuration.xml
)
in the tag "query-handler".
<query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">
In fact, when using Lucene, you should always use the same analyzer for indexing and for querying, otherwise the results are unpredictable. You don't have to worry about this, eXo JCR does this for you automatically. If you don't like the StandardAnalyzer configured by default, just replace it by your own.
If you don't have a handy QueryHandler, you should learn how to create a customized Handler in 5 minutes.
By default Exo JCR uses the Lucene standard Analyzer to index contents. This analyzer uses some standard filters in the method that analyzes the content:
public TokenStream tokenStream(String fieldName, Reader reader) { StandardTokenizer tokenStream = new StandardTokenizer(reader, replaceInvalidAcronym); tokenStream.setMaxTokenLength(maxTokenLength); TokenStream result = new StandardFilter(tokenStream); result = new LowerCaseFilter(result); result = new StopFilter(result, stopSet); return result; }
The first one (StandardFilter) removes 's (as 's in "Peter's") from the end of words and removes dots from acronyms.
The second one (LowerCaseFilter) normalizes token text to lower case.
The last one (StopFilter) removes stop words from a token stream. The stop set is defined in the analyzer.
For specific cases, you may wish to use additional filters like ISOLatin1AccentFilter, which replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalents.
In order to use a different filter, you have to create a new analyzer, and a new search index to use the analyzer. You put it in a jar, which is deployed with your application.
The ISOLatin1AccentFilter is not present in the current Lucene version used by eXo. You can use the attached file. You can also create your own filter, the relevant method is
public final Token next(final Token reusableToken) throws java.io.IOException
which defines how chars are read and used by the filter.
The analyzer has to extends org.apache.lucene.analysis.standard.StandardAnalyzer, and overload the method
public TokenStream tokenStream(String fieldName, Reader reader)
to put your own filters. You can have a glance at the example analyzer attached to this article.
Now, we have the analyzer, we have to write the SearchIndex, which will use the analyzer. Your have to extends org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex. You have to write the constructor, to set the right analyzer, and the method
public Analyzer getAnalyzer() { return MyAnalyzer; }
to return your analyzer. You can see the attached SearchIndex.
Since 1.12 version, we can set Analyzer directly in configuration. So, creation new SearchIndex only for new Analyzer is redundant.
In
portal/WEB-INF/conf/jcr/repository-configuration.xml
,
you have to replace each
<query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">
by your own class
<query-handler class="mypackage.indexation.MySearchIndex">
In
portal/WEB-INF/conf/jcr/repository-configuration.xml
,
you have to add parameter "analyzer" to each query-handler
config:
<query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex"> <properties> ... <property name="analyzer" value="org.exoplatform.services.jcr.impl.core.MyAnalyzer"/> ... </properties> </query-handler>
When you start exo, your SearchIndex will start to index contents with the specified filters.
Starting with version 1.9, the default search index implementation in JCR allows you to control which properties of a node are indexed. You also can define different analyzers for different nodes.
The configuration parameter is called indexingConfiguration and per default is not set. This means all properties of a node are indexed.
If you wish to configure the indexing behavior, you need to add a parameter to the query-handler element in your configuration file.
<param name="indexing-configuration-path" value="/indexing_configuration.xml"/>
To optimize the index size, you can limit the node scope so that only certain properties of a node type are indexed.
With the below configuration, only properties named Text are indexed for nodes of type nt:unstructured. This configuration also applies to all nodes whose type extends from nt:unstructured.
<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <index-rule nodeType="nt:unstructured"> <property>Text</property> </index-rule> </configuration>
Please note that you have to declare the namespace prefixes in the configuration element that you are using throughout the XML file!
It is also possible to configure a boost value for the nodes that match the index rule. The default boost value is 1.0. Higher boost values (a reasonable range is 1.0 - 5.0) will yield a higher score value and appear as more relevant.
<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <index-rule nodeType="nt:unstructured" boost="2.0"> <property>Text</property> </index-rule> </configuration>
If you do not wish to boost the complete node but only certain properties, you can also provide a boost value for the listed properties:
<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <index-rule nodeType="nt:unstructured"> <property boost="3.0">Title</property> <property boost="1.5">Text</property> </index-rule> </configuration>
You may also add a condition to the index rule and have multiple rules with the same nodeType. The first index rule that matches will apply and all remain ones are ignored:
<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <index-rule nodeType="nt:unstructured" boost="2.0" condition="@priority = 'high'"> <property>Text</property> </index-rule> <index-rule nodeType="nt:unstructured"> <property>Text</property> </index-rule> </configuration>
In the above example, the first rule only applies if the nt:unstructured node has a priority property with a value 'high'. The condition syntax supports only the equals operator and a string literal.
You may also refer properties in the condition that are not on the current node:
<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <index-rule nodeType="nt:unstructured" boost="2.0" condition="ancestor::*/@priority = 'high'"> <property>Text</property> </index-rule> <index-rule nodeType="nt:unstructured" boost="0.5" condition="parent::foo/@priority = 'low'"> <property>Text</property> </index-rule> <index-rule nodeType="nt:unstructured" boost="1.5" condition="bar/@priority = 'medium'"> <property>Text</property> </index-rule> <index-rule nodeType="nt:unstructured"> <property>Text</property> </index-rule> </configuration>
The indexing configuration also allows you to specify the type of a node in the condition. Please note however that the type match must be exact. It does not consider sub types of the specified node type.
<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <index-rule nodeType="nt:unstructured" boost="2.0" condition="element(*, nt:unstructured)/@priority = 'high'"> <property>Text</property> </index-rule> </configuration>
Per default all configured properties are fulltext indexed if they are of type STRING and included in the node scope index. A node scope search finds normally all nodes of an index. That is, the select jcr:contains(., 'foo') returns all nodes that have a string property containing the word 'foo'. You can exclude explicitly a property from the node scope index:
<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <index-rule nodeType="nt:unstructured"> <property nodeScopeIndex="false">Text</property> </index-rule> </configuration>
Sometimes it is useful to include the contents of descendant nodes into a single node to easier search on content that is scattered across multiple nodes.
JCR allows you to define indexed aggregates, basing on relative path patterns and primary node types.
The following example creates an indexed aggregate on nt:file that includes the content of the jcr:content node:
<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <aggregate primaryType="nt:file"> <include>jcr:content</include> </aggregate> </configuration>
You can also restrict the included nodes to a certain type:
<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <aggregate primaryType="nt:file"> <include primaryType="nt:resource">jcr:content</include> </aggregate> </configuration>
You may also use the * to match all child nodes:
<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <aggregate primaryType="nt:file">http://wiki.exoplatform.com/xwiki/bin/edit/JCR/Search+Configuration <include primaryType="nt:resource">*</include> </aggregate> </configuration>
If you wish to include nodes up to a certain depth below the current node, you can add multiple include elements. E.g. the nt:file node may contain a complete XML document under jcr:content:
<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <aggregate primaryType="nt:file"> <include>*</include> <include>*/*</include> <include>*/*/*</include> </aggregate> </configuration>
In this configuration section, you define how a property has to be analyzed. If there is an analyzer configuration for a property, this analyzer is used for indexing and searching of this property. For example:
<?xml version="1.0"?> <!DOCTYPE configuration SYSTEM "http://www.exoplatform.org/dtd/indexing-configuration-1.0.dtd"> <configuration xmlns:nt="http://www.jcp.org/jcr/nt/1.0"> <analyzers> <analyzer class="org.apache.lucene.analysis.KeywordAnalyzer"> <property>mytext</property> </analyzer> <analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"> <property>mytext2</property> </analyzer> </analyzers> </configuration>
The configuration above means that the property "mytext" for the entire workspace is indexed (and searched) with the Lucene KeywordAnalyzer, and property "mytext2" with the WhitespaceAnalyzer. Using different analyzers for different languages is particularly useful.
The WhitespaceAnalyzer tokenizes a property, the KeywordAnalyzer takes the property as a whole.
When using analyzers, you may encounter an unexpected behavior when searching within a property compared to searching within a node scope. The reason is that the node scope always uses the global analyzer.
Let's suppose that the property "mytext" contains the text : "testing my analyzers" and that you haven't configured any analyzers for the property "mytext" (and not changed the default analyzer in SearchIndex).
If your query is for example:
xpath = "//*[jcr:contains(mytext,'analyzer')]"
This xpath does not return a hit in the node with the property above and default analyzers.
Also a search on the node scope
xpath = "//*[jcr:contains(.,'analyzer')]"
won't give a hit. Realize that you can only set specific analyzers on a node property, and that the node scope indexing/analyzing is always done with the globally defined analyzer in the SearchIndex element.
Now, if you change the analyzer used to index the "mytext" property above to
<analyzer class="org.apache.lucene.analysis.Analyzer.GermanAnalyzer"> <property>mytext</property> </analyzer>
and you do the same search again, then for
xpath = "//*[jcr:contains(mytext,'analyzer')]"
you would get a hit because of the word stemming (analyzers - analyzer).
The other search,
xpath = "//*[jcr:contains(.,'analyzer')]"
still would not give a result, since the node scope is indexed with the global analyzer, which in this case does not take into account any word stemming.
In conclusion, be aware that when using analyzers for specific properties, you might find a hit in a property for some search text, and you do not find a hit with the same search text in the node scope of the property!
Both index rules and index aggregates influence how content is indexed in JCR. If you change the configuration, the existing content is not automatically re-indexed according to the new rules. You, therefore, have to manually re-index the content when you change the configuration!
eXo JCR supports some advanced features, which are not specified in JSR 170:
Get a text excerpt with highlighted words that matches the query: ExcerptProvider.
Search a term and its synonyms: SynonymSearch
Search similar nodes: SimilaritySearch
Check spelling of a full text query statement: SpellChecker
Define index aggregates and rules: IndexingConfiguration.