JBoss.orgCommunity Documentation

Chapter 25. QueryHandler configuration

25.1. Indexing in clustered environment
25.2. Configuration
25.2.1. Query-handler configuration overview
25.2.2. Standalone strategy
25.2.3. Cluster-ready indexing strategies
25.2.4. JBoss-Cache template configuration
25.3. Asynchronous reindexing
25.3.1. On startup indexing
25.3.2. Hot Asynchronous Workspace Reindexing via JMX
25.3.3. Notices
25.4. Advanced tuning
25.4.1. Lucene tuning

JCR offers multiple indexing strategies. They include both for standalone and clustered environments using the advantages of running in a single JVM or doing the best to use all resources available in cluster. JCR uses Lucene library as underlying search and indexing engine, but it has several limitations that greatly reduce possibilities and limits the usage of cluster advantages. That's why eXo JCR offers three strategies that are suitable for it's own usecases. They are standalone, clustered with shared index and clustered with local indexes. Each one has it's pros and cons.

Stanadlone strategy provides a stack of indexes to achieve greater performance within single JVM.

It combines in-memory buffer index directory with delayed file-system flushing. This index is called "Volatile" and it is invoked in searches also. Within some conditions volatile index is flushed to the persistent storage (file system) as new index directory. This allows to achieve great results for write operations.

Clustered implementation with local indexes is built upon same strategy with volatile in-memory index buffer along with delayed flushing on persistent storage.

As this implementation designed for clustered environment it has additional mechanisms for data delivery within cluster. Actual text extraction jobs done on the same node that does content operations (i.e. write operation). Prepared "documents" (Lucene term that means block of data ready for indexing) are replicated withing cluster nodes and processed by local indexes. So each cluster instance has the same index content. When new node joins the cluster it has no initial index, so it must be created. There are some supported ways of doing this operation. The simplest is to simply copy the index manually but this is not intended for use. If no initial index found JCR uses automated sceneries. They are controlled via configuration (see "index-recovery-mode" parameter) offering full re-indexing from database or copying from another cluster node.

For some reasons having a multiple index copies on each instance can be costly. So shared index can be used instead (see diagram below).

This indexing strategy combines advantages of in-memory index along with shared persistent index offering "near" real time search capabilities. This means that newly added content is accessible via search practically immediately. This strategy allows nodes to index data in their own volatile (in-memory) indexes, but persistent indexes are managed by single "coordinator" node only. Each cluster instance has a read access for shared index to perform queries combining search results found in own in-memory index also. Take in account that shared folder must be configured in your system environment (i.e. mounted NFS folder). But this strategy in some extremely rare cases can have a bit different volatile indexes within cluster instances for a while. In a few seconds they will be up2date.

See more about Search Configuration.

Configuration example:

<workspace name="ws">
   <query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">
      <properties>
         <property name="index-dir" value="shareddir/index/db1/ws" />
         <property name="changesfilter-class"
            value="org.exoplatform.services.jcr.impl.core.query.jbosscache.JBossCacheIndexChangesFilter" />
         <property name="jbosscache-configuration" value="jbosscache-indexer.xml" />
         <property name="jgroups-configuration" value="udp-mux.xml" />
         <property name="jgroups-multiplexer-stack" value="true" />
         <property name="jbosscache-cluster-name" value="JCR-cluster-indexer-ws" />
         <property name="max-volatile-time" value="60" />
         <property name="rdbms-reindexing" value="true" />
         <property name="reindexing-page-size" value="1000" />
         <property name="index-recovery-mode" value="from-coordinator" />
         <property name="index-recovery-filter" value="org.exoplatform.services.jcr.impl.core.query.lucene.DocNumberRecoveryFilter" />
      </properties>
   </query-handler>
</workspace>

Note

If you use postgreSQL and the parameter rdbms-reindexing is set to true, the performances of the queries used while indexing can be improved by setting the parameter "enable_seqscan" to "off" or "default_statistics_target" to at least "50" in the configuration of your database. Then you need to restart DB server and make analyze of the JCR_SVALUE (or JCR_MVALUE) table.

Note

If you use DB2 and the parameter rdbms-reindexing is set to true, the performance of the queiries used while indexing can be improved by making statisticks on tables by running "RUNSTATS ON TABLE <scheme>.<table> WITH DISTRIBUTION AND INDEXES ALL" for JCR_SITEM (or JCR_MITEM) and JCR_SVALUE (or JCR_MVALUE) tables.

For both cluster-ready implementations JBoss Cache, JGroups and Changes Filter values must be defined. Shared index requires some kind of remote or shared file system to be attached in a system (i.e. NFS, SMB or etc). Indexing directory ("indexDir" value) must point to it. Setting "changesfilter-class" to "org.exoplatform.services.jcr.impl.core.query.jbosscache.JBossCacheIndexChangesFilter" will enable shared index implementation.

<workspace name="ws">
   <query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">
      <properties>
         <property name="index-dir" value="/mnt/nfs_drive/index/db1/ws" />
         <property name="changesfilter-class"
            value="org.exoplatform.services.jcr.impl.core.query.jbosscache.JBossCacheIndexChangesFilter" />
         <property name="jbosscache-configuration" value="jbosscache-indexer.xml" />
         <property name="jgroups-configuration" value="udp-mux.xml" />
         <property name="jgroups-multiplexer-stack" value="true" />
         <property name="jbosscache-cluster-name" value="JCR-cluster-indexer-ws" />
         <property name="max-volatile-time" value="60" />
         <property name="rdbms-reindexing" value="true" />
         <property name="reindexing-page-size" value="1000" />
         <property name="index-recovery-mode" value="from-coordinator" />
      </properties>
   </query-handler>
</workspace>

In order to use cluster-ready strategy based on local indexes, when each node has own copy of index on local file system, the following configuration must be applied. Indexing directory must point to any folder on local file system and "changesfilter-class" must be set to "org.exoplatform.services.jcr.impl.core.query.jbosscache.LocalIndexChangesFilter".

<workspace name="ws">
   <query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">
      <properties>
         <property name="index-dir" value="/mnt/nfs_drive/index/db1/ws" />
         <property name="changesfilter-class"
            value="org.exoplatform.services.jcr.impl.core.query.jbosscache.LocalIndexChangesFilter" />
         <property name="jbosscache-configuration" value="jbosscache-indexer.xml" />
         <property name="jgroups-configuration" value="udp-mux.xml" />
         <property name="jgroups-multiplexer-stack" value="true" />
         <property name="jbosscache-cluster-name" value="JCR-cluster-indexer-ws" />
         <property name="max-volatile-time" value="60" />
         <property name="rdbms-reindexing" value="true" />
         <property name="reindexing-page-size" value="1000" />
         <property name="index-recovery-mode" value="from-coordinator" />
      </properties>
   </query-handler>
</workspace>

Common usecase for all cluster-ready applications is a hot joining and leaving of processing units. Node that is joining cluster for the first time or node joining after some downtime, they all must be in a synchronized state. When having a deal with shared value storages, databases and indexes, cluster nodes are synchronized anytime. But it's an issue when local index strategy used. If new node joins cluster, having no index it is retrieved or recreated. Node can be restarted also and thus index not empty. By default existing index is thought to be actual, but can be outdated. JCR offers a mechanism called RecoveryFilters that will automatically retrieve index for the joining node on startup. This feature is a set of filters that can be defined via QueryHandler configuration:

<property name="index-recovery-filter" value="org.exoplatform.services.jcr.impl.core.query.lucene.DocNumberRecoveryFilter" />

Filter number is not limited so they can be combined:

<property name="index-recovery-filter" value="org.exoplatform.services.jcr.impl.core.query.lucene.DocNumberRecoveryFilter" />
<property name="index-recovery-filter" value="org.exoplatform.services.jcr.impl.core.query.lucene.SystemPropertyRecoveryFilter" />

If any one returns fires, the index is re-synchronized. This feature uses standard index recovery mode defined by previously described parameter (can be "from-indexing" (default) or "from-coordinator")

<property name="index-recovery-mode" value="from-coordinator" />

There are couple implementations of filters:

Managing a big set of data using JCR in production environment sometimes requires special operations with Indexes, stored on File System. One of those maintenance operations is a recreation of it. Also called "re-indexing". There are various usecases when it's important to do. They include hardware faults, hard restarts, data-corruption, migrations and JCR updates that brings new features related to index. Usually index re-creation requested on server's startup or in runtime.

Common usecase for updating and re-creating the index is to stop the server and manually remove indexes for workspaces requiring it. When server will be started, missing indexes are automatically recovered by re-indexing. JCR Supports direct RDBMS re-indexing, that usually is faster than ordinary and can be configured via QueryHandler parameter "rdbms-reindexing" set to "true" (for more information please refer to "Query-handler configuration overview"). New feature to introduce is asynchronous indexing on startup. Usually startup is blocked until process is finished. Block can take any period of time, depending on amount of data persisted in repositories. But this can be resolved by using an asynchronous approaches of startup indexation. Saying briefly, it performs all operations with index in background, without blocking the repository. This is controlled by the value of "async-reindexing" parameter in QueryHandler configuration. With asynchronous indexation active, JCR starts with no active indexes present. Queries on JCR still can be executed without exceptions, but no results will be returned until index creation completed. Checking index state is possible via QueryManagerImpl:

boolean online = ((QueryManagerImpl)Workspace.getQueryManager()).getQueryHandeler().isOnline();

"OFFLINE" state means that index is currently re-creating. When state changed, corresponding log event is printed. From the start of background task index is switched to "OFFLINE", with following log event :

[INFO] Setting index OFFLINE (repository/production[system]).

When process finished, two events are logged :

[INFO] Created initial index for 143018 nodes (repository/production[system]).
[INFO] Setting index ONLINE (repository/production[system]).

Those two log lines indicates the end of process for workspace given in brackets. Calling isOnline() as mentioned above, will also return true.

Some hard system faults, error during upgrades, migration issues and some other factors may corrupt the index. Most likely end customers would like the production systems to fix index issues in run-time, without delays and restarts. Current versions of JCR supports "Hot Asynchronous Workspace Reindexing" feature. It allows end-user (Service Administrator) to launch the process in background without stopping or blocking whole application by using any JMX-compatible console (see screenshot below, "JConsole in action").

Server can continue working as expected while index is recreated. This depends on the flag "allow queries", passed via JMX interface to reindex operation invocation. If the flag set, then application continues working. But there is one critical limitation the end-users must be aware. Index is frozen while background task is running. It meant that queries are performed on index present on the moment of task startup and data written into repository after startup won't be available through the search until process finished. Data added during re-indexation is also indexed, but will be available only when task is done. Briefly, JCR makes the snapshot of indexes on asynch task startup and uses it for searches. When operation finished, stale indexes replaced by newly created including newly added data. If flag "allow queries" is set to false, then all queries will throw an exception while task is running. Current state can be acquired using the following JMX operation: