JBoss.orgCommunity Documentation
JCR offers multiple indexing strategies. They include both for standalone and clustered environments using the advantages of running in a single JVM or doing the best to use all resources available in cluster. JCR uses Lucene library as underlying search and indexing engine, but it has several limitations that greatly reduce possibilities and limits the usage of cluster advantages. That's why eXo JCR offers three strategies that are suitable for it's own usecases. They are standalone, clustered with shared index and clustered with local indexes. Each one has it's pros and cons.
Stanadlone strategy provides a stack of indexes to achieve greater performance within single JVM.
It combines in-memory buffer index directory with delayed file-system flushing. This index is called "Volatile" and it is invoked in searches also. Within some conditions volatile index is flushed to the persistent storage (file system) as new index directory. This allows to achieve great results for write operations.
Clustered implementation with local indexes is built upon same strategy with volatile in-memory index buffer along with delayed flushing on persistent storage.
As this implementation designed for clustered environment it has additional mechanisms for data delivery within cluster. Actual text extraction jobs done on the same node that does content operations (i.e. write operation). Prepared "documents" (Lucene term that means block of data ready for indexing) are replicated withing cluster nodes and processed by local indexes. So each cluster instance has the same index content. When new node joins the cluster it has no initial index, so it must be created. There are some supported ways of doing this operation. The simplest is to simply copy the index manually but this is not intended for use. If no initial index found JCR uses automated sceneries. They are controlled via configuration (see "index-recovery-mode" parameter) offering full re-indexing from database or copying from another cluster node.
For some reasons having a multiple index copies on each instance can be costly. So shared index can be used instead (see diagram below).
This indexing strategy combines advantages of in-memory index along with shared persistent index offering "near" real time search capabilities. This means that newly added content is accessible via search practically immediately. This strategy allows nodes to index data in their own volatile (in-memory) indexes, but persistent indexes are managed by single "coordinator" node only. Each cluster instance has a read access for shared index to perform queries combining search results found in own in-memory index also. Take in account that shared folder must be configured in your system environment (i.e. mounted NFS folder). But this strategy in some extremely rare cases can have a bit different volatile indexes within cluster instances for a while. In a few seconds they will be up2date.
See more about Search Configuration.
Configuration example:
<workspace name="ws"> <query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex"> <properties> <property name="index-dir" value="shareddir/index/db1/ws" /> <property name="changesfilter-class" value="org.exoplatform.services.jcr.impl.core.query.jbosscache.JBossCacheIndexChangesFilter" /> <property name="jbosscache-configuration" value="jbosscache-indexer.xml" /> <property name="jgroups-configuration" value="udp-mux.xml" /> <property name="jgroups-multiplexer-stack" value="true" /> <property name="jbosscache-cluster-name" value="JCR-cluster-indexer-ws" /> <property name="max-volatile-time" value="60" /> <property name="rdbms-reindexing" value="true" /> <property name="reindexing-page-size" value="1000" /> <property name="index-recovery-mode" value="from-coordinator" /> <property name="index-recovery-filter" value="org.exoplatform.services.jcr.impl.core.query.lucene.DocNumberRecoveryFilter" /> </properties> </query-handler> </workspace>
Table 25.1. Config properties description
Property name | Description |
---|---|
index-dir | path to index |
changesfilter-class | template of JBoss-cache configuration for all query-handlers in repository |
jbosscache-configuration | template of JBoss-cache configuration for all query-handlers in repository |
jgroups-configuration | jgroups-configuration is template configuration for all components (search, cache, locks) [Add link to document describing template configurations] |
jgroups-multiplexer-stack | [TODO about jgroups-multiplexer-stack - add link to JBoss doc] |
jbosscache-cluster-name | cluster name (must be unique) |
max-volatile-time | max time to live for Volatile Index |
rdbms-reindexing | indicate that need to use rdbms reindexing mechanism if possible, the default value is true |
reindexing-page-size | maximum amount of nodes which can be retrieved from storage for re-indexing purpose, the default value is 100 |
index-recovery-mode | If the parameter has been set to
from-indexing , so a full indexing will be
automatically launched (default behavior), if the parameter has
been set to from-coordinator , the index will
be retrieved from coordinator |
index-recovery-filter | Defines implementation class or classes of RecoveryFilters, the mechanism of index synchronization for Local Index strategy. |
async-reindexing | Controls the process of re-indexing on JCR's startup. If flag set, indexing will be launched asynchronously, without blocking the JCR. Default is "false". |
If you use postgreSQL and the parameter rdbms-reindexing is set to true, the performances of the queries used while indexing can be improved by setting the parameter "enable_seqscan" to "off" or "default_statistics_target" to at least "50" in the configuration of your database. Then you need to restart DB server and make analyze of the JCR_SVALUE (or JCR_MVALUE) table.
If you use DB2 and the parameter rdbms-reindexing is set to true, the performance of the queiries used while indexing can be improved by making statisticks on tables by running "RUNSTATS ON TABLE <scheme>.<table> WITH DISTRIBUTION AND INDEXES ALL" for JCR_SITEM (or JCR_MITEM) and JCR_SVALUE (or JCR_MVALUE) tables.
When running JCR in standalone usually standalone indexing is used also. Such parameters as "changesfilter-class", "jgroups-configuration" and all the "jbosscache-*" must be skipped and not defined. Like the configuration below.
<workspace name="ws"> <query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex"> <properties> <property name="index-dir" value="shareddir/index/db1/ws" /> <property name="max-volatile-time" value="60" /> <property name="rdbms-reindexing" value="true" /> <property name="reindexing-page-size" value="1000" /> <property name="index-recovery-mode" value="from-coordinator" /> </properties> </query-handler> </workspace>
For both cluster-ready implementations JBoss Cache, JGroups and Changes Filter values must be defined. Shared index requires some kind of remote or shared file system to be attached in a system (i.e. NFS, SMB or etc). Indexing directory ("indexDir" value) must point to it. Setting "changesfilter-class" to "org.exoplatform.services.jcr.impl.core.query.jbosscache.JBossCacheIndexChangesFilter" will enable shared index implementation.
<workspace name="ws"> <query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex"> <properties> <property name="index-dir" value="/mnt/nfs_drive/index/db1/ws" /> <property name="changesfilter-class" value="org.exoplatform.services.jcr.impl.core.query.jbosscache.JBossCacheIndexChangesFilter" /> <property name="jbosscache-configuration" value="jbosscache-indexer.xml" /> <property name="jgroups-configuration" value="udp-mux.xml" /> <property name="jgroups-multiplexer-stack" value="true" /> <property name="jbosscache-cluster-name" value="JCR-cluster-indexer-ws" /> <property name="max-volatile-time" value="60" /> <property name="rdbms-reindexing" value="true" /> <property name="reindexing-page-size" value="1000" /> <property name="index-recovery-mode" value="from-coordinator" /> </properties> </query-handler> </workspace>
In order to use cluster-ready strategy based on local indexes, when each node has own copy of index on local file system, the following configuration must be applied. Indexing directory must point to any folder on local file system and "changesfilter-class" must be set to "org.exoplatform.services.jcr.impl.core.query.jbosscache.LocalIndexChangesFilter".
<workspace name="ws"> <query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex"> <properties> <property name="index-dir" value="/mnt/nfs_drive/index/db1/ws" /> <property name="changesfilter-class" value="org.exoplatform.services.jcr.impl.core.query.jbosscache.LocalIndexChangesFilter" /> <property name="jbosscache-configuration" value="jbosscache-indexer.xml" /> <property name="jgroups-configuration" value="udp-mux.xml" /> <property name="jgroups-multiplexer-stack" value="true" /> <property name="jbosscache-cluster-name" value="JCR-cluster-indexer-ws" /> <property name="max-volatile-time" value="60" /> <property name="rdbms-reindexing" value="true" /> <property name="reindexing-page-size" value="1000" /> <property name="index-recovery-mode" value="from-coordinator" /> </properties> </query-handler> </workspace>
Common usecase for all cluster-ready applications is a hot joining and leaving of processing units. Node that is joining cluster for the first time or node joining after some downtime, they all must be in a synchronized state. When having a deal with shared value storages, databases and indexes, cluster nodes are synchronized anytime. But it's an issue when local index strategy used. If new node joins cluster, having no index it is retrieved or recreated. Node can be restarted also and thus index not empty. By default existing index is thought to be actual, but can be outdated. JCR offers a mechanism called RecoveryFilters that will automatically retrieve index for the joining node on startup. This feature is a set of filters that can be defined via QueryHandler configuration:
<property name="index-recovery-filter" value="org.exoplatform.services.jcr.impl.core.query.lucene.DocNumberRecoveryFilter" />
Filter number is not limited so they can be combined:
<property name="index-recovery-filter" value="org.exoplatform.services.jcr.impl.core.query.lucene.DocNumberRecoveryFilter" /> <property name="index-recovery-filter" value="org.exoplatform.services.jcr.impl.core.query.lucene.SystemPropertyRecoveryFilter" />
If any one returns fires, the index is re-synchronized. This feature uses standard index recovery mode defined by previously described parameter (can be "from-indexing" (default) or "from-coordinator")
<property name="index-recovery-mode" value="from-coordinator" />
There are couple implementations of filters:
org.exoplatform.services.jcr.impl.core.query.lucene.DummyRecoveryFilter: always returns true, for cases when index must be force resynchronized (recovered) each time;
org.exoplatform.services.jcr.impl.core.query.lucene.SystemPropertyRecoveryFilter : return value of system property "org.exoplatform.jcr.recoveryfilter.forcereindexing". So index recovery can be controlled from the top without changing documentation using system properties;
org.exoplatform.services.jcr.impl.core.query.lucene.ConfigurationPropertyRecoveryFilter : return value of QueryHandler configuration property "index-recovery-filter-forcereindexing". So index recovery can be controlled from configuration separately for each workspace. I.e:
<property name="index-recovery-filter" value="org.exoplatform.services.jcr.impl.core.query.lucene.ConfigurationPropertyRecoveryFilter" /> <property name="index-recovery-filter-forcereindexing" value="true" />
org.exoplatform.services.jcr.impl.core.query.lucene.DocNumberRecoveryFilter : checks number of documents in index on coordinator side and self-side. Return true if differs. Advantage of this filter comparing to other, it will skip reindexing for workspaces where index wasn't modified. I.e. there is 10 repositories with 3 workspaces in each one. Only one is really heavily used in cluster : frontend/production. So using this filter will only reindex those workspaces that are really changed, without affecting other indexes thus greatly reducing startup time.
JBoss-Cache template configuration for query handler is about the same for both clustered strategies.
jbosscache-indexer.xml
<?xml version="1.0" encoding="UTF-8"?> <jbosscache xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:jboss:jbosscache-core:config:3.1"> <locking useLockStriping="false" concurrencyLevel="50000" lockParentForChildInsertRemove="false" lockAcquisitionTimeout="20000" /> <!-- Configure the TransactionManager --> <transaction transactionManagerLookupClass="org.jboss.cache.transaction.JBossStandaloneJTAManagerLookup" /> <clustering mode="replication" clusterName="${jbosscache-cluster-name}"> <stateRetrieval timeout="20000" fetchInMemoryState="false" /> <jgroupsConfig multiplexerStack="jcr.stack" /> <sync /> </clustering> <!-- Eviction configuration --> <eviction wakeUpInterval="5000"> <default algorithmClass="org.jboss.cache.eviction.FIFOAlgorithm" eventQueueSize="1000000"> <property name="maxNodes" value="10000" /> <property name="minTimeToLive" value="60000" /> </default> </eviction> </jbosscache>
See more about template configurations here.
Managing a big set of data using JCR in production environment sometimes requires special operations with Indexes, stored on File System. One of those maintenance operations is a recreation of it. Also called "re-indexing". There are various usecases when it's important to do. They include hardware faults, hard restarts, data-corruption, migrations and JCR updates that brings new features related to index. Usually index re-creation requested on server's startup or in runtime.
Common usecase for updating and re-creating the index is to stop the server and manually remove indexes for workspaces requiring it. When server will be started, missing indexes are automatically recovered by re-indexing. JCR Supports direct RDBMS re-indexing, that usually is faster than ordinary and can be configured via QueryHandler parameter "rdbms-reindexing" set to "true" (for more information please refer to "Query-handler configuration overview"). New feature to introduce is asynchronous indexing on startup. Usually startup is blocked until process is finished. Block can take any period of time, depending on amount of data persisted in repositories. But this can be resolved by using an asynchronous approaches of startup indexation. Saying briefly, it performs all operations with index in background, without blocking the repository. This is controlled by the value of "async-reindexing" parameter in QueryHandler configuration. With asynchronous indexation active, JCR starts with no active indexes present. Queries on JCR still can be executed without exceptions, but no results will be returned until index creation completed. Checking index state is possible via QueryManagerImpl:
boolean online = ((QueryManagerImpl)Workspace.getQueryManager()).getQueryHandeler().isOnline();
"OFFLINE" state means that index is currently re-creating. When state changed, corresponding log event is printed. From the start of background task index is switched to "OFFLINE", with following log event :
[INFO] Setting index OFFLINE (repository/production[system]).
When process finished, two events are logged :
[INFO] Created initial index for 143018 nodes (repository/production[system]). [INFO] Setting index ONLINE (repository/production[system]).
Those two log lines indicates the end of process for workspace given in brackets. Calling isOnline() as mentioned above, will also return true.
Some hard system faults, error during upgrades, migration issues and some other factors may corrupt the index. Most likely end customers would like the production systems to fix index issues in run-time, without delays and restarts. Current versions of JCR supports "Hot Asynchronous Workspace Reindexing" feature. It allows end-user (Service Administrator) to launch the process in background without stopping or blocking whole application by using any JMX-compatible console (see screenshot below, "JConsole in action").
Server can continue working as expected while index is recreated. This depends on the flag "allow queries", passed via JMX interface to reindex operation invocation. If the flag set, then application continues working. But there is one critical limitation the end-users must be aware. Index is frozen while background task is running. It meant that queries are performed on index present on the moment of task startup and data written into repository after startup won't be available through the search until process finished. Data added during re-indexation is also indexed, but will be available only when task is done. Briefly, JCR makes the snapshot of indexes on asynch task startup and uses it for searches. When operation finished, stale indexes replaced by newly created including newly added data. If flag "allow queries" is set to false, then all queries will throw an exception while task is running. Current state can be acquired using the following JMX operation:
getHotReindexingState() - returns information about latest invocation: start time, if in progress or finish time if done.
First of all, can't launch Hot re-indexing via JMX if index is already in offline mode. It means that index is currently is invoked in some operations, like re-indexing at startup, copying in cluster to another node or whatever. Another important this is Hot Asynchronous Reindexing via JMX and "on startup" reindexing are completely different features. So you can't get the state of startup reindexing using command getHotReindexingState in JMX interface, but there are some common JMX operations:
getIOMode - returns current index IO mode (READ_ONLY / READ_WRITE), belongs to clustered configuration states;
getState - returns current state: ONLINE / OFFLINE.
As mentioned above, JCR Indexing is based on Lucene indexing library as underlying search engine. It uses Directories to store index and manages access to index by Lock Factories. By default JCR implementation uses optimal combination of Directory implementation and Lock Factory implementation. When running on OS different from Windows, NIOFSDirectory implementation used. And SimpleFSDirectory for Windows stations. NativeFSLockFactory is an optimal solution for wide variety of cases including clustered environment with NFS shared resources. But those default can be overridden with the help of system properties. There are two properties: "org.exoplatform.jcr.lucene.store.FSDirectoryLockFactoryClass" and "org.exoplatform.jcr.lucene.FSDirectory.class" that are responsible for changing default behavior. First one defines implementation of abstract Lucene LockFactory class and the second one sets implementation class for FSDirectory instances. For more information please refer to Lucene documentation. But be sure You know what You are changing. JCR allows end users to change implementation classes of Lucene internals, but doesn't guarantee it's stability and functionality.