Chapter 3. Configuration

3.1. Enabling Hibernate Search and automatic indexing

Let's start with the most basic configuration question - how do I enable Hibernate Search?

3.1.1. Enabling Hibernate Search

The good news is that Hibernate Search is enabled out of the box when detected on the classpath by Hibernate Core. If, for some reason you need to disable it, set hibernate.search.autoregister_listeners to false. Note that there is no performance penalty when the listeners are enabled but no entities are annotated as indexed.

3.1.2. Automatic indexing

By default, every time an object is inserted, updated or deleted through Hibernate, Hibernate Search updates the according Lucene index. It is sometimes desirable to disable that features if either your index is read-only or if index updates are done in a batch way (see Section 6.3, “Rebuilding the whole index”).

To disable event based indexing, set

hibernate.search.indexing_strategy = manual

Note

In most case, the JMS backend provides the best of both world, a lightweight event based system keeps track of all changes in the system, and the heavyweight indexing process is done by a separate process or machine.

3.2. Configuring the `IndexManager`

The role of the index manager component is described in Chapter 2, Architecture. Hibernate Search provides two possible implementations for this interface to choose from.

directory-based: the default implementation which uses the Lucene Directory abstraction to manage index files.
near-real-time: avoid flushing writes to disk at each commit. This index manager is also Directory based, but also makes uses of Lucene's NRT functionallity.

To select an alternative you specify the property:

hibernate.search.[default|<indexname>].indexmanager = near-real-time

hibernate.search.[default|<indexname>].indexmanager = my.corp.myapp.CustomIndexManager

Tip

Your custom index manager implementation doesn't need to use the same components as the default implementations. For example, you can delegate to a remote indexing service which doesn't expose a Directory interface.

3.3. Directory configuration

As we have seen in Section 3.2, “Configuring the IndexManager” the default index manager uses Lucene's notion of a Directory to store the index files. The Directory implementation can be customized and Lucene comes bundled with a file system and an in-memory implementation. DirectoryProvider is the Hibernate Search abstraction around a Lucene Directory and handles the configuration and the initialization of the underlying Lucene resources. Table 3.1, “List of built-in DirectoryProvider” shows the list of the directory providers available in Hibernate Search together with their corresponding options.

To configure your DirectoryProvider you have to understand that each indexed entity is associated to a Lucene index (except of the case where multiple entities share the same index - Section 9.5, “Sharing indexes”). The name of the index is given by the index property of the @Indexed annotation. If the index property is not specified the fully qualified name of the indexed class will be used as name (recommended).

Knowing the index name, you can configure the directory provider and any additional options by using the prefix hibernate.search.<indexname>. The name default (hibernate.search.default) is reserved and can be used to define properties which apply to all indexes. Example 3.2, “Configuring directory providers” shows how hibernate.search.default.directory_provider is used to set the default directory provider to be the filesystem one. hibernate.search.default.indexBase sets then the default base directory for the indexes. As a result the index for the entity Status is created in /usr/lucene/indexes/org.hibernate.example.Status.

The index for the Rule entity, however, is using an in-memory directory, because the default directory provider for this entity is overriden by the property hibernate.search.Rules.directory_provider.

Finally the Action entity uses a custom directory provider CustomDirectoryProvider specified via hibernate.search.Actions.directory_provider.

Example 3.1. Specifying the index name

package org.hibernate.example;


@Indexed

public class Status { ... }


@Indexed(index="Rules")

public class Rule { ... }


@Indexed(index="Actions")

public class Action { ... }

Example 3.2. Configuring directory providers

hibernate.search.default.directory_provider filesystem
hibernate.search.default.indexBase=/usr/lucene/indexes
hibernate.search.Rules.directory_provider ram
hibernate.search.Actions.directory_provider com.acme.hibernate.CustomDirectoryProvider

Tip

Using the described configuration scheme you can easily define common rules like the directory provider and base directory, and override those defaults later on on a per index basis.

Table 3.1. List of built-in DirectoryProvider

Name and description	Properties
ram: Memory based directory, the directory will be uniquely identified (in the same deployment unit) by the `@Indexed.index` element	none
filesystem: File system based directory. The directory used will be <indexBase>/< indexName >	`indexBase` : base directory `indexName`: override @Indexed.index (useful for sharded indexes) `locking_strategy` : optional, see Section 3.7, “LockFactory configuration” `filesystem_access_type`: allows to determine the exact type of `FSDirectory` implementation used by this `DirectoryProvider`. Allowed values are `auto` (the default value, selects `NIOFSDirectory` on non Windows systems, `SimpleFSDirectory` on Windows), `simple` (`SimpleFSDirectory`), `nio` (`NIOFSDirectory`), `mmap` (`MMapDirectory`). Make sure to refer to Javadocs of these `Directory` implementations before changing this setting. Even though `NIOFSDirectory` or `MMapDirectory` can bring substantial performace boosts they also have their issues.
filesystem-master: File system based directory. Like `filesystem`. It also copies the index to a source directory (aka copy directory) on a regular basis. The recommended value for the refresh period is (at least) 50% higher that the time to copy the information (default 3600 seconds - 60 minutes). Note that the copy is based on an incremental copy mechanism reducing the average copy time. DirectoryProvider typically used on the master node in a JMS back end cluster. The `buffer_size_on_copy` optimum depends on your operating system and available RAM; most people reported good results using values between 16 and 64MB.	`indexBase`: base directory `indexName`: override @Indexed.index (useful for sharded indexes) `sourceBase`: source (copy) base directory. `source`: source directory suffix (default to `@Indexed.index`). The actual source directory name being `<sourceBase>/<source>` `refresh`: refresh period in seconds (the copy will take place every `refresh` seconds). If a copy is still in progress when the following `refresh` period elapses, the second copy operation will be skipped. `buffer_size_on_copy`: The amount of MegaBytes to move in a single low level copy instruction; defaults to 16MB. `locking_strategy` : optional, see Section 3.7, “LockFactory configuration” `filesystem_access_type`: allows to determine the exact type of `FSDirectory` implementation used by this `DirectoryProvider`. Allowed values are `auto` (the default value, selects `NIOFSDirectory` on non Windows systems, `SimpleFSDirectory` on Windows), `simple` (`SimpleFSDirectory`), `nio` (`NIOFSDirectory`), `mmap` (`MMapDirectory`). Make sure to refer to Javadocs of these `Directory` implementations before changing this setting. Even though `NIOFSDirectory` or `MMapDirectory` can bring substantial performace boosts they also have their issues.
filesystem-slave: File system based directory. Like `filesystem`, but retrieves a master version (source) on a regular basis. To avoid locking and inconsistent search results, 2 local copies are kept. The recommended value for the refresh period is (at least) 50% higher that the time to copy the information (default 3600 seconds - 60 minutes). Note that the copy is based on an incremental copy mechanism reducing the average copy time. If a copy is still in progress when `refresh` period elapses, the second copy operation will be skipped. DirectoryProvider typically used on slave nodes using a JMS back end. The `buffer_size_on_copy` optimum depends on your operating system and available RAM; most people reported good results using values between 16 and 64MB.	`indexBase`: Base directory `indexName`: override @Indexed.index (useful for sharded indexes) `sourceBase`: Source (copy) base directory. `source`: Source directory suffix (default to `@Indexed.index`). The actual source directory name being `<sourceBase>/<source>` `refresh`: refresh period in second (the copy will take place every refresh seconds). `buffer_size_on_copy`: The amount of MegaBytes to move in a single low level copy instruction; defaults to 16MB. `locking_strategy` : optional, see Section 3.7, “LockFactory configuration” `retry_marker_lookup` : optional, default to 0. Defines how many times we look for the marker files in the source directory before failing. Waiting 5 seconds between each try. `retry_initialize_period` : optional, set an integer value in seconds to enable the retry initialize feature: if the slave can't find the master index it will try again until it's found in background, without preventing the application to start: fullText queries performed before the index is initialized are not blocked but will return empty results. When not enabling the option or explicitly setting it to zero it will fail with an exception instead of scheduling a retry timer. To prevent the application from starting without an invalid index but still control an initialization timeout, see `retry_marker_lookup` instead. `filesystem_access_type`: allows to determine the exact type of `FSDirectory` implementation used by this `DirectoryProvider`. Allowed values are `auto` (the default value, selects `NIOFSDirectory` on non Windows systems, `SimpleFSDirectory` on Windows), `simple` (`SimpleFSDirectory`), `nio` (`NIOFSDirectory`), `mmap` (`MMapDirectory`). Make sure to refer to Javadocs of these `Directory` implementations before changing this setting. Even though `NIOFSDirectory` or `MMapDirectory` can bring substantial performace boosts they also have their issues.
infinispan: Infinispan based directory. Use it to store the index in a distributed grid, making index changes visible to all elements of the cluster very quickly. Also see Section 3.3.1, “Infinispan Directory configuration” for additional requirements and configuration settings. Infinispan needs a global configuration and additional dependencies; the settings defined here apply to each different index.	`locking_cachename`: name of the Infinispan cache to use to store locks. `data_cachename` : name of the Infinispan cache to use to store the largest data chunks; this area will contain the largest objects, use replication if you have enough memory or switch to distribution. `metadata_cachename`: name of the Infinispan cache to use to store the metadata relating to the index; this data is rather small and read very often, it's recommended to have this cache setup using replication. `chunk_size`: large files of the index are split in smaller chunks, you might want to set the highest value efficiently handled by your network. Networking tuning might be useful.

Tip

If the built-in directory providers do not fit your needs, you can write your own directory provider by implementing the org.hibernate.store.DirectoryProvider interface. In this case, pass the fully qualified class name of your provider into the directory_provider property. You can pass any additional properties using the prefix hibernate.search.<indexname>.

3.3.1. Infinispan Directory configuration

Infinispan is a distributed, scalable, highly available data grid platform which supports autodiscovery of peer nodes. Using Infinispan and Hibernate Search in combination, it is possible to store the Lucene index in a distributed environment where index updates are quickly available on all nodes.

This section describes in greater detail how to configure Hibernate Search to use an Infinispan Lucene Directory.

When using an Infinispan Directory the index is stored in memory and shared across multiple nodes. It is considered a single directory distributed across all participating nodes. If a node updates the index, all other nodes are updated as well. Updates on one node can be immediately searched for in the whole cluster.

The default configuration replicates all data defining the index across all nodes, thus consuming a significant amount of memory. For large indexes it's suggested to enable data distribution, so that each piece of information is replicated to a subset of all cluster members.

It is also possible to offload part or most information to a CacheStore, such as plain filesystem, Amazon S3, Cassandra, Berkley DB or standard relational databases. You can configure it to have a CacheStore on each node or have a single centralized one shared by each node.

See the Infinispan documentation for all Infinispan configuration options.

3.3.1.1. Requirements

To use the Infinispan directory via Maven, add the following dependencies:

Example 3.3. Maven dependencies for Hibernate Search


<dependency>

   <groupId>org.hibernate</groupId>

   <artifactId>hibernate-search</artifactId>

   <version>4.1.1.Final</version>

</dependency>

<dependency>

   <groupId>org.hibernate</groupId>

   <artifactId>hibernate-search-infinispan</artifactId>

   <version>4.1.1.Final</version>

</dependency>

For the non-maven users, add hibernate-search-infinispan.jar, infinispan-lucene-directory.jar and infinispan-core.jar to your application classpath. These last two jars are distributed by Infinispan.

3.3.1.2. Architecture

Even when using an Infinispan directory it's still recommended to use the JMS Master/Slave or JGroups backend, because in Infinispan all nodes will share the same index and it is likely that IndexWriters being active on different nodes will try to acquire the lock on the same index. So instead of sending updates directly to the index, send it to a JMS queue or JGroups channel and have a single node apply all changes on behalf of all other nodes.

Configuring a non-default backend is not a requirement but a performance optimization as locks are enabled to have a single node writing.

To configure a JMS slave only the backend must be replaced, the directory provider must be set to infinispan; set the same directory provider on the master, they will connect without the need to setup the copy job across nodes. Using the JGroups backend is very similar - just combine the backend configuration with the infinispan directory provider.

3.3.1.3. Infinispan Configuration

The most simple configuration only requires to enable the backend:

hibernate.search.[default|<indexname>].directory_provider infinispan

That's all what is needed to get a cluster-replicated index, but the default configuration does not enable any form of permanent persistence for the index; to enable such a feature an Infinispan configuration file should be provided.

To use Infinispan, Hibernate Search requirest a CacheManager; it can lookup and reuse an existing CacheManager, via JNDI, or start and manage a new one. In the latter case Hibernate Search will start and stop it ( closing occurs when the Hibernate SessionFactory is closed).

To use and existing CacheManager via JNDI (optional parameter):

hibernate.search.[default|<indexname>].infinispan.cachemanager_jndiname = [jndiname]

To start a new CacheManager from a configuration file (optional parameter):

hibernate.search.[default|<indexname>].infinispan.configuration_resourcename = [infinispan configuration filename]

If both parameters are defined, JNDI will have priority. If none of these is defined, Hibernate Search will use the default Infinispan configuration included in hibernate-search-infinispan.jar. This configuration should work fine in most cases but does not store the index in a persistent cache store.

As mentioned in Table 3.1, “List of built-in DirectoryProvider”, each index makes use of three caches, so three different caches should be configured as shown in the default-hibernatesearch-infinispan.xml provided in the hibernate-search-infinispan.jar. Several indexes can share the same caches.

3.4. Worker configuration

It is possible to refine how Hibernate Search interacts with Lucene through the worker configuration. There exist several architectural components and possible extension points. Let's have a closer look.

First there is a Worker. An implementation of the Worker interface is reponsible for receiving all entity changes, queuing them by context and applying them once a context ends. The most intuative context, especially in connection with ORM, is the transaction. For this reason Hibernate Search will per default use the TransactionalWorker to scope all changes per transaction. One can, however, imagine a scenario where the context depends for example on the number of entity changes or some other application (lifecycle) events. For this reason the Worker implementation is configurable as shown in Table 3.2, “Scope configuration”.

Table 3.2. Scope configuration

Property	Description
hibernate.search.worker.scope	The fully qualifed class name of the `Worker` implementation to use. If this property is not set, empty or `transaction` the default `TransactionalWorker` is used.
hibernate.search.worker.*	All configuration properties prefixed with `hibernate.search.worker` are passed to the Worker during initialization. This allows adding custom, worker specific parameters.

Once a context ends it is time to prepare and apply the index changes. This can be done synchronously or asynchronously from within a new thread. Synchronous updates have the advantage that the index is at all times in sync with the databases. Asynchronous updates, on the other hand, can help to minimize the user response time. The drawback is potential discrepancies between database and index states. Lets look at the configuration options shown in Table 3.3, “Execution configuration”.

Note

The following options can be different on each index; in fact they need the indexName prefix or use default to set the default value for all indexes.

Table 3.3. Execution configuration

Property	Description
hibernate.search.<indexName>.worker.execution	`sync`: synchronous execution (default) `async`: asynchronous execution
hibernate.search.<indexName>.worker.thread_pool.size	The backend can apply updates from the same transaction context (or batch) in parallel, using a threadpool. The default value is 1. You can experiment with larger values if you have many operations per transaction.
hibernate.search.<indexName>.worker.buffer_queue.max	Defines the maximal number of work queue if the thread poll is starved. Useful only for asynchronous execution. Default to infinite. If the limit is reached, the work is done by the main thread.

So far all work is done within the same Virtual Machine (VM), no matter which execution mode. The total amount of work has not changed for the single VM. Luckily there is a better approach, namely delegation. It is possible to send the indexing work to a different server by configuring hibernate.search.worker.backend - see Table 3.4, “Backend configuration”. Again this option can be configured differently for each index.

Table 3.4. Backend configuration

Property Description

hibernate.search.<indexName>.worker.backend

lucene: The default backend which runs index updates in the same VM. Also used when the property is undefined or empty.

jms: JMS backend. Index updates are send to a JMS queue to be processed by an indexing master. See Table 3.5, “JMS backend configuration” for additional configuration options and Section 3.4.1, “JMS Master/Slave back end” for a more detailed descripton of this setup.

jgroupsMaster or jgroupsSlave: Backend using JGroups as communication layer. See Section 3.4.2, “JGroups Master/Slave back end” for a more detailed description of this setup.

blackhole: Mainly a test/developer setting which ignores all indexing work

You can also specify the fully qualified name of a class implementing BackendQueueProcessor. This way you can implement your own communication layer. The implementation is responsilbe for returning a Runnable instance which on execution will process the index work.

Table 3.5. JMS backend configuration

Property	Description
hibernate.search.<indexName>.worker.jndi.*	Defines the JNDI properties to initiate the InitialContext (if needed). JNDI is only used by the JMS back end.
hibernate.search.<indexName>.worker.jms.connection_factory	Mandatory for the JMS back end. Defines the JNDI name to lookup the JMS connection factory from (`/ConnectionFactory` by default in JBoss AS)
hibernate.search.<indexName>.worker.jms.queue	Mandatory for the JMS back end. Defines the JNDI name to lookup the JMS queue from. The queue will be used to post work messages.

Warning

As you probably noticed, some of the shown properties are correlated which means that not all combinations of property values make sense. In fact you can end up with a non-functional configuration. This is especially true for the case that you provide your own implementations of some of the shown interfaces. Make sure to study the existing code before you write your own Worker or BackendQueueProcessor implementation.

3.4.1. JMS Master/Slave back end

This section describes in greater detail how to configure the Master/Slave Hibernate Search architecture.

JMS back end configuration.

3.4.1.1. Slave nodes

Every index update operation is sent to a JMS queue. Index querying operations are executed on a local index copy.

Example 3.4. JMS Slave configuration

### slave configuration

## DirectoryProvider
# (remote) master location
hibernate.search.default.sourceBase = /mnt/mastervolume/lucenedirs/mastercopy

# local copy location
hibernate.search.default.indexBase = /Users/prod/lucenedirs

# refresh every half hour
hibernate.search.default.refresh = 1800

# appropriate directory provider
hibernate.search.default.directory_provider = filesystem-slave

## Backend configuration
hibernate.search.default.worker.backend = jms
hibernate.search.default.worker.jms.connection_factory = /ConnectionFactory
hibernate.search.default.worker.jms.queue = queue/hibernatesearch
#optional jndi configuration (check your JMS provider for more information)

## Optional asynchronous execution strategy
# hibernate.search.default.worker.execution = async
# hibernate.search.default.worker.thread_pool.size = 2
# hibernate.search.default.worker.buffer_queue.max = 50

Tip

A file system local copy is recommended for faster search results.

3.4.1.2. Master node

Every index update operation is taken from a JMS queue and executed. The master index is copied on a regular basis.

Example 3.5. JMS Master configuration

### master configuration

## DirectoryProvider
# (remote) master location where information is copied to
hibernate.search.default.sourceBase = /mnt/mastervolume/lucenedirs/mastercopy

# local master location
hibernate.search.default.indexBase = /Users/prod/lucenedirs

# refresh every half hour
hibernate.search.default.refresh = 1800

# appropriate directory provider
hibernate.search.default.directory_provider = filesystem-master

## Backend configuration
#Backend is the default lucene one

In addition to the Hibernate Search framework configuration, a Message Driven Bean has to be written and set up to process the index works queue through JMS.

Example 3.6. Message Driven Bean processing the indexing queue

@MessageDriven(activationConfig = {

      @ActivationConfigProperty(propertyName="destinationType", 

                                propertyValue="javax.jms.Queue"),

      @ActivationConfigProperty(propertyName="destination", 

                                propertyValue="queue/hibernatesearch"),

      @ActivationConfigProperty(propertyName="DLQMaxResent", propertyValue="1")

   } )

public class MDBSearchController extends AbstractJMSHibernateSearchController 

                                 implements MessageListener {

    @PersistenceContext EntityManager em;

    

    //method retrieving the appropriate session

    protected Session getSession() {

        return (Session) em.getDelegate();

    }


    //potentially close the session opened in #getSession(), not needed here

    protected void cleanSessionIfNeeded(Session session) 

    }

}

This example inherits from the abstract JMS controller class available in the Hibernate Search source code and implements a JavaEE MDB. This implementation is given as an example and can be adjusted to make use of non Java EE Message Driven Beans. For more information about the getSession() and cleanSessionIfNeeded(), please check AbstractJMSHibernateSearchController's javadoc.

3.4.2. JGroups Master/Slave back end

This section describes how to configure the JGroups Master/Slave back end. The configuration examples illustrated in Section 3.4.1, “JMS Master/Slave back end” also apply here, only a different backend (hibernate.search.worker.backend) needs to be set.

All backends configured to use JGroups share the same Channel. The JGroups JChannel is the main communication link across all nodes participating in the same cluster group; since it is convenient and more efficient to have just one channel shared across all backends, the Channel configuration properties are not defined on a per-worker section but globally. See Section 3.4.2.3, “JGroups channel configuration”.

3.4.2.1. Slave nodes

Every index update operation is sent through a JGroups channel to the master node. Index querying operations are executed on a local index copy. Enabling the JGroups worker only makes sure the index operations are sent to the master, you still have to synchronize configuring an appropriate directory (See filesystem-master, filesystem-slave or infinispan options in Section 3.3, “Directory configuration”).

Example 3.7. JGroups Slave configuration

### slave configuration
hibernate.search.default.worker.backend = jgroupsSlave

3.4.2.2. Master node

Every index update operation is taken from a JGroups channel and executed. The master index is copied on a regular basis.

Example 3.8. JGroups Master configuration

### master configuration
hibernate.search.default.worker.backend = jgroupsMaster

3.4.2.3. JGroups channel configuration

Configuring the JGroups channel essentially entails specifying the transport in terms of a network protocol stack. To configure the JGroups transport, point the configuration property hibernate.search.services.jgroups.configurationFile to a JGroups configuration file; this can be either a file path or a Java resource name.

Tip

If no property is explicitly specified it is assumed that the JGroups default configuration file flush-udp.xml is used. This example configuration is known to work in most scenarios, with the notable exception of Amazon AWS; refer to the JGroups manual for more examples and protocol configuration details.

The default channel name is Hibernate Search Cluster which can be configured as seen in Example 3.9, “JGroups channel name configuration”.

Example 3.9. JGroups channel name configuration

hibernate.search.services.jgroups.clusterName = My-Custom-Cluster-Id

3.4.2.3.1. JGroups channel instance injection

For programmatic configurations, one additional option is available to configure the JGroups channel: to pass an existing channel instance to Hibernate Search directly using the property hibernate.search.services.jgroups.providedChannel, as shown in the following example.

import org.hibernate.search.backend.impl.jgroups.JGroupsChannelProvider;


org.jgroups.JChannel channel = ...

Map<String,String> properties = new HashMap<String,String)(1);

properties.put( JGroupsChannelProvider.CHANNEL_INJECT, channel );

EntityManagerFactory emf = Persistence.createEntityManagerFactory( "userPU", properties );

3.5. Reader strategy configuration

The different reader strategies are described in Reader strategy. Out of the box strategies are:

shared: share index readers across several queries. This strategy is the most efficient.
not-shared: create an index reader for each individual query

The default reader strategy is shared. This can be adjusted:

hibernate.search.[default|<indexname>].reader.strategy = not-shared

Adding this property switches to the not-shared strategy.

Or if you have a custom reader strategy:

hibernate.search.[default|<indexname>].reader.strategy = my.corp.myapp.CustomReaderProvider

where my.corp.myapp.CustomReaderProvider is the custom strategy implementation.

3.6. Tuning Lucene indexing performance

Hibernate Search allows you to tune the Lucene indexing performance by specifying a set of parameters which are passed through to underlying Lucene IndexWriter such as mergeFactor, maxMergeDocs and maxBufferedDocs. You can specify these parameters either as default values applying for all indexes, on a per index basis, or even per shard.

There are several low level IndexWriter settings which can be tuned for different use cases. These parameters are grouped by the indexwriter keyword:

hibernate.search.[default|<indexname>].indexwriter.<parameter_name>

If no value is set for an indexwriter value in a specific shard configuration, Hibernate Search will look at the index section, then at the default section.

Example 3.10. Example performance option configuration

hibernate.search.Animals.2.indexwriter.max_merge_docs 10
hibernate.search.Animals.2.indexwriter.merge_factor 20
hibernate.search.Animals.2.indexwriter.term_index_interval default
hibernate.search.default.indexwriter.max_merge_docs 100
hibernate.search.default.indexwriter.ram_buffer_size 64

The configuration in Example 3.10, “Example performance option configuration” will result in these settings applied on the second shard of the Animal index:

max_merge_docs = 10
merge_factor = 20
ram_buffer_size = 64MB
term_index_interval = Lucene default

All other values will use the defaults defined in Lucene.

The default for all values is to leave them at Lucene's own default. The values listed in Table 3.6, “List of indexing performance and behavior properties” depend for this reason on the version of Lucene you are using. The values shown are relative to version 2.4. For more information about Lucene indexing performance, please refer to the Lucene documentation.

Table 3.6. List of indexing performance and behavior properties

Property	Description	Default Value
hibernate.search.[default\|<indexname>].exclusive_index_use	Set to `true` when no other process will need to write to the same index. This will enable Hibernate Search to work in exlusive mode on the index and improve performance when writing changes to the index.	`true` (improved performance, releases locks only at shutdown)
hibernate.search.[default\|<indexname>].max_queue_length	Each index has a separate "pipeline" which contains the updates to be applied to the index. When this queue is full adding more operations to the queue becomes a blocking operation. Configuring this setting doesn't make much sense unless the `worker.execution` is configured as `async`.	`1000`
hibernate.search.[default\|<indexname>].indexwriter.max_buffered_delete_terms	Determines the minimal number of delete terms required before the buffered in-memory delete terms are applied and flushed. If there are documents buffered in memory at the time, they are merged and a new segment is created.	Disabled (flushes by RAM usage)
hibernate.search.[default\|<indexname>].indexwriter.max_buffered_docs	Controls the amount of documents buffered in memory during indexing. The bigger the more RAM is consumed.	Disabled (flushes by RAM usage)
hibernate.search.[default\|<indexname>].indexwriter.max_merge_docs	Defines the largest number of documents allowed in a segment. Smaller values perform better on frequently changing indexes, larger values provide better search performance if the index does not change often.	Unlimited (Integer.MAX_VALUE)
hibernate.search.[default\|<indexname>].indexwriter.merge_factor	Controls segment merge frequency and size. Determines how often segment indexes are merged when insertion occurs. With smaller values, less RAM is used while indexing, and searches on unoptimized indexes are faster, but indexing speed is slower. With larger values, more RAM is used during indexing, and while searches on unoptimized indexes are slower, indexing is faster. Thus larger values (> 10) are best for batch index creation, and smaller values (< 10) for indexes that are interactively maintained. The value must not be lower than 2.	10
hibernate.search.[default\|<indexname>].indexwriter.merge_min_size	Controls segment merge frequency and size. Segments smaller than this size (in MB) are always considered for the next segment merge operation. Setting this too large might result in expensive merge operations, even tough they are less frequent. See also `org.apache.lucene.index.LogDocMergePolicy`. `minMergeSize`.	0 MB (actually ~1K)
hibernate.search.[default\|<indexname>].indexwriter.merge_max_size	Controls segment merge frequency and size. Segments larger than this size (in MB) are never merged in bigger segments. This helps reduce memory requirements and avoids some merging operations at the cost of optimal search speed. When optimizing an index this value is ignored. See also `org.apache.lucene.index.LogDocMergePolicy`. `maxMergeSize`.	Unlimited
hibernate.search.[default\|<indexname>].indexwriter.merge_max_optimize_size	Controls segment merge frequency and size. Segments larger than this size (in MB) are not merged in bigger segments even when optimizing the index (see `merge_max_size` setting as well). Applied to `org.apache.lucene.index.LogDocMergePolicy`. `maxMergeSizeForOptimize`.	Unlimited
hibernate.search.[default\|<indexname>].indexwriter.merge_calibrate_by_deletes	Controls segment merge frequency and size. Set to `false` to not consider deleted documents when estimating the merge policy. Applied to `org.apache.lucene.index.LogMergePolicy`. `calibrateSizeByDeletes`.	`true`
hibernate.search.[default\|<indexname>].indexwriter.ram_buffer_size	Controls the amount of RAM in MB dedicated to document buffers. When used together max_buffered_docs a flush occurs for whichever event happens first. Generally for faster indexing performance it's best to flush by RAM usage instead of document count and use as large a RAM buffer as you can.	16 MB
hibernate.search.[default\|<indexname>].indexwriter.term_index_interval	Expert: Set the interval between indexed terms. Large values cause less memory to be used by IndexReader, but slow random-access to terms. Small values cause more memory to be used by an IndexReader, and speed random-access to terms. See Lucene documentation for more details.	128
hibernate.search.[default\|<indexname>].indexwriter.use_compound_file	The advantage of using the compound file format is that less file descriptors are used. The disadvantage is that indexing takes more time and temporary disk space. You can set this parameter to `false` in an attempt to improve the indexing time, but you could run out of file descriptors if `mergeFactor` is also large. Boolean parameter, use "`true`" or "`false`". The default value for this option is `true`.	true
hibernate.search.enable_dirty_check	Not all entity changes require an update of the Lucene index. If all of the updated entity properties (dirty properties) are not indexed Hibernate Search will skip the re-indexing work. Disable this option if you use custom `FieldBridge`s which need to be invoked at each update event (even though the property for which the field bridge is configured has not changed). This optimization will not be applied on classes using a `@ClassBridge` or a `@DynamicBoost`. Boolean parameter, use "`true`" or "`false`". The default value for this option is `true`.	true

Tip

When your architecture permits it, always keep hibernate.search.default.exclusive_index_use=true as it greatly improves efficiency in index writing. This is the default since Hibernate Search version 4.

Tip

To tune the indexing speed it might be useful to time the object loading from database in isolation from the writes to the index. To achieve this set the blackhole as worker backend and start your indexing routines. This backend does not disable Hibernate Search: it will still generate the needed changesets to the index, but will discard them instead of flushing them to the index. In contrast to setting the hibernate.search.indexing_strategy to manual, using blackhole will possibly load more data from the database. because associated entities are re-indexed as well.

hibernate.search.[default|<indexname>].worker.backend blackhole

The recommended approach is to focus first on optimizing the object loading, and then use the timings you achieve as a baseline to tune the indexing process.

Warning

The blackhole backend is not meant to be used in production, only as a tool to identify indexing bottlenecks.

3.6.1. Control segment size

The options merge_max_size, merge_max_optimize_size, merge_calibrate_by_deletes give you control on the maximum size of the segments being created, but you need to understand how they affect file sizes. If you need to hard limit the size, consider that merging a segment is about adding it together with another existing segment to form a larger one, so you might want to set the max_size for merge operations to less than half of your hard limit. Also segments might initially be generated larger than your expected size at first creation time: before they are ever merged. A segment is never created much larger than ram_buffer_size, but the threshold is checked as an estimate.

Example:

//to be fairly confident no files grow above 15MB, use:
hibernate.search.default.indexwriter.ram_buffer_size 10
hibernate.search.default.indexwriter.merge_max_optimize_size 7
hibernate.search.default.indexwriter.merge_max_size 7

Tip

When using the Infinispan Directory to cluster indexes make sure that your segments are smaller than the chunk_size so that you avoid fragmenting segments in the grid. Note that the chunk_size of the Infinispan Directory is expressed in bytes, while the index tuning options are in MB.

3.7. LockFactory configuration

Lucene Directorys have default locking strategies which work generally good enough for most cases, but it's possible to specify for each index managed by Hibernate Search a specific LockingFactory you want to use. This is generally not needed but could be useful.

Some of these locking strategies require a filesystem level lock and may be used even on RAM based indexes, this combination is valid but in this case the indexBase configuration option usually needed only for filesystem based Directory instances must be specified to point to a filesystem location where to store the lock marker files.

To select a locking factory, set the hibernate.search.<index>.locking_strategy option to one of simple, native, single or none. Alternatively set it to the fully qualified name of an implementation of org.hibernate.search.store.LockFactoryProvider.

Table 3.7. List of available LockFactory implementations

name	Class	Description
simple	org.apache.lucene.store.SimpleFSLockFactory	Safe implementation based on Java's File API, it marks the usage of the index by creating a marker file. If for some reason you had to kill your application, you will need to remove this file before restarting it.
native	org.apache.lucene.store.NativeFSLockFactory	As does `simple` this also marks the usage of the index by creating a marker file, but this one is using native OS file locks so that even if the JVM is terminated the locks will be cleaned up. This implementation has known problems on NFS, avoid it on network shares. `native` is the default implementation for the `filesystem`, `filesystem-master` and `filesystem-slave` directory providers.
single	org.apache.lucene.store.SingleInstanceLockFactory	This LockFactory doesn't use a file marker but is a Java object lock held in memory; therefore it's possible to use it only when you are sure the index is not going to be shared by any other process. This is the default implementation for the `ram` directory provider.
none	org.apache.lucene.store.NoLockFactory	All changes to this index are not coordinated by any lock; test your application carefully and make sure you know what it means.

Configuration example:

hibernate.search.default.locking_strategy simple
hibernate.search.Animals.locking_strategy native
hibernate.search.Books.locking_strategy org.custom.components.MyLockingFactory

The Infinispan Directory uses a custom implementation; it's still possible to override it but make sure you understand how that will work, especially with clustered indexes.

3.8. Exception Handling Configuration

Hibernate Search allows you to configure how exceptions are handled during the indexing process. If no configuration is provided then exceptions are logged to the log output by default. It is possible to explicitly declare the exception logging mechanism as seen below:

hibernate.search.error_handler log

The default exception handling occurs for both synchronous and asynchronous indexing. Hibernate Search provides an easy mechanism to override the default error handling implementation.

In order to provide your own implementation you must implement the ErrorHandler interface, which provides the handle(ErrorContext context) method. ErrorContext provides a reference to the primary LuceneWork instance, the underlying exception and any subsequent LuceneWork instances that could not be processed due to the primary exception.

public interface ErrorContext {

   List<LuceneWork> getFailingOperations();

   LuceneWork getOperationAtFault();

   Throwable getThrowable();

   boolean hasErrors();

}

To register this error handler with Hibernate Search you must declare the fully qualified classname of your ErrorHandler implementation in the configuration properties:

hibernate.search.error_handler CustomerErrorHandler

3.9. Index format compatibility

While Hibernate Search strives to offer a backwards compatible API to make it easy to port your application to newer versions, it delegates to Apache Lucene to handle the index writing and searching. The Lucene developers too attempt to keep a stable index format, but sometimes an update in the index format can not be avoided; in those rare cases you either have to reindex all your data, or use an index upgrade tool, or sometimes Lucene is able to read the old format so you don't need to take specific actions (besides making backup of your index).

While an index format incompatibility is an exceptional event, more often when upgrading Lucene the Analyzer implementations might slightly change behaviour, and this could lead to a poor recall score, possibly missing many hits from the results.

Hibernate Search exposes a configuration property hibernate.search.lucene_version which instructs the Analyzers and other Lucene classes to conform to their behaviour as defined in an (older) specific version of Lucene. See also org.apache.lucene.util.Version contained in the lucene-core.jar, depending on the specific version of Lucene you're using you might have different options available. When this option is not specified, Hibernate Search will instruct Lucene to use the default of it's current version, which is usually the best option for new projects. Still it's recommended to define the version you're using explicitly in the configuration so that when you happen to upgrade Lucene the Analyzers will not change behaviour; you can then choose to update this value in a second time, maybe when you have the chance to rebuild the index from scratch.

Example 3.11. Force Analyzers to be compatible with a Lucene 3.0 created index

hibernate.search.lucene_version LUCENE_30

This option is global for the configured SearchFactory and affects all Lucene APIs having such a parameter, as this should be applied consistently. So if you are also making use of Lucene bypassing Hibernate Search, make sure to apply the same value too.