Hibernate.orgCommunity Documentation

Chapter 10. Advanced features

10.1. Accessing the SearchFactory
10.2. Using an IndexReader
10.3. Accessing a Lucene Directory
10.4. Sharding indexes
10.5. Sharing indexes
10.6. Using external services
10.6.1. Exposing a service
10.6.2. Using a service
10.7. Customizing Lucene's scoring formula

In this final chapter we are offering a smorgasbord of tips and tricks which might become useful as you dive deeper and deeper into Hibernate Search.

The SearchFactory object keeps track of the underlying Lucene resources for Hibernate Search. It is a convenient way to access Lucene natively. The SearchFactory can be accessed from a FullTextSession:

Queries in Lucene are executed on an IndexReader. Hibernate Search might cache index readers to maximize performance, or provide other efficient strategies to retrieve an updated IndexReader minimizing IO operations. Your code can access these cached resources, but you have to follow some "good citizen" rules.

In this example the SearchFactory figures out which indexes are needed to query this entity (considering a Sharding strategy). Using the configured ReaderProvider (described inReader strategy) on each index, it returns a compound IndexReader on top of all involved indexes. Because this IndexReader is shared amongst several clients, you must adhere to the following rules:

  • Never call indexReader.close(), but always call readerProvider.closeReader(reader), preferably in a finally block.

  • Don't use this IndexReader for modification operations (it's a readonly IndexReader, you would get an exception).

Aside from those rules, you can use the IndexReader freely, especially to do native Lucene queries. Using the shared IndexReaders will make most queries more efficient than by opening one directly from - for example - the filesystem.

As an alternative to the method open(Class... types) you can use open(String... indexNames); in this case you pass in one or more index names; using this strategy you can also select a subset of the indexes for any indexed type if sharding is used.

A Directory is the most common abstraction used by Lucene to represent the index storage; Hibernate Search doesn't interact directly with a Lucene Directory but abstracts these interactions via an IndexManager: an index does not necessarily need to be implemented by a Directory.

If you know your index is represented as a Directory and need to access it, you can get a reference to the Directory via the IndexManager. Cast the IndexManager to a DirectoryBasedIndexManager and then use getDirectoryProvider().getDirectory() to get a reference to the underlying Directory. This is not recommended, we would encourage to use the IndexReader instead.

In some cases it can be useful to split (shard) the indexed data of a given entity into several Lucene indexes.

Possible use cases for sharding are:

By default sharding is not enabled unless the number of shards is configured. To do this use the hibernate.search.<indexName>.sharding_strategy.nbr_of_shards property as seen in Example 10.4, “Enabling index sharding”. In this example 5 shards are enabled.

Responsible for splitting the data into sub-indexes is the IndexShardingStrategy. The default sharding strategy splits the data according to the hash value of the id string representation (generated by the FieldBridge). This ensures a fairly balanced sharding. You can replace the default strategy by implementing a custom IndexShardingStrategy. To use your custom strategy you have to set the hibernate.search.<indexName>.sharding_strategy property.

The IndexShardingStrategy also allows for optimizing searches by selecting which shard to run the query against. By activating a filter (see Section 5.3.1, “Using filters in a sharded environment”), a sharding strategy can select a subset of the shards used to answer a query (IndexShardingStrategy.getIndexManagersForQuery) and thus speed up the query execution.

Each shard has an independent IndexManager and so can be configured to use a different directory provider and backend configurations. The IndexManager index names for the Animal entity in Example 10.6, “Sharding configuration for entity Animal” are Animal.0 to Animal.4. In other words, each shard has the name of it's owning index followed by . (dot) and its index number (see also Section 3.3, “Directory configuration”).

In Example 10.6, “Sharding configuration for entity Animal”, the configuration uses the default id string hashing strategy and shards the Animal index into 5 sub-indexes. All sub-indexes are filesystem instances and the directory where each sub-index is stored is as followed:

  • for sub-index 0: /usr/lucene/indexes/Animal00 (shared indexBase but overridden indexName)

  • for sub-index 1: /usr/lucene/indexes/Animal.1 (shared indexBase, default indexName)

  • for sub-index 2: /usr/lucene/indexes/Animal.2 (shared indexBase, default indexName)

  • for sub-index 3: /usr/lucene/shared/Animal03 (overridden indexBase, overridden indexName)

  • for sub-index 4: /usr/lucene/indexes/Animal.4 (shared indexBase, default indexName)

When implementing a IndexShardingStrategy any field can be used to determine the sharding selection. Consider that to handle deletions, purge and purgeAll operations, the implementation might need to return one or more indexes without being able to read all the field values or the primary identifier; in case the information is not enough to pick a single index, all indexes should be returned, so that the delete operation will be propagated to all indexes potentially containing the documents to be deleted.

It is technically possible to store the information of more than one entity into a single Lucene index. There are two ways to accomplish this:

Any of the pluggable contracts we have seen so far allows for the injection of a service. The most notable example being the DirectoryProvider. The full list is:

Some of these components need to access a service which is either available in the environment or whose lifecycle is bound to the SearchFactory. Sometimes, you even want the same service to be shared amongst several instances of these contract. One example is the ability the share an Infinispan cache instance between several directory providers running in different JVMs to store the various indexes using the same underlying infrastructure; this provides real-time replication of indexes across nodes.

To expose a service, you need to implement org.hibernate.search.spi.ServiceProvider<T>. T is the type of the service you want to use. Services are retrieved by components via their ServiceProvider class implementation.

Lucene allows the user to customize its scoring formula by extending org.apache.lucene.search.Similarity. The abstract methods defined in this class match the factors of the following formula calculating the score of query q for document d:

score(q,d) = coord(q,d) · queryNorm(q) · ∑ t in q ( tf(t in d) · idf(t) 2 · t.getBoost() · norm(t,d) )

tf(t ind)Term frequency factor for the term (t) in the document (d).
idf(t)Inverse document frequency of the term.
coord(q,d)Score factor based on how many of the query terms are found in the specified document.
queryNorm(q)Normalizing factor used to make scores between queries comparable.
t.getBoost()Field boost.
norm(t,d)Encapsulates a few (indexing time) boost and length factors.

It is beyond the scope of this manual to explain this formula in more detail. Please refer to Similarity's Javadocs for more information.

Hibernate Search provides three ways to modify Lucene's similarity calculation.

First you can set the default similarity by specifying the fully specified classname of your Similarity implementation using the property hibernate.search.similarity. The default value is org.apache.lucene.search.DefaultSimilarity.

You can also override the similarity used for a specific index by setting the similarity property

hibernate.search.default.similarity = my.custom.Similarity

Finally you can override the default similarity on class level using the @Similarity annotation.

@Similarity(impl = DummySimilarity.class)
public class Book {

As an example, let's assume it is not important how often a term appears in a document. Documents with a single occurrence of the term should be scored the same as documents with multiple occurrences. In this case your custom implementation of the method tf(float freq) should return 1.0.