Hibernate.orgCommunity Documentation

Chapter 10. Advanced features

10.1. Accessing the SearchFactory
10.2. Using an IndexReader
10.3. Accessing a Lucene Directory
10.4. Sharding indexes
10.4.1. Static sharding
10.4.2. Dynamic sharding
10.5. Sharing indexes
10.6. Using external services
10.6.1. Using a Service
10.6.2. Implementing a Service
10.7. Customizing Lucene’s scoring formula

In this final chapter we are offering a smörgåsbord of tips and tricks which might become useful as you dive deeper and deeper into Hibernate Search.

The SearchFactory object keeps track of the underlying Lucene resources for Hibernate Search. It is a convenient way to access Lucene natively. The SearchFactory can be accessed from a FullTextSession:


Queries in Lucene are executed on an IndexReader. Hibernate Search caches index readers to maximize performance and implements other strategies to retrieve updated IndexReaders in order to minimize IO operations. Your code can access these cached resources, but you have to follow some "good citizen" rules.


In this example the SearchFactory figures out which indexes are needed to query this entity. Using the configured ReaderProvider (described in Section 2.3, “Reader strategy”) on each index, it returns a compound IndexReader on top of all involved indexes. Because this IndexReader is shared amongst several clients, you must adhere to the following rules:

  • Never call indexReader.close(), but always call readerProvider.closeReader(reader), using a finally block.
  • Don’t use this IndexReader for modification operations: it’s a read-only instace, you would get an exception.

Aside from those rules, you can use the IndexReader freely, especially to do native Lucene queries. Using this shared IndexReaders will be more efficient than by opening one directly from - for example - the filesystem.

As an alternative to the method open(Class…​ types) you can use open(String…​ indexNames) in this case you pass in one or more index names; using this strategy you can also select a subset of the indexes for any indexed type if sharding is used.


A Directory is the most common abstraction used by Lucene to represent the index storage; Hibernate Search doesn’t interact directly with a Lucene Directory but abstracts these interactions via an IndexManager: an index does not necessarily need to be implemented by a Directory.

If you are certain that your index is represented as a Directory and need to access it, you can get a reference to the Directory via the IndexManager. You will have to cast the IndexManager instance to a DirectoryBasedIndexManager and then use getDirectoryProvider().getDirectory() to get a reference to the underlying Directory. This is not recommended, if you need low level access to the index using Lucene APIs we suggest to see Section 10.2, “Using an IndexReader” instead.

In some cases it can be useful to split (shard) the data into several Lucene indexes. There are two main use use cases:

  • A single index is so big that index update times are slowing the application down. In this case static sharding can be used to split the data into a pre-defined number of shards.
  • Data is naturally segmented by customer, region, language or other application parameter and the index should be split according to these segments. This is a use case for dynamic sharding.

Tip

By default sharding is not enabled.

Dynamic sharding allows you to manage the shards yourself and even create new shards on the fly. To do so you need to implement the interface ShardIdentifierProvider and set the hibernate.search.[default|<indexName>].sharding_strategy property to the fully qualified name of this class. Note that instead of implementing the interface directly, you should rather derive your implementation from org.hibernate.search.store.ShardIdentifierProviderTemplate which provides a basic implementation. Let’s look at Example 10.6, “Custom ShardIdentifierProvider” for an example.


The are several things happening in AnimalShardIdentifierProvider. First off its purpose is to create one shard per animal type (e.g. mammal, insect, etc.). It does so by inspecting the class type and the Lucene document passed to the getShardIdentifier() method. It extracts the type field from the document and uses it as shard name. getShardIdentifier() is called for every addition to the index and a new shard will be created with every new animal type encountered. The base class ShardIdentifierProviderTemplate maintains a set with all known shards to which any identifier must be added by calling addShard().

It is important to understand that Hibernate Search cannot know which shards already exist when the application starts. When using ShardIdentifierProviderTemplate as base class of a ShardIdentifierProvider implementation, the initial set of shard identifiers must be returned by the loadInitialShardNames() method. How this is done will depend on the use case. However, a common case in combination with Hibernate ORM is that the initial shard set is defined by the the distinct values of a given database column. Example 10.6, “Custom ShardIdentifierProvider” shows how to handle such a case. AnimalShardIdentifierProvider makes in its loadInitialShardNames() implementation use of a service called HibernateSessionFactoryService (see also Section 10.6, “Using external services”) which is available within an ORM environment. It allows to request a Hibernate SessionFactory instance which can be used to run a Criteria query in order to determine the initial set of shard identifiers.

Last but not least, the ShardIdentifierProvider also allows for optimizing searches by selecting which shard to run a query against. By activating a filter (see Section 5.3.1, “Using filters in a sharded environment”), a sharding strategy can select a subset of the shards used to answer a query (getShardIdentifiersForQuery(), not shown in the example) and thus speed up the query execution.

Important

This ShardIdentifierProvider is considered experimental. We might need to apply some changes to the defined method signatures to accommodate for unforeseen use cases. Please provide feedback if you have ideas, or just to let us know how you’re using this API.

It is technically possible to store the information of more than one entity into a single Lucene index. There are two ways to accomplish this:

  • Configuring the underlying directory providers to point to the same physical index directory. In practice, you set the property hibernate.search.[fully qualified entity name].indexName to the same value. As an example, let’s use the same index (directory) for the Furniture and Animal entities. We just set indexName for both entities to "Animal". Both entities will then be stored in the Animal directory:
hibernate.search.org.hibernate.search.test.shards.Furniture.indexName = Animal
hibernate.search.org.hibernate.search.test.shards.Animal.indexName = Animal
  • Setting the @Indexed annotation’s index attribute of the entities you want to merge to the same value. If we again wanted all Furniture instances to be indexed in the Animal index along with all instances of Animal we would specify @Indexed(index="Animal") on both Animal and Furniture classes.

Note

This is only presented here so that you know the option is available. There is really not much benefit in sharing indexes.

A Service in Hibernate Search is a class implementing the interface org.hibernate.search.engine.service.spi.Service and providing a default no-arg constructor. Theoretically that’s all that is needed to request a given service type from the Hibernate Search ServiceManager. In practice you want probably want to add some service life cycle methods (implement Startable and Stoppable) as well as actual methods providing some functionality.

Hibernate Search uses the service approach to decouple different components of the system. Let’s have a closer look at services and how they are used.

To implement a service, you need to create an interface which identifies it and extends org.hibernate.search.engine.service.spi.Service. You can then add additional methods to your service interface as needed.

Naturally you will also need to provide an implementation of your service interface. This implementation must have a public no-arg constructor. Optionally your service can also implement the life cycle methods org.hibernate.search.engine.service.spi.Startable and/or org.hibernate.search.engine.service.spi.Stoppable. These methods will be called by the ServiceManager when the service is created respectively the last reference to a requested service is released.

Services are retrieved from the ServiceManager.requestService using the Class object of the interface you define as a key.

Lucene allows the user to customize its scoring formula by extending org.apache.lucene.search.similarities.Similarity. The abstract methods defined in this class match the factors of the following formula calculating the score of query q for document d:

score(q,d) = coord(q,d) · queryNorm(q) · ∑ ~t in q~ ( tf(t in d) · idf(t) 2 · t.getBoost() · norm(t,d) )

FactorDescription

tf(t ind)

Term frequency factor for the term (t) in the document (d).

idf(t)

Inverse document frequency of the term.

coord(q,d)

Score factor based on how many of the query terms are found in the specified document.

queryNorm(q)

Normalizing factor used to make scores between queries comparable.

t.getBoost()

Field boost.

norm(t,d)

Encapsulates a few (indexing time) boost and length factors.

It is beyond the scope of this manual to explain this formula in more detail. Please refer to Similarity’s Javadocs for more information.

Hibernate Search provides two ways to modify Lucene’s similarity calculation.

First you can set the default similarity by specifying the fully specified classname of your Similarity implementation using the property hibernate.search.similarity. The default value is org.apache.lucene.search.similarities.DefaultSimilarity.

Secondly, you can override the similarity used for a specific index by setting the similarity property for this index (see Section 3.3, “Directory configuration” for more information about index configuration):

hibernate.search.[default|<indexname>].similarity = my.custom.Similarity

As an example, let’s assume it is not important how often a term appears in a document. Documents with a single occurrence of the term should be scored the same as documents with multiple occurrences. In this case your custom implementation of the method tf(float freq) should return 1.0.

Note

When two entities share the same index they must declare the same Similarity implementation.