Chapter 10. Advanced features

In this final chapter we are offering a smorgasbord of tips and tricks which might become useful as you dive deeper and deeper into Hibernate Search.

10.1. Accessing the SearchFactory

The SearchFactory object keeps track of the underlying Lucene resources for Hibernate Search. It is a convenient way to access Lucene natively. The SearchFactory can be accessed from a FullTextSession:

Example 10.1. Accessing the SearchFactory

FullTextSession fullTextSession = Search.getFullTextSession(regularSession);

SearchFactory searchFactory = fullTextSession.getSearchFactory();

10.2. Using an IndexReader

Queries in Lucene are executed on an IndexReader. Hibernate Search might cache index readers to maximize performance, or provide other efficient strategies to retrieve an updated IndexReader minimizing IO operations. Your code can access these cached resources, but you have to follow some "good citizen" rules.

Example 10.2. Accessing an IndexReader

IndexReader reader = searchFactory.getIndexReaderAccessor().open(Order.class);

try {

   //perform read-only operations on the reader

}

finally {

   searchFactory.getIndexReaderAccessor().close(reader);

}

In this example the SearchFactory figures out which indexes are needed to query this entity (considering a Sharding strategy). Using the configured ReaderProvider (described in Reader strategy) on each index, it returns a compound IndexReader on top of all involved indexes. Because this IndexReader is shared amongst several clients, you must adhere to the following rules:

Never call indexReader.close(), but always call readerProvider.closeReader(reader), preferably in a finally block.
Don't use this IndexReader for modification operations (it's a readonly IndexReader, you would get an exception).

Aside from those rules, you can use the IndexReader freely, especially to do native Lucene queries. Using the shared IndexReaders will make most queries more efficient than by opening one directly from - for example - the filesystem.

As an alternative to the method open(Class... types) you can use open(String... indexNames); in this case you pass in one or more index names; using this strategy you can also select a subset of the indexes for any indexed type if sharding is used.

Example 10.3. Accessing an IndexReader by index names

IndexReader reader = searchFactory

      .getIndexReaderAccessor()

      .open("Products.1", "Products.3");

10.3. Accessing a Lucene Directory

A Directory is the most common abstraction used by Lucene to represent the index storage; Hibernate Search doesn't interact directly with a Lucene Directory but abstracts these interactions via an IndexManager: an index does not necessarily need to be implemented by a Directory.

If you know your index is represented as a Directory and need to access it, you can get a reference to the Directory via the IndexManager. Cast the IndexManager to a DirectoryBasedIndexManager and then use getDirectoryProvider().getDirectory() to get a reference to the underlying Directory. This is not recommended, we would encourage to use the IndexReader instead.

10.4. Sharding indexes

In some cases it can be useful to split (shard) the indexed data of a given entity into several Lucene indexes.

Note

This solution is not recommended unless there is a pressing need. Searches will be slower as all shards have to be opened for a single search. Don't do it until you have a real use case!

Possible use cases for sharding are:

A single index is so huge that index update times are slowing the application down.
A typical search will only hit a sub-set of the index, such as when data is naturally segmented by customer, region or application.

By default sharding is not enabled unless the number of shards is configured. To do this use the hibernate.search.<indexName>.sharding_strategy.nbr_of_shards property as seen in Example 10.4, “Enabling index sharding”. In this example 5 shards are enabled.

Example 10.4. Enabling index sharding

hibernate.search.<indexName>.sharding_strategy.nbr_of_shards = 5

Responsible for splitting the data into sub-indexes is the IndexShardingStrategy. The default sharding strategy splits the data according to the hash value of the id string representation (generated by the FieldBridge). This ensures a fairly balanced sharding. You can replace the default strategy by implementing a custom IndexShardingStrategy. To use your custom strategy you have to set the hibernate.search.<indexName>.sharding_strategy property.

Example 10.5. Specifying a custom sharding strategy

hibernate.search.<indexName>.sharding_strategy = my.shardingstrategy.Implementation

The IndexShardingStrategy also allows for optimizing searches by selecting which shard to run the query against. By activating a filter (see Section 5.3.1, “Using filters in a sharded environment”), a sharding strategy can select a subset of the shards used to answer a query (IndexShardingStrategy.getIndexManagersForQuery) and thus speed up the query execution.

Each shard has an independent IndexManager and so can be configured to use a different directory provider and backend configurations. The IndexManager index names for the Animal entity in Example 10.6, “Sharding configuration for entity Animal” are Animal.0 to Animal.4. In other words, each shard has the name of it's owning index followed by . (dot) and its index number (see also Section 3.3, “Directory configuration”).

Example 10.6. Sharding configuration for entity Animal

hibernate.search.default.indexBase = /usr/lucene/indexes

hibernate.search.Animal.sharding_strategy.nbr_of_shards = 5
hibernate.search.Animal.directory_provider = filesystem
hibernate.search.Animal.0.indexName = Animal00
hibernate.search.Animal.3.indexBase = /usr/lucene/sharded
hibernate.search.Animal.3.indexName = Animal03

In Example 10.6, “Sharding configuration for entity Animal”, the configuration uses the default id string hashing strategy and shards the Animal index into 5 sub-indexes. All sub-indexes are filesystem instances and the directory where each sub-index is stored is as followed:

for sub-index 0: /usr/lucene/indexes/Animal00 (shared indexBase but overridden indexName)
for sub-index 1: /usr/lucene/indexes/Animal.1 (shared indexBase, default indexName)
for sub-index 2: /usr/lucene/indexes/Animal.2 (shared indexBase, default indexName)
for sub-index 3: /usr/lucene/shared/Animal03 (overridden indexBase, overridden indexName)
for sub-index 4: /usr/lucene/indexes/Animal.4 (shared indexBase, default indexName)

When implementing a IndexShardingStrategy any field can be used to determine the sharding selection. Consider that to handle deletions, purge and purgeAll operations, the implementation might need to return one or more indexes without being able to read all the field values or the primary identifier; in case the information is not enough to pick a single index, all indexes should be returned, so that the delete operation will be propagated to all indexes potentially containing the documents to be deleted.

10.5. Sharing indexes

It is technically possible to store the information of more than one entity into a single Lucene index. There are two ways to accomplish this:

Configuring the underlying directory providers to point to the same physical index directory. In practice, you set the property hibernate.search.[fully qualified entity name].indexName to the same value. As an example let’s use the same index (directory) for the Furniture and Animal entity. We just set indexName for both entities to for example “Animal”. Both entities will then be stored in the Animal directory.
```
hibernate.search.org.hibernate.search.test.shards.Furniture.indexName = Animal
hibernate.search.org.hibernate.search.test.shards.Animal.indexName = Animal
```
Setting the @Indexed annotation’s index attribute of the entities you want to merge to the same value. If we again wanted all Furniture instances to be indexed in the Animal index along with all instances of Animal we would specify @Indexed(index="Animal") on both Animal and Furniture classes.
Note
This is only presented here so that you know the option is available. There is really not much benefit in sharing indexes.

10.6. Using external services

Any of the pluggable contracts we have seen so far allows for the injection of a service. The most notable example being the DirectoryProvider. The full list is:

DirectoryProvider
ReaderProvider
OptimizerStrategy
BackendQueueProcessor
Worker
ErrorHandler
MassIndexerProgressMonitor

Some of these components need to access a service which is either available in the environment or whose lifecycle is bound to the SearchFactory. Sometimes, you even want the same service to be shared amongst several instances of these contract. One example is the ability the share an Infinispan cache instance between several directory providers running in different JVMs to store the various indexes using the same underlying infrastructure; this provides real-time replication of indexes across nodes.

10.6.1. Exposing a service

To expose a service, you need to implement org.hibernate.search.spi.ServiceProvider<T>. T is the type of the service you want to use. Services are retrieved by components via their ServiceProvider class implementation.

10.6.1.1. Managed services

If your service ought to be started when Hibernate Search starts and stopped when Hibernate Search stops, you can use a managed service. Make sure to properly implement the start and stop methods of ServiceProvider. When the service is requested, the getService method is called.

Example 10.7. Example of ServiceProvider implementation

public class CacheServiceProvider implements ServiceProvider<Cache> {

    private CacheManager manager;


    public void start(Properties properties) {

        //read configuration

        manager = new CacheManager(properties);

    }


    public Cache getService() {

        return manager.getCache(DEFAULT);

    }


    void stop() {

        manager.close();

    }

}

Note

The ServiceProvider implementation must have a no-arg constructor.

To be transparently discoverable, such service should have an accompanying META-INF/services/org.hibernate.search.spi.ServiceProvider whose content list the (various) service provider implementation(s).

Example 10.8. Content of META-INF/services/org.hibernate.search.spi.ServiceProvider

com.acme.infra.hibernate.CacheServiceProvider

10.6.1.2. Provided services

Alternatively, the service can be provided by the environment bootstrapping Hibernate Search. For example, Infinispan which uses Hibernate Search as its internal search engine can pass the CacheContainer to Hibernate Search. In this case, the CacheContainer instance is not managed by Hibernate Search and the start/stop methods of its corresponding service provider will not be used.

Note

Provided services have priority over managed services. If a provider service is registered with the same ServiceProvider class as a managed service, the provided service will be used.

The provided services are passed to Hibernate Search via the SearchConfiguration interface (getProvidedServices).

Important

Provided services are used by frameworks controlling the lifecycle of Hibernate Search and not by traditional users.

If, as a user, you want to retrieve a service instance from the environment, use registry services like JNDI and look the service up in the provider.

10.6.2. Using a service

Many of of the pluggable contracts of Hibernate Search can use services. Services are accessible via the BuildContext interface.

Example 10.9. Example of a directory provider using a cache service

public CustomDirectoryProvider implements DirectoryProvider<RAMDirectory> {

    private BuildContext context;


    public void initialize(

        String directoryProviderName, 

        Properties properties, 

        BuildContext context) {

        //initialize

        this.context = context;

    }


    public void start() {

        Cache cache = context.requestService( CacheServiceProvider.class );

        //use cache

    }


    public RAMDirectory getDirectory() {

        // use cache

    }


    public stop() {

        //stop services

        context.releaseService( CacheServiceProvider.class );

    } 

}

When you request a service, an instance of the service is served to you. Make sure to then release the service. This is fundamental. Note that the service can be released in the DirectoryProvider.stop method if the DirectoryProvider uses the service during its lifetime or could be released right away of the service is simply used at initialization time.

10.7. Customizing Lucene's scoring formula

Lucene allows the user to customize its scoring formula by extending org.apache.lucene.search.Similarity. The abstract methods defined in this class match the factors of the following formula calculating the score of query q for document d:

score(q,d) = coord(q,d) · queryNorm(q) · ∑ _{t in q} ( tf(t in d) · idf(t) ² · t.getBoost() · norm(t,d) )

Factor	Description
tf(t ind)	Term frequency factor for the term (t) in the document (d).
idf(t)	Inverse document frequency of the term.
coord(q,d)	Score factor based on how many of the query terms are found in the specified document.
queryNorm(q)	Normalizing factor used to make scores between queries comparable.
t.getBoost()	Field boost.
norm(t,d)	Encapsulates a few (indexing time) boost and length factors.

It is beyond the scope of this manual to explain this formula in more detail. Please refer to Similarity's Javadocs for more information.

Hibernate Search provides three ways to modify Lucene's similarity calculation.

First you can set the default similarity by specifying the fully specified classname of your Similarity implementation using the property hibernate.search.similarity. The default value is org.apache.lucene.search.DefaultSimilarity.

You can also override the similarity used for a specific index by setting the similarity property

hibernate.search.default.similarity = my.custom.Similarity

Finally you can override the default similarity on class level using the @Similarity annotation.

@Entity
@Indexed
@Similarity(impl = DummySimilarity.class)
public class Book {
...
}

As an example, let's assume it is not important how often a term appears in a document. Documents with a single occurrence of the term should be scored the same as documents with multiple occurrences. In this case your custom implementation of the method tf(float freq) should return 1.0.

Warning

When two entities share the same index they must declare the same Similarity implementation. Classes in the same class hierarchy always share the index, so it's not allowed to override the Similarity implementation in a subtype.

Likewise, it does not make sense to define the similarity via the index setting and the class-level setting as they would conflict. Such a configuration will be rejected.