Hibernate.orgCommunity Documentation
As Hibernate core applies changes to the Database, Hibernate Search detects these changes and will update the index automatically (unless the EventListeners are disabled). Sometimes changes are made to the database without using Hibernate, as when backup is restored or your data is otherwise affected; for these cases Hibernate Search exposes the Manual Index APIs to explicitly update or remove a single entity from the index, or rebuild the index for the whole database, or remove all references to a specific type.
All these methods affect the Lucene Index only, no changes are applied to the Database.
Using FullTextSession
.index(T
entity)
you can directly add or update a specific object
instance to the index. If this entity was already indexed, then the index
will be updated. Changes to the index are only applied at transaction
commit.
Example 6.1. Indexing an entity via FullTextSession.index(T
entity)
FullTextSession fullTextSession = Search.getFullTextSession(session);
Transaction tx = fullTextSession.beginTransaction();
Object customer = fullTextSession.load( Customer.class, 8 ); fullTextSession.index(customer);
tx.commit(); //index only updated at commit time
In case you want to add all instances for a type, or for all indexed
types, the recommended approach is to use a
MassIndexer
: see Section 6.3.2, “Using a MassIndexer” for more details.
The method FullTextSession.index(T
entity)
is considered an explicit indexing operation, so any
registered EntityIndexingInterceptor
won't be applied
in this case. For more information on EntityIndexingInterceptor
see Section 4.5, “Conditional indexing: to index or not based on entity state”.
It is equally possible to remove an entity or all entities of a
given type from a Lucene index without the need to physically remove them
from the database. This operation is named purging and is also done
through the FullTextSession
.
Example 6.2. Purging a specific instance of an entity from the index
FullTextSession fullTextSession = Search.getFullTextSession(session);
Transaction tx = fullTextSession.beginTransaction();
for (Customer customer : customers) {
fullTextSession.purge( Customer.class, customer.getId() );
}
tx.commit(); //index is updated at commit time
Purging will remove the entity with the given id from the Lucene index but will not touch the database.
If you need to remove all entities of a given type, you can use the
purgeAll
method. This operation removes all
entities of the type passed as a parameter as well as all its
subtypes.
Example 6.3. Purging all instances of an entity from the index
FullTextSession fullTextSession = Search.getFullTextSession(session);
Transaction tx = fullTextSession.beginTransaction(); fullTextSession.purgeAll( Customer.class );
//optionally optimize the index
//fullTextSession.getSearchFactory().optimize( Customer.class );
tx.commit(); //index changes are applied at commit time
As in the previous example, it is suggested to optimize the index after many purge operation to actually free the used space.
As is the case with method FullTextSession.index(T
entity)
, also purge
and
purgeAll
are considered explicit indexinging
operations: any registered EntityIndexingInterceptor
won't be applied. For more information on EntityIndexingInterceptor
see Section 4.5, “Conditional indexing: to index or not based on entity state”.
Methods index
,
purge
and purgeAll
are
available on FullTextEntityManager
as
well.
All manual indexing methods (index
,
purge
and purgeAll
)
only affect the index, not the database, nevertheless they are
transactional and as such they won't be applied until the transaction is
successfully committed, or you make use of
flushToIndexes
.
If you change the entity mapping to the index, chances are that the whole Index needs to be updated; For example if you decide to index a an existing field using a different analyzer you'll need to rebuild the index for affected types. Also if the Database is replaced (like restored from a backup, imported from a legacy system) you'll want to be able to rebuild the index from existing data. Hibernate Search provides two main strategies to choose from:
Using
FullTextSession
.flushToIndexes()
periodically, while using
FullTextSession
.index()
on all entities.
Use a MassIndexer
.
This strategy consists in removing the existing index and then
adding all entities back to the index using
FullTextSession
.purgeAll()
and
FullTextSession
.index()
,
however there are some memory and efficiency contraints. For maximum
efficiency Hibernate Search batches index operations and executes them
at commit time. If you expect to index a lot of data you need to be
careful about memory consumption since all documents are kept in a queue
until the transaction commit. You can potentially face an
OutOfMemoryException
if you don't empty the queue
periodically: to do this you can use
fullTextSession.flushToIndexes()
. Every time
fullTextSession.flushToIndexes()
is called (or
if the transaction is committed), the batch queue is processed applying
all index changes. Be aware that, once flushed, the changes cannot be
rolled back.
Example 6.4. Index rebuilding using index() and flushToIndexes()
fullTextSession.setFlushMode(FlushMode.MANUAL);
fullTextSession.setCacheMode(CacheMode.IGNORE);
transaction = fullTextSession.beginTransaction();
//Scrollable results will avoid loading too many objects in memory
ScrollableResults results = fullTextSession.createCriteria( Email.class )
.setFetchSize(BATCH_SIZE)
.scroll( ScrollMode.FORWARD_ONLY );
int index = 0;
while( results.next() ) {
index++;
fullTextSession.index( results.get(0) ); //index each element
if (index % BATCH_SIZE == 0) {
fullTextSession.flushToIndexes(); //apply changes to indexes
fullTextSession.clear(); //free memory since the queue is processed
}
}
transaction.commit();
Try to use a batch size that guarantees that your application will not run out of memory: with a bigger batch size objects are fetched faster from database but more memory is needed.
Hibernate Search's MassIndexer
uses several
parallel threads to rebuild the index; you can optionally select which
entities need to be reloaded or have it reindex all entities. This
approach is optimized for best performance but requires to set the
application in maintenance mode: making queries to the index is not
recommended when a MassIndexer is busy.
This will rebuild the index, deleting it and then reloading all entities from the database. Although it's simple to use, some tweaking is recommended to speed up the process: there are several parameters configurable.
During the progress of a MassIndexer the content of the index is undefined! If a query is performed while the MassIndexer is working most likely some results will be missing.
Example 6.6. Using a tuned MassIndexer
fullTextSession
.createIndexer( User.class )
.batchSizeToLoadObjects( 25 )
.cacheMode( CacheMode.NORMAL )
.threadsToLoadObjects( 12 )
.idFetchSize( 150 )
.progressMonitor( monitor ) //a MassIndexerProgressMonitor implementation
.startAndWait();
This will rebuild the index of all User instances (and subtypes),
and will create 12 parallel threads to load the User instances using
batches of 25 objects per query; these same 12 threads will also need
to process indexed embedded relations and custom FieldBridge
s
or ClassBridge
s, to finally output a Lucene
document. In this conversion process these threads are likely going to
need to trigger lazyloading of additional attributes, so you will
probably need a high number of threads working in parallel.
The number of threads working on actual index writing is defined by the backend
configuration of each index. See the option
worker.thread_pool.size
in Table 3.3, “Execution configuration”.
As of Hibernate Search 4.4.0, instead of indexing all the types
in parallel, the MassIndexer is configured by default to index only
one type in parallel. It prevents resource exhaustion especially
database connections and usually does not slow down the indexing.
You can however configure this behavior using
MassIndexer
.typesToIndexInParallel(int threadsToIndexObjects)
:
Example 6.7. Configuring the MassIndexer to index several types in parallel
fullTextSession
.createIndexer( User.class, Customer.class )
.typesToIndexInParallel( 2 )
.batchSizeToLoadObjects( 25 )
.cacheMode( CacheMode.NORMAL )
.threadsToLoadObjects( 5 )
.idFetchSize( 150 )
.progressMonitor( monitor ) //a MassIndexerProgressMonitor implementation
.startAndWait();
Generally we suggest to leave cacheMode to
CacheMode.IGNORE
(the default), as in most reindexing
situations the cache will be a useless additional overhead; it might be
useful to enable some other CacheMode
depending on
your data: it could increase performance if the main entity is relating
to enum-like data included in the index.
The MassIndexer was designed for speed and is unaware of transactions, so there is no need to begin one or committing. Also because it is not transactional it is not recommended to let users use the system during its processing, as it is unlikely people will be able to find results and the system load might be too high anyway.
The MassIndexer was designed to finish the reindexing task as quickly as possible, but this requires a bit of care in its configuration to behave fairly with your server resources.
There is a simple formula to understand how the different options applied to the MassIndexer affect the number of used worker threads and connections: each thread will require a JDBC connection.
threads = typesToIndexInParallel * (threadsToLoadObjects + 1); required JDBC connections = threads;
Let's see some suggestions for a roughly sane tuning starting point:
Option typesToIndexInParallel
should probably be a low value, like 1 or 2, depending on how
much of your CPUs have spare cycles and how slow a database round trip will be.
Before tuning a parallel run, experiment with options to tune your primary indexed entities in isolation.
Making threadsToLoadObjects
higher increases the preloading rate for the picked entities
from the database, but also increases memory usage and the pressure on the threads working on subsequent indexing.
Increasing parallelism usually helps as the bottleneck usually is the latency to the database connection: it's probably worth it to experiment with values significantly higher than the number of actual cores available, but make sure your database can handle all the multiple requests.
This advice might not apply to you: always measure the effects! We're providing this as a means to help you understand how these options are related.
Running the MassIndexer with many threads will require many connections to the database. If you don't have a sufficiently large connection pool, the MassIndexer itself and/or your other applications could starve being unable to serve other requests: make sure you size your connection pool accordingly to the options as explained in the above paragraph.
The "sweet spot" of number of threads to achieve best performance is highly dependent on your overall architecture, database design and even data values. All internal thread groups have meaningful names so they should be easily identified with most diagnostic tools, including simply threaddumps.
The provided MassIndexer is quite general purpose, and while we believe
it's a robust approach, you might be able to squeeze some better performance
by writing a custom implementation. To run your own MassIndexer
instead of using the one shipped with Hibernate Search you have to:
create an implementation of the org.hibernate.search.spi.MassIndexerFactory
interface;
set the property hibernate.search.massindexer.factoryclass
with the qualified class name of the factory implementation.
Example 6.8. Custom MassIndexerFactory
example
package org.myproject
import org.hibernate.search.spi.MassIndexerFactory
[...]
public class CustomIndexerFactory implements MassIndexerFactory {
public void initialize(Properties properties) {
}
public MassIndexer createMassIndexer(...) {
return new CustomIndexer();
}
}
hibernate.search.massindexer.factoryclass = org.myproject.CustomIndexerFactory
Other parameters which affect indexing time and memory consumption are:
hibernate.search.[default|<indexname>].exclusive_index_use
hibernate.search.[default|<indexname>].indexwriter.max_buffered_docs
hibernate.search.[default|<indexname>].indexwriter.max_merge_docs
hibernate.search.[default|<indexname>].indexwriter.merge_factor
hibernate.search.[default|<indexname>].indexwriter.merge_min_size
hibernate.search.[default|<indexname>].indexwriter.merge_max_size
hibernate.search.[default|<indexname>].indexwriter.merge_max_optimize_size
hibernate.search.[default|<indexname>].indexwriter.merge_calibrate_by_deletes
hibernate.search.[default|<indexname>].indexwriter.ram_buffer_size
hibernate.search.[default|<indexname>].indexwriter.term_index_interval
Previous versions also had a max_field_length
but
this was removed from Lucene, it's possible to obtain a similar effect by
using a LimitTokenCountAnalyzer
.
All .indexwriter
parameters are Lucene specific
and Hibernate Search is just passing these parameters through - see Section 3.7.1, “Tuning indexing performance” for more details.
The MassIndexer
uses a forward only
scrollable result to iterate on the primary keys to be loaded, but MySQL's
JDBC driver will load all values in memory; to avoid this "optimisation"
set idFetchSize
to
Integer.MIN_VALUE
.