Let's make things a little more interesting now. Assume that one of your indexed book entities has the title "Refactoring: Improving the Design of Existing Code" and you want to get hits for all of the following queries: "refactor", "refactors", "refactored" and "refactoring". In Lucene this can be achieved by choosing an analyzer class which applies word stemming during the indexing as well as search process. Hibernate Search offers several ways to configure the analyzer to use (see Section 4.1.5, “Analyzer”):
Setting the hibernate.search.analyzer
property in the configuration file. The specified class will then be
the default analyzer.
Setting the
annotation at the entity level.
@Analyzer
Setting the @
annotation at the field level.
Analyzer
When using the @Analyzer
annotation one can
either specify the fully qualified classname of the analyzer to use or one
can refer to an analyzer definition defined by the
@AnalyzerDef
annotation. In the latter case the Solr
analyzer framework with its factories approach is utilized. To find out
more about the factory classes available you can either browse the Solr
JavaDoc or read the corresponding section on the Solr
Wiki. Note that depending on the chosen factory class additional
libraries on top of the Solr dependencies might be required. For example,
the PhoneticFilterFactory
depends on commons-codec.
In the example below a
StandardTokenizerFactory
is used followed by two
filter factories, LowerCaseFilterFactory
and
SnowballPorterFilterFactory
. The standard tokenizer
splits words at punctuation characters and hyphens while keeping email
addresses and internet hostnames intact. It is a good general purpose
tokenizer. The lowercase filter lowercases the letters in each token
whereas the snowball filter finally applies language specific
stemming.
Generally, when using the Solr framework you have to start with a tokenizer followed by an arbitrary number of filters.
Example 1.10. Using @AnalyzerDef
and the Solr framework
to define and use an analyzer
package example; ... @Entity @Indexed @AnalyzerDef(name = "customanalyzer", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = { @TokenFilterDef(factory = LowerCaseFilterFactory.class), @TokenFilterDef(factory = SnowballPorterFilterFactory.class, params = { @Parameter(name = "language", value = "English") }) }) public class Book { @Id @GeneratedValue @DocumentId private Integer id; @Field(index=Index.TOKENIZED, store=Store.NO) @Analyzer(definition = "customanalyzer") private String title; @Field(index=Index.TOKENIZED, store=Store.NO) @Analyzer(definition = "customanalyzer") private String subtitle; @IndexedEmbedded @ManyToMany private Set<Author> authors = new HashSet<Author>(); @Field(index = Index.UN_TOKENIZED, store = Store.YES) @DateBridge(resolution = Resolution.DAY) private Date publicationDate; public Book() { } // standard getters/setters follow here ... }