Archive for category Lucene
Analyzers at Lucene.Net
Posted by Dario in Lucene, NHibernate on December 11th, 2007
In the last post we were watching the Lucene.Net integration with NHibernate through NHibernate Search. Now lets talking about a little more about Lucene.
Lucene can index text from many sources: PDF, HTML, Word documents, etc; and this make it so attractive for applications to solve text-search problems. When Lucene finish the document parsing from a rich media, then need to convert this stream in a plain-text token format for it can digest and thus make the context get indexed. The previous step to the index content is the analysis, and for this are the Analyzers. Lucene provide some classes that you can use for example: WhitespaceAnalyzer, this class tokenizes the text without take into account the white spaces; StopAnalyzer delete some English StopWords from the text in order to index it, for example: the, an, a, that, this, etc.
Using NHibernate Search, we can make queries against the index that Lucene maintain, whether in Memory or on File System. This is query using NHibernate Search:
QueryParser qp = new QueryParser("Summary", new StopAnalyzer()); IQuery NHQuery = s.CreateFullTextQuery(qp.Parse("series"), typeof(Book)); IList result = NHQuery.List();
QueryParser receives as parameter an Analyzer, at this case StopAnalyzer. Using this Analyzer, you find the search terms within the search query. This has nothing to do with the Analyzer that you configure at Lucene startup, that show the way that the token go to persist at index. This analyzer realize a filter at the query string in order to find the search-keywords.
To understand a little bit more about Analyzer, a made this console application based on the Lucene In Action code examples. The idea was see what output token are produced by the distinct Analyzers. Sorry the example is in Spanish, and the custom Analyzer that I made has the Spanish Stop Words. You can checkout the example here.
