Lucene Lecture at Pisa

11
Lucene Doug Cutting [email protected] November 24 2004 University of Pisa Prelude my background.. please interrupt with questions blog this talk now so that we can search for it later (using a Lucene-based blog search engine, of course) In this course, Paolo and Antonio have presented many techniques. I present real software that uses many of these techniques. Lucene is software library for search open source not a complete application set of java classes active user and developer communities widely used , e.g, IBM and Microsoft . Lucene Architecture Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/ 1 of 11 9/28/2010 5:55 AM

Transcript of Lucene Lecture at Pisa

Page 1: Lucene Lecture at Pisa

Lucene

Doug [email protected]

November 24 2004University of Pisa

Prelude

my background..please interrupt with questionsblog this talk now so that we can search for it later(using a Lucene-based blog search engine, of course)In this course, Paolo and Antonio have presented many techniques.I present real software that uses many of these techniques.

Lucene is

software library for searchopen sourcenot a complete applicationset of java classesactive user and developer communitieswidely used, e.g, IBM and Microsoft.

Lucene Architecture

Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

1 of 11 9/28/2010 5:55 AM

Page 2: Lucene Lecture at Pisa

[draw on whiteboard for reference throughout talk]

Lucene API

org.apache.lucene.documentorg.apache.lucene.analysisorg.apache.lucene.indexorg.apache.lucene.search

Package: org.apache.lucene.document

A Document is a sequence of Fields.A Field is a <name , value> pair.

nam e is the name of the field, e.g., title, body, subject, date, etc.va lue is text.

Field values may be stored, indexed o r analyzed (and, now, vectored).

Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

2 of 11 9/28/2010 5:55 AM

Page 3: Lucene Lecture at Pisa

Example public Document makeDocument(File f) throws FileNotFoundException { Document doc = new Document(); doc.add(new Field("path", f.getPath(), Store.YES, Index.TOKENIZED));

doc.add(new Field("modified", DateTools.timeToString(f.lastModified(), Resolution.MINUTE), Store.YES, Index.UN_TOKENIZED));

// Reader implies Store.NO and Index.TOKENIZED doc.add(new Field("contents", new FileReader(f)));

return doc; }

Example (continued)

field stored indexed analyzedpath yes yes yes

modified yes yes nocontent no yes yes

Package: org.apache.lucene.analysis

An Analyzer is a TokenStream factory.A TokenStream is an iterator over Tokens.

input is a character iterator (Reader)A Token is tuple <text, type , start, length , positionIncrement>

text (e.g., “pisa”).type (e.g., “word”, “sent”, “para”).start & length offsets, in characters (e.g, <5,4>)positionIncrement (normally 1)

standard TokenStream implementations areTokenizers, which divide characters into tokens andTokenFilters, e.g., stop lists, stemmers, etc.

Example

public class ItalianAnalyzer extends Analyzer {

private Set stopWords = StopFilter.makeStopSet(new String[] {"il", "la", "in"};

public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new WhitespaceTokenizer(reader); result = new LowerCaseFilter(result);

Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

3 of 11 9/28/2010 5:55 AM

Page 4: Lucene Lecture at Pisa

result = new StopFilter(result, stopWords); result = new SnowballFilter(result, "Italian"); return result; }}

Package: org.apache.lucene.index

Term is <fieldName, text>index maps Term → <df, <docNum, <position>* >*>e.g., “content:pisa” → <2, <2, <14>>, <4, <2, 9>>>new: term vectors!

ExampleIndexWriter writer = new IndexWriter("index", new ItalianAnalyzer());File[] files = directory.listFiles();for (int i = 0; i < files.length; i++) { writer.addDocument(makeDocument(files[i]));}writer.close();

Some Inverted Index Strategies

batch-based: use file-sorting algorithms (textbook)1.+ fastest to build+ fastest to search- slow to update

b-tree based: update in place (http://lucene.sf.net/papers/sigir90.ps)2.+ fast to search- update/build does not scale- complex implementation

segment based: lots of small indexes (Verity)3.+ fast to build+ fast to update- slower to search

Lucene's Index Algorithm

two basic algorithms:make an index for a single document1.merge a set of indices2.

incremental algorithm:maintain a stack of segment indicescreate index for each incoming documentpush new indexes onto the stack

Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

4 of 11 9/28/2010 5:55 AM

Page 5: Lucene Lecture at Pisa

let b=10 be the merge factor; M=∞

for (size = 1; size < M; size *= b) { if (there are b indexes with size docs on top of the stack) { pop them off the stack; merge them into a single index; push the merged index onto the stack; } else { break; }}

optimization: single-doc indexes kept in RAM, saves system callsnotes:

average b*logb(N)/2 indexesN=1M, b=2 gives just 20 indexesfast to update and not too slow to search

batch indexing w/ M=∞, merge all at endequivalent to external merge sort, optimal

segment indexing w/ M<∞

Indexing Diagram

b = 311 documents indexedstack has four indexesgrayed indexes have been deleted5 merges have occurred

Index Compression

Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

5 of 11 9/28/2010 5:55 AM

Page 6: Lucene Lecture at Pisa

For keys in Term -> ... map, use technique from Paolo's slides:

For values in Term -> ... map, use technique from Paolo's slides:

Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

6 of 11 9/28/2010 5:55 AM

Page 7: Lucene Lecture at Pisa

VInt Encoding Example

Value First byte Second byte Third byte

0 00000000

1 00000001

2 00000010

...

127 01111111

128 10000000 00000001

Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

7 of 11 9/28/2010 5:55 AM

Page 8: Lucene Lecture at Pisa

129 10000001 00000001

130 10000010 00000001

...

16,383 11111111 01111111

16,384 10000000 10000000 00000001

16,385 10000001 10000000 00000001

...

This provides compression while still being efficient to decode.

Package: org.apache.lucene.search

primitive queries:TermQuery: match docs containing a TermPhraseQuery: match docs w/ sequence of TermsBooleanQuery: match docs matching other queries.e.g., +path:pisa +content:“Doug Cutting” -path:nutch

new: SpansQueryderived queries:

PrefixQuery, WildcardQuery, etc.

ExampleQuery pisa = new TermQuery(new Term("content", "pisa"));Query babel = new TermQuery(new Term("content", "babel"));

PhraseQuery leaningTower = new PhraseQuery();leaningTower.add(new Term("content", "leaning"));leaningTower.add(new Term("content", "tower"));

BooleanQuery query = new BooleanQuery();query.add(leaningTower, Occur.MUST);query.add(pisa, Occur.SHOULD);query.add(babel, Occur.MUST_NOT);

Search Algorithms

From Paolo's slides:

Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

8 of 11 9/28/2010 5:55 AM

Page 9: Lucene Lecture at Pisa

Lucene's Disjunctive Search Algorithm

described in http://lucene.sf.net/papers/riao97.pssince all postings must be processed

goal is to minimize per-posting computationmerges postings through a fixed-size array of accumulator bucketsperforms boolean logic with bit masksscales well with large queries

[ draw a diagram to illustrate? ]

Lucene's Conjunctive Search Algorithm

From Paolo's slides:

Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

9 of 11 9/28/2010 5:55 AM

Page 10: Lucene Lecture at Pisa

Algorithm

use linked list of pointers to doc listinitially sorted by docloop

if all are at same doc, record hitskip first to-or-past last and move to end of list

Scoring

From Paolo's slides:

Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

10 of 11 9/28/2010 5:55 AM

Page 11: Lucene Lecture at Pisa

Is very much like Lucene's Similarity.

Lucene's Phrase Scoring

approximate phrase IDF with sum of termscompute actual tf of phraseslop penalizes slight mismatches by edit-distance

Thanks!

And there's lots more to Lucene.Check out http://jakarta.apache.org/lucene/.

Finally, search for this talk on Technorati.

Lucene lecture at Pisa http://lucene.sourceforge.net/talks/pisa/

11 of 11 9/28/2010 5:55 AM