Introduction To Apache Lucene
-
Upload
mindfire-solutions -
Category
Technology
-
view
2.037 -
download
5
description
Transcript of Introduction To Apache Lucene
![Page 1: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/1.jpg)
Introduction to Apache Lucene
Sumit Luthra
![Page 2: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/2.jpg)
Agenda What is Apache Lucene ?
Focus of Apache Lucene
Lucene Architecture
Core Indexing Classes
Core Searching Classes
Demo
Questions & Answers
![Page 3: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/3.jpg)
What is Apache Lucene? Apache Lucene is a high-performance, full- featured text search engine library written entirely in Java.”
Also known as Information Retrieval Library.
Lucene is specifically an API, not an application.
Open Source
![Page 4: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/4.jpg)
Focus Indexing Documents
Searching Documents
Note : You can use Lucene to provide consistent full-text indexing across both database objects and documents in various formats (Microsoft Office documents, PDF, HTML, text, emails and so on).
![Page 5: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/5.jpg)
Lucene Architecture
Raw Content
Acquire content
Build document
Analyze document
Index document
Index
Users
Search UI
Build query
Render results
Run query
![Page 6: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/6.jpg)
Indexing DocumentsIndexWriter writer = new IndexWriter(directory, analyzer, true);
Document doc = new Document();doc.add(new Field(“content", “Hello World”,
Field.Store.YES, Field.Index.TOKENIZED));doc.add(new Field(“name", “filename.txt",
Field.Store.YES, Field.Index.TOKENIZED));doc.add(new Field(“path", “http://myfile/",
Field.Store.YES, Field.Index.TOKENIZED));// [...]
writer.addDocument(doc);
writer.close();
![Page 7: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/7.jpg)
Core indexing classes
IndexWriter
Directory
Analyzer
Document
Field
![Page 8: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/8.jpg)
IndexWriter construction
// Deprecated
IndexWriter(Directory d, Analyzer a, // default analyzer
IndexWriter.MaxFieldLength mfl);
// Preferred
IndexWriter(Directory d,
IndexWriterConfig c);
![Page 9: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/9.jpg)
Directory
FSDirectory
RAMDirectory
DbDirectory
FileSwitchDirectory
JEDirectory
![Page 10: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/10.jpg)
AnalyzersTokenizes the input text
Common Analyzers
– WhitespaceAnalyzerSplits tokens on whitespace
– SimpleAnalyzerSplits tokens on non-letters, and then lowercases
– StopAnalyzerSame as SimpleAnalyzer, but also removes stop words
– StandardAnalyzerMost sophisticated analyzer that knows about certain token types, lowercases, removes stop words, ...
![Page 11: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/11.jpg)
Analysis examples• “The quick brown fox jumped over the lazy dog”
• WhitespaceAnalyzer
– [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]
• SimpleAnalyzer
– [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]
• StopAnalyzer
– [quick] [brown] [fox] [jumped] [over] [lazy] [dog]
• StandardAnalyzer
– [quick] [brown] [fox] [jumped] [over] [lazy] [dog]
![Page 12: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/12.jpg)
More analysis examples• “XY&Z Corporation – [email protected]”
• WhitespaceAnalyzer
– [XY&Z] [Corporation] [-] [[email protected]]
• SimpleAnalyzer
– [xy] [z] [corporation] [xyz] [example] [com]
• StopAnalyzer
– [xy] [z] [corporation] [xyz] [example] [com]
• StandardAnalyzer
– [xy&z] [corporation] [[email protected]]
![Page 13: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/13.jpg)
Document & FieldsA Document is the atomic unit of indexing and
searching, It contains Fields
Fields have a name and a value
– You have to translate raw content into Fields
– Examples: Title, author, date, abstract, body, URL, keywords, ...
– Different documents can have different fields
![Page 14: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/14.jpg)
Field optionsField.Store
– NO : Don’t store the field value in the index
– YES : Store the field value in the index
Field.Index
– ANALYZED : Tokenize with an Analyzer
– NOT_ANALYZED : Do not tokenize
– NO : Do not index this field
![Page 15: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/15.jpg)
Searching an Index
IndexSearcher searcher = new IndexSearcher(directory);QueryParser parser = new QueryParser(Version, field_name
,analyzer);Query query = parser.parse(WORD_SEARCHED);
TopDocs hits = searcher.search(query, noOfHits);
ScoreDoc[] document = hits.scoreDocs;
Document doc = searcher.doc(0); // look at first matchSystem.out.println(“name=" + doc.get(“name"));searcher.close();
![Page 16: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/16.jpg)
Core searching classes
IndexSearcher
Query
QueryParser
TopDocs
ScoreDoc
![Page 17: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/17.jpg)
IndexSearcherConstructor:
– IndexSearcher(Directory d);
• // Deprecated
– IndexSearcher(IndexReader r);
• Construct an IndexReader with static method IndexReader.open(dir)
![Page 18: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/18.jpg)
Query• TermQuery
– Constructed from a Term
• TermRangeQuery
• NumericRangeQuery
• PrefixQuery
• BooleanQuery
• PhraseQuery
• WildcardQuery
• FuzzyQuery
• MatchAllDocsQuery
![Page 19: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/19.jpg)
QueryParser• Constructor
– QueryParser(Version matchVersion, String defaultField, Analyzer analyzer);
• Parsing methods
– Query parse(String query) throwsParseException;
– ... and many more
![Page 20: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/20.jpg)
QueryParser syntax examplesQuery expression Document matches if…
java Contains the term java in the default field
java junitjava OR junit
Contains the term java or junit or both in the default field (the default operator can be changed to AND)
+java +junit
java AND junit
Contains both java and junit in the default field
title:ant Contains the term ant in the title field
title:extreme –subject:sports Contains extreme in the title and not sports in subject
(agile OR extreme) AND java Boolean expression matches
title:”junit in action” Phrase matches in title
title:”junit action”~5 Proximity matches (within 5) in title
java* Wildcard matches
java~ Fuzzy matches
lastmodified:[1/1/09 TO 12/31/09]
Range matches
![Page 21: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/21.jpg)
TopDocs Class containing top N ranked searched documents/results that match a given query.
ScoreDocArray of ScoreDoc containing documents/resultsthat match a given query.
![Page 22: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/22.jpg)
You will require lucene-core-x.y.jar for this demo.
Demo of simple indexing and searching using Apache Lucene
![Page 23: Introduction To Apache Lucene](https://reader034.fdocuments.in/reader034/viewer/2022050906/556264f1d8b42aab1a8b4bd3/html5/thumbnails/23.jpg)
Any Questions ?
Thank You.