Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction...
Transcript of Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction...
![Page 1: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/1.jpg)
Introduction to Information
Retrieval1. seminar
IR architecture, documentprocessing, indexing, weighting
University of Pannonia
Tamás Kiezer, Miklós Erdélyi
![Page 2: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/2.jpg)
Review (1)
• IR architecture overview
![Page 3: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/3.jpg)
Review (2)
• Document processing workflow
– Parsing
– Tokenization
– Stopword removal
– Stemming
– Inverted file building (indexing)
![Page 4: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/4.jpg)
Parsing
• Stored information available in diverseformats (HTML, PDF, DOC, etc.)
• Must convert them to a „canonical” format(ie. plain text)
• Many open source tools are available to do parsing in practice– NekoHTML, pdftohtml, PDFBox, wvWare, etc.
• Metadata (DCMI)
• Examples
![Page 5: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/5.jpg)
Tokenization (segmentation)
• Chopping the document unit up into pieces called tokens
• Language-specific (needs languageidentification)
• How do we recognize word boundaries?– -, /, ., ?, !, …
– eg. by non-alphanumeric characters
• How do we handle numbers? (index size!)
• Non-trivial for eastern languages like Japanese, Chinese, etc.
• Examples
![Page 6: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/6.jpg)
Stoplisting (1)
• Idea: too frequent or too rare words do not convey useful information
– Throw away these words during
preprocessing using a stoplist
• Example English stoplist:a ab about above ac according across ads ae af after afterwards
against albeit all almost alone along already also although always
among amongst an and another any anybody anyhow anyone
…
with within without worse worst would wow www x y ye year yet
yippee you your yours yourself yourselves
![Page 7: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/7.jpg)
Stoplisting (2)
• Automatized generation of a stoplist: from the word frequency distribution
![Page 8: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/8.jpg)
Stemming
• Idea: reduce lexicon size, improve retrieval efficiency
• Language-specific methods– Properly handling agglutinative languages such as
Hungarian is difficult
• Stemming methods– Brute force, lemmatization, suffix stripping, affix
stripping
• Over-stemming, under-stemming
• Normalization (equivalence classing of terms)
![Page 9: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/9.jpg)
Stemming – Porter’s method
• Suffix stripping method
• Well-tried for stemming English texts
• 4-step algorithm– Step 1 deals with plurals and past participles.
– Step 2-3 removes adjective/noun formative syllables.
– Step 4 removes noun formative syllables.
– Step 5 tidies up.
• Example
![Page 10: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/10.jpg)
Example: Porter’s stemming rules
(excerpt)
![Page 11: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/11.jpg)
Example: Hunspell for stemming
Hungarian text (too)
• Hunspell: general library for morphological analysis and stemming
• Affix stripper (does prefix and suffix stripping) with a dictionary of base words
• Example rules:
![Page 12: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/12.jpg)
Inverted file structure – review
• Stores the postings list for each term
• Eases answering queries - how?
![Page 13: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/13.jpg)
Inverted index construction
• Example:
![Page 14: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/14.jpg)
Weighting methods – review
• Binary weighting:
• Frequency weighting:
• Max-normalized (max-tf):
• Length-normalized (norm-tf):
• Term frequency inverse document frequency
• Length normalized term frequency inverse document frequency
(norm-tf-idf):
![Page 15: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/15.jpg)
Exercise: building a TD matrix
• Let us consider the following simple document collection:
• Build a frequency weighted TD matrix
• Build a norm-tf weighted TD matrix
• Build a norm-tf-idf weighted TD matrix
Doc 1 breakthrough drug for schizophrenia
Doc 2 new schizophrenia drug
Doc 3 new approach for treatment of schizophrenia
Doc 4 new hopes for schizophrenia patients
![Page 16: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/16.jpg)
Solution: tf weighted TD matrix
0100treatment
1111schizophrenia
1000patient
1110new
1000hope
0011drug
0001breakthrough
0100approach
Doc4Doc3Doc2Doc1Terms/Documents
![Page 17: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/17.jpg)
Solution: norm-tf weighted TD
matrix
0000treatment
0,500,577350,57735schizophrenia
0,5000patient
0,500,577350new
0,5000hope
000,577350,57735drug
0000,57735breakthrough
00,500approach
Doc4Doc3Doc2Doc1Terms/Documents
![Page 18: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/18.jpg)
Example: Terrier IR Platform
![Page 19: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/19.jpg)
Terrier: Indexing
![Page 20: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/20.jpg)
Terrier: Search results
![Page 21: Introduction to Information Retrievalcir.dcs.uni-pannon.hu/cikkek/IR_seminar1.pdf · Introduction to Information Retrieval 1. seminar IR architecture, document processing, indexing,](https://reader033.fdocuments.in/reader033/viewer/2022041611/5e37ea2cdddcda02d06d71c4/html5/thumbnails/21.jpg)
Questions?