Lecture2 information ritrival

34
Introduction to Information Retrieval Introduction to Information Retrieval Document ingestion

description

chapter notes

Transcript of Lecture2 information ritrival

Introduction to Information RetrievalIntroduction to Information Retrieval Introduction toInformation RetrievalDocument ingestionIntroduction to Information RetrievalIntroduction to Information Retrieval Recall the basic indexing pipelineTokenizerToken streamFriends Romans CountrymenLinguistic modulesModifed tokensfriend roman countrymanIndexerInverted indexfriendromancountryman2 42! "

Documents tobe indexedFriends, Romans, countrymen.Introduction to Information RetrievalIntroduction to Information Retrieval #arsing a document$hat %ormat is it in&pd%'(ord'excel'html&$hat language is it in&$hat character set is in use&)*#2+2, -T./0, 123ach o% these is a classifcation problem, (hich (e (ill stud4 later in the course56ut these tasks are o%ten done heuristicall4 17ec5 25Introduction to Information RetrievalIntroduction to Information Retrieval *omplications8 .ormat'languageDocuments being indexed can include docs %rom man4 di9erent languages: single index ma4 contain terms %rom man4 languages57ometimes a document or its components can contain multiple languages'%ormats.rench email (ith a ;erman pd% attachment5.rench email