Information Retrieval Techniques Israr Hanif M.Phil QAU Islamabad Ph D (In progress) COMSATS.
-
Upload
flora-garrison -
Category
Documents
-
view
217 -
download
2
Transcript of Information Retrieval Techniques Israr Hanif M.Phil QAU Islamabad Ph D (In progress) COMSATS.
Information Retrieval Techniques
Israr HanifM.Phil QAU Islamabad
Ph D (In progress) COMSATS
Information Retrieval Techniques
MS(CS) Lecture 1AIR UNIVERSITY MULTAN CAMPUS
Information Retrieval Systems
• Information– What is “information”?
• Retrieval– What do we mean by “retrieval”?– What are different types information needs?
• Systems– How do computer systems fit into the human
information seeking process?
Dictionary says…
• Oxford English Dictionary– information: informing, telling; thing told, knowledge,
items of knowledge, news– knowledge: knowing familiarity gained by experience;
person’s range of information; a theoretical or practical understanding of; the sum of what is known
• Random House Dictionary– information: knowledge communicated or received
concerning a particular fact or circumstance; news
Intuitive Notions
• Information must– Be something, although the exact nature
(substance, energy, or abstract concept) is not clear;
– Be “new”: repetition of previously received messages is not informative
– Be “true”: false or counterfactual information is “mis-information”
– Be “about” something
Robert M. Losee. (1997) A Discipline Independent Definition of Information. Journal of the American Society for Information Science, 48(3), 254-269.
Information Hierarchy
Data
Information
Knowledge
Wisdom
More refined and abstract
Information Hierarchy
• Data– The raw material of information
• Information– Data organized and presented in a particular manner
• Knowledge– “Justified true belief”– Information that can be acted upon
• Wisdom– Distilled and integrated knowledge– Demonstrative of high-level “understanding”
A (Facetious) Example
• Data– 98.6º F, 99.5º F, 100.3º F, 101º F, …
• Information– Hourly body temperature: 98.6º F, 99.5º F, 100.3º F,
101º F, …• Knowledge– If you have a temperature above 100º F, you most likely
have a fever• Wisdom– If you don’t feel well, go see a doctor
What types of information?
• Text (Documents and portions thereof)• XML and structured documents• Images• Audio (sound effects, songs, etc.) • Video• Source code• Applications/Web services
“Retrieval?”
• “Fetch something” that’s been stored• Recover a stored state of knowledge• Search through stored messages to find some
messages relevant to the task at hand
Sender Recipient
Encoding Decodingstoragemessage message
noiseindexing/writing Retrieval/reading
What is IR?
• Information retrieval is a problem-oriented discipline, concerned with the problem of the effective and efficient transfer of desired information between human generator and human user
• Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
Anomalous States of Knowledge as a Basis for Information Retrieval. (1980) Nicholas J. Belkin. Canadian Journal of Information Science, 5, 133-143.
What is Information Retrieval ?
• The process of actively seeking out information relevant to a topic of interest (van Rijsbergen)
– Typically it refers to the automatic (rather than manual) retrieval of documents• Information Retrieval System (IRS)
– “Document” is the generic term for an information holder (book, chapter, article, webpage, etc)
Hopkins IR Workshop 2005 Copyright © Victor Lavrenko
What is Information Retrieval?
• Most people equate IR with web-search– highly visible, commercially successful endeavors– leverage 3+ decades of academic research
• IR: finding any kind of relevant information– web-pages, news events, answers, images, …– “relevance” is a key notion (details in Part II)
1414
1515
The formalized IR process
Collection of documents
Real world
Document representations Query
Information need
Anomalous state of knowledge
Matching
Results
What do we want from an IRS ?
• Systemic approach– Goal (for a known information need):
Return as many relevant documents as possible and as few non-relevant documents as possible
• Cognitive approach– Goal (in an interactive information-seeking
environment, with a given IRS):Support the user’s exploration of the problem domain and the task completion.
The role of an IR system – a modern view –
• Support the user in– exploring a problem domain, understanding its
terminology, concepts and structure– clarifying, refining and formulating an information
need– finding documents that match the info need
description• As many relevant docs as possible• As few non-relevant documents as possible
How does it do this ?
• User interfaces and visualization tools for– exploring a collection of documents– exploring search results
• Query expansion based on– Thesauri– Lexical/statistic analysis of text / context and concept
formation– Relevance feedback
• Indexing and matching model
How well does it do this ?• Evaluation– Of the components• Indexing / matching algorithms
– Of the exploratory process overall• Usability issues• Usefulness to task• User satisfaction
Role of the user interface in IR
Problem definition
Source selection
Problem articulation
Examination of results
Extraction of information
Integration with overall task
INPUT
OUTPUT
Engine
The Big Picture
• The four components of the information retrieval environment:– User– Process– System– Collection
What computer geeks care about!What we care about!
The Information Retrieval CycleSource
Selection
Search
Query
Selection
Ranked List
Examination
Documents
Delivery
Documents
QueryFormulation
Resource
query reformulation,vocabulary learning,relevance feedback
source reselection
Supporting the Search ProcessSource
Selection
Search
Query
Selection
Ranked List
Examination
Documents
Delivery
Documents
QueryFormulation
Resource
Indexing Index
Acquisition Collection
Simplification?Source
Selection
Search
Query
Selection
Ranked List
Examination
Documents
Delivery
Documents
QueryFormulation
Resource
query reformulation,vocabulary learning,relevance feedback
source reselection
Is this itself a vast simplification?
The IR Black BoxDocumentsQuery
Hits
Inside The IR Black BoxDocumentsQuery
Hits
RepresentationFunction
RepresentationFunction
Query Representation Document Representation
ComparisonFunction Index