8/2/2019 Text Retrieval
1/29
Multimedia Information
Retrieval(CSC 545)
Textual Retrieval
By
Dr. Nursuriati Jamil
The problem of IR
Goal = find documents relevantto aninformation need from a large document set
Documentcollection
Info.need
Query
Answer list
IR
systemRetrieval
8/2/2019 Text Retrieval
2/29
The retrieval problem
Given N documents (D0, ..., DN-1) Query Q of user
Problem ranked list of k documents Dj (0 word2, word35word100, word123, kena -> word67, .
Featureextraction
Querytransformation
Q = {dera, kanak-kanak, Malaysia,
seksa, pukul, hukum,budak, bayi, remaja}
Relevance rankingRSV(Q,HBJ3N129)= .2RSV(Q, HBM4N111=.4
HBM4N11HBJ3N129
Result
Inverted file:dera - HBJ3N129, HBM4N111budak -> HBJ2N19, HBJ3N129 Malaysia-> HBJN129
Retrieval
OFFLINE
Penderaan kanak-kanak di MalaysiaQuery
ONLINE
Insert
Document
Database
8/2/2019 Text Retrieval
3/29
Feature (Terms) extraction A text retrieval system represents documents as sets of terms (e.g.,
words). Thereby, the originally structured document becomes anunstructured set of terms potentially annotated with attributes todenote frequency and position in the text. The transformationcomprises several steps:
1. Elimination of structure (i.e. formats)
2. Elimination of frequent/infrequent terms (i.e. stop words)
3. Mapping text to terms (without punctuation)
4. Reduction of terms to their stems (stemming, syllable division)
5. Mapping to index terms
(The order of the steps above may vary; often, steps are even brokeninto several steps or several steps are combined into a single pass)
Types of terms: words, phrases or n-gram (i.e., sequence of ncharacters)
Overview of feature extraction
3 Stemming
8/2/2019 Text Retrieval
4/29
Overview of feature extraction
Structure elimination Frequent/infrequent terms Text -> Term
StemIndex
Step 1. Structure elimination HTML contains special markups, so-called tags. They
describe meta-information about the document and thelayout/presentation of content. An HTML document is splitinto two parts, a header section and a body section:
Header: Contains meta-informationabout thedocument; they also
describe all embeddedelements like images.
Body: Encompasses thedocument enrichedwith markups forlayout. The structureof the document is notalways obvious.
8/2/2019 Text Retrieval
5/29
Step 1. Structure elimination (cont.)
Meta data: HTML provides several possibilities to define meta-information (-Tag). The most frequent ones are:
URL of page: http://www-dbs.ethz.ch/~mmir/
Title of document:
ETH Zurich - Homepage
Meta information in header section:
8/2/2019 Text Retrieval
6/29
Step 1. Structure elimination (cont.)
Embedded links how to handle them:
Embedded objects (image, plug-ins):
8/2/2019 Text Retrieval
7/29
Distribution of term frequencies
e.g.stopwords,mostfrequent
words
e.g. seldom used words
Insignificant terms Stop words are terms with little or no semantic meaning, thus often
not indexed. Examples: English: the, a, is Bahasa Melayu: ada, iaitu, mana, bersabda, wahai
Often, the rank of these terms is on the left side of the upper cut-off line. Generally, stop words are responsible for 20% to 30% ofthe term occurrences in a text. With the elimination of stop words,the memory consumption of the index can be reduced.
Similarly, the most frequent terms in a collection of documentscarry little information (rank on the left side of the upper cut-offline): The term Computer is meaningless to index articles about computer
science. The term Computer, however, is important to distinguish between
general articles such as careers in computer science.
Analogously, one can strip offwords that are seldom used. Thisassumes that users will not use them in their queries (the rank ison the right side of the lower cut-off). Although, the additionalmemory consumption is rather small.
8/2/2019 Text Retrieval
8/29
Overview of feature extraction
2 Removestopwords
Step 3: Mapping text to terms To select appropriate features for documents, one typically uses
linguistic or statistical approaches to define the features based onwords, fragments of words or phrases.
Most search engines use words or phrases as features. Someengines use stemming, some differentiate between upper andlower cases, and some support error correction.
An interesting option is the usage of fragments, i.e., so-called n-grams. Although not directly related to semantics of text, they arevery useful to support fuzzy retrieval.
But there are other possibilities: fragments of words, i.e., n-grams:Example: street -> str, tre, ree, eet
streets -> str, tre, ree, eet, etsstrets -> str, tre, ret, ets
Benefits: Simple misspellings or bad recognition often result in bad
retrievals; fragments significantly improve retrievalquality.
Stemming and syllable division not necessary any more better. No language specific retrieval necessary; every language is
processed equally
8/2/2019 Text Retrieval
9/29
Locations and frequency of terms Retrieval algorithms often use the number of term
occurrences and the positions of terms within the documentto identify and rank results.
Term frequency ("feature frequency"): tf(Ti, Dj) Number of occurrences of a feature Tiin document Dj
Term frequency is important to rank documents. Term locations (feature locations): loc(Ti,dj) ->P(N) [set
of locations] Term locations frequently influence the ranking and whether
a document appears in the result at all, e.g.: Condition: Q =shah NEAR alam (explicit phrase
matching) looking for documents with the terms shahand alam close to each other
Ranking: Q =shah alam (implicit phrase matching)documents with the terms shah next to alam shouldbe at the top of results.
tf = term frequency frequency of a term/keyword in a document
The higher the tf, the higher the importance (weight) forthe doc.
df = document frequency no. of documents containing the term
distribution of the term
idf = inverse document frequency the unevenness of term distribution in the corpus
the specificity of term to a document
The more the term is distributed evenly, the less it isspecific to a document
weight(t,D) = tf(t,D) * idf(t)
tf*idf weighting schema
8/2/2019 Text Retrieval
10/29
Example
Term #of docs --> Dj, tfj Dj, tfj
Haji 3 --> D7, 4 D26,10
Iman 5 --> D21, 2 ....
Term Haji occurs in three documents, 4
times in doc 7, 10 times in doc 26 and5 times in doc 40.
Dj, tfj
D40, 5
.
Some common tf*idfschemes
tf(t, D)=freq(t,D) idf(t) = log(N/n)
tf(t, D)=log[freq(t,D)] n = #docs containing t
tf(t, D)=log[freq(t,D)]+1 N = #docs in corpus
tf(t, D)=freq(t,d)/Max[f(t,d)]
weight(t,D) = tf(t,D) * idf(t)
8/2/2019 Text Retrieval
11/29
Overview of feature extraction
Term Pos #Doc Dj,tfj Dj,tfj Dj,tfj
Abdul 5 2 10, 1 21, 2
Agong 4 3 2, 3 6, 5 31, 2
: : : : : :
3 Text toterm
Step 4: Stemming How word stemming works?
Stemming broadens our results to include bothword roots and word derivations. It is commonlyaccepted that removal of word-endings (sometimescalled suffix stripping) is a good idea; removal ofprefixes can be useful in some subject domains.
Why do we need word stemming in the context of freetext searching? Free text-searching, searches exactly as we type in
to the search box, without changing it to thesaurusterm.
Morphological variants of words have similarsemantic interpretations.
Smaller dictionary size results in a saving ofstorage space and processing time.
8/2/2019 Text Retrieval
12/29
Word stemming (cont.) Algorithms for Word Stemming
A stemming algorithm is an algorithm that converts aword to a related form. One of the simplest suchtransformations is conversion of plurals to singulars.
Affix removal algorithms, Successor Variety, TableLookup, N-gram
In most languages, words have various inflected (orsometimes, derived) forms. The different forms should notcarry different meanings but should be mapped to a singleform.
However, in many languages, it is not simple to derive thelinguistic stem without a dictionary. At least for English,
there exist algorithms without the need of a dictionarywhich still produce good results (Porter Algorithm).
Word stemming (cont.)
Pros & Cons
Word Stemmers are used to conflate terms to improveretrieval effectiveness and/or to reduce the size ofindexing files increase recall at the cost of decreasedprecision
Over stemming and Under Stemming also create aproblem for retrieving the documents.
8/2/2019 Text Retrieval
13/29
Porter's Algorithm
The Porter Stemmer is a conflation Stemmer developed byMartin Porter at the University of Cambridge in 1980.
Porter stemming algorithm (or 'Porter stemmer') is a processfor removing the commoner morphological and inflexionalendings from words in English.
Most effective and widely used.
Porter's Algorithm works based on number of vowelcharacters, which are followed be a consonant character in thestem (Measure), must be greater than one for the rule to beapplied.
A word can have any one of the forms: CC, C..V, V..V,V..C.
These can be represented as [C](VC){m}[V].
Porter's Algorithm (cont.)
The rules in the Porter algorithm are separated into fivedistinct steps numbered from 1 to 5. They are applied to thewords in the text starting from step 1 and moving on to step5.
Step 1 deals with plurals and past participles. The subsequentsteps are much more straightforward.
Ex. plastered->plaster, motoring-> motor
Step 2 deals with pattern matching on some common suffixes.
Ex. happy -> happi, relational -> relate, callousness ->callous
Step 3 deals with special word endings.
Ex. triplicate-> triplic, hopeful-> hope
8/2/2019 Text Retrieval
14/29
Porter's Algorithm (cont.) Step 4 checks the stripped word against more suffixes in case the
word is compounded.
Ex. revival -> reviv, allowance-> allow, inference-> infer etc.,
Step 5 checks if the stripped word ends in a vowel and fixes itappropriately
Ex. probate -> probat, cease -> ceas, controll -> control
The algorithm is careful not to remove a suffix when the stem istoo short, the length of the stem being given by its measure,m. There is no linguistic basis for this approach.
Dictionary-based stemming A dictionary significantly improves the quality of stemming (Note:
the Porter Algorithm does not derive a linguistic correct stem). Itdetermines the correct linguistic stem for all words but at theprice of additional lookup costs and maintenance costs for thedictionary.
The EuroWordNet initiative tries to develop a semantic dictionaryfor the European languages. Next to words, the dictionary shallcontain flexed forms and relations between words (see nextsection). However, the usage of these dictionaries is not for free(with the exception of WordNet for English). Names remain aproblem of their own...
Examples of such dictionaries / ontologies: EuroWordNet: http://www.illc.uva.nl/EuroWordNet/ GermanNet: http://www.sfs.uni-tuebingen.de/lsd/ WordNet: http://wordnet.princeton.edu/
We look a dictionary based stemming with the example ofMorphy, the stemmer of WordNet. Morphy combines twoapproaches for stemming: a rule-based approach for regular flexions much like the porter
algorithm but muchsimpler an exception list with strong or irregular flexions of terms
8/2/2019 Text Retrieval
15/29
Stemming process
Unstemmedwords
Stemmingalgorithm
Porters algortihm,Fatimahs algorithm,Wordnet dictionary
Stopwords
Is it a stopword?
Morphologicalrules
(e.g. ber..an,me+, +lah)
Apply prefix-suffix,suffix & infix rules
Worddictionary
Stemmedwords
Is it in dictionary?
Step 5: Mapping to index terms Term extraction must further deal with homonyms (equal terms but
different semantics) and synonyms (different terms but equal semantics). But there are further relations between terms that may be useful to
consider. In the following, a list of the most common relationships: Homonyms (equal terms but different semantics):
bank (shore vs. financial institute) Synonyms (different terms but equal semantics):
walk, go, pace, run, sprint Hypernyms (umbrella term) / Hyponym (species)
Animal -> dog, cat, bird, ... Holonyms (is part of) / Meronyms (has parts)
door ->lock The relationships above define a network (often denoted as ontology)
with terms as nodes and relations as edges. An occurrence of a termmay be interpreted as occurrences of near-by terms in this network aswell (whereby near-by has to be defined appropriately). Example: A document contains the term dog. We may also
interpret this as an occurrence of the term animal (with a smallerweight)
8/2/2019 Text Retrieval
16/29
Step 5 : (cont.) Some search engine do not implement step 4 and 5. Google only
recently improved its search capabilities with stemming. If the collection contains documents in different languages, cross-lingual
approaches that (automatically) translate or relate terms to differentlanguages and make them retrievable even for queries in differentlanguages than the document.
Term extraction for queries: Similar to term extraction of documents If term extraction of query implements step 5:
Omit step 5 in term extraction of documents in the collection Extend the query terms with near-by terms:
Expansion with synonyms: Q=house Qnew=house, home,domicile, ...
If a specialized search returns not enough answers, exchangekeywords with their hypernyms: e.g., Q=mare (femalehorse) -> Qnew=horse
If a general search term returns too many results, let the userchoose (i.e. relevance feedback) a more specialized term toreduce the result list: e.g., Q=horse-> Qnew=mare, pony,chestnut, pacer
What is WordNet? A large lexical database, or electronic
dictionary, developed and maintained atPrinceton University
http://wordnet.princeton.edu
Includes most English nouns, verbs, adjectives,adverbs
Electronic format makes it amenable to
automatic manipulation Used in many Natural Language Processing
applications (information retrieval, text mining,question answering, machine translation,AI/reasoning,...)
Wordnets are built for many languages.
8/2/2019 Text Retrieval
17/29
Whats special about WordNet?
Traditional paper dictionaries are organizedalphabetically: words that are found together(on the same page) are not related by meaning
WordNet is organized by meaning: words inclose proximity are semantically similar
Human users and computers can browseWordNet and find words that are meaningfullyrelated to their queries (somewhat like in ahyperdimensional thesaurus)
Meaning similiarity can be measured and
quantified to support Natural LanguageUnderstanding
A simple picture
animal (animate, breathes, hasheart,...)
|
bird (has feathers, flies,..)
|canary (yellow, sings nicely,..)
8/2/2019 Text Retrieval
18/29
Hypo-/hypernymy relates noun synsets
Creates relationships among more/less general concepts
Creates hierarchies. Hierarchies can have up to 16levels
{vehicle}
/ \
{car, automobile} {bicycle, bike}
/ \ \
{convertible} {SUV} {mountain bike}
A car is a kind ofvehicle
The class of vehicles includes cars, bikes
Hyponymy
Transitivity:
A car is a kind ofvehicle
An SUV is a kind ofcar
=> An SUV is a kind ofvehicle
8/2/2019 Text Retrieval
19/29
Meronymy/holonymy
(part-whole relation)
{car, automobile}
|
{engine}
/ \
{spark plug} {cylinder}
An engine hasspark plugs
Spark plus and cylinders areparts ofan engine
Meronymy/Holonymy
Inheritance:
A finger ispart ofa hand
A hand ispart ofan arm
An arm ispart ofa body
=>a finger ispart ofa body
8/2/2019 Text Retrieval
20/29
Structure of WordNet (Nouns)
{vehicle}
{conveyance; transport}
{car; auto; automobile; machine; motorcar}
{cruiser; squad car; patrol car; police car; prowl car} {cab; taxi; hack; taxicab; }
{motor vehicle; automotive vehicle}
{bumper}
{car door}
{car window}
{car mirror}
{hinge; flexible joint}
{doorlock}
{armrest}
hyperonym
hyperonym
hyperonym
hyperonymhyperonym
meronym
meronym
meronym
meronym
Homework
Select 5 most frequent noun terms, findhomonyms, synonyms, hypernmys andholonyms of the terms. May use Wordnetat http://wordnet.princeton.edu/. SelectUse Wordnet Online.
Create the noun ontology.
http://wordnet.princeton.edu/http://wordnet.princeton.edu/http://wordnet.princeton.edu/http://wordnet.princeton.edu/8/2/2019 Text Retrieval
21/29
IR models
Overview
Boolean Retrieval
Fuzzy Retrieval
Vector Space Retrieval
Probabilistic Retrieval (BIR Model)
Latent Semantic Indexing
Boolean search
8/2/2019 Text Retrieval
22/29
Boolean model Historically:
Documents were stored on tapes or punched cards.
Searching: only sequential access.
Today:
Boolean search is still very frequent but is not state-of-the-art.. Google uses it for simplicity but furtehr improved it byadditionally sorting/ranking results sets.
Model:
Document D represented by binary vector dwith di=1 if termtioccurs in document i.
Query q comes from query space Q; let tbe an arbitrary term,
and q1 and q2 be queries from Q; Q is given by queries oftype:
t,q1 ^ q2, , q1 v q2, q1
Boolean model (cont.)
8/2/2019 Text Retrieval
23/29
Term-document matrix
Query: Brutus AND Caesar AND NOT Calpurnia
Take the vectors for Brutus, Caesar and Calpurnia, complement the last,and then do a bitwise AND:
110100 AND 110111 AND 101111 = 100100
Boolean retrieval
Query: Brutus AND Caesar AND NOT Calpurnia
8/2/2019 Text Retrieval
24/29
Fuzzy retrieval
Fuzzy retrieval (cont.)
8/2/2019 Text Retrieval
25/29
Vector-space model Since Boolean models binary weights too limiting, vector
supports partial matching.
Non-binary weights are assigned to index terms in queriesand documents.
Term weights are used to compute degree of similaritybetween documents in the database and the users query.
term3 = malam
term2 = ibadah
term1 = solat
q d
Vector-space model (cont.)
The tf metric is considered an indication of how well a term characterizes the content of a
document. The idf, in turn, reflects the number of documents in the collection in which
the term occurs, irrespective of the number of times it occurs in those documents.
8/2/2019 Text Retrieval
26/29
Inverse document frequency
Document-Term-Matrix
8/2/2019 Text Retrieval
27/29
Vector-space model (cont.)
Example
N = #of documents
M= # of terms
8/2/2019 Text Retrieval
28/29
a arrived
gold silver truck
Class exercises
Using selected, most frequent 10 terms inyour story, create term-document matrixfor boolean model and vector model.
8/2/2019 Text Retrieval
29/29
Remarks There are many more methods to determine the vector
representations and to compute retrieval status values Main assumption of vector space retrieval Terms occur independent from each other in documents Not true: if one writes about Mercedes, the term "car" is likely to co-
occur in document
Advantages: Simple model with efficient evaluation algorithms Partial match queries possible, i.e., it returns documents that only
partly contain the query terms (similar to or-operator of Booleanretrieval)
Very good retrieval quality; but not state-of-the-art Relevance feedback may further improve vector space retrieval
Disadvantages: Many heuristics and simplification; no proof for "correctness" of result
set HTML/Web: occurrences of terms is not the most important criteria to
rank documents (spamming)
Top Related