Speech & NLP (Fall 2014): Information Retrieval
-
Upload
vladimir-kulyukin -
Category
Science
-
view
160 -
download
0
Transcript of Speech & NLP (Fall 2014): Information Retrieval
Speech & NLP
www.vkedco.blogspot.com
Information Retrieval
Texts as Feature Vectors, Vector Spaces,
Vocabulary Normalization through Stemming & Stoplisting,
Porter’s Algorithm for Suffix Stripping,
Term Weighting, Query Expansion, Precision & Recall
Vladimir Kulyukin
Outline
● Texts as Feature Vectors
● Vector Space Model
● Vocabulary Normalization through Stemming & Stoplisting
● Porter’s Algorithm for Suffix Stripping (aka Porter’s Stemmer)
● Term Weighting
● Query Expansion
● Precision & Recall
Texts as Feature Vectors
Text as Collection of Words ● Any text can be viewed as a collection of words (collections,
unlike sets, allow for duplicates)
● Various techniques can be designed to compute different
properties of texts: most frequent word, least frequent word,
frequency of a word in a text, word n-grams, word co-occurrence
probabilities, part of speech, etc.
● Each such technique is a feature extractor: it extracts from text
specific features (e.g., a single word) and assigns to them
specific weights (e.g., the frequency of that word in the text) or
symbols (part of speech)
● Feature extraction turns a text from a collection of words into a
feature vector
Information Retrieval
● Information Retrieval (IR) is an area of NLP that
deals with storage and retrieval of digital media
● The primary focus of IR has been digital texts
● Other media such as images, videos, audio files
have received more prominent focus recently
Basic IR Terminology
● Document is an indexable and retrievable unit of
digital text
● Collection is a set of documents that can be
searched by users
● Term is a wordform that occurs in a collection
● Query is a set of terms
Vector Space Model
Background
● Vector Space Model of IR was invented by G. Salton
in the early 1970’s
● Document collection is a vector space
● Terms found in texts are dimensions of that vector
space
● Documents are vectors in the vector space
● Term weights are coordinates along specific
dimensions
Example: A 3D Feature Vector Space
● Suppose that all texts in our universe consist of three words w1,
w2, and w3
● Suppose that there are three texts T1, T2, and T3 such that
– T1 = “w1 w1 w2”
– T2 = “w3 w2”
– T3 = “w3 w3 w1”
● Suppose that our feature extraction procedure takes each word
in text and maps it to its frequency in that text
● Since there are three words, each feature vector has 3
dimensions; hence, we have a 3D vector space
Vector Space as Feature Vector Table
w1 w2 w3
T1 2 1 0
T2 0 1 1
T3 1 0 2
𝑇𝑖 𝐢𝐬 𝐚 𝐭𝐞𝐱𝐭 𝐝𝐨𝐜𝐮𝐦𝐞𝐧𝐭
𝑇𝑖 is a feature vector
3D Vector Space
w1
w2
w3
T1 = (2, 1, 0)
T2 = (0, 1, 1) T3 =(1, 0, 2)
Another Example: A 3D Feature Vector Space
● Suppose that all texts in our universe consist of three words w1,
w2, and w3
● Suppose that there are three texts T1, T2, and T3 such that
– T1 = “w1 w1 w2”
– T2 = “w3 w2”
– T3 = “w3 w3 w1”
● Suppose that our feature extraction procedure takes each word
in text and simply records its presence (1) or absence in the
document
Vector Space as Binary Feature Vector Table
w1 w2 w3
T1 1 1 0
T2 0 1 1
T3 1 0 1
Matching Queries Against Vector Tables
● Let twf be a term weighting function that assigns a numerical
weight to a specific term in a specific document
● For example, if the query q = “w1 w3”, i.e., the user enters “w1
w3”, then 𝑞 = 𝑡𝑤𝑓 𝑞, 𝑤1 , 𝑡𝑤𝑓 𝑞,𝑤2 , 𝑡𝑤𝑓(𝑞, 𝑤3)
● If the feature vector table is binary, then 𝑞 = 1,0,1
● One similarity that can be used to rank binary documents is as
follows:
𝑠𝑖𝑚 𝑞 , 𝑇𝑖 = 𝑡𝑤𝑓 𝑞,𝑤𝑘 ∙ 𝑡𝑤𝑓 𝑇𝑖, 𝑤𝑘𝑛𝑘=1 , where n
is the dimension of the vector space (e.g., n = 3)
Matching Queries Against Vector Tables
● Suppose the query q = “w1 w3” and the feature vector table is binary, then
𝑞 = 1,0,1
● Below are the binary (dot product) similarity coefficients for each
document in our 3D document collection (n = 3):
𝑠𝑖𝑚 𝑞 , 𝑇1 = 𝑡𝑤𝑓 𝑞,𝑤𝑘 ∙ 𝑡𝑤𝑓 𝑇1, 𝑤𝑘
𝑛=3
𝑘=1
= 1 ∙ 1 + 0 ∙ 1 + 1 ∙ 0 = 1
𝑠𝑖𝑚 𝑞 , 𝑇2 = 𝑡𝑤𝑓 𝑞,𝑤𝑘 ∙ 𝑡𝑤𝑓 𝑇2, 𝑤𝑘
𝑛=3
𝑘=1
= 1 ∙ 0 + 0 ∙ 1 + 1 ∙ 1 = 1
𝑠𝑖𝑚 𝑞 , 𝑇3 = 𝑡𝑤𝑓 𝑞,𝑤𝑘 ∙ 𝑡𝑤𝑓 𝑇3, 𝑤𝑘
𝑛=3
𝑘=1
= 1 ∙ 1 + 0 ∙ 0 + 1 ∙ 1 = 2
Matching Queries Against Vector Tables
● Another common metric is cosine, which is equal to 1 for
identical vectors and 0 to orthogonal vectors:
𝑠𝑖𝑚 𝑞 , 𝑇𝑖 = 𝑡𝑤𝑓 𝑞,𝑤𝑘 ∙ 𝑡𝑤𝑓 𝑇𝑖, 𝑤𝑘𝑛𝑘=1
𝑡𝑤𝑓 𝑞,𝑤𝑘𝑁𝑘=1 𝑡𝑤𝑓 𝑇𝑖, 𝑤𝑘
𝑁𝑘=1
Two Principal Tasks for Vector Space Model
● If the vector space model is to used, we have to
– determine how to compute terms (vocabulary
normalization)
– determine how to assign weights to terms in
individual documents (term weighting)
Vocabulary Normalization
through
Stemming & Stoplisting
Vocabulary Normalization
● Texts contain many words that are morphologically related:
CONNECT, CONNECTED, CONNECTING, CONNECTION,
CONNECTIONS
● There are also many words in most texts that do not distinguish
them from other texts: TO, UP, FROM, UNTIL, THE, A, BY, etc.
● Stemming is the operation of conflating different wordforms into
a single wordform, called stem; CONNECTED, CONNECTING,
CONNECTION, CONNECTIONS are all conflated to CONNECT
● Stoplisting is the operation of removing wordforms that do not
distinguish texts from each other
● Stemming & stoplisting are vocabulary normalization procedures
Vocabulary Normalization
● Stemming & stoplisting are two most common vocabulary
normalization procedures
● Both procedures are aimed at standardizing the indexing
vocabulary
● Both procedures reduce the size of the indexing
vocabulary, which is a great time and space booster
● After vocabulary normalization is done, the remaining
words are called terms
Porter’s Algorithm
for
Suffix Stripping
Martin Porter’s original paper is at http://tartarus.org/martin/PorterStemmer/def.txt
Source code in various languages is at http://tartarus.org/martin/PorterStemmer/
Suffix Stripping Approaches
● Use a stem list
● Use a suffix list
● Use a set of rules that match wordforms & remove
suffixes under specified conditions
Pros & Cons of Suffix Stripping
● Suffix stripping is done not for linguistic reasons
but to improve retrieval performance & storage
efficiency
● It is reasonable when wordform conflation does
not lose information (e.g., CONNECTOR &
CONNECTION)
● It does not seem reasonable when conflation is
lossy (e.g., RELATIVE & RELATIVITY are
conflated)
Pros & Cons of Suffix Stripping
● Suffix stripping is never 100% correct
● The same rule set conflates SAND and SANDER,
which is OK, but it also conflates WAND and
WANDER, which may not be OK
● With any set of rules there comes a point when
adding more rules actually worsens performance
● Exceptions are important but may not be worth
the trouble
Consonants & Vowels
● A consonant is a letter different from A, E, I,
O, U and different from Y when it is
preceded by a consonant
● Y is a consonant when it is preceded by A, E,
I, O, U: in TOY, Y is a consonant; in BY, it is
a vowel
● A vowel is not a consonant
Consonants & Vowels
● A consonant is denoted as c and a vowel as v
● A sequence of at least one consonant (e.g.,
c, cc, ccc, cccc, etc) is denoted as C
● A sequence of at least one vowel (e.g., v, vv,
vvv, etc.) is denoted as V
Porter’s Insight: Wordform Representation
● Any wordform can be represented as one of the four forms:
– CVCV … C
– CVCV … V
– VCVC … C
– VCVC … V
● These forms are condensed into one form: [C]VCVC … [V]
(square brackets denote sequences of zero or more consonants or
vowels)
● This form can be rewritten as [C](VC)m[V], m >= 0
Porter’s Insight: Wordform Representation
● In the formula [C](VC)m[V], m >= 0, m is called the measure of a word
● Examples:
● m=0: TR, EE, TREE, Y, BY
● m=1: TROUBLE, OATS, TREES, IVY
– TROUBLE: [C] TR; (VC) OUBL; [V] E
– OATS: [C] NULL; (VC) OATS; [V] NULL
– TREES: [C] TR; (VC) EES; [V] NULL
● m=2: TROUBLES, PRIVATE
– TROUBLES: [C] TR; (VC)2 (OUBL)(ES); [V] NULL
– PRIVATE: [C] PR; (VC)2 (IV)(AT); [V] E
Morphological Rules
● Suffix removal rules have the form
– (condition) S1 S2
● If a wordform ends with suffix S1 and the stem
before S1 satisfies the (optional) condition, then
S1 is replaced with S2
● Example:
– (m > 1) EMENT
– S1 is EMENT; S2 is NULL
– This rule maps REPLACEMENT to REPLAC
Morphological Rules: Condition Specification
● Conditions can be specified as follows:
– (m > n), where n is a number
– *X – stem ends with the letter X
– *v* - stem contains a vowel
– *d – stem with a double consonant (e.g., -TT)
– *o – stem ends in cvc where the second c is not W, X, or Y
(e.g., -WIL, -HOP)
● Logical AND, OR, and NOT operators are also allowed:
– ((m > 1) AND (*S OR *T))
Length-Based Rule Matching
● If there are several rules match, the one with the
longest S1 wins
● Consider this rule set with null conditions:
– SSES SS
– IES I
– SS SS
– S
● Given this rule set, CARESSES CARESS because SSES is
the longest match and CARES CARE
Five Rule Sets
● In the original paper by M.F. Porter, there are eight sets
of rules: 1A, 1B, 1C, 2, 3, 4, 5A, 5B
● A wordform passes through each rule set one by one
staring from 1A and ending at 5B, in that order
● If no rule in a rule set is applicable, the wordform comes
out unmodified
1A 1B 1C 2 3 4
5A 5B
W
W’
Example
1A: S
3: (m>0) ALIZE AL
GENERALIZATIONS GENERALIZATION 2: (m>0) IZATION IZE
GENERAL
4: (m>0) AL
GENER
GENERALIZE
Term Weighting
Term Weighting in Documents
● Term weighting has a large influence on the
performance of IR systems
● In general, there are two design factors that bear on
term weighting:
– How important is a term within a given document?
– How important is a term within a given collection?
● A common measure of term importance within a single
document is its frequency in that document (this is
commonly referred to as term frequency – tf)
Term Weighting in Collections
● Terms that occur in every document or many
documents in a given collection are not useful as
document discriminators
● Terms that occur in relatively few documents in a
given collection are useful as document
discriminators
● Generally, collection-wide term weighting
approaches value terms that occur in relatively
few documents
Inverse Document Frequency
● Suppose that we have some document collection C
● Let N be the total number of documents in C
● Let 𝑛𝑖 be the number of documents in C that contain at least one
occurrence of the i-th term 𝑡𝑖
● Then the inverse document frequency of 𝑡𝑖
𝑖𝑑𝑓 𝑡𝑖 , 𝐶 = 𝑙𝑜𝑔𝑁
𝑛𝑖
Example: IDF
W1 W1 W2 W3
W3 W3 W3 W3
W2 W2 W2 W1 W3
W3 W3 W3 W1 W1
T1 T2 T3 T4
𝑖𝑑𝑓 𝑊1, 𝐶 = 𝑙𝑜𝑔4
3, because N = 4 and 𝑛1 = 3
𝑖𝑑𝑓 𝑊2, 𝐶 = 𝑙𝑜𝑔4
2, because N = 4 and 𝑛2 = 2
𝑖𝑑𝑓 𝑊3, 𝐶 = 𝑙𝑜𝑔4
4, because N = 4 and 𝑛2 = 4
𝐶 = 𝑇1, 𝑇2, 𝑇3, 𝑇4
TF*IDF: Combining Local and Global Weights
● Suppose that we have some document collection C
● Let N be the total number of documents in C
● Let 𝑛𝑖 be the number of documents in C that contain at least one
occurrence of the i-th term 𝑡𝑖
● Let 𝑡𝑓 𝑡𝑖 , 𝑇𝑗 , 𝐶 be the frequency of the term 𝑡𝑖 in the document 𝑇𝑗 of
collection C
● Let 𝑖𝑑𝑓 𝑡𝑖 , 𝐶 be the inverse document frequency of the term 𝑡𝑖 in
collection C
● Then the tfidf measure of 𝑡𝑖 in 𝑇𝑗 of C:
𝑡𝑓𝑖𝑑𝑓 𝑡𝑖 , 𝑇𝑗 , 𝐶 = 𝑡𝑓 𝑡𝑖 , 𝑇𝑗 , 𝐶 ∙ 𝑖𝑑𝑓 𝑡𝑖 , 𝐶
Example: TF*IDF
W1 W1 W2 W3
W3 W3 W3 W3
W2 W2 W2 W1 W3
W3 W3 W3 W1 W1
T1 T2 T3 T4
𝑡𝑓𝑖𝑑𝑓 𝑊1, 𝑇1, 𝐶 = 𝑡𝑓 𝑊1, 𝑇1, 𝐶 ∙ 𝑖𝑑𝑓 𝑊1, 𝐶 = 2 ∙ 𝑙𝑜𝑔4
3
𝑡𝑓𝑖𝑑𝑓 𝑊2, 𝑇1, 𝐶 = 𝑡𝑓 𝑊2, 𝑇1, 𝐶 ∙ 𝑖𝑑𝑓 𝑊2, 𝐶 = 1 ∙ 𝑙𝑜𝑔4
2
𝑡𝑓𝑖𝑑𝑓 𝑊3, 𝑇1, 𝐶 = 𝑡𝑓 𝑊3, 𝑇1, 𝐶 ∙ 𝑖𝑑𝑓 𝑊3, 𝐶 = 1 ∙ 𝑙𝑜𝑔4
4
User Query Expansion
Improving User Queries
● Typically, we cannot change the content of the indexed documents: once a
collection is indexed, we can documents to it, remove documents from it,
but we cannot change the weights of the terms within the vector space
model
● What we can do is improve the user query
● But how? We can dynamically change the weights of the terms in
the user query to move it closer to the more relevant documents
● The standard method of doing it in the vector space model is
called relevance feedback
How Relevance Feedback Works
● The user types in a query
● The system retrieves a set of documents
● The user specifies whether each document is relevant or not to the query:
this can be done on every document in the retrieved set or a small subset of
documents
● The system dynamically increases the weights of the terms in the relevant
documents and decreases the weights of the terms in the non-relevant
documents
● In several iterations, the user query vector ends up being pushed closer to
the relevant documents and further from the non-relevant documents
Rocchio Relevance Feedback Formula
● Suppose that we have some document collection C
● Let 𝑞𝑖 be the user query vector at the i-th iteration (i.e., 𝑞0 is the original user query vector
● Let us assume that 𝑅 = 𝑟1, … 𝑟|𝑅| is the set of relevant document
vectors from C and 𝑁𝑅 = 𝑛𝑟1, … , 𝑛𝑟|𝑁𝑅| is the set of non-relevant
document vectors from C
● The query vector on the next iteration is:
𝑞𝑖+1 = 𝑞𝑖 +𝛽
|𝑅| 𝑟𝑗|𝑅|𝑗=1 −
𝛾
𝑁𝑅 𝑛𝑟𝑘𝑁𝑅𝑘=1 , where𝛽+ 𝛾= 1
Rocchio Relevance Feedback Formula
● The query vector on the next iteration is:
𝑞𝑖+1 = 𝑞𝑖 +𝛽
|𝑅| 𝑟𝑗|𝑅|𝑗=1 −
𝛾
𝑁𝑅 𝑛𝑟𝑘𝑁𝑅𝑘=1 , where𝛽+ 𝛾= 1
𝑞𝑖+1 = 𝑞𝑖 +𝛽
|𝑅| 𝑟𝑗
|𝑅|
𝑗=1
−𝛾
𝑁𝑅 𝑛𝑟𝑘
𝑁𝑅
𝑘=1
=
𝑞𝑖 +𝛽
|𝑅|𝑟1 +⋯+
𝛽
|𝑅|𝑟|𝑅| −
𝛾
|𝑁𝑅|𝑛𝑟1 +⋯+
𝛾
|𝑁𝑅|𝑛𝑟|𝑁𝑅|
Thesaurus-Based Query Expansion
● Another commonly used strategy is to have a
thesaurus
● The thesaurus is used to expand the user query by
adding terms to it (e.g., synonyms or correlated
terms)
● Thesaurai are typically collection dependent and
do not generalize across different collections
Performance Evaluation
● There are two commonly used measures of relevance in
IR
● Recall = (number of relevant documents
retrieved)/(total number of relevant documents in
collection C)
● Precision = (number of relevant documents
retrieved)/(number of documents retrieved)
● Typically, recall and precision are inversely related: as
precision increases, recall drops and vice versa
References
1. M. Porter. “An Algorithm for Suffix Stripping.” Program, 14
no. 3, pp 130-137, July 1980.
2. D. Jurafsky & J. Martin. “Speech and Language Processing”,
Ch. 17.