Probabilistic Models in Information Retrieval SI650: Information Retrieval
Information Retrieval
description
Transcript of Information Retrieval
(C) 2005, The University of Michigan 1
Information Retrieval
Dragomir R. RadevUniversity of Michigan
September 19, 2005
(C) 2005, The University of Michigan 2
About the instructor
• Dragomir R. Radev• Associate Professor, University of Michigan
– School of Information– Department of Electrical Engineering and Computer Science– Department of Linguistics
• Head of CLAIR (Computational Linguistics And Information Retrieval) at U. Michigan
• Treasurer, North American Chapter of the ACL• Ph.D., 1998, Computer Science, Columbia University• Email: [email protected]• Home page: http://tangra.si.umich.edu/~radev
(C) 2005, The University of Michigan 3
Introduction
(C) 2005, The University of Michigan 4
IR systems
• Vivísimo
• AskJeeves
• NSIR
• Lemur
• MG
• Nutch
(C) 2005, The University of Michigan 5
Examples of IR systems
• Conventional (library catalog). Search by keyword, title, author, etc.
• Text-based (Lexis-Nexis, Google, FAST).Search by keywords. Limited search using queries in natural language.
• Multimedia (QBIC, WebSeek, SaFe)Search by visual appearance (shapes, colors,… ).
• Question answering systems (AskJeeves, NSIR, Answerbus)Search in (restricted) natural language
(C) 2005, The University of Michigan 6
(C) 2005, The University of Michigan 7
(C) 2005, The University of Michigan 8
Need for IR
• Advent of WWW - more than 8 Billion documents indexed on Google
• How much information? 200TB according to Lyman and Varian 2003.http://www.sims.berkeley.edu/research/projects/how-much-info/
• Search, routing, filtering
• User’s information need
(C) 2005, The University of Michigan 9
Some definitions of Information Retrieval (IR)
Salton (1989): “Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests.”
Kowalski (1997): “An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects).”
(C) 2005, The University of Michigan 10
Sample queries (from Excite)
In what year did baseball become an offical sport?play station codes . combirth control and depressiongovernment"WorkAbility I"+conferencekitchen applianceswhere can I find a chines rosewoodtiger electronics58 Plymouth FuryHow does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero?emeril LagasseHubbleM.S Subalaksmirunning
(C) 2005, The University of Michigan 11
Mappings and abstractions
Reality Data
Information need Query
From Korfhage’s book
(C) 2005, The University of Michigan 12
Typical IR system
• (Crawling)
• Indexing
• Retrieval
• User interface
(C) 2005, The University of Michigan 13
Key Terms Used in IR
• QUERY: a representation of what the user is looking for - can be a list of words or a phrase.
• DOCUMENT: an information entity that the user wants to retrieve
• COLLECTION: a set of documents
• INDEX: a representation of information that makes querying easier
• TERM: word or concept that appears in a document or a query
(C) 2005, The University of Michigan 14
Documents
(C) 2005, The University of Michigan 15
Documents
• Not just printed paper
• collections vs. documents
• data structures: representations
• Bag of words method
• document surrogates: keywords, summaries
• encoding: ASCII, Unicode, etc.
(C) 2005, The University of Michigan 16
Document preprocessing
• Formatting
• Tokenization (Paul’s, Willow Dr., Dr. Willow, 555-1212, New York, ad hoc)
• Casing (cat vs. CAT)
• Stemming (computer, computation)
• Soundex
(C) 2005, The University of Michigan 17
Document representations
• Term-document matrix (m x n)
• term-term matrix (m x m x n)
• document-document matrix (n x n)
• Example: 3,000,000 documents (n) with 50,000 terms (m)
• sparse matrices
• Boolean vs. integer matrices
(C) 2005, The University of Michigan 18
Document representations
• Term-document matrix– Evaluating queries (e.g., (AB)C)– Storage issues
• Inverted files– Storage issues– Evaluating queries– Advantages and disadvantages
(C) 2005, The University of Michigan 19
IR models
(C) 2005, The University of Michigan 20
Major IR models
• Boolean
• Vector
• Probabilistic
• Language modeling
• Fuzzy retrieval
• Latent semantic indexing
(C) 2005, The University of Michigan 21
Major IR tasks
• Ad-hoc
• Filtering and routing
• Question answering
• Spoken document retrieval
• Multimedia retrieval
(C) 2005, The University of Michigan 22
Venn diagrams
x w y z
D1D2
(C) 2005, The University of Michigan 23
Boolean model
A B
(C) 2005, The University of Michigan 24
restaurants AND (Mideastern OR vegetarian) AND inexpensive
Boolean queries
• What types of documents are returned?
• Stemming
• thesaurus expansion
• inclusive vs. exclusive OR
• confusing uses of AND and OR
dinner AND sports AND symphony
4 OF (Pentium, printer, cache, PC, monitor, computer, personal)
(C) 2005, The University of Michigan 25
Boolean queries• Weighting (Beethoven AND sonatas)
• precedence
coffee AND croissant OR muffin
raincoat AND umbrella OR sunglasses
• Use of negation: potential problems
• Conjunctive and Disjunctive normal forms
• Full CNF and DNF
(C) 2005, The University of Michigan 27
Boolean model
• Partition
• Partial relevance?
• Operators: AND, NOT, OR, parentheses
(C) 2005, The University of Michigan 28
Exercise
• D1 = “computer information retrieval”
• D2 = “computer retrieval”
• D3 = “information”
• D4 = “computer information”
• Q1 = “information retrieval”
• Q2 = “information ¬computer”
(C) 2005, The University of Michigan 29
Exercise0
1 Swift
2 Shakespeare
3 Shakespeare Swift
4 Milton
5 Milton Swift
6 Milton Shakespeare
7 Milton Shakespeare Swift
8 Chaucer
9 Chaucer Swift
10 Chaucer Shakespeare
11 Chaucer Shakespeare Swift
12 Chaucer Milton
13 Chaucer Milton Swift
14 Chaucer Milton Shakespeare
15 Chaucer Milton Shakespeare Swift
((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))
(C) 2005, The University of Michigan 30
Stop lists• 250-300 most common words in English
account for 50% or more of a given text.
• Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.
• Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%).
• Token/type ratio: 2256/859 = 2.63
(C) 2005, The University of Michigan 31
Vector modelsTerm 1
Term 2
Term 3
Doc 1
Doc 2
Doc 3
(C) 2005, The University of Michigan 32
Vector queries
• Each document is represented as a vector
• non-efficient representations (bit vectors)
• dimensional compatibility
W1 W2 W3 W4 W5 W6 W7 W8 W9 W10
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
(C) 2005, The University of Michigan 33
The matching process
• Document space
• Matching is done between a document and a query (or between two documents)
• distance vs. similarity
• Euclidean distance, Manhattan distance, Word overlap, Jaccard coefficient, etc.
(C) 2005, The University of Michigan 34
Miscellaneous similarity measures
• The Cosine measure
(D,Q) = = (di x qi)
(di)2 * (qi)2
|X Y|
|X| * |Y|
(D,Q) =|X Y|
|X Y|
• The Jaccard coefficient
(C) 2005, The University of Michigan 35
Exercise
• Compute the cosine measures (D1,D2) and (D1,D3) for the documents: D1 = <1,3>, D2 = <100,300> and D3 = <3,1>
• Compute the corresponding Euclidean distances, Manhattan distances, and Jaccard coefficients.
(C) 2005, The University of Michigan 36
Evaluation
(C) 2005, The University of Michigan 37
Relevance
• Difficult to change: fuzzy, inconsistent
• Methods: exhaustive, sampling, pooling, search-based
(C) 2005, The University of Michigan 38
Contingency table
w x
y z
n2 = w + y
n1 = w + x
N
relevant
not relevant
retrieved not retrieved
(C) 2005, The University of Michigan 39
Precision and Recall
Recall:
Precision:
w
w+y
w+x
w
(C) 2005, The University of Michigan 40
Exercise
Go to Google (www.google.com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien, “JRR Melville”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista.
Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. Try different information needs.
Later, try different queries.
(C) 2005, The University of Michigan 41
n Doc. no Relevant? Recall Precision1 588 x 0.2 1.00
2 589 x 0.4 1.00
3 576 0.4 0.67
4 590 x 0.6 0.75
5 986 0.6 0.60
6 592 x 0.8 0.67
7 984 0.8 0.57
8 988 0.8 0.50
9 578 0.8 0.44
10 985 0.8 0.40
11 103 0.8 0.36
12 591 0.8 0.33
13 772 x 1.0 0.38
14 990 1.0 0.36
[From Salton’s book]
(C) 2005, The University of Michigan 42
P/R graph
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Pre
cis
ion
Interpolated average precision (e.g., 11pt)Interpolation – what is precision at recall=0.5?
(C) 2005, The University of Michigan 43
Issues
• Why not use accuracy A=(w+z)/N?• Average precision• Average P at given “document cutoff
values”• Report when P=R• F measure: F=(2+1)PR/(2P+R)• F1 measure: F1 = 2/(1/R+1/P) : harmonic
mean of P and R
(C) 2005, The University of Michigan 47
Relevance collections
• TREC ad hoc collections, 2-6 GB
• TREC Web collections, 2-100GB
(C) 2005, The University of Michigan 48
Sample TREC query<top><num> Number: 305<title> Most Dangerous Vehicles
<desc> Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? <narr> Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example.</top>
LA031689-0177FT922-1008LA090190-0126LA101190-0218LA082690-0158LA112590-0109FT944-136LA020590-0119FT944-5300LA052190-0048LA051689-0139FT944-9371LA032390-0172
LA042790-0172LA021790-0136LA092289-0167LA111189-0013LA120189-0179LA020490-0021LA122989-0063LA091389-0119LA072189-0048FT944-15615LA091589-0101LA021289-0208
(C) 2005, The University of Michigan 49
<DOCNO> LA031689-0177 </DOCNO><DOCID> 31701 </DOCID><DATE><P>March 16, 1989, Thursday, Home Edition </P></DATE><SECTION><P>Business; Part 4; Page 1; Column 5; Financial Desk </P></SECTION><LENGTH><P>586 words </P></LENGTH><HEADLINE><P>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </P></HEADLINE><BYLINE><P>By LINDA WILLIAMS, Times Staff Writer </P></BYLINE><TEXT><P>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-overaccidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. </P><P>The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of theSuzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents afterConsumer Reports magazine charged that the vehicle had basic design flaws. </P><P>Several Fatalities </P><P>However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs,particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigationconducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. </P><P>According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicleroll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involvingthe Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After theaccident report, NHTSA declined to investigate the Samurai. </P>...</TEXT><GRAPHIC><P> Photo, The Ford Bronco II "appears to have a highernumber of single-vehicle, first event roll-overs," a federal officialsaid. </P></GRAPHIC><SUBJECT><P>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS;RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </P></SUBJECT></DOC>
(C) 2005, The University of Michigan 50
TREC (cont’d)
• http://trec.nist.gov/tracks.html• http://
trec.nist.gov/presentations/presentations.html
(C) 2005, The University of Michigan 51
Word distribution models
(C) 2005, The University of Michigan 52
Shakespeare
• Romeo and Juliet:• And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; You, 289; Thou, 277; Me,
262; Not, 257; With, 234; It, 224; For, 223; This, 215; Be, 207; But, 181; Thy, 167; What, 163; O, 160; As, 156; Her, 150; Will, 147; So, 145; Thee, 139; Love, 135; His, 128; Have, 127; He, 120; Romeo, 115; By, 114; She, 114; Shall, 107; Your, 103; No, 102; Come, 96; Him, 96; All, 92; Do, 89; From, 86; Then, 83; Good, 82; Now, 82; Here, 80; If, 80; An, 78; Go, 76; On, 76; I'll, 71; Death, 69; Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63; Which, 60;
• …
• A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; Abound'st, 1; Abroach, 1; Absolved, 1; Abuse, 1; Abused, 1; Abuses, 1; Accents, 1; Access, 1; Accident, 1; Accidents, 1; According, 1; Accursed, 1; Accustom'd, 1; Ache, 1; Aches, 1; Aching, 1; Acknowledge, 1; Acquaint, 1; Acquaintance, 1; Acted, 1; Acting, 1; Action, 1; Acts, 1; Adam, 1; Add, 1; Added, 1; Adding, 1; Addle, 1; Adjacent, 1; Admired, 1; Ado, 1; Advance, 1; Adversary, 1; Adversity's, 1; Advise, 1; Afeard, 1; Affecting, 1; Afflicted, 1; Affliction, 1; Affords, 1; Affray, 1; Affright, 1; Afire, 1; Agate-stone, 1; Agile, 1; Agree, 1; Agrees, 1; Aim'd, 1; Alderman, 1; All-cheering, 1; All-seeing, 1; Alla, 1; Alliance, 1; Alligator, 1; Allow, 1; Ally, 1; Although, 1;
http://www.mta75.org/curriculum/english/Shakes/indexx.html
(C) 2005, The University of Michigan 53
The BNC (Adam Kilgarriff)• 1 6187267 the det• 2 4239632 be v• 3 3093444 of prep• 4 2687863 and conj• 5 2186369 a det• 6 1924315 in prep• 7 1620850 to infinitive-marker• 8 1375636 have v• 9 1090186 it pron• 10 1039323 to prep• 11 887877 for prep• 12 884599 i pron• 13 760399 that conj• 14 695498 you pron• 15 681255 he pron• 16 680739 on prep• 17 675027 with prep• 18 559596 do v• 19 534162 at prep• 20 517171 by prep
Kilgarriff, A. Putting Frequencies in the Dictionary.International Journal of Lexicography10 (2) 1997. Pp 135--155
(C) 2005, The University of Michigan 54
Stop lists• 250-300 most common words in English
account for 50% or more of a given text.
• Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.
• Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%).
• Token/type ratio: 2256/859 = 2.63
(C) 2005, The University of Michigan 55
Zipf’s law
Rank x Frequency Constant
Rank Term Freq. Z Rank Term Freq. Z
1 the 69,971 0.070 6 in 21,341 0.128
2 of 36,411 0.073 7 that 10,595 0.074
3 and 28,852 0.086 8 is 10,099 0.081
4 to 26.149 0.104 9 was 9,816 0.088
5 a 23,237 0.116 10 he 9,543 0.095
(C) 2005, The University of Michigan 56
Zipf's law is fairly general!
• Frequency of accesses to web pages • in particular the access counts on the Wikipedia page,with s approximately equal to 0.3 • page access counts on Polish Wikipedia (data for late July 2003) approximately obey Zipf's law with s about 0.5
• Words in the English language • for instance, in Shakespeare’s play Hamlet with s approximately 0.5
• Sizes of settlements• Income distributions amongst individuals • Size of earthquakes• Notes in musical performances
http://en.wikipedia.org/wiki/Zipf's_law
(C) 2005, The University of Michigan 57
Zipf’s law (cont’d)
• Limitations:– Low and high frequencies– Lack of convergence
• Power law with coefficient c = -1– Y=kxc
• Li (1992) – typing words one letter at a time, including spaces
(C) 2005, The University of Michigan 60
Indexing
(C) 2005, The University of Michigan 61
Methods
• Manual: e.g., Library of Congress subject headings, MeSH
• Automatic
(C) 2005, The University of Michigan 62
LOC subject headings
http://www.loc.gov/catdir/cpso/lcco/lcco.html
A -- GENERAL WORKSB -- PHILOSOPHY. PSYCHOLOGY. RELIGIONC -- AUXILIARY SCIENCES OF HISTORYD -- HISTORY (GENERAL) AND HISTORY OF EUROPEE -- HISTORY: AMERICAF -- HISTORY: AMERICAG -- GEOGRAPHY. ANTHROPOLOGY. RECREATIONH -- SOCIAL SCIENCESJ -- POLITICAL SCIENCEK -- LAWL -- EDUCATIONM -- MUSIC AND BOOKS ON MUSICN -- FINE ARTSP -- LANGUAGE AND LITERATUREQ -- SCIENCER -- MEDICINES -- AGRICULTURET -- TECHNOLOGYU -- MILITARY SCIENCEV -- NAVAL SCIENCEZ -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)
(C) 2005, The University of Michigan 63
MedicineCLASS R - MEDICINESubclass RR5-920 Medicine (General)R5-130.5 General worksR131-687 History of medicine. Medical expeditionsR690-697 Medicine as a profession. PhysiciansR702-703 Medicine and the humanities. Medicine and disease in
relation to history, literature, etc.R711-713.97 DirectoriesR722-722.32 Missionary medicine. Medical missionariesR723-726 Medical philosophy. Medical ethicsR726.5-726.8 Medicine and disease in relation to psychology.
Terminal care. DyingR727-727.5 Medical personnel and the public. Physician and the
publicR728-733 Practice of medicine. Medical practice economicsR735-854 Medical education. Medical schools. ResearchR855-855.5 Medical technologyR856-857 Biomedical engineering. Electronics. InstrumentationR858-859.7 Computer applications to medicine. Medical informaticsR864 Medical recordsR895-920 Medical physics. Medical radiology. Nuclear medicine
(C) 2005, The University of Michigan 64
Finding the most frequent terms in a document
• Typically stop words: the, and, in
• Not content-bearing
• Terms vs. words
• Luhn’s method
(C) 2005, The University of Michigan 65
Luhn’s method
WORDS
FREQUENCY
E
(C) 2005, The University of Michigan 66
Computing term salience
• Term frequency (IDF)
• Document frequency (DF)
• Inverse document frequency (IDF)
N
wDFwIDF
)(log)(
(C) 2005, The University of Michigan 67
Applications of TFIDF
• Cosine similarity
• Indexing
• Clustering
(C) 2005, The University of Michigan 68
Vector-based matching
• The cosine measure
sim (D,C) =
(dk . ck . idf(k))
(dk)2 . (ck)2
k
k
k
(C) 2005, The University of Michigan 69
IDF: Inverse document frequency
N: number of documentsdk: number of documents containing term kfik: absolute frequency of term k in document iwik: weight of term k in document i
idfk = log2(N/dk) + 1 = log2N - log2dk + 1
TF * IDF is used for automated indexing and for topicdiscrimination:
(C) 2005, The University of Michigan 70
Asian and European news
622.941 deng306.835 china196.725 beijing153.608 chinese152.113 xiaoping124.591 jiang108.777 communist102.894 body 85.173 party 71.898 died 68.820 leader 43.402 state 38.166 people
97.487 nato92.151 albright74.652 belgrade46.657 enlargement34.778 alliance34.778 french33.803 opposition32.571 russia14.095 government 9.389 told 9.154 would 8.459 their 6.059 which
(C) 2005, The University of Michigan 71
Other topics
120.385 shuttle 99.487 space 90.128 telescope 70.224 hubble 59.992 rocket 50.160 astronauts 49.722 discovery 47.782 canaveral 47.782 cape 40.889 mission 35.778 florida 27.063 center
74.652 compuserve65.321 massey55.989 salizzoni29.996 bob27.994 online27.198 executive15.890 interim15.271 chief11.647 service11.174 second 6.781 world 6.315 president
(C) 2005, The University of Michigan 72
Compression
(C) 2005, The University of Michigan 73
Compression
• Methods– Fixed length codes– Huffman coding– Ziv-Lempel codes
(C) 2005, The University of Michigan 74
Fixed length codes
• Binary representations– ASCII– Representational power (2k symbols where k is
the number of bits)
(C) 2005, The University of Michigan 75
Variable length codes• Alphabet:
A .- N -. 0 -----B -... O --- 1 .----C -.-. P .--. 2 ..---D -.. Q --.- 3 ...—E . R .-. 4 ....-F ..-. S ... 5 .....G --. T - 6 -....H .... U ..- 7 --...I .. V ...- 8 ---..J .--- W .-- 9 ----.K -.- X -..-L .-.. Y -.—M -- Z --..
• Demo:– http://www.babbage.demon.co.uk/morse.html– http://www.scphillips.com/morse/
(C) 2005, The University of Michigan 76
Most frequent letters in English
• Most frequent letters:– E T A O I N S H R D L U– http://www.math.cornell.edu/~mec/modules/
cryptography/subs/frequencies.html• Demo:
– http://www.amstat.org/publications/jse/secure/v7n2/count-char.cfm
• Also: bigrams:– TH HE IN ER AN RE ND AT ON NT – http://www.math.cornell.edu/~mec/modules/
cryptography/subs/digraphs.html
(C) 2005, The University of Michigan 77
Useful links about cryptography
• http://world.std.com/~franl/crypto.html
• http://www.faqs.org/faqs/cryptography-faq/
• http://en.wikipedia.org/wiki/Cryptography
(C) 2005, The University of Michigan 78
Huffman coding
• Developed by David Huffman (1952)• Average of 5 bits per character (37.5%
compression)• Based on frequency distributions of
symbols• Algorithm: iteratively build a tree of
symbols starting with the two least frequent symbols
(C) 2005, The University of Michigan 79
Symbol Frequency
A 7
B 4
C 10
D 5
E 2
F 11
G 15
H 3
I 7
J 8
(C) 2005, The University of Michigan 80
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
c
b d
f
g
i j
he
a
(C) 2005, The University of Michigan 81
Symbol Code
A 0110
B 0010
C 000
D 0011
E 01110
F 010
G 10
H 01111
I 110
J 111
(C) 2005, The University of Michigan 82
Exercise
• Consider the bit string: 01101101111000100110001110100111000110101101011101
• Use the Huffman code from the example to decode it.
• Try inserting, deleting, and switching some bits at random locations and try decoding.
(C) 2005, The University of Michigan 83
Ziv-Lempel coding
• Two types - one is known as LZ77 (used in GZIP)
• Code: set of triples <a,b,c>• a: how far back in the decoded text to look
for the upcoming text segment• b: how many characters to copy• c: new character to add to complete segment
(C) 2005, The University of Michigan 84
• <0,0,p> p• <0,0,e> pe• <0,0,t> pet• <2,1,r> peter• <0,0,_> peter_• <6,1,i> peter_pi• <8,2,r> peter_piper• <6,3,c> peter_piper_pic• <0,0,k> peter_piper_pick• <7,1,d> peter_piper_picked• <7,1,a> peter_piper_picked_a• <9,2,e> peter_piper_picked_a_pe• <9,2,_> peter_piper_picked_a_peck_• <0,0,o> peter_piper_picked_a_peck_o• <0,0,f> peter_piper_picked_a_peck_of• <17,5,l> peter_piper_picked_a_peck_of_pickl• <12,1,d> peter_piper_picked_a_peck_of_pickled• <16,3,p> peter_piper_picked_a_peck_of_pickled_pep• <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper• <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers
(C) 2005, The University of Michigan 85
Links on text compression
• Data compression:– http://www.data-compression.info/
• Calgary corpus:– http://links.uwaterloo.ca/calgary.corpus.html
• Huffman coding:– http://www.compressconsult.com/huffman/– http://en.wikipedia.org/wiki/Huffman_coding
• LZ– http://en.wikipedia.org/wiki/LZ77
(C) 2005, The University of Michigan 86
Relevance feedback and
query expansion
(C) 2005, The University of Michigan 87
Relevance feedback
• Problem: initial query may not be the most appropriate to satisfy a given information need.
• Idea: modify the original query so that it gets closer to the right documents in the vector space
(C) 2005, The University of Michigan 88
Relevance feedback
• Automatic
• Manual
• Method: identifying feedback termsQ’ = a1Q + a2R - a3N
Often a1 = 1, a2 = 1/|R| and a3 = 1/|N|
(C) 2005, The University of Michigan 89
Example
• Q = “safety minivans”• D1 = “car safety minivans tests injury statistics” -
relevant• D2 = “liability tests safety” - relevant• D3 = “car passengers injury reviews” - non-
relevant• R = ?• S = ?• Q’ = ?
(C) 2005, The University of Michigan 90
Pseudo relevance feedback
• Automatic query expansion– Thesaurus-based expansion (e.g., using latent
semantic indexing – later…)– Distributional similarity– Query log mining
(C) 2005, The University of Michigan 106
String matching
(C) 2005, The University of Michigan 107
String matching methods
• Index-based
• Full or approximate– E.g., theater = theatre
(C) 2005, The University of Michigan 108
Index-based matching
• Inverted files
• Position-based inverted files
• Block-based inverted files
1 6 9 11 1719 24 28 33 40 46 50 55 60
This is a text. A text has many words. Words are made from letters.
Text: 11, 19
Words: 33, 40
From: 55
(C) 2005, The University of Michigan 109
Inverted index (trie)
Letters: 60
Text: 11, 19
Words: 33, 40
Made: 50
Many: 28
l
m
t
w
ad
n
(C) 2005, The University of Michigan 110
Sequential searching
• No indexing structure given• Given: database d and search pattern p.
– Example: find “words” in the earlier example
• Brute force method– try all possible starting positions
– O(n) positions in the database and O(m) characters in the pattern so the total worst-case runtime is O(mn)
– Typical runtime is actually O(n) given that mismatches are easy to notice
(C) 2005, The University of Michigan 111
Knuth-Morris-Pratt
• Average runtime similar to BF
• Worst case runtime is linear: O(n)
• Idea: reuse knowledge
• Need preprocessing of the pattern
(C) 2005, The University of Michigan 112
Knuth-Morris-Pratt (cont’d)
• Example (http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm)
database: ABC ABC ABC ABDAB ABCDABCDABDE
pattern: ABCDABD
index 0 1 2 3 4 5 6 7 char A B C D A B D – pos -1 0 0 0 0 1 2 0
1234567ABCDABD ABCDABD
(C) 2005, The University of Michigan 113
Knuth-Morris-Pratt (cont’d)ABC ABC ABC ABDAB ABCDABCDABDEABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^
(C) 2005, The University of Michigan 115
Word similarity
• Hamming distance - when words are of the same length
• Levenshtein distance - number of edits (insertions, deletions, replacements)– color --> colour (1)– survey --> surgery (2)– com puter --> computer ?
• Longest common subsequence (LCS)– lcs (survey, surgery) = surey
(C) 2005, The University of Michigan 116
Levenshtein edit distance
• Examples:– Theatre-> theater– Ghaddafi->Qadafi– Computer->counter
• Edit distance (inserts, deletes, substitutions)– Edit transcript
• Done through dynamic programming
(C) 2005, The University of Michigan 117
Recurrence relation
• Three dependencies– D(i,0)=i– D(0,j)=j– D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j-
1)+t(i,j)]
• Simple edit distance: – t(i,j) = 0 iff S1(i)=S2(j)
(C) 2005, The University of Michigan 118
Example
Gusfield 1997
W R I T E R S
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
V 1 1
I 2 2
N 3 3
T 4 4
N 5 5
E 6 6
R 7 7
(C) 2005, The University of Michigan 119
Example (cont’d)
Gusfield 1997
W R I T E R S
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
V 1 1 1 2 3 4 5 6 7
I 2 2 2 2 2 3 4 5 6
N 3 3 3 3 3 3 4 5 6
T 4 4 4 4 4 *
N 5 5
E 6 6
R 7 7
(C) 2005, The University of Michigan 120
Tracebacks
Gusfield 1997
W R I T E R S
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
V 1 1 1 2 3 4 5 6 7
I 2 2 2 2 2 3 4 5 6
N 3 3 3 3 3 3 4 5 6
T 4 4 4 4 4 *
N 5 5
E 6 6
R 7 7
(C) 2005, The University of Michigan 121
Weighted edit distance
• Used to emphasize the relative cost of different edit operations
• Useful in bioinformatics– Homology information– BLAST– Blosum– http://eta.embl-heidelberg.de:8000/misc/mat/
blosum50.html
(C) 2005, The University of Michigan 122
• Web sites:– http://www.merriampark.com/ld.htm– http://odur.let.rug.nl/~kleiweg/lev/
(C) 2005, The University of Michigan 123
Clustering
(C) 2005, The University of Michigan 124
Clustering
• Exclusive/overlapping clusters
• Hierarchical/flat clusters
• The cluster hypothesis– Documents in the same cluster are relevant to
the same query
(C) 2005, The University of Michigan 125
Representations for document clustering
• Typically: vector-based– Words: “cat”, “dog”, etc.– Features: document length, author name, etc.
• Each document is represented as a vector in an n-dimensional space
• Similar documents appear nearby in the vector space (distance measures are needed)
(C) 2005, The University of Michigan 126
Hierarchical clusteringDendrograms
http://odur.let.rug.nl/~kleiweg/clustering/clustering.html
E.g., language similarity:
(C) 2005, The University of Michigan 127
Another example
• Kingdom = animal• Phylum = Chordata• Subphylum = Vertebrata• Class = Osteichthyes• Subclass = Actinoptergyii• Order = Salmoniformes• Family = Salmonidae• Genus = Oncorhynchus• Species = Oncorhynchus kisutch (Coho salmon)
(C) 2005, The University of Michigan 128
Clustering using dendrograms
REPEATCompute pairwise similaritiesIdentify closest pairMerge pair into single node
UNTIL only one node leftQ: what is the equivalent Venn diagram representation?
Example: cluster the following sentences:
A B C B AA D C C A D EC D E F C D AE F G F D AA C D A B A
(C) 2005, The University of Michigan 129
Methods
• Single-linkage– One common pair is sufficient– disadvantages: long chains
• Complete-linkage– All pairs have to match– Disadvantages: too conservative
• Average-linkage• Centroid-based (online)
– Look at distances to centroids
• Demo:– /clair4/class/ir-w05/clustering
(C) 2005, The University of Michigan 130
k-means
• Needed: small number k of desired clusters
• hard vs. soft decisions
• Example: Weka
(C) 2005, The University of Michigan 131
k-means
1 initialize cluster centroids to arbitrary vectors
2 while further improvement is possible do
3 for each document d do
4 find the cluster c whose centroid is closest to d
5 assign d to cluster c
6 end for
7 for each cluster c do
8 recompute the centroid of cluster c based on its documents
9 end for
10 end while
(C) 2005, The University of Michigan 132
Example
• Cluster the following vectors into two groups:– A = <1,6>– B = <2,2>– C = <4,0>– D = <3,3>– E = <2,5>– F = <2,1>
(C) 2005, The University of Michigan 133
Complexity
• Complexity = O(kn) because at each step, n documents have to be compared to k centroids.
(C) 2005, The University of Michigan 136
Human clustering
• Significant disagreement in the number of clusters, overlap of clusters, and the composition of clusters (Maczkassy et al. 1998).
(C) 2005, The University of Michigan 137
Lexical networks
(C) 2005, The University of Michigan 138
Lexical Networks
• Used to represent relationships between words
• Example: WordNet - created by George Miller’s team at Princeton
• Based on synsets (synonyms, interchangeable words) and lexical matrices
(C) 2005, The University of Michigan 139
Lexical matrix
Word FormsWord
Meanings F1 F2 F3 … Fn
M1 E1,1 E1,2
M2 E1,2
……
Mm Em,n
(C) 2005, The University of Michigan 140
Synsets
• Disambiguation– {board, plank}– {board, committee}
• Synonyms– substitution– weak substitution– synonyms must be of the same part of speech
(C) 2005, The University of Michigan 141
$ ./wn board -hypen
Synonyms/Hypernyms (Ordered by Frequency) of noun board
9 senses of board
Sense 1board => committee, commission => administrative unit => unit, social unit => organization, organisation => social group => group, grouping
Sense 2board => sheet, flat solid => artifact, artefact => object, physical object => entity, something
Sense 3board, plank => lumber, timber => building material => artifact, artefact => object, physical object => entity, something
(C) 2005, The University of Michigan 142
Sense 4display panel, display board, board => display => electronic device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something
Sense 5board, gameboard => surface => artifact, artefact => object, physical object => entity, something
Sense 6board, table => fare => food, nutrient => substance, matter => object, physical object => entity, something
(C) 2005, The University of Michigan 143
Sense 7control panel, instrument panel, control board, board, panel => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something
Sense 8circuit board, circuit card, board, card => printed circuit => computer circuit => circuit, electrical circuit, electric circuit => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something
Sense 9dining table, board => table => furniture, piece of furniture, article of furniture => furnishings => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something
(C) 2005, The University of Michigan 144
Antonymy
• “x” vs. “not-x”
• “rich” vs. “poor”?
• {rise, ascend} vs. {fall, descend}
(C) 2005, The University of Michigan 145
Other relations
• Meronymy: X is a meronym of Y when native speakers of English accept sentences similar to “X is a part of Y”, “X is a member of Y”.
• Hyponymy: {tree} is a hyponym of {plant}.
• Hierarchical structure based on hyponymy (and hypernymy).
(C) 2005, The University of Michigan 146
Other features of WordNet
• Index of familiarity
• Polysemy
(C) 2005, The University of Michigan 147
board used as a noun is familiar (polysemy count = 9)
bird used as a noun is common (polysemy count = 5)
cat used as a noun is common (polysemy count = 7)
house used as a noun is familiar (polysemy count = 11)
information used as a noun is common (polysemy count = 5)
retrieval used as a noun is uncommon (polysemy count = 3)
serendipity used as a noun is very rare (polysemy count = 1)
Familiarity and polysemy
(C) 2005, The University of Michigan 148
Compound nouns
advisory boardappeals boardbackboardbackgammon boardbaseboardbasketball backboardbig boardbillboardbinder's boardbinder board
blackboardboard gameboard measureboard meetingboard memberboard of appealsboard of directorsboard of educationboard of regentsboard of trustees
(C) 2005, The University of Michigan 149
Overview of senses1. board -- (a committee having supervisory powers; "the board has seven members")2. board -- (a flat piece of material designed for a special purpose; "he nailed boards across the windows")3. board, plank -- (a stout length of sawn timber; made in a wide variety of sizes and used for many purposes)4. display panel, display board, board -- (a board on which information can be displayed to public view)5. board, gameboard -- (a flat portable surface (usually rectangular) designed for board games; "he got out the board and set up the pieces")6. board, table -- (food or meals in general; "she sets a fine table"; "room and board")7. control panel, instrument panel, control board, board, panel -- (an insulated panel containing switches and dials and meters for controlling electrical devices; "he checked the instrument panel"; "suddenly the board lit up like a Christmas tree")8. circuit board, circuit card, board, card -- (a printed circuit that can be inserted into expansion slots in a computer to increase the computer's capabilities) 9. dining table, board -- (a table at which meals are served; "he helped her clear the dining table"; "a feast was spread upon the board")
(C) 2005, The University of Michigan 150
Top-level concepts{act, action, activity}
{animal, fauna}
{artifact}
{attribute, property}
{body, corpus}
{cognition, knowledge}
{communication}
{event, happening}
{feeling, emotion}
{food}
{group, collection}
{location, place}
{motive}
{natural object}
{natural phenomenon}
{person, human being}
{plant, flora}
{possession}
{process}
{quantity, amount}
{relation}
{shape}
{state, condition}
{substance}
{time}
(C) 2005, The University of Michigan 151
WordNet and DistSim
wn reason -hypen - hypernyms
wn reason -synsn - synsets
wn reason -simsn - synonyms
wn reason -over - overview of senses
wn reason -famln - familiarity/polysemy
wn reason -grepn - compound nouns
/data2/tools/relatedwords/relate reason
(C) 2005, The University of Michigan 152
System comparison
(C) 2005, The University of Michigan 153
Comparing two systems
• Comparing A and B
• One query?
• Average performance?
• Need: A to consistently outperform B
[this slide: courtesy James Allan]
(C) 2005, The University of Michigan 154
The sign test
• Example 1:– A > B (12 times)
– A = B (25 times)
– A < B (3 times)
– p < 0.035 (significant at the 5% level)
• Example 2:– A > B (18 times)
– A < B (9 times)
– p < 0.122 (not significant at the 5% level)[this slide: courtesy James Allan]
(C) 2005, The University of Michigan 155
Other tests
• The t test:– Takes into account the actual performances, not
just which system is better– http://nimitz.mcs.kent.edu/~blewis/stat/
tTest.html
• The sign test:– http://www.fon.hum.uva.nl/Service/Statistics/
Sign_Test.html
(C) 2005, The University of Michigan 156
Techniques for dimensionalityreduction: SVD and LSI
(C) 2005, The University of Michigan 157
Techniques for dimensionality reduction
• Based on matrix decomposition (goal: preserve clusters, explain away variance)
• A quick review of matrices– Vectors
– Matrices
– Matrix multiplication
1
1
2
*
1494
852
321
(C) 2005, The University of Michigan 158
SVD: Singular Value Decomposition
• A=UVT
• This decomposition exists for all matrices, dense or sparse
• If A has 5 columns and 3 rows, then U will be 5x5 and V will be 3x3
• In Matlab, use [U,S,V] = svd (A)
(C) 2005, The University of Michigan 159
Term matrix normalization
1
0
1
1
0
0
000
101
1010
0
0
0
1
1
110
110
101
A
71.0
00.0
71.0
58.0
00.0
00.0
00.000.000.0
45.000.058.0
45.000.058.000.0
00.0
00.0
00.0
58.0
58.0
45.071.000.0
45.071.000.0
45.000.058.0
)(nA
D1 D2 D3 D4 D5
D1 D2 D3 D4 D5
(C) 2005, The University of Michigan 160
Example (Berry and Browne)
• T1: baby• T2: child• T3: guide• T4: health • T5: home• T6: infant• T7: proofing• T8: safety• T9: toddler
• D1: infant & toddler first aid• D2: babies & children’s room (for
your home)• D3: child safety at home• D4: your baby’s health and safety:
from infant to toddler• D5: baby proofing basics• D6: your guide to easy rust proofing• D7: beanie babies collector’s guide
(C) 2005, The University of Michigan 161
Document term matrix
0001001
0001100
0110000
0001001
0000110
0001000
1100000
0000110
1011010
A
00045.00071.0
00045.058.000
071.071.00000
00045.00071.0
000058.058.00
00045.0000
71.071.000000
000058.058.00
71.0071.045.0058.00
)(nA
(C) 2005, The University of Michigan 162
Decompositionu =
-0.6976 -0.0945 0.0174 -0.6950 0.0000 0.0153 0.1442 -0.0000 0 -0.2622 0.2946 0.4693 0.1968 -0.0000 -0.2467 -0.1571 -0.6356 0.3098 -0.3519 -0.4495 -0.1026 0.4014 0.7071 -0.0065 -0.0493 -0.0000 0.0000 -0.1127 0.1416 -0.1478 -0.0734 0.0000 0.4842 -0.8400 0.0000 -0.0000 -0.2622 0.2946 0.4693 0.1968 0.0000 -0.2467 -0.1571 0.6356 -0.3098 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 -0.3098 -0.6356 -0.3519 -0.4495 -0.1026 0.4014 -0.7071 -0.0065 -0.0493 0.0000 -0.0000 -0.2112 0.3334 0.0962 0.2819 -0.0000 0.7338 0.4659 -0.0000 0.0000 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 0.3098 0.6356
v =
-0.1687 0.4192 -0.5986 0.2261 0 -0.5720 0.2433 -0.4472 0.2255 0.4641 -0.2187 0.0000 -0.4871 -0.4987 -0.2692 0.4206 0.5024 0.4900 -0.0000 0.2450 0.4451 -0.3970 0.4003 -0.3923 -0.1305 0 0.6124 -0.3690 -0.4702 -0.3037 -0.0507 -0.2607 -0.7071 0.0110 0.3407 -0.3153 -0.5018 -0.1220 0.7128 -0.0000 -0.0162 -0.3544 -0.4702 -0.3037 -0.0507 -0.2607 0.7071 0.0110 0.3407
(C) 2005, The University of Michigan 163
Decomposition
s = 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0.7100 0 0 0 0 0 0 0 0.5692 0 0 0 0 0 0 0 0.1977 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Spread on the v1 axis
(C) 2005, The University of Michigan 164
Rank-4 approximations4 =
1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(C) 2005, The University of Michigan 165
Rank-4 approximationu*s4*v' -0.0019 0.5985 -0.0148 0.4552 0.7002 0.0102 0.7002 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.1980 0.0514 0.0064 0.2199 0.0535 -0.0544 0.0535 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.2165 0.2494 0.4367 0.2282 -0.0360 0.0394 -0.0360 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008
(C) 2005, The University of Michigan 166
Rank-4 approximationu*s4
-1.1056 -0.1203 0.0207 -0.5558 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.1786 0.1801 -0.1765 -0.0587 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.3348 0.4241 0.1149 0.2255 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0
(C) 2005, The University of Michigan 167
Rank-4 approximations4*v'
-0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 -0.7150 0.5544 0.6001 -0.4686 -0.0605 -0.1457 -0.0605 0.1808 -0.1749 0.3918 -0.1043 -0.2085 0.5700 -0.2085 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(C) 2005, The University of Michigan 168
Rank-2 approximations2 =
1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(C) 2005, The University of Michigan 169
Rank-2 approximationu*s2*v'
0.1361 0.4673 0.2470 0.3908 0.5563 0.4089 0.5563 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.1057 0.1205 0.1239 0.1430 0.0293 -0.0341 0.0293 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.2343 0.2454 0.2685 0.3027 0.0286 -0.1073 0.0286 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048
(C) 2005, The University of Michigan 170
Rank-2 approximationu*s2
-1.1056 -0.1203 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.1786 0.1801 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.3348 0.4241 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0
(C) 2005, The University of Michigan 171
Rank-2 approximations2*v'
-0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(C) 2005, The University of Michigan 172
Documents to concepts and terms to concepts
A(:,1)'*u*s
-0.4238 0.6784 -0.8541 0.1446 -0.0000 -0.1853 0.0095
>> A(:,1)'*u*s4
-0.4238 0.6784 -0.8541 0.1446 0 0 0
>> A(:,1)'*u*s2
-0.4238 0.6784 0 0 0 0 0
>> A(:,2)'*u*s2
-1.1233 0.3650 0 0 0 0 0
>> A(:,3)'*u*s2
-0.6762 0.6807 0 0 0 0 0
(C) 2005, The University of Michigan 173
Documents to concepts and terms to concepts
>> A(:,4)'*u*s2
-0.9972 0.6478 0 0 0 0 0
>> A(:,5)'*u*s2
-1.1809 -0.4914 0 0 0 0 0
>> A(:,6)'*u*s2
-0.7918 -0.8121 0 0 0 0 0
>> A(:,7)'*u*s2
-1.1809 -0.4914 0 0 0 0 0
(C) 2005, The University of Michigan 174
Cont’d>> (s2*v'*A(1,:)')'
-1.7523 -0.1530 0 0 0 0 0 0 0
>> (s2*v'*A(2,:)')'
-0.6585 0.4768 0 0 0 0 0 0 0
>> (s2*v'*A(3,:)')'
-0.8838 -0.7275 0 0 0 0 0 0 0
>> (s2*v'*A(4,:)')'
-0.2831 0.2291 0 0 0 0 0 0 0
>> (s2*v'*A(5,:)')'
-0.6585 0.4768 0 0 0 0 0 0 0
(C) 2005, The University of Michigan 175
Cont’d>> (s2*v'*A(6,:)')'
-0.4730 0.6078 0 0 0 0 0 0 0
>> (s2*v'*A(7,:)')'
-0.8838 -0.7275 0 0 0 0 0 0 0
>> (s2*v'*A(8,:)')'
-0.5306 0.5395 0 0 0 0 0 0 0
>> (s2*v'*A(9,:)')‘
-0.4730 0.6078 0 0 0 0 0 0 0
(C) 2005, The University of Michigan 176
PropertiesA*A'
1.5471 0.3364 0.5041 0.2025 0.3364 0.2025 0.5041 0.2025 0.2025 0.3364 0.6728 0 0 0.6728 0 0 0.3364 0 0.5041 0 1.0082 0 0 0 0.5041 0 0 0.2025 0 0 0.2025 0 0.2025 0 0.2025 0.2025 0.3364 0.6728 0 0 0.6728 0 0 0.3364 0 0.2025 0 0 0.2025 0 0.7066 0 0.2025 0.7066 0.5041 0 0.5041 0 0 0 1.0082 0 0 0.2025 0.3364 0 0.2025 0.3364 0.2025 0 0.5389 0.2025 0.2025 0 0 0.2025 0 0.7066 0 0.2025 0.7066
A'*A
1.0082 0 0 0.6390 0 0 0 0 1.0092 0.6728 0.2610 0.4118 0 0.4118 0 0.6728 1.0092 0.2610 0 0 0 0.6390 0.2610 0.2610 1.0125 0.3195 0 0.3195 0 0.4118 0 0.3195 1.0082 0.5041 0.5041 0 0 0 0 0.5041 1.0082 0.5041 0 0.4118 0 0.3195 0.5041 0.5041 1.0082
A is a document to term matrix. What is A*A’, what is A’*A?
(C) 2005, The University of Michigan 177
Latent semantic indexing (LSI)
• Dimensionality reduction = identification of hidden (latent) concepts
• Query matching in latent space
(C) 2005, The University of Michigan 178
Useful pointers
• http://lsa.colorado.edu• http://lsi.research.telcordia.com/• http://www.cs.utk.edu/~lsi/• http://javelina.cet.middlebury.edu/lsa/out/
lsa_definition.htm• http://citeseer.nj.nec.com/
deerwester90indexing.html• http://www.pcug.org.au/~jdowling/
(C) 2005, The University of Michigan 179
Models of the Web
(C) 2005, The University of Michigan 180
Size
• The Web is the largest repository of data and it grows exponentially.– 320 Million Web pages [Lawrence & Giles 1998]
– 800 Million Web pages, 15 TB [Lawrence & Giles 1999]
– 8 Billion Web pages indexed [Google 2005]
• Amount of data– roughly 200 TB [Lyman et al. 2003]
(C) 2005, The University of Michigan 181
Bow-tie model of the Web
SCC56 M
OUT44 M
IN44 M
Bröder & al. WWW 2000, Dill & al. VLDB 2001
DISC17 M
TEND44M
24% of pagesreachable froma given page
(C) 2005, The University of Michigan 182
Power laws
• Web site size (Huberman and Adamic 1999)• Power-law connectivity (Barabasi and Albert
1999): exponents 2.45 for out-degree and 2.1 for the in-degree
• Others: call graphs among telephone carriers, citation networks (Redner 1998), e.g., Erdos, collaboration graph of actors, metabolic pathways (Jeong et al. 2000), protein networks (Maslov and Sneppen 2002). All values of gamma are around 2-3.
(C) 2005, The University of Michigan 183
Small-world networks
• Diameter = average length of the shortest path between all pairs of nodes. Example…
• Milgram experiment (1967)– Kansas/Omaha --> Boston (42/160 letters)– diameter = 6
• Albert et al. 1999 – average distance between two verstices is d = 0.35 + 2.06 log10n. For n = 109, d=18.89.
• Six degrees of separation
(C) 2005, The University of Michigan 184
Clustering coefficient
• Cliquishness (c): between the kv (kv – 1)/2 pairs of neighbors.
• Examples:
n k d drand C crand
Actors 225226 61 3.65 2.99 0.79 0.00027
Power grid 4941 2.67 18.7 12.4 0.08 0.005
C. Elegans 282 14 2.65 2.25 0.28 0.05
(C) 2005, The University of Michigan 185
Models of the Web
Npkk
kekP
kk
!)(
)()(
k
kP
A
B
a
b
• Erdös/Rényi 59, 60
• Barabási/Albert 99
• Watts/Strogatz 98
• Kleinberg 98
• Menczer 02
• Radev 03
• Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology
(C) 2005, The University of Michigan 188
Social network analysis for IR
(C) 2005, The University of Michigan 189
Social networks
• Induced by a relation• Symmetric or not• Examples:
– Friendship networks– Board membership– Citations– Power grid of the US– WWW
(C) 2005, The University of Michigan 190
Krebs 2004
(C) 2005, The University of Michigan 191
Prestige and centrality
• Degree centrality: how many neighbors each node has.
• Closeness centrality: how close a node is to all of the other nodes
• Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes
• Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects.
• Prestige = same as centrality but for directed graphs.
(C) 2005, The University of Michigan 192
Graph-based representations
1
2
34
5
7
6 81 2 3 4 5 6 7 8
1 1 1
2 1
3 1 1
4 1
5 1 1 1 1
6 1 1
7
8
Square connectivity(incidence) matrix
Graph G (V,E)
(C) 2005, The University of Michigan 193
Markov chains
• A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E.
• Path = sequence (x0, x1, …, xn).Xi = xi-1*E
• The probability of a path can be computed as a product of probabilities for each step i.
• Random walk = find Xj given x0, E, and j.
(C) 2005, The University of Michigan 194
Stationary solutions
• The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions:– E is stochastic
– E is irreducible
– E is aperiodic
• To make these conditions true:– All rows of E add up to 1 (and no value is negative)
– Make sure that E is strongly connected
– Make sure that E is not bipartite
• Example: PageRank [Brin and Page 1998]: use “teleportation”
(C) 2005, The University of Michigan 195
1
2
34
5
7
6 8
Example
This graph E has a second graph E’(not drawn) superimposed on it:E’ is the uniform transition graph.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8
Pag
eRan
k
t=0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8
Pag
eRan
k
t=1
(C) 2005, The University of Michigan 196
Eigenvectors
• An eigenvector is an implicit “direction” for a matrix.Mv = λv, where v is non-zero, though λ can be any
complex number in principle.
• The largest eigenvalue of a stochastic matrix E is real: λ1 = 1.
• For λ1, the left (principal) eigenvector is p, the right eigenvector = 1
• In other words, ETp = p.
(C) 2005, The University of Michigan 197
Computing the stationary distribution
0)(
pEI
pEpT
T
function PowerStatDist (E):begin p(0) = u; (or p(0) = [1,0,…0]) i=1; repeat p(i) = ETp(i-1)
L = ||p(i)-p(i-1)||1; i = i + 1; until L < return p(i)
end
Solution for thestationary distribution
(C) 2005, The University of Michigan 198
1
2
34
5
7
6 8
Example
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8
Pag
eRan
k
t=0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8
Pag
eRan
k
t=1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8
Pag
eRan
k
t=10
(C) 2005, The University of Michigan 199
How Google works
• Crawling
• Anchor text
• Fast query processing
• Pagerank
(C) 2005, The University of Michigan 200
More about PageRank
• Named after Larry Page, founder of Google (and UM alum)
• Reading “The anatomy of a large-scale hypertextual web search engine” by Brin and Page.
• Independent of query (although more recent work by Haveliwala (WWW 2002) has also identified topic-based PageRank.
(C) 2005, The University of Michigan 201
HITS
• Query-dependent model (Kleinberg 97)• Hubs and authorities (e.g., cars, Honda)
• Algorithm– obtain root set using input query– expanded the root set by radius one– run iterations on the hub and authority scores together– report top-ranking authorities and hubs
hEa T'Eah '
(C) 2005, The University of Michigan 202
The link-content hypothesis
• Topical locality: page is similar () to the page that points to it ().
• Davison (TF*IDF, 100K pages)– 0.31 same domain
– 0.23 linked pages
– 0.19 sibling
– 0.02 random
• Menczer (373K pages, non-linear least squares fit)
• Chakrabarti (focused crawling) - prob. of losing the topic
Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001
21)1()(
e 03.01=1.8, 2=0.6,
(C) 2005, The University of Michigan 203
Measuring the Web
(C) 2005, The University of Michigan 204
Bharat and Broder 1998
• Based on crawls of HotBot, Altavista, Excite, and InfoSeek
• 10,000 queries in mid and late 1997
• Estimate is 200M pages
• Only 1.4% are indexed by all of them
(C) 2005, The University of Michigan 205
Example (from Bharat&Broder)
A similar approach by Lawrence and Giles yields 320M pages (Lawrence and Giles 1998).
(C) 2005, The University of Michigan 212
Question answering
(C) 2005, The University of Michigan 213
People ask questions
• Excite corpus of 2,477,283 queries (one day’s worth)
• 8.4% of them are questions– 43.9% factual (what is the country code for
Belgium)– 56.1% procedural (how do I set up TCP/IP) or
other
• In other words, 100 K questions per day
(C) 2005, The University of Michigan 214
People ask questionsIn what year did baseball become an offical sport?Who is the largest man in the world?Where can i get information on Raphael?where can i find information on puritan religion?Where can I find how much my house is worth?how do i get out of debt?Where can I found out how to pass a drug test?When is the Super Bowl?who is California's District State Senator?where can I buy extra nibs for a foutain pen?how do i set up tcp/ip ?what time is it in west samoa?Where can I buy a little kitty cat?what are the symptoms of attention deficit disorder?Where can I get some information on Michael Jordan?How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero?When did the Neanderthal man live?Which Frenchman declined the Nobel Prize for Literature for ideological reasons?What is the largest city in Northern Afghanistan?
How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero?When did the Neanderthal man live?
(C) 2005, The University of Michigan 215
(C) 2005, The University of Michigan 216
Question answering
What is the largest city in Northern Afghanistan?
(C) 2005, The University of Michigan 217
Possible approaches
• Map?• Knowledge base
Find x: city (x) located (x,”Northern Afghanistan”) ¬exists (y): city (y) located (y,”Northern Afghanistan”) greaterthan (population (y), population (x))
• Database?• World factbook?• Search engine?
(C) 2005, The University of Michigan 218
The TREC Q&A evaluation
• Run by NIST [Voorhees and Tice 2000]• 2GB of input• 200 questions• Essentially fact extraction
– Who was Lincoln’s secretary of state?– What does the Peugeot company manufacture?
• Questions are based on text• Answers are assumed to be present• No inference needed
(C) 2005, The University of Michigan 219
Q: When did Nelson Mandela become president of South Africa?
A: 10 May 1994
Q: How tall is the Matterhorn?
A: The institute revised the Matterhorn 's height to 14,776 feet 9 inches
Q: How tall is the replica of the Matterhorn at Disneyland?
A: In fact he has climbed the 147-foot Matterhorn at Disneyland every week end for the last 3 1/2 years
Q: If Iraq attacks a neighboring country, what should the US do?
A: ??
Question answering
(C) 2005, The University of Michigan 220
Q: Why did David Koresh ask the FBI for a word processor?Q: Name the designer of the shoe that spawned millions of plastic imitations, known as "jellies".Q: What is the brightest star visible from Earth?Q: What are the Valdez Principles?Q: Name a film that has won the Golden Bear in the Berlin Film Festival?Q: Name a country that is developing a magnetic levitation railway system?Q: Name the first private citizen to fly in space.Q: What did Shostakovich write for Rostropovich?Q: What is the term for the sum of all genetic material in a given organism?Q: What is considered the costliest disaster the insurance industry has ever faced?Q: What is Head Start?Q: What was Agent Orange used for during the Vietnam War?Q: What did John Hinckley do to impress Jodie Foster?Q: What was the first Gilbert and Sullivan opera?Q: What did Richard Feynman say upon hearing he would receive the Nobel Prize in Physics?Q: How did Socrates die?Q: Why are electric cars less efficient in the north-east than in California?
(C) 2005, The University of Michigan 221
NSIR
• Current project at U-M– http://tangra.si.umich.edu/clair/NSIR/html/nsir.cgi
• Reading:– [Radev et al., 2005a]
• Dragomir R. Radev, Weiguo Fan, Hong Qi, Harris Wu, and Amardeep Grewal. Probabilistic question answering on the web. Journal of the American Society for Information Science and Technology 56(3), March 2005
• http://tangra.si.umich.edu/~radev/bib2html/radev-bib.html
(C) 2005, The University of Michigan 222
(C) 2005, The University of Michigan 223
... Afghanistan, Kabul, 2,450 ... Administrative capital and largest city (1997 est ... Undetermined.Panama, Panama City, 450,668. ... of the Gauteng, Northern Province, Mpumalanga ... www.infoplease.com/cgi-bin/id/A0855603
... died in Kano, northern Nigeria's largest city, during two days of anti-American riotsled by Muslims protesting the US-led bombing of Afghanistan, according to ... www.washingtonpost.com/wp-dyn/print/world/
... air strikes on the city. ... the Taliban militia in northern Afghanistan in a significantblow ... defection would be the largest since the United States ... www.afgha.com/index.php - 60k
... Kabul is the capital and largest city of Afghanistan. . ... met. area pop. 2,029,889),is the largest city in Uttar Pradesh, a state in northern India. . ... school.discovery.com/homeworkhelp/worldbook/atozgeography/ k/k1menu.html
... Gudermes, Chechnya's second largest town. The attack ... location in Afghanistan's outlyingregions ... in the city of Mazar-i-Sharif, a Northern Alliance-affiliated ... english.pravda.ru/hotspots/2001/09/17/
... Get Worse By RICK BRAGG Pakistan's largest city is getting a jump on the ... Region: EducationOffers Women in Northern Afghanistan a Ray of Hope. ... www.nytimes.com/pages/world/asia/
... within three miles of the airport at Mazar-e-Sharif, the largest city in northernAfghanistan, held since 1998 by the Taliban. There was no immediate comment ... uk.fc.yahoo.com/photos/a/afghanistan.html
(C) 2005, The University of Michigan 224
Document retrieval
Query modulation
Sentence retrieval
Answer extraction
Answer ranking
What is the largest city in Northern Afghanistan?
(largest OR biggest) city “Northern Afghanistan”
www.infoplease.com/cgi-bin/id/A0855603www.washingtonpost.com/wp-dyn/print/world/
Gudermes, Chechnya's second largest town … location in Afghanistan's outlying regionswithin three miles of the airport at Mazar-e-Sharif, the largest city in northern Afghanistan
GudermesMazer-e-Sharif
Mazer-e-SharifGudermes
(C) 2005, The University of Michigan 225
(C) 2005, The University of Michigan 226
(C) 2005, The University of Michigan 227
Research problems• Source identification:
– semi-structured vs. text sources
• Query modulation:– best paraphrase of a NL question given the syntax of a search engine?– Compare two approaches: noisy channel model and rule-based
• Sentence ranking– n-gram matching, Okapi, co-reference?
• Answer extraction– question type identification– phrase chunking– no general-purpose named entity tagger available
• Answer ranking– what are the best predictors of a phrase being the answer to a given
question: question type, proximity to query words, frequency
• Evaluation (MRDR)– accuracy, reliability, timeliness
(C) 2005, The University of Michigan 228
Document retrieval
• Use existing search engines: Google, AlltheWeb, NorthernLight
• No modifications to question
• CF: work on QASM (ACM CIKM 2001)
(C) 2005, The University of Michigan 229
F
tfwtfwidftfwS
N
kk
N
jj
N
iii
i
321
13
12
11 ****
Sentence ranking
• Weighted N-gram matching:
• Weights are determined empirically, e.g., 0.6, 0.3, and 0.1
(C) 2005, The University of Michigan 230
Probabilistic phrase reranking
• Answer extraction: probabilistic phrase reranking. What is:
p(ph is answer to q | q, ph)
• Evaluation: TRDR– Example: (2,8,10) gives .725– Document, sentence, or phrase level
• Criterion: presence of answer(s)
• High correlation with manual assessment
n
iirn 1
11
(C) 2005, The University of Michigan 231
Phrase types
PERSON PLACE DATE NUMBER DEFINITIONORGANIZATION DESCRIPTION ABBREVIATIONKNOWNFOR RATE LENGTH MONEY REASONDURATION PURPOSE NOMINAL OTHER
(C) 2005, The University of Michigan 232
Question Type Identification• Wh-type not sufficient:
• Who: PERSON 77, DESCRIPTION 19, ORG 6• What: NOMINAL 78, PLACE 27, DEF26, PERSON 18, ORG 16,
NUMBER 14, etc.• How: NUMBER 33, LENGTH 6, RATE 2, etc.
• Ripper:– 13 features: Question-Words, Wh-Word, Word-Beside-Wh-
Word, Is-Noun-Length, Is-Noun-Person, etc.– Top 2 question types
• Heuristic algorithm:– About 100 regular expressions based on words and parts of
speech
(C) 2005, The University of Michigan 233
Ripper performance
-20.69%-TREC8,9,10
30%17.03%TREC10TREC8,9
24%22.4%TREC8TREC9
Test Error Rate
Train Error Rate
TestTraining
(C) 2005, The University of Michigan 234
Regex performance
7.6%5.5%4.6%TREC8,9,10
18.2%6%7.4%TREC8,9
18%15%7.8%TREC9
Test on TREC10
Test on TREC8
Test on TREC9
Training
(C) 2005, The University of Michigan 235
Phrase ranking
• Phrases are identified by a shallow parser (ltchunk from Edinburgh)
• Four features:– Proximity– POS (part-of-speech) signature (qtype)– Query overlap– Frequency
(C) 2005, The University of Michigan 236
Proximity
• Phrasal answers tend to appear near words from the query
• Average distance = 7 words, range = 1 to 50 words
• Use linearrescalingof scores
(C) 2005, The University of Michigan 237
Part of speech signature
NO (100%)NO (86.7%) PERSON (3.8%) NUMBER (3.8%) ORG (2.5%)PERSON (37.4%) PLACE (29.6%) DATE (21.7%) NO (7.6%)NO (75.6%) NUMBER (11.1%) PLACE (4.4%) ORG (4.4%)PLACE (37.3%) PERSON (35.6%) NO (16.9%) ORG (10.2%)ORG (55.6%) NO (33.3%) PLACE (5.6%) DATE (5.6%)
VBDDT NNNNPDT JJ NNPNNP NNPDT NNP
Phrase TypesSignature
Example: “Hugo/NNP Young/NNP”P (PERSON | “NNP NNP”) = .458
Example: “the/DT Space/NNP Flight/NNP Operations/NNP contractor/NN”P (PERSON | “DT NNP NNP NNP NN”) = 0
Penn Treebank tagset (DT = determiner, JJ = adjective)
(C) 2005, The University of Michigan 238
Query overlap and frequency
• Query overlap:– What is the capital of Zimbabwe?– Possible choices:
Mugabe, Zimbabwe, Luanda, Harare
• Frequency:– Not necessarily accurate but rather useful
(C) 2005, The University of Michigan 239
Reranking
Rank Probability and phrase
1 0.599862 the_DT Space_NNP Flight_NNP Operations_NNP contractor_NN ._.2 0.598564 International_NNP Space_NNP Station_NNP Alpha_NNP3 0.598398 International_NNP Space_NNP Station_NNP4 0.598125 to_TO become_VB5 0.594763 a_DT joint_JJ venture_NN United_NNP Space_NNP Alliance_NNP6 0.593933 NASA_NNP Johnson_NNP Space_NNP Center_NNP7 0.587140 will_MD form_VB8 0.585410 The_DT purpose_NN9 0.576797 prime_JJ contracts_NNS10 0.568013 First_NNP American_NNP11 0.567361 this_DT bulletin_NN board_NN12 0.565757 Space_NNP :_:13 0.562627 'Spirit_NN '_'' of_IN...41 0.516368 Alan_NNP Shepard_NNP
Proximity = .5164
(C) 2005, The University of Michigan 240
Reranking
Rank Probability and phrase
1 0.465012 Space_NNP Administration_NNP ._.2 0.446466 SPACE_NNP CALENDAR_NNP _.3 0.413976 First_NNP American_NNP4 0.399043 International_NNP Space_NNP Station_NNP Alpha_NNP5 0.396250 her_PRP$ third_JJ space_NN mission_NN6 0.395956 NASA_NNP Johnson_NNP Space_NNP Center_NNP7 0.394122 the_DT American_NNP Commercial_NNP Launch_NNP Industry_NNP8 0.390163 the_DT Red_NNP Planet_NNP ._.9 0.379797 First_NNP American_NNP10 0.376336 Alan_NNP Shepard_NNP11 0.375669 February_NNP12 0.374813 Space_NNP13 0.373999 International_NNP Space_NNP Station_NNP
Qtype = .7288Proximity * qtype = .3763
(C) 2005, The University of Michigan 241
Reranking
Rank Probability and phrase
1 0.478857 Neptune_NNP Beach_NNP ._.2 0.449232 February_NNP3 0.447075 Go_NNP4 0.437895 Space_NNP5 0.431835 Go_NNP6 0.424678 Alan_NNP Shepard_NNP7 0.423855 First_NNP American_NNP8 0.421133 Space_NNP May_NNP9 0.411065 First_NNP American_NNP woman_NN10 0.401994 Life_NNP Sciences_NNP11 0.385763 Space_NNP Shuttle_NNP Discovery_NNP STS-60_NN12 0.381865 the_DT Moon_NNP International_NNP Space_NNP Station_NNP13 0.370030 Space_NNP Research_NNP A_NNP Session_NNP
All four features
(C) 2005, The University of Michigan 242
(C) 2005, The University of Michigan 243
(C) 2005, The University of Michigan 244
(C) 2005, The University of Michigan 245
Document level performance
164163149#>0
1.33611.04950.8355Avg
GoogleNLightAlltheWebEngine
TREC 8 corpus (200 questions)
(C) 2005, The University of Michigan 246
Sentence level performance
1351371591191211599999148#>0
0.490.542.550.440.482.530.260.312.13Avg
GOO
GOL
GOU
NLO
NLL
NLU
AWO
AWL
AWU
Engine
(C) 2005, The University of Michigan 247
Phrase level performance
0.1990.1570.1170.105Combined
0.06460.0580.0540.038Global proximity
0.06460.0680.0480.026Appearance order
1.9412.6982.6522.176Upperbound
Google S+PGoogle D+PNorthernLightAlltheWeb
Experiments performedOct-Nov. 2001
(C) 2005, The University of Michigan 248
(C) 2005, The University of Michigan 249
(C) 2005, The University of Michigan 250
Text classification
(C) 2005, The University of Michigan 251
Introduction
• Text classification: assigning documents to predefined categories
• Hierarchical vs. flat• Many techniques: generative (maxent, knn, Naïve
Bayes) vs. discriminative (SVM, regression)• Generative: model joint prob. p(x,y) and use
Bayesian prediction to compute p(y|x)• Discriminative: model p(y|x) directly.
(C) 2005, The University of Michigan 252
Generative models: knn
• K-nearest neighbors
• Very easy to program
• Issues: choosing k, b?
)(
),(),(qdkNNd
qcq ddsbdcscore
(C) 2005, The University of Michigan 253
Feature selection: The 2 test
• For a term t:
• Testing for independence:P(C=0,It=0) should be equal to P(C=0) P(It=0)– P(C=0) = (k00+k01)/n– P(C=1) = 1-P(C=0) = (k10+k11)/n– P(It=0) = (k00+K10)/n– P(It=1) = 1-P(It=0) = (k01+k11)/n
It
0 1
C 0 k00 k01
1 k10 k11
(C) 2005, The University of Michigan 254
Feature selection: The 2 test
• High values of 2 indicate lower belief in independence.
• In practice, compute 2 for all words and pick the top k among them.
))()()((
)(
0010011100011011
2011000112
kkkkkkkk
kkkknΧ
(C) 2005, The University of Michigan 255
Feature selection: mutual information
• No document length scaling is needed
• Documents are assumed to be generated according to the multinomial model
x y yPxP
yxPyxPYXMI
)()(
),(log),(),(
(C) 2005, The University of Michigan 256
Naïve Bayesian classifiers
• Naïve Bayesian classifier
• Assuming statistical independence
),(
)()|,...,(),...,|(
,...21
2121
k
kk FFFP
CdPCdFFFPFFFCdP
k
j j
k
j j
kFP
CdPCdFPFFFCdP
1
121
)(
)()|(),...,|(
(C) 2005, The University of Michigan 257
Spam recognitionReturn-Path: <[email protected]>X-Sieve: CMU Sieve 2.2From: "Ibrahim Galadima" <[email protected]>Reply-To: [email protected]: [email protected]: Tue, 14 Jan 2003 21:06:26 -0800Subject: Gooday
DEAR SIR
FUNDS FOR INVESTMENTS
THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HADNO PREVIOUS CORRESPONDENCE WITH YOU
I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENTNATIONAL ELECTORAL COMMISSION INEC I GOT YOURCONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLEPERSON WITH WHOM TO HANDLE A VERY CONFIDENTIALTRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED ATTWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATESDOLLARS US$20M TO A SAFE FOREIGN ACCOUNT
THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITHARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OFOVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A
(C) 2005, The University of Michigan 258
Well-known datasets• 20 newsgroups
– http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/
• Reuters-21578– Cats: grain, acquisitions, corn, crude, wheat, trade…
• WebKB– http://www-2.cs.cmu.edu/~webkb/– course, student, faculty, staff, project, dept, other– NB performance (2000)– P=26,43,18,6,13,2,94– R=83,75,77,9,73,100,35
(C) 2005, The University of Michigan 259
Support vector machines
• Introduced by Vapnik in the early 90s.
(C) 2005, The University of Michigan 260
Semi-supervised learning
• EM
• Co-training
• Graph-based
(C) 2005, The University of Michigan 261
Additional topics
• Soft margins
• VC dimension
• Kernel methods
(C) 2005, The University of Michigan 262
• SVMs are widely considered to be the best method for text classification (look at papers by Sebastiani, Christianini, Joachims), e.g. 86% accuracy on Reuters.
• NB also good in many circumstances
(C) 2005, The University of Michigan 263
Readings• Books:• 1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto; Modern Information Retrieval, Addison-
Wesley/ACM Press, 1999.• 2. Pierre Baldi, Paolo Frasconi, Padhraic Smyth; Modeling the Internet and the Web: Probabilistic
Methods and Algorithms; Wiley, 2003, ISBN: 0-470-84906-1
• Papers:• Barabasi and Albert "Emergence of scaling in random networks" Science (286) 509-512, 1999• Bharat and Broder "A technique for measuring the relative size and overlap of public Web search
engines" WWW 1998• Brin and Page "The Anatomy of a Large-Scale Hypertextual Web Search Engine" WWW 1998 • Bush "As we may thing" The Atlantic Monthly 1945 • Chakrabarti, van den Berg, and Dom "Focused Crawling" WWW 1999• Cho, Garcia-Molina, and Page "Efficient Crawling Through URL Ordering" WWW 1998• Davison "Topical locality on the Web" SIGIR 2000• Dean and Henzinger "Finding related pages in the World Wide Web" WWW 1999• Deerwester, Dumais, Landauer, Furnas, Harshman "Indexing by latent semantic analysis" JASIS
41(6) 1990
(C) 2005, The University of Michigan 264
Readings• Erkan and Radev "LexRank: Graph-based Lexical Centrality as Salience in Text
Summarization" JAIR 22, 2004• Jeong and Barabasi "Diameter of the world wide web" Nature (401) 130-131, 1999• Hawking, Voorhees, Craswell, and Bailey "Overview of the TREC-8 Web Track" TREC
2000• Haveliwala "Topic-sensitive pagerank" WWW 2002• Kumar, Raghavan, Rajagopalan, Sivakumar, Tomkins, Upfal "The Web as a graph"
PODS 2000• Lawrence and Giles "Accessibility of information on the Web" Nature (400) 107-109,
1999• Lawrence and Giles "Searching the World-Wide Web" Science (280) 98-100, 1998• Menczer "Links tell us about lexical and semantic Web content" arXiv 2001• Page, Brin, Motwani, and Winograd "The PageRank citation ranking: Bringing order to
the Web" Stanford TR, 1998• Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web"
JASIST 2005• Singhal "Modern Information Retrieval: an Overview" IEEE 2001
(C) 2005, The University of Michigan 265
More readings• Gerard Salton, Automatic Text Processing, Addison-
Wesley (1989)• Gerald Kowalski, Information Retrieval Systems: Theory
and Implementation, Kluwer (1997)• Gerard Salton and M. McGill, Introduction to Modern
Information Retrieval, McGraw-Hill (1983)• C. J. an Rijsbergen, Information Retrieval, Buttersworths
(1979)• Ian H. Witten, Alistair Moffat, and Timothy C. Bell,
Managing Gigabytes, Van Nostrand Reinhold (1994)• ACM SIGIR Proceedings, SIGIR Forum• ACM conferences in Digital Libraries
(C) 2005, The University of Michigan 266
Thank you!
Благодаря!