Information Retrieval

(C) 2005, The University of Michigan 1

Information Retrieval

Dragomir R. RadevUniversity of Michigan

[email protected]

September 19, 2005


About the instructor

• Dragomir R. Radev• Associate Professor, University of Michigan

– School of Information– Department of Electrical Engineering and Computer Science– Department of Linguistics

• Head of CLAIR (Computational Linguistics And Information Retrieval) at U. Michigan

• Treasurer, North American Chapter of the ACL• Ph.D., 1998, Computer Science, Columbia University• Email: [email protected]• Home page: http://tangra.si.umich.edu/~radev


Introduction


IR systems

• Google

• Vivísimo

• AskJeeves

• NSIR

• Lemur

• MG

• Nutch


Examples of IR systems

• Conventional (library catalog). Search by keyword, title, author, etc.

• Text-based (Lexis-Nexis, Google, FAST).Search by keywords. Limited search using queries in natural language.

• Multimedia (QBIC, WebSeek, SaFe)Search by visual appearance (shapes, colors,… ).

• Question answering systems (AskJeeves, NSIR, Answerbus)Search in (restricted) natural language


Need for IR

• Advent of WWW - more than 8 Billion documents indexed on Google

• How much information? 200TB according to Lyman and Varian 2003.http://www.sims.berkeley.edu/research/projects/how-much-info/

• Search, routing, filtering

• User’s information need


Some definitions of Information Retrieval (IR)

Salton (1989): “Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests.”

Kowalski (1997): “An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects).”


Sample queries (from Excite)

In what year did baseball become an offical sport?play station codes . combirth control and depressiongovernment"WorkAbility I"+conferencekitchen applianceswhere can I find a chines rosewoodtiger electronics58 Plymouth FuryHow does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero?emeril LagasseHubbleM.S Subalaksmirunning


Mappings and abstractions

Reality Data

Information need Query

From Korfhage’s book


Typical IR system

• (Crawling)

• Indexing

• Retrieval

• User interface


Key Terms Used in IR

• QUERY: a representation of what the user is looking for - can be a list of words or a phrase.

• DOCUMENT: an information entity that the user wants to retrieve

• COLLECTION: a set of documents

• INDEX: a representation of information that makes querying easier

• TERM: word or concept that appears in a document or a query


Documents


Documents

• Not just printed paper

• collections vs. documents

• data structures: representations

• Bag of words method

• document surrogates: keywords, summaries

• encoding: ASCII, Unicode, etc.


Document preprocessing

• Formatting

• Tokenization (Paul’s, Willow Dr., Dr. Willow, 555-1212, New York, ad hoc)

• Casing (cat vs. CAT)

• Stemming (computer, computation)

• Soundex


Document representations

• Term-document matrix (m x n)

• term-term matrix (m x m x n)

• document-document matrix (n x n)

• Example: 3,000,000 documents (n) with 50,000 terms (m)

• sparse matrices

• Boolean vs. integer matrices


Document representations

• Term-document matrix– Evaluating queries (e.g., (AB)C)– Storage issues

• Inverted files– Storage issues– Evaluating queries– Advantages and disadvantages


IR models


Major IR models

• Boolean

• Vector

• Probabilistic

• Language modeling

• Fuzzy retrieval

• Latent semantic indexing


Major IR tasks

• Ad-hoc

• Filtering and routing

• Question answering

• Spoken document retrieval

• Multimedia retrieval


Venn diagrams

x w y z

D1D2


Boolean model

A B


restaurants AND (Mideastern OR vegetarian) AND inexpensive

Boolean queries

• What types of documents are returned?

• Stemming

• thesaurus expansion

• inclusive vs. exclusive OR

• confusing uses of AND and OR

dinner AND sports AND symphony

4 OF (Pentium, printer, cache, PC, monitor, computer, personal)


Boolean queries• Weighting (Beethoven AND sonatas)

• precedence

coffee AND croissant OR muffin

raincoat AND umbrella OR sunglasses

• Use of negation: potential problems

• Conjunctive and Disjunctive normal forms

• Full CNF and DNF


Boolean model

• Partition

• Partial relevance?

• Operators: AND, NOT, OR, parentheses


Exercise

• D1 = “computer information retrieval”

• D2 = “computer retrieval”

• D3 = “information”

• D4 = “computer information”

• Q1 = “information retrieval”

• Q2 = “information ¬computer”


Exercise0

1 Swift

2 Shakespeare

3 Shakespeare Swift

4 Milton

5 Milton Swift

6 Milton Shakespeare

7 Milton Shakespeare Swift

8 Chaucer

9 Chaucer Swift

10 Chaucer Shakespeare

11 Chaucer Shakespeare Swift

12 Chaucer Milton

13 Chaucer Milton Swift

14 Chaucer Milton Shakespeare

15 Chaucer Milton Shakespeare Swift

((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))


Stop lists• 250-300 most common words in English

account for 50% or more of a given text.

• Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.

• Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%).

• Token/type ratio: 2256/859 = 2.63


Vector modelsTerm 1

Term 2

Term 3

Doc 1

Doc 2

Doc 3


Vector queries

• Each document is represented as a vector

• non-efficient representations (bit vectors)

• dimensional compatibility

W1 W2 W3 W4 W5 W6 W7 W8 W9 W10

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10


The matching process

• Document space

• Matching is done between a document and a query (or between two documents)

• distance vs. similarity

• Euclidean distance, Manhattan distance, Word overlap, Jaccard coefficient, etc.


Miscellaneous similarity measures

• The Cosine measure

(D,Q) = = (di x qi)

(di)2 * (qi)2

|X Y|

|X| * |Y|

(D,Q) =|X Y|

|X Y|

• The Jaccard coefficient

Exercise

• Compute the cosine measures (D1,D2) and (D1,D3) for the documents: D1 = <1,3>, D2 = <100,300> and D3 = <3,1>

• Compute the corresponding Euclidean distances, Manhattan distances, and Jaccard coefficients.


Evaluation


Relevance

• Difficult to change: fuzzy, inconsistent

• Methods: exhaustive, sampling, pooling, search-based


Contingency table

w x

y z

n2 = w + y

n1 = w + x

N

relevant

not relevant

retrieved not retrieved


Precision and Recall

Recall:

Precision:

w

w+y

w+x

w


Exercise

Go to Google (www.google.com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien, “JRR Melville”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista.

Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. Try different information needs.

Later, try different queries.


n Doc. no Relevant? Recall Precision1 588 x 0.2 1.00

2 589 x 0.4 1.00

3 576 0.4 0.67

4 590 x 0.6 0.75

5 986 0.6 0.60

6 592 x 0.8 0.67

7 984 0.8 0.57

8 988 0.8 0.50

9 578 0.8 0.44

10 985 0.8 0.40

11 103 0.8 0.36

12 591 0.8 0.33

13 772 x 1.0 0.38

14 990 1.0 0.36

[From Salton’s book]


P/R graph

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Pre

cis

ion

Interpolated average precision (e.g., 11pt)Interpolation – what is precision at recall=0.5?


Issues

• Why not use accuracy A=(w+z)/N?• Average precision• Average P at given “document cutoff

values”• Report when P=R• F measure: F=(2+1)PR/(2P+R)• F1 measure: F1 = 2/(1/R+1/P) : harmonic

mean of P and R


Relevance collections

• TREC ad hoc collections, 2-6 GB

• TREC Web collections, 2-100GB

Sample TREC query<top><num> Number: 305<title> Most Dangerous Vehicles

<desc> Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? <narr> Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example.</top>

LA031689-0177FT922-1008LA090190-0126LA101190-0218LA082690-0158LA112590-0109FT944-136LA020590-0119FT944-5300LA052190-0048LA051689-0139FT944-9371LA032390-0172

LA042790-0172LA021790-0136LA092289-0167LA111189-0013LA120189-0179LA020490-0021LA122989-0063LA091389-0119LA072189-0048FT944-15615LA091589-0101LA021289-0208

<DOCNO> LA031689-0177 </DOCNO><DOCID> 31701 </DOCID><DATE>March 16, 1989, Thursday, Home Edition </DATE><SECTION>Business; Part 4; Page 1; Column 5; Financial Desk </SECTION><LENGTH>586 words </LENGTH><HEADLINE>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </HEADLINE><BYLINE>By LINDA WILLIAMS, Times Staff Writer </BYLINE><TEXT>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-overaccidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of theSuzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents afterConsumer Reports magazine charged that the vehicle had basic design flaws. Several Fatalities However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs,particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigationconducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicleroll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involvingthe Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After theaccident report, NHTSA declined to investigate the Samurai. ...</TEXT><GRAPHIC> Photo, The Ford Bronco II "appears to have a highernumber of single-vehicle, first event roll-overs," a federal officialsaid. </GRAPHIC><SUBJECT>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS;RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </SUBJECT></DOC>


TREC (cont’d)

• http://trec.nist.gov/tracks.html• http://

trec.nist.gov/presentations/presentations.html

http://trec.nist.gov/tracks.html

http://trec.nist.gov/presentations/presentations.html

http://trec.nist.gov/presentations/presentations.html


Word distribution models


Shakespeare

• Romeo and Juliet:• And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; You, 289; Thou, 277; Me,

262; Not, 257; With, 234; It, 224; For, 223; This, 215; Be, 207; But, 181; Thy, 167; What, 163; O, 160; As, 156; Her, 150; Will, 147; So, 145; Thee, 139; Love, 135; His, 128; Have, 127; He, 120; Romeo, 115; By, 114; She, 114; Shall, 107; Your, 103; No, 102; Come, 96; Him, 96; All, 92; Do, 89; From, 86; Then, 83; Good, 82; Now, 82; Here, 80; If, 80; An, 78; Go, 76; On, 76; I'll, 71; Death, 69; Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63; Which, 60;

• …

• A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; Abound'st, 1; Abroach, 1; Absolved, 1; Abuse, 1; Abused, 1; Abuses, 1; Accents, 1; Access, 1; Accident, 1; Accidents, 1; According, 1; Accursed, 1; Accustom'd, 1; Ache, 1; Aches, 1; Aching, 1; Acknowledge, 1; Acquaint, 1; Acquaintance, 1; Acted, 1; Acting, 1; Action, 1; Acts, 1; Adam, 1; Add, 1; Added, 1; Adding, 1; Addle, 1; Adjacent, 1; Admired, 1; Ado, 1; Advance, 1; Adversary, 1; Adversity's, 1; Advise, 1; Afeard, 1; Affecting, 1; Afflicted, 1; Affliction, 1; Affords, 1; Affray, 1; Affright, 1; Afire, 1; Agate-stone, 1; Agile, 1; Agree, 1; Agrees, 1; Aim'd, 1; Alderman, 1; All-cheering, 1; All-seeing, 1; Alla, 1; Alliance, 1; Alligator, 1; Allow, 1; Ally, 1; Although, 1;

http://www.mta75.org/curriculum/english/Shakes/indexx.html


The BNC (Adam Kilgarriff)• 1 6187267 the det• 2 4239632 be v• 3 3093444 of prep• 4 2687863 and conj• 5 2186369 a det• 6 1924315 in prep• 7 1620850 to infinitive-marker• 8 1375636 have v• 9 1090186 it pron• 10 1039323 to prep• 11 887877 for prep• 12 884599 i pron• 13 760399 that conj• 14 695498 you pron• 15 681255 he pron• 16 680739 on prep• 17 675027 with prep• 18 559596 do v• 19 534162 at prep• 20 517171 by prep

Kilgarriff, A. Putting Frequencies in the Dictionary.International Journal of Lexicography10 (2) 1997. Pp 135--155


Stop lists• 250-300 most common words in English

account for 50% or more of a given text.

• Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.

• Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%).

• Token/type ratio: 2256/859 = 2.63


Zipf’s law

Rank x Frequency Constant

Rank Term Freq. Z Rank Term Freq. Z

1 the 69,971 0.070 6 in 21,341 0.128

2 of 36,411 0.073 7 that 10,595 0.074

3 and 28,852 0.086 8 is 10,099 0.081

4 to 26.149 0.104 9 was 9,816 0.088

5 a 23,237 0.116 10 he 9,543 0.095


Zipf's law is fairly general!

• Frequency of accesses to web pages • in particular the access counts on the Wikipedia page,with s approximately equal to 0.3 • page access counts on Polish Wikipedia (data for late July 2003) approximately obey Zipf's law with s about 0.5

• Words in the English language • for instance, in Shakespeare’s play Hamlet with s approximately 0.5

• Sizes of settlements• Income distributions amongst individuals • Size of earthquakes• Notes in musical performances

http://en.wikipedia.org/wiki/Zipf's_law


Zipf’s law (cont’d)

• Limitations:– Low and high frequencies– Lack of convergence

• Power law with coefficient c = -1– Y=kxc

• Li (1992) – typing words one letter at a time, including spaces


Indexing


Methods

• Manual: e.g., Library of Congress subject headings, MeSH

• Automatic


LOC subject headings

http://www.loc.gov/catdir/cpso/lcco/lcco.html

A -- GENERAL WORKSB -- PHILOSOPHY. PSYCHOLOGY. RELIGIONC -- AUXILIARY SCIENCES OF HISTORYD -- HISTORY (GENERAL) AND HISTORY OF EUROPEE -- HISTORY: AMERICAF -- HISTORY: AMERICAG -- GEOGRAPHY. ANTHROPOLOGY. RECREATIONH -- SOCIAL SCIENCESJ -- POLITICAL SCIENCEK -- LAWL -- EDUCATIONM -- MUSIC AND BOOKS ON MUSICN -- FINE ARTSP -- LANGUAGE AND LITERATUREQ -- SCIENCER -- MEDICINES -- AGRICULTURET -- TECHNOLOGYU -- MILITARY SCIENCEV -- NAVAL SCIENCEZ -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)

http://www.loc.gov/catdir/cpso/lcco/lcco_a.pdf

http://www.loc.gov/catdir/cpso/lcco/lcco_b.pdf

http://www.loc.gov/catdir/cpso/lcco/lcco_c.pdf

http://www.loc.gov/catdir/cpso/lcco/lcco_d.pdf

http://www.loc.gov/catdir/cpso/lcco/lcco_ef.pdf

http://www.loc.gov/catdir/cpso/lcco/lcco_ef.pdf

http://www.loc.gov/catdir/cpso/lcco/lcco_g.pdf

http://www.loc.gov/catdir/cpso/lcco/lcco_h.pdf

http://www.loc.gov/catdir/cpso/lcco/lcco_j.pdf

http://www.loc.gov/catdir/cpso/lcco/lcco_k.pdf

http://www.loc.gov/catdir/cpso/lcco/lcco_l.pdf

http://www.loc.gov/catdir/cpso/lcco/lcco_m.pdf

http://www.loc.gov/catdir/cpso/lcco/lcco_n.pdf

http://www.loc.gov/catdir/cpso/lcco/lcco_p.pdf

http://www.loc.gov/catdir/cpso/lcco/lcco_q.pdf

http://www.loc.gov/catdir/cpso/lcco/lcco_r.pdf


MedicineCLASS R - MEDICINESubclass RR5-920 Medicine (General)R5-130.5 General worksR131-687 History of medicine. Medical expeditionsR690-697 Medicine as a profession. PhysiciansR702-703 Medicine and the humanities. Medicine and disease in

relation to history, literature, etc.R711-713.97 DirectoriesR722-722.32 Missionary medicine. Medical missionariesR723-726 Medical philosophy. Medical ethicsR726.5-726.8 Medicine and disease in relation to psychology.

Terminal care. DyingR727-727.5 Medical personnel and the public. Physician and the

publicR728-733 Practice of medicine. Medical practice economicsR735-854 Medical education. Medical schools. ResearchR855-855.5 Medical technologyR856-857 Biomedical engineering. Electronics. InstrumentationR858-859.7 Computer applications to medicine. Medical informaticsR864 Medical recordsR895-920 Medical physics. Medical radiology. Nuclear medicine


Finding the most frequent terms in a document

• Typically stop words: the, and, in

• Not content-bearing

• Terms vs. words

• Luhn’s method


Luhn’s method

WORDS

FREQUENCY

E


Computing term salience

• Term frequency (IDF)

• Document frequency (DF)

• Inverse document frequency (IDF)

N

wDFwIDF

)(log)(


Applications of TFIDF

• Cosine similarity

• Indexing

• Clustering


Vector-based matching

• The cosine measure

sim (D,C) =

(dk . ck . idf(k))

(dk)2 . (ck)2

k

k

k


IDF: Inverse document frequency

N: number of documentsdk: number of documents containing term kfik: absolute frequency of term k in document iwik: weight of term k in document i

idfk = log2(N/dk) + 1 = log2N - log2dk + 1

TF * IDF is used for automated indexing and for topicdiscrimination:


Asian and European news

622.941 deng306.835 china196.725 beijing153.608 chinese152.113 xiaoping124.591 jiang108.777 communist102.894 body 85.173 party 71.898 died 68.820 leader 43.402 state 38.166 people

97.487 nato92.151 albright74.652 belgrade46.657 enlargement34.778 alliance34.778 french33.803 opposition32.571 russia14.095 government 9.389 told 9.154 would 8.459 their 6.059 which


Other topics

120.385 shuttle 99.487 space 90.128 telescope 70.224 hubble 59.992 rocket 50.160 astronauts 49.722 discovery 47.782 canaveral 47.782 cape 40.889 mission 35.778 florida 27.063 center

74.652 compuserve65.321 massey55.989 salizzoni29.996 bob27.994 online27.198 executive15.890 interim15.271 chief11.647 service11.174 second 6.781 world 6.315 president


Compression


Compression

• Methods– Fixed length codes– Huffman coding– Ziv-Lempel codes


Fixed length codes

• Binary representations– ASCII– Representational power (2k symbols where k is

the number of bits)


Variable length codes• Alphabet:

A .- N -. 0 -----B -... O --- 1 .----C -.-. P .--. 2 ..---D -.. Q --.- 3 ...—E . R .-. 4 ....-F ..-. S ... 5 .....G --. T - 6 -....H .... U ..- 7 --...I .. V ...- 8 ---..J .--- W .-- 9 ----.K -.- X -..-L .-.. Y -.—M -- Z --..

• Demo:– http://www.babbage.demon.co.uk/morse.html– http://www.scphillips.com/morse/


Most frequent letters in English

• Most frequent letters:– E T A O I N S H R D L U– http://www.math.cornell.edu/~mec/modules/

cryptography/subs/frequencies.html• Demo:

– http://www.amstat.org/publications/jse/secure/v7n2/count-char.cfm

• Also: bigrams:– TH HE IN ER AN RE ND AT ON NT – http://www.math.cornell.edu/~mec/modules/

cryptography/subs/digraphs.html


Useful links about cryptography

• http://world.std.com/~franl/crypto.html

• http://www.faqs.org/faqs/cryptography-faq/

• http://en.wikipedia.org/wiki/Cryptography


Huffman coding

• Developed by David Huffman (1952)• Average of 5 bits per character (37.5%

compression)• Based on frequency distributions of

symbols• Algorithm: iteratively build a tree of

symbols starting with the two least frequent symbols


Symbol Frequency

A 7

B 4

C 10

D 5

E 2

F 11

G 15

H 3

I 7

J 8


0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

c

b d

f

g

i j

he

a


Symbol Code

A 0110

B 0010

C 000

D 0011

E 01110

F 010

G 10

H 01111

I 110

J 111


Exercise

• Consider the bit string: 01101101111000100110001110100111000110101101011101

• Use the Huffman code from the example to decode it.

• Try inserting, deleting, and switching some bits at random locations and try decoding.

Ziv-Lempel coding

• Two types - one is known as LZ77 (used in GZIP)

• Code: set of triples <a,b,c>• a: how far back in the decoded text to look

for the upcoming text segment• b: how many characters to copy• c: new character to add to complete segment

• <0,0,p> p• <0,0,e> pe• <0,0,t> pet• <2,1,r> peter• <0,0,_> peter_• <6,1,i> peter_pi• <8,2,r> peter_piper• <6,3,c> peter_piper_pic• <0,0,k> peter_piper_pick• <7,1,d> peter_piper_picked• <7,1,a> peter_piper_picked_a• <9,2,e> peter_piper_picked_a_pe• <9,2,_> peter_piper_picked_a_peck_• <0,0,o> peter_piper_picked_a_peck_o• <0,0,f> peter_piper_picked_a_peck_of• <17,5,l> peter_piper_picked_a_peck_of_pickl• <12,1,d> peter_piper_picked_a_peck_of_pickled• <16,3,p> peter_piper_picked_a_peck_of_pickled_pep• <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper• <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers


Links on text compression

• Data compression:– http://www.data-compression.info/

• Calgary corpus:– http://links.uwaterloo.ca/calgary.corpus.html

• Huffman coding:– http://www.compressconsult.com/huffman/– http://en.wikipedia.org/wiki/Huffman_coding

• LZ– http://en.wikipedia.org/wiki/LZ77


Relevance feedback and

query expansion


Relevance feedback

• Problem: initial query may not be the most appropriate to satisfy a given information need.

• Idea: modify the original query so that it gets closer to the right documents in the vector space


Relevance feedback

• Automatic

• Manual

• Method: identifying feedback termsQ’ = a1Q + a2R - a3N

Often a1 = 1, a2 = 1/|R| and a3 = 1/|N|


Example

• Q = “safety minivans”• D1 = “car safety minivans tests injury statistics” -

relevant• D2 = “liability tests safety” - relevant• D3 = “car passengers injury reviews” - non-

relevant• R = ?• S = ?• Q’ = ?


Pseudo relevance feedback

• Automatic query expansion– Thesaurus-based expansion (e.g., using latent

semantic indexing – later…)– Distributional similarity– Query log mining


String matching


String matching methods

• Index-based

• Full or approximate– E.g., theater = theatre


Index-based matching

• Inverted files

• Position-based inverted files

• Block-based inverted files

1 6 9 11 1719 24 28 33 40 46 50 55 60

This is a text. A text has many words. Words are made from letters.

Text: 11, 19

Words: 33, 40

From: 55


Inverted index (trie)

Letters: 60

Text: 11, 19

Words: 33, 40

Made: 50

Many: 28

l

m

t

w

ad

n


Sequential searching

• No indexing structure given• Given: database d and search pattern p.

– Example: find “words” in the earlier example

• Brute force method– try all possible starting positions

– O(n) positions in the database and O(m) characters in the pattern so the total worst-case runtime is O(mn)

– Typical runtime is actually O(n) given that mismatches are easy to notice


Knuth-Morris-Pratt

• Average runtime similar to BF

• Worst case runtime is linear: O(n)

• Idea: reuse knowledge

• Need preprocessing of the pattern


Knuth-Morris-Pratt (cont’d)

• Example (http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm)

database: ABC ABC ABC ABDAB ABCDABCDABDE

pattern: ABCDABD

index 0 1 2 3 4 5 6 7 char A B C D A B D – pos -1 0 0 0 0 1 2 0

1234567ABCDABD ABCDABD


Knuth-Morris-Pratt (cont’d)ABC ABC ABC ABDAB ABCDABCDABDEABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^


Word similarity

• Hamming distance - when words are of the same length

• Levenshtein distance - number of edits (insertions, deletions, replacements)– color --> colour (1)– survey --> surgery (2)– com puter --> computer ?

• Longest common subsequence (LCS)– lcs (survey, surgery) = surey


Levenshtein edit distance

• Examples:– Theatre-> theater– Ghaddafi->Qadafi– Computer->counter

• Edit distance (inserts, deletes, substitutions)– Edit transcript

• Done through dynamic programming


Recurrence relation

• Three dependencies– D(i,0)=i– D(0,j)=j– D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j-

1)+t(i,j)]

• Simple edit distance: – t(i,j) = 0 iff S1(i)=S2(j)


Example

Gusfield 1997

W R I T E R S

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

V 1 1

I 2 2

N 3 3

T 4 4

N 5 5

E 6 6

R 7 7


Example (cont’d)

Gusfield 1997

W R I T E R S

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

V 1 1 1 2 3 4 5 6 7

I 2 2 2 2 2 3 4 5 6

N 3 3 3 3 3 3 4 5 6

T 4 4 4 4 4 *

N 5 5

E 6 6

R 7 7


Tracebacks

Gusfield 1997

W R I T E R S

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

V 1 1 1 2 3 4 5 6 7

I 2 2 2 2 2 3 4 5 6

N 3 3 3 3 3 3 4 5 6

T 4 4 4 4 4 *

N 5 5

E 6 6

R 7 7


Weighted edit distance

• Used to emphasize the relative cost of different edit operations

• Useful in bioinformatics– Homology information– BLAST– Blosum– http://eta.embl-heidelberg.de:8000/misc/mat/

blosum50.html


• Web sites:– http://www.merriampark.com/ld.htm– http://odur.let.rug.nl/~kleiweg/lev/


Clustering


Clustering

• Exclusive/overlapping clusters

• Hierarchical/flat clusters

• The cluster hypothesis– Documents in the same cluster are relevant to

the same query


Representations for document clustering

• Typically: vector-based– Words: “cat”, “dog”, etc.– Features: document length, author name, etc.

• Each document is represented as a vector in an n-dimensional space

• Similar documents appear nearby in the vector space (distance measures are needed)


Hierarchical clusteringDendrograms

http://odur.let.rug.nl/~kleiweg/clustering/clustering.html

E.g., language similarity:


Another example

• Kingdom = animal• Phylum = Chordata• Subphylum = Vertebrata• Class = Osteichthyes• Subclass = Actinoptergyii• Order = Salmoniformes• Family = Salmonidae• Genus = Oncorhynchus• Species = Oncorhynchus kisutch (Coho salmon)


Clustering using dendrograms

REPEATCompute pairwise similaritiesIdentify closest pairMerge pair into single node

UNTIL only one node leftQ: what is the equivalent Venn diagram representation?

Example: cluster the following sentences:

A B C B AA D C C A D EC D E F C D AE F G F D AA C D A B A


Methods

• Single-linkage– One common pair is sufficient– disadvantages: long chains

• Complete-linkage– All pairs have to match– Disadvantages: too conservative

• Average-linkage• Centroid-based (online)

– Look at distances to centroids

• Demo:– /clair4/class/ir-w05/clustering


k-means

• Needed: small number k of desired clusters

• hard vs. soft decisions

• Example: Weka


k-means

1 initialize cluster centroids to arbitrary vectors

2 while further improvement is possible do

3 for each document d do

4 find the cluster c whose centroid is closest to d

5 assign d to cluster c

6 end for

7 for each cluster c do

8 recompute the centroid of cluster c based on its documents

9 end for

10 end while

Example

• Cluster the following vectors into two groups:– A = <1,6>– B = <2,2>– C = <4,0>– D = <3,3>– E = <2,5>– F = <2,1>


Complexity

• Complexity = O(kn) because at each step, n documents have to be compared to k centroids.


Human clustering

• Significant disagreement in the number of clusters, overlap of clusters, and the composition of clusters (Maczkassy et al. 1998).


Lexical networks


Lexical Networks

• Used to represent relationships between words

• Example: WordNet - created by George Miller’s team at Princeton

• Based on synsets (synonyms, interchangeable words) and lexical matrices


Lexical matrix

Word FormsWord

Meanings F1 F2 F3 … Fn

M1 E1,1 E1,2

M2 E1,2

……

Mm Em,n


Synsets

• Disambiguation– {board, plank}– {board, committee}

• Synonyms– substitution– weak substitution– synonyms must be of the same part of speech


$ ./wn board -hypen

Synonyms/Hypernyms (Ordered by Frequency) of noun board

9 senses of board

Sense 1board => committee, commission => administrative unit => unit, social unit => organization, organisation => social group => group, grouping

Sense 2board => sheet, flat solid => artifact, artefact => object, physical object => entity, something

Sense 3board, plank => lumber, timber => building material => artifact, artefact => object, physical object => entity, something


Sense 4display panel, display board, board => display => electronic device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something

Sense 5board, gameboard => surface => artifact, artefact => object, physical object => entity, something

Sense 6board, table => fare => food, nutrient => substance, matter => object, physical object => entity, something


Sense 7control panel, instrument panel, control board, board, panel => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something

Sense 8circuit board, circuit card, board, card => printed circuit => computer circuit => circuit, electrical circuit, electric circuit => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something

Sense 9dining table, board => table => furniture, piece of furniture, article of furniture => furnishings => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something


Antonymy

• “x” vs. “not-x”

• “rich” vs. “poor”?

• {rise, ascend} vs. {fall, descend}


Other relations

• Meronymy: X is a meronym of Y when native speakers of English accept sentences similar to “X is a part of Y”, “X is a member of Y”.

• Hyponymy: {tree} is a hyponym of {plant}.

• Hierarchical structure based on hyponymy (and hypernymy).


Other features of WordNet

• Index of familiarity

• Polysemy


board used as a noun is familiar (polysemy count = 9)

bird used as a noun is common (polysemy count = 5)

cat used as a noun is common (polysemy count = 7)

house used as a noun is familiar (polysemy count = 11)

information used as a noun is common (polysemy count = 5)

retrieval used as a noun is uncommon (polysemy count = 3)

serendipity used as a noun is very rare (polysemy count = 1)

Familiarity and polysemy


Compound nouns

advisory boardappeals boardbackboardbackgammon boardbaseboardbasketball backboardbig boardbillboardbinder's boardbinder board

blackboardboard gameboard measureboard meetingboard memberboard of appealsboard of directorsboard of educationboard of regentsboard of trustees


Overview of senses1. board -- (a committee having supervisory powers; "the board has seven members")2. board -- (a flat piece of material designed for a special purpose; "he nailed boards across the windows")3. board, plank -- (a stout length of sawn timber; made in a wide variety of sizes and used for many purposes)4. display panel, display board, board -- (a board on which information can be displayed to public view)5. board, gameboard -- (a flat portable surface (usually rectangular) designed for board games; "he got out the board and set up the pieces")6. board, table -- (food or meals in general; "she sets a fine table"; "room and board")7. control panel, instrument panel, control board, board, panel -- (an insulated panel containing switches and dials and meters for controlling electrical devices; "he checked the instrument panel"; "suddenly the board lit up like a Christmas tree")8. circuit board, circuit card, board, card -- (a printed circuit that can be inserted into expansion slots in a computer to increase the computer's capabilities) 9. dining table, board -- (a table at which meals are served; "he helped her clear the dining table"; "a feast was spread upon the board")


Top-level concepts{act, action, activity}

{animal, fauna}

{artifact}

{attribute, property}

{body, corpus}

{cognition, knowledge}

{communication}

{event, happening}

{feeling, emotion}

{food}

{group, collection}

{location, place}

{motive}

{natural object}

{natural phenomenon}

{person, human being}

{plant, flora}

{possession}

{process}

{quantity, amount}

{relation}

{shape}

{state, condition}

{substance}

{time}


WordNet and DistSim

wn reason -hypen - hypernyms

wn reason -synsn - synsets

wn reason -simsn - synonyms

wn reason -over - overview of senses

wn reason -famln - familiarity/polysemy

wn reason -grepn - compound nouns

/data2/tools/relatedwords/relate reason


System comparison


Comparing two systems

• Comparing A and B

• One query?

• Average performance?

• Need: A to consistently outperform B

[this slide: courtesy James Allan]

The sign test

• Example 1:– A > B (12 times)

– A = B (25 times)

– A B (18 times)

– A < B (9 times)

– p < 0.122 (not significant at the 5% level)[this slide: courtesy James Allan]


Other tests

• The t test:– Takes into account the actual performances, not

just which system is better– http://nimitz.mcs.kent.edu/~blewis/stat/

tTest.html

• The sign test:– http://www.fon.hum.uva.nl/Service/Statistics/

Sign_Test.html


Techniques for dimensionalityreduction: SVD and LSI


Techniques for dimensionality reduction

• Based on matrix decomposition (goal: preserve clusters, explain away variance)

• A quick review of matrices– Vectors

– Matrices

– Matrix multiplication

1

1

2

*

1494

852

321


SVD: Singular Value Decomposition

• A=UVT

• This decomposition exists for all matrices, dense or sparse

• If A has 5 columns and 3 rows, then U will be 5x5 and V will be 3x3

• In Matlab, use [U,S,V] = svd (A)


Term matrix normalization

1

0

1

1

0

0

000

101

1010

0

0

0

1

1

110

110

101

A

71.0

00.0

71.0

58.0

00.0

00.0

00.000.000.0

45.000.058.0

45.000.058.000.0

00.0

00.0

00.0

58.0

58.0

45.071.000.0

45.071.000.0

45.000.058.0

)(nA

D1 D2 D3 D4 D5

D1 D2 D3 D4 D5


Example (Berry and Browne)

• T1: baby• T2: child• T3: guide• T4: health • T5: home• T6: infant• T7: proofing• T8: safety• T9: toddler

• D1: infant & toddler first aid• D2: babies & children’s room (for

your home)• D3: child safety at home• D4: your baby’s health and safety:

from infant to toddler• D5: baby proofing basics• D6: your guide to easy rust proofing• D7: beanie babies collector’s guide


Document term matrix

0001001

0001100

0110000

0001001

0000110

0001000

1100000

0000110

1011010

A

00045.00071.0

00045.058.000

071.071.00000

00045.00071.0

000058.058.00

00045.0000

71.071.000000

000058.058.00

71.0071.045.0058.00

)(nA


Decompositionu =

-0.6976 -0.0945 0.0174 -0.6950 0.0000 0.0153 0.1442 -0.0000 0 -0.2622 0.2946 0.4693 0.1968 -0.0000 -0.2467 -0.1571 -0.6356 0.3098 -0.3519 -0.4495 -0.1026 0.4014 0.7071 -0.0065 -0.0493 -0.0000 0.0000 -0.1127 0.1416 -0.1478 -0.0734 0.0000 0.4842 -0.8400 0.0000 -0.0000 -0.2622 0.2946 0.4693 0.1968 0.0000 -0.2467 -0.1571 0.6356 -0.3098 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 -0.3098 -0.6356 -0.3519 -0.4495 -0.1026 0.4014 -0.7071 -0.0065 -0.0493 0.0000 -0.0000 -0.2112 0.3334 0.0962 0.2819 -0.0000 0.7338 0.4659 -0.0000 0.0000 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 0.3098 0.6356

v =

-0.1687 0.4192 -0.5986 0.2261 0 -0.5720 0.2433 -0.4472 0.2255 0.4641 -0.2187 0.0000 -0.4871 -0.4987 -0.2692 0.4206 0.5024 0.4900 -0.0000 0.2450 0.4451 -0.3970 0.4003 -0.3923 -0.1305 0 0.6124 -0.3690 -0.4702 -0.3037 -0.0507 -0.2607 -0.7071 0.0110 0.3407 -0.3153 -0.5018 -0.1220 0.7128 -0.0000 -0.0162 -0.3544 -0.4702 -0.3037 -0.0507 -0.2607 0.7071 0.0110 0.3407


Decomposition

s = 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0.7100 0 0 0 0 0 0 0 0.5692 0 0 0 0 0 0 0 0.1977 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Spread on the v1 axis


Rank-4 approximations4 =

1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


Rank-4 approximationu*s4*v' -0.0019 0.5985 -0.0148 0.4552 0.7002 0.0102 0.7002 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.1980 0.0514 0.0064 0.2199 0.0535 -0.0544 0.0535 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.2165 0.2494 0.4367 0.2282 -0.0360 0.0394 -0.0360 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008


Rank-4 approximationu*s4

-1.1056 -0.1203 0.0207 -0.5558 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.1786 0.1801 -0.1765 -0.0587 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.3348 0.4241 0.1149 0.2255 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0


Rank-4 approximations4*v'

-0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 -0.7150 0.5544 0.6001 -0.4686 -0.0605 -0.1457 -0.0605 0.1808 -0.1749 0.3918 -0.1043 -0.2085 0.5700 -0.2085 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


Rank-2 approximations2 =

1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


Rank-2 approximationu*s2*v'

0.1361 0.4673 0.2470 0.3908 0.5563 0.4089 0.5563 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.1057 0.1205 0.1239 0.1430 0.0293 -0.0341 0.0293 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.2343 0.2454 0.2685 0.3027 0.0286 -0.1073 0.0286 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048


Rank-2 approximationu*s2

-1.1056 -0.1203 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.1786 0.1801 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.3348 0.4241 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0


Rank-2 approximations2*v'

-0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


Documents to concepts and terms to concepts

A(:,1)'*u*s

-0.4238 0.6784 -0.8541 0.1446 -0.0000 -0.1853 0.0095

>> A(:,1)'*u*s4

-0.4238 0.6784 -0.8541 0.1446 0 0 0

>> A(:,1)'*u*s2

-0.4238 0.6784 0 0 0 0 0

>> A(:,2)'*u*s2

-1.1233 0.3650 0 0 0 0 0

>> A(:,3)'*u*s2

-0.6762 0.6807 0 0 0 0 0


Documents to concepts and terms to concepts

>> A(:,4)'*u*s2

-0.9972 0.6478 0 0 0 0 0

>> A(:,5)'*u*s2

-1.1809 -0.4914 0 0 0 0 0

>> A(:,6)'*u*s2

-0.7918 -0.8121 0 0 0 0 0

>> A(:,7)'*u*s2

-1.1809 -0.4914 0 0 0 0 0


Cont’d>> (s2*v'*A(1,:)')'

-1.7523 -0.1530 0 0 0 0 0 0 0

>> (s2*v'*A(2,:)')'

-0.6585 0.4768 0 0 0 0 0 0 0

>> (s2*v'*A(3,:)')'

-0.8838 -0.7275 0 0 0 0 0 0 0

>> (s2*v'*A(4,:)')'

-0.2831 0.2291 0 0 0 0 0 0 0

>> (s2*v'*A(5,:)')'

-0.6585 0.4768 0 0 0 0 0 0 0


Cont’d>> (s2*v'*A(6,:)')'

-0.4730 0.6078 0 0 0 0 0 0 0

>> (s2*v'*A(7,:)')'

-0.8838 -0.7275 0 0 0 0 0 0 0

>> (s2*v'*A(8,:)')'

-0.5306 0.5395 0 0 0 0 0 0 0

>> (s2*v'*A(9,:)')‘

-0.4730 0.6078 0 0 0 0 0 0 0


PropertiesA*A'

1.5471 0.3364 0.5041 0.2025 0.3364 0.2025 0.5041 0.2025 0.2025 0.3364 0.6728 0 0 0.6728 0 0 0.3364 0 0.5041 0 1.0082 0 0 0 0.5041 0 0 0.2025 0 0 0.2025 0 0.2025 0 0.2025 0.2025 0.3364 0.6728 0 0 0.6728 0 0 0.3364 0 0.2025 0 0 0.2025 0 0.7066 0 0.2025 0.7066 0.5041 0 0.5041 0 0 0 1.0082 0 0 0.2025 0.3364 0 0.2025 0.3364 0.2025 0 0.5389 0.2025 0.2025 0 0 0.2025 0 0.7066 0 0.2025 0.7066

A'*A

1.0082 0 0 0.6390 0 0 0 0 1.0092 0.6728 0.2610 0.4118 0 0.4118 0 0.6728 1.0092 0.2610 0 0 0 0.6390 0.2610 0.2610 1.0125 0.3195 0 0.3195 0 0.4118 0 0.3195 1.0082 0.5041 0.5041 0 0 0 0 0.5041 1.0082 0.5041 0 0.4118 0 0.3195 0.5041 0.5041 1.0082

A is a document to term matrix. What is A*A’, what is A’*A?


Latent semantic indexing (LSI)

• Dimensionality reduction = identification of hidden (latent) concepts

• Query matching in latent space


Useful pointers

• http://lsa.colorado.edu• http://lsi.research.telcordia.com/• http://www.cs.utk.edu/~lsi/• http://javelina.cet.middlebury.edu/lsa/out/

lsa_definition.htm• http://citeseer.nj.nec.com/

deerwester90indexing.html• http://www.pcug.org.au/~jdowling/


Models of the Web


Size

• The Web is the largest repository of data and it grows exponentially.– 320 Million Web pages [Lawrence & Giles 1998]

– 800 Million Web pages, 15 TB [Lawrence & Giles 1999]

– 8 Billion Web pages indexed [Google 2005]

• Amount of data– roughly 200 TB [Lyman et al. 2003]


Bow-tie model of the Web

SCC56 M

OUT44 M

IN44 M

Bröder & al. WWW 2000, Dill & al. VLDB 2001

DISC17 M

TEND44M

24% of pagesreachable froma given page


Power laws

• Web site size (Huberman and Adamic 1999)• Power-law connectivity (Barabasi and Albert

1999): exponents 2.45 for out-degree and 2.1 for the in-degree

• Others: call graphs among telephone carriers, citation networks (Redner 1998), e.g., Erdos, collaboration graph of actors, metabolic pathways (Jeong et al. 2000), protein networks (Maslov and Sneppen 2002). All values of gamma are around 2-3.


Small-world networks

• Diameter = average length of the shortest path between all pairs of nodes. Example…

• Milgram experiment (1967)– Kansas/Omaha --> Boston (42/160 letters)– diameter = 6

• Albert et al. 1999 – average distance between two verstices is d = 0.35 + 2.06 log10n. For n = 109, d=18.89.

• Six degrees of separation


Clustering coefficient

• Cliquishness (c): between the kv (kv – 1)/2 pairs of neighbors.

• Examples:

n k d drand C crand

Actors 225226 61 3.65 2.99 0.79 0.00027

Power grid 4941 2.67 18.7 12.4 0.08 0.005

C. Elegans 282 14 2.65 2.25 0.28 0.05


Models of the Web

Npkk

kekP

kk

!)(

)()(

k

kP

A

B

a

b

• Erdös/Rényi 59, 60

• Barabási/Albert 99

• Watts/Strogatz 98

• Kleinberg 98

• Menczer 02

• Radev 03

• Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology


Social network analysis for IR


Social networks

• Induced by a relation• Symmetric or not• Examples:

– Friendship networks– Board membership– Citations– Power grid of the US– WWW


Krebs 2004


Prestige and centrality

• Degree centrality: how many neighbors each node has.

• Closeness centrality: how close a node is to all of the other nodes

• Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes

• Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects.

• Prestige = same as centrality but for directed graphs.


Graph-based representations

1

2

34

5

7

6 81 2 3 4 5 6 7 8

1 1 1

2 1

3 1 1

4 1

5 1 1 1 1

6 1 1

7

8

Square connectivity(incidence) matrix

Graph G (V,E)


Markov chains

• A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E.

• Path = sequence (x0, x1, …, xn).Xi = xi-1*E

• The probability of a path can be computed as a product of probabilities for each step i.

• Random walk = find Xj given x0, E, and j.


Stationary solutions

• The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions:– E is stochastic

– E is irreducible

– E is aperiodic

• To make these conditions true:– All rows of E add up to 1 (and no value is negative)

– Make sure that E is strongly connected

– Make sure that E is not bipartite

• Example: PageRank [Brin and Page 1998]: use “teleportation”


1

2

34

5

7

6 8

Example

This graph E has a second graph E’(not drawn) superimposed on it:E’ is the uniform transition graph.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Pag

eRan

k

t=0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Pag

eRan

k

t=1


Eigenvectors

• An eigenvector is an implicit “direction” for a matrix.Mv = λv, where v is non-zero, though λ can be any

complex number in principle.

• The largest eigenvalue of a stochastic matrix E is real: λ1 = 1.

• For λ1, the left (principal) eigenvector is p, the right eigenvector = 1

• In other words, ETp = p.

Computing the stationary distribution

0)(

pEI

pEpT

T

function PowerStatDist (E):begin p(0) = u; (or p(0) = [1,0,…0]) i=1; repeat p(i) = ETp(i-1)

L = ||p(i)-p(i-1)||1; i = i + 1; until L < return p(i)

end

Solution for thestationary distribution


1

2

34

5

7

6 8

Example

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Pag

eRan

k

t=0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Pag

eRan

k

t=1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Pag

eRan

k

t=10


How Google works

• Crawling

• Anchor text

• Fast query processing

• Pagerank


More about PageRank

• Named after Larry Page, founder of Google (and UM alum)

• Reading “The anatomy of a large-scale hypertextual web search engine” by Brin and Page.

• Independent of query (although more recent work by Haveliwala (WWW 2002) has also identified topic-based PageRank.


HITS

• Query-dependent model (Kleinberg 97)• Hubs and authorities (e.g., cars, Honda)

• Algorithm– obtain root set using input query– expanded the root set by radius one– run iterations on the hub and authority scores together– report top-ranking authorities and hubs

hEa T'Eah '


The link-content hypothesis

• Topical locality: page is similar () to the page that points to it ().

• Davison (TF*IDF, 100K pages)– 0.31 same domain

– 0.23 linked pages

– 0.19 sibling

– 0.02 random

• Menczer (373K pages, non-linear least squares fit)

• Chakrabarti (focused crawling) - prob. of losing the topic

Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001

21)1()(

e 03.01=1.8, 2=0.6,


Measuring the Web


Bharat and Broder 1998

• Based on crawls of HotBot, Altavista, Excite, and InfoSeek

• 10,000 queries in mid and late 1997

• Estimate is 200M pages

• Only 1.4% are indexed by all of them


Example (from Bharat&Broder)

A similar approach by Lawrence and Giles yields 320M pages (Lawrence and Giles 1998).


Question answering


People ask questions

• Excite corpus of 2,477,283 queries (one day’s worth)

• 8.4% of them are questions– 43.9% factual (what is the country code for

Belgium)– 56.1% procedural (how do I set up TCP/IP) or

other

• In other words, 100 K questions per day


People ask questionsIn what year did baseball become an offical sport?Who is the largest man in the world?Where can i get information on Raphael?where can i find information on puritan religion?Where can I find how much my house is worth?how do i get out of debt?Where can I found out how to pass a drug test?When is the Super Bowl?who is California's District State Senator?where can I buy extra nibs for a foutain pen?how do i set up tcp/ip ?what time is it in west samoa?Where can I buy a little kitty cat?what are the symptoms of attention deficit disorder?Where can I get some information on Michael Jordan?How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero?When did the Neanderthal man live?Which Frenchman declined the Nobel Prize for Literature for ideological reasons?What is the largest city in Northern Afghanistan?

How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero?When did the Neanderthal man live?


Question answering

What is the largest city in Northern Afghanistan?


Possible approaches

• Map?• Knowledge base

Find x: city (x) located (x,”Northern Afghanistan”) ¬exists (y): city (y) located (y,”Northern Afghanistan”) greaterthan (population (y), population (x))

• Database?• World factbook?• Search engine?


The TREC Q&A evaluation

• Run by NIST [Voorhees and Tice 2000]• 2GB of input• 200 questions• Essentially fact extraction

– Who was Lincoln’s secretary of state?– What does the Peugeot company manufacture?

• Questions are based on text• Answers are assumed to be present• No inference needed


Q: When did Nelson Mandela become president of South Africa?

A: 10 May 1994

Q: How tall is the Matterhorn?

A: The institute revised the Matterhorn 's height to 14,776 feet 9 inches

Q: How tall is the replica of the Matterhorn at Disneyland?

A: In fact he has climbed the 147-foot Matterhorn at Disneyland every week end for the last 3 1/2 years

Q: If Iraq attacks a neighboring country, what should the US do?

A: ??

Question answering


Q: Why did David Koresh ask the FBI for a word processor?Q: Name the designer of the shoe that spawned millions of plastic imitations, known as "jellies".Q: What is the brightest star visible from Earth?Q: What are the Valdez Principles?Q: Name a film that has won the Golden Bear in the Berlin Film Festival?Q: Name a country that is developing a magnetic levitation railway system?Q: Name the first private citizen to fly in space.Q: What did Shostakovich write for Rostropovich?Q: What is the term for the sum of all genetic material in a given organism?Q: What is considered the costliest disaster the insurance industry has ever faced?Q: What is Head Start?Q: What was Agent Orange used for during the Vietnam War?Q: What did John Hinckley do to impress Jodie Foster?Q: What was the first Gilbert and Sullivan opera?Q: What did Richard Feynman say upon hearing he would receive the Nobel Prize in Physics?Q: How did Socrates die?Q: Why are electric cars less efficient in the north-east than in California?


NSIR

• Current project at U-M– http://tangra.si.umich.edu/clair/NSIR/html/nsir.cgi

• Reading:– [Radev et al., 2005a]

• Dragomir R. Radev, Weiguo Fan, Hong Qi, Harris Wu, and Amardeep Grewal. Probabilistic question answering on the web. Journal of the American Society for Information Science and Technology 56(3), March 2005

• http://tangra.si.umich.edu/~radev/bib2html/radev-bib.html


... Afghanistan, Kabul, 2,450 ... Administrative capital and largest city (1997 est ... Undetermined.Panama, Panama City, 450,668. ... of the Gauteng, Northern Province, Mpumalanga ... www.infoplease.com/cgi-bin/id/A0855603

... died in Kano, northern Nigeria's largest city, during two days of anti-American riotsled by Muslims protesting the US-led bombing of Afghanistan, according to ... www.washingtonpost.com/wp-dyn/print/world/

... air strikes on the city. ... the Taliban militia in northern Afghanistan in a significantblow ... defection would be the largest since the United States ... www.afgha.com/index.php - 60k

... Kabul is the capital and largest city of Afghanistan. . ... met. area pop. 2,029,889),is the largest city in Uttar Pradesh, a state in northern India. . ... school.discovery.com/homeworkhelp/worldbook/atozgeography/ k/k1menu.html

... Gudermes, Chechnya's second largest town. The attack ... location in Afghanistan's outlyingregions ... in the city of Mazar-i-Sharif, a Northern Alliance-affiliated ... english.pravda.ru/hotspots/2001/09/17/

... Get Worse By RICK BRAGG Pakistan's largest city is getting a jump on the ... Region: EducationOffers Women in Northern Afghanistan a Ray of Hope. ... www.nytimes.com/pages/world/asia/

... within three miles of the airport at Mazar-e-Sharif, the largest city in northernAfghanistan, held since 1998 by the Taliban. There was no immediate comment ... uk.fc.yahoo.com/photos/a/afghanistan.html

Google


Document retrieval

Query modulation

Sentence retrieval

Answer extraction

Answer ranking

What is the largest city in Northern Afghanistan?

(largest OR biggest) city “Northern Afghanistan”

www.infoplease.com/cgi-bin/id/A0855603www.washingtonpost.com/wp-dyn/print/world/

Gudermes, Chechnya's second largest town … location in Afghanistan's outlying regionswithin three miles of the airport at Mazar-e-Sharif, the largest city in northern Afghanistan

GudermesMazer-e-Sharif

Mazer-e-SharifGudermes


Research problems• Source identification:

– semi-structured vs. text sources

• Query modulation:– best paraphrase of a NL question given the syntax of a search engine?– Compare two approaches: noisy channel model and rule-based

• Sentence ranking– n-gram matching, Okapi, co-reference?

• Answer extraction– question type identification– phrase chunking– no general-purpose named entity tagger available

• Answer ranking– what are the best predictors of a phrase being the answer to a given

question: question type, proximity to query words, frequency

• Evaluation (MRDR)– accuracy, reliability, timeliness


Document retrieval

• Use existing search engines: Google, AlltheWeb, NorthernLight

• No modifications to question

• CF: work on QASM (ACM CIKM 2001)


F

tfwtfwidftfwS

N

kk

N

jj

N

iii

i

321

13

12

11 ****

Sentence ranking

• Weighted N-gram matching:

• Weights are determined empirically, e.g., 0.6, 0.3, and 0.1


Probabilistic phrase reranking

• Answer extraction: probabilistic phrase reranking. What is:

p(ph is answer to q | q, ph)

• Evaluation: TRDR– Example: (2,8,10) gives .725– Document, sentence, or phrase level

• Criterion: presence of answer(s)

• High correlation with manual assessment

n

iirn 1

11


Phrase types

PERSON PLACE DATE NUMBER DEFINITIONORGANIZATION DESCRIPTION ABBREVIATIONKNOWNFOR RATE LENGTH MONEY REASONDURATION PURPOSE NOMINAL OTHER


Question Type Identification• Wh-type not sufficient:

• Who: PERSON 77, DESCRIPTION 19, ORG 6• What: NOMINAL 78, PLACE 27, DEF26, PERSON 18, ORG 16,

NUMBER 14, etc.• How: NUMBER 33, LENGTH 6, RATE 2, etc.

• Ripper:– 13 features: Question-Words, Wh-Word, Word-Beside-Wh-

Word, Is-Noun-Length, Is-Noun-Person, etc.– Top 2 question types

• Heuristic algorithm:– About 100 regular expressions based on words and parts of

speech


Ripper performance

-20.69%-TREC8,9,10

30%17.03%TREC10TREC8,9

24%22.4%TREC8TREC9

Test Error Rate

Train Error Rate

TestTraining


Regex performance

7.6%5.5%4.6%TREC8,9,10

18.2%6%7.4%TREC8,9

18%15%7.8%TREC9

Test on TREC10

Test on TREC8

Test on TREC9

Training


Phrase ranking

• Phrases are identified by a shallow parser (ltchunk from Edinburgh)

• Four features:– Proximity– POS (part-of-speech) signature (qtype)– Query overlap– Frequency


Proximity

• Phrasal answers tend to appear near words from the query

• Average distance = 7 words, range = 1 to 50 words

• Use linearrescalingof scores


Part of speech signature

NO (100%)NO (86.7%) PERSON (3.8%) NUMBER (3.8%) ORG (2.5%)PERSON (37.4%) PLACE (29.6%) DATE (21.7%) NO (7.6%)NO (75.6%) NUMBER (11.1%) PLACE (4.4%) ORG (4.4%)PLACE (37.3%) PERSON (35.6%) NO (16.9%) ORG (10.2%)ORG (55.6%) NO (33.3%) PLACE (5.6%) DATE (5.6%)

VBDDT NNNNPDT JJ NNPNNP NNPDT NNP

Phrase TypesSignature

Example: “Hugo/NNP Young/NNP”P (PERSON | “NNP NNP”) = .458

Example: “the/DT Space/NNP Flight/NNP Operations/NNP contractor/NN”P (PERSON | “DT NNP NNP NNP NN”) = 0

Penn Treebank tagset (DT = determiner, JJ = adjective)


Query overlap and frequency

• Query overlap:– What is the capital of Zimbabwe?– Possible choices:

Mugabe, Zimbabwe, Luanda, Harare

• Frequency:– Not necessarily accurate but rather useful


Reranking

Rank Probability and phrase

1 0.599862 the_DT Space_NNP Flight_NNP Operations_NNP contractor_NN ._.2 0.598564 International_NNP Space_NNP Station_NNP Alpha_NNP3 0.598398 International_NNP Space_NNP Station_NNP4 0.598125 to_TO become_VB5 0.594763 a_DT joint_JJ venture_NN United_NNP Space_NNP Alliance_NNP6 0.593933 NASA_NNP Johnson_NNP Space_NNP Center_NNP7 0.587140 will_MD form_VB8 0.585410 The_DT purpose_NN9 0.576797 prime_JJ contracts_NNS10 0.568013 First_NNP American_NNP11 0.567361 this_DT bulletin_NN board_NN12 0.565757 Space_NNP :_:13 0.562627 'Spirit_NN '_'' of_IN...41 0.516368 Alan_NNP Shepard_NNP

Proximity = .5164


Reranking


1 0.465012 Space_NNP Administration_NNP ._.2 0.446466 SPACE_NNP CALENDAR_NNP _.3 0.413976 First_NNP American_NNP4 0.399043 International_NNP Space_NNP Station_NNP Alpha_NNP5 0.396250 her_PRP$ third_JJ space_NN mission_NN6 0.395956 NASA_NNP Johnson_NNP Space_NNP Center_NNP7 0.394122 the_DT American_NNP Commercial_NNP Launch_NNP Industry_NNP8 0.390163 the_DT Red_NNP Planet_NNP ._.9 0.379797 First_NNP American_NNP10 0.376336 Alan_NNP Shepard_NNP11 0.375669 February_NNP12 0.374813 Space_NNP13 0.373999 International_NNP Space_NNP Station_NNP

Qtype = .7288Proximity * qtype = .3763


Reranking


1 0.478857 Neptune_NNP Beach_NNP ._.2 0.449232 February_NNP3 0.447075 Go_NNP4 0.437895 Space_NNP5 0.431835 Go_NNP6 0.424678 Alan_NNP Shepard_NNP7 0.423855 First_NNP American_NNP8 0.421133 Space_NNP May_NNP9 0.411065 First_NNP American_NNP woman_NN10 0.401994 Life_NNP Sciences_NNP11 0.385763 Space_NNP Shuttle_NNP Discovery_NNP STS-60_NN12 0.381865 the_DT Moon_NNP International_NNP Space_NNP Station_NNP13 0.370030 Space_NNP Research_NNP A_NNP Session_NNP

All four features


Document level performance

164163149#>0

1.33611.04950.8355Avg

GoogleNLightAlltheWebEngine

TREC 8 corpus (200 questions)


Sentence level performance

1351371591191211599999148#>0

0.490.542.550.440.482.530.260.312.13Avg

GOO

GOL

GOU

NLO

NLL

NLU

AWO

AWL

AWU

Engine


Phrase level performance

0.1990.1570.1170.105Combined

0.06460.0580.0540.038Global proximity

0.06460.0680.0480.026Appearance order

1.9412.6982.6522.176Upperbound

Google S+PGoogle D+PNorthernLightAlltheWeb

Experiments performedOct-Nov. 2001


Text classification


Introduction

• Text classification: assigning documents to predefined categories

• Hierarchical vs. flat• Many techniques: generative (maxent, knn, Naïve

Bayes) vs. discriminative (SVM, regression)• Generative: model joint prob. p(x,y) and use

Bayesian prediction to compute p(y|x)• Discriminative: model p(y|x) directly.


Generative models: knn

• K-nearest neighbors

• Very easy to program

• Issues: choosing k, b?

)(

),(),(qdkNNd

qcq ddsbdcscore


Feature selection: The 2 test

• For a term t:

• Testing for independence:P(C=0,It=0) should be equal to P(C=0) P(It=0)– P(C=0) = (k00+k01)/n– P(C=1) = 1-P(C=0) = (k10+k11)/n– P(It=0) = (k00+K10)/n– P(It=1) = 1-P(It=0) = (k01+k11)/n

It

0 1

C 0 k00 k01

1 k10 k11


Feature selection: The 2 test

• High values of 2 indicate lower belief in independence.

• In practice, compute 2 for all words and pick the top k among them.

))()()((

)(

0010011100011011

2011000112

kkkkkkkk

kkkknΧ


Feature selection: mutual information

• No document length scaling is needed

• Documents are assumed to be generated according to the multinomial model

x y yPxP

yxPyxPYXMI

)()(

),(log),(),(


Naïve Bayesian classifiers

• Naïve Bayesian classifier

• Assuming statistical independence

),(

)()|,...,(),...,|(

,...21

2121

k

kk FFFP

CdPCdFFFPFFFCdP

k

j j

k

j j

kFP

CdPCdFPFFFCdP

1

121

)(

)()|(),...,|(

Spam recognitionReturn-Path: <[email protected]>X-Sieve: CMU Sieve 2.2From: "Ibrahim Galadima" <[email protected]>Reply-To: [email protected]: [email protected]: Tue, 14 Jan 2003 21:06:26 -0800Subject: Gooday

DEAR SIR

FUNDS FOR INVESTMENTS

THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HADNO PREVIOUS CORRESPONDENCE WITH YOU

I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENTNATIONAL ELECTORAL COMMISSION INEC I GOT YOURCONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLEPERSON WITH WHOM TO HANDLE A VERY CONFIDENTIALTRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED ATTWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATESDOLLARS US$20M TO A SAFE FOREIGN ACCOUNT

THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITHARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OFOVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A


Well-known datasets• 20 newsgroups

– http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/

• Reuters-21578– Cats: grain, acquisitions, corn, crude, wheat, trade…

• WebKB– http://www-2.cs.cmu.edu/~webkb/– course, student, faculty, staff, project, dept, other– NB performance (2000)– P=26,43,18,6,13,2,94– R=83,75,77,9,73,100,35


Support vector machines

• Introduced by Vapnik in the early 90s.


Semi-supervised learning

• EM

• Co-training

• Graph-based


Additional topics

• Soft margins

• VC dimension

• Kernel methods


• SVMs are widely considered to be the best method for text classification (look at papers by Sebastiani, Christianini, Joachims), e.g. 86% accuracy on Reuters.

• NB also good in many circumstances


Readings• Books:• 1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto; Modern Information Retrieval, Addison-

Wesley/ACM Press, 1999.• 2. Pierre Baldi, Paolo Frasconi, Padhraic Smyth; Modeling the Internet and the Web: Probabilistic

Methods and Algorithms; Wiley, 2003, ISBN: 0-470-84906-1

• Papers:• Barabasi and Albert "Emergence of scaling in random networks" Science (286) 509-512, 1999• Bharat and Broder "A technique for measuring the relative size and overlap of public Web search

engines" WWW 1998• Brin and Page "The Anatomy of a Large-Scale Hypertextual Web Search Engine" WWW 1998 • Bush "As we may thing" The Atlantic Monthly 1945 • Chakrabarti, van den Berg, and Dom "Focused Crawling" WWW 1999• Cho, Garcia-Molina, and Page "Efficient Crawling Through URL Ordering" WWW 1998• Davison "Topical locality on the Web" SIGIR 2000• Dean and Henzinger "Finding related pages in the World Wide Web" WWW 1999• Deerwester, Dumais, Landauer, Furnas, Harshman "Indexing by latent semantic analysis" JASIS

41(6) 1990


Readings• Erkan and Radev "LexRank: Graph-based Lexical Centrality as Salience in Text

Summarization" JAIR 22, 2004• Jeong and Barabasi "Diameter of the world wide web" Nature (401) 130-131, 1999• Hawking, Voorhees, Craswell, and Bailey "Overview of the TREC-8 Web Track" TREC

2000• Haveliwala "Topic-sensitive pagerank" WWW 2002• Kumar, Raghavan, Rajagopalan, Sivakumar, Tomkins, Upfal "The Web as a graph"

PODS 2000• Lawrence and Giles "Accessibility of information on the Web" Nature (400) 107-109,

1999• Lawrence and Giles "Searching the World-Wide Web" Science (280) 98-100, 1998• Menczer "Links tell us about lexical and semantic Web content" arXiv 2001• Page, Brin, Motwani, and Winograd "The PageRank citation ranking: Bringing order to

the Web" Stanford TR, 1998• Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web"

JASIST 2005• Singhal "Modern Information Retrieval: an Overview" IEEE 2001


More readings• Gerard Salton, Automatic Text Processing, Addison-

Wesley (1989)• Gerald Kowalski, Information Retrieval Systems: Theory

and Implementation, Kluwer (1997)• Gerard Salton and M. McGill, Introduction to Modern

Information Retrieval, McGraw-Hill (1983)• C. J. an Rijsbergen, Information Retrieval, Buttersworths

(1979)• Ian H. Witten, Alistair Moffat, and Timothy C. Bell,

Managing Gigabytes, Van Nostrand Reinhold (1994)• ACM SIGIR Proceedings, SIGIR Forum• ACM conferences in Digital Libraries


Thank you!

Благодаря!

Information Retrieval

Documents

Transcript of Information Retrieval