Overview - UCSD Cognitive Sciencerik/courses/cogs188_s10/slides/3-wgt-match.pdf · Overview The...

40
© R. K. Belew 1996-2001 Finding Out About Chapter 3: 25 Sept 01 Overview The fascination with the subliminal, the camouflaged, and the encrypted is ancient. Getting a computer to munch away at long strings of letters from the Old Testament is not that different from killing animals and interpreting the entrails, or pouring out tea and reading the leaves. It does add the modern impersonal touch – a computer found it, not a person, so it must be “ really there.” But computers find what people tell them to find. As the programmers like to say, “prophesy in, prophesy out.”

Transcript of Overview - UCSD Cognitive Sciencerik/courses/cogs188_s10/slides/3-wgt-match.pdf · Overview The...

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Overview

The fascination with the subliminal, the camouflaged, and theencrypted is ancient. Getting a computer to munch away at longstrings of letters from the Old Testament is not that different fromkilling animals and interpreting the entrails, or pouring out teaand reading the leaves. It does add the modern impersonal touch –a computer found it, not a person, so it must be “ really there.” Butcomputers find what people tell them to find. As the programmerslike to say, “prophesy in, prophesy out.”

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Level of analysis

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Zipf’s Law

≈ ≈

Rank order of words

toocommon

tooraresignificant

Zipf's first law

Fre

quen

cy o

f wor

ds

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Zipfian Distribution of AIT Words

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Principle of Least Effort

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Other very clever people have providedother cognitive/linguistic explanations

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Or, not!

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

WWW surfing behavior

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Consequences of lexical decisions

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Consequences ... (cont)Token Freq Unstem-f

the 78428of 50026and 33834a 31347to 28666in 21512system 2 1 4 8 8 8632is 18781model 1 4 7 7 2 4796

for 14640de 1 1 9 2 3network 1 0 3 0 6 3965

this 10095base 9 8 3 8that 9820are 9792learn 9 2 9 3world 8 1 0 3la 7 6 7 8author 7 6 1 5an 7593

Token Freq Unstem-f

knowledg 7 4 1 0 5496

neural 7 2 2 0 3912with 7197as 6964on 6920by 6886

process 6 5 6 9 2900

design 6 3 6 2 3308

del 6 1 7 8be 6045

develop 5 8 9 1integr 5 6 3 3domain 5 6 3 0based 5326use 5 2 2 6intel l ig 5 1 9 7which 5158

control 5 1 5 1 3288

expert 4 9 5 3 2842

comput 4 8 5 1mechan 4 8 1 8escolar 4 7 2 8

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Consequences ... (cont2)Token Freq Unstem-f

approach 4 6 2 1 2535from 4587

classifi 4 5 5 6algorithm 4 5 3 3 2155

f inal 4 4 3 6systems 4387can 4370

code 4 1 1 6robot 4 1 0 3intern 4 0 9 7applic 4 0 5 5perform 4 0 5 1percept 4 0 4 7method 4 0 3 7 2003

enabl 4 0 3 6data 4 0 1 3 3326

make 3 9 8 4increm 3 9 4 7incomplet 3 8 9 0secondli 3 7 6 5mo 3 7 3 3it 3697used 3594problem 3520

Token Freqwe 3276these 3268using 3268learning 3266was 3205has 3051or 2859been 2715research 2622have 2609two 2601developed 2550information 2461networks 2449time 2370s 2350new 2293also 2259performance 2244results 2239were 2216such 2165problems 2133analysis 2045models 2000

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Function words follow Poissondistribution

λ

Pr(n occur of w) = e−λ w λ wn

n!

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Two-Poisson model

λ λ

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

“Resolving power”

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Resolving Power

Fre

quen

cy o

f wor

ds

Rank order of words

upper cut-off

lower cut-off

toocommon

toorare

significant

Zipf's first law

Fre

quen

cy o

f wor

ds

Resolving power

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Indexing term distribution

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Exhaustivity: Number of topics indexed

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Specificity: ability to describe FOAinformation need precisely

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Index: A balance between user and corpus

Query

Exhaustivity

INDEX Corpus

Specificity

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Not too exhaustive, not too specific...

Query

Exhaustivity

INDEX Corpus

Specificity

Representationof

Discriminabilityof

Hi Precision Hi Recall

few doc/jkw

many kw/doc

leads to

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Factors in index weighting

freqkd ≡ N(occurrences of wordk in docd )

wkd ∝ freqkd ∗ discrimk

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Indexing Graph

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Information is reduction in uncertainty

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Hypothetical Word Distributions

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Separate informative words from noise

Noisek = freqkd

freqkd=1

NDoc∑ ∗ log

freqk

freqkd

Signalk = freqk − Noisek

wkd = freqkd * Signalk

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

3.3.7 Inverse document frequency

Dock ≡ N(documents containing wordk )

wkd = freqkd * logNorm

Dock

+1

Norm =Ndoc [Sparck- Jones' 72]

argmaxk

Dock [Sparck- Jones' 79]

î

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Fig 3.7 Vector Space

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Inter-document similarity

Sim(di ,d j ) ≡ "Similarity" twix documents

D* ≡ Centroid; average document

Sim ≡ 12NDoc

Sim(di ,d j )i, j∑

= α Sim(i=1

NDoc

∑ di ,D* )

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Removing keyword collapses documentspace

Simk ≡ Sim when termk removed

Disck ≡ Simk − Sim

wkd = freqkd * Disck

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Length Normalization of Vector Space

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Sensitivity of IDF to “Document” Size

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Pivot-Based Document LengthNormalization

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Summary: SMART WeightingSpecification

wkd = freqkd ∗ collectk

norm

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Frequency of KW in DOC

freqkd =

{0,1} binary

freqkd

maxk

( freqkd )max norm

12 + 1

2freqkd

maxk

( freqkd )augmented

ln( freqkd ) +1 log

î

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Collection statistics of KW

freqkd =

{0,1} binary

freqkd

maxk

( freqkd )max norm

12 + 1

2freqkd

maxk

( freqkd )augmented

ln( freqkd ) +1 log

î

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Normalization

norm =

wivector∑ sum

wi2

vector∑ cosine

wi4

vector∑ fourth

maxvector

wi( ) max

î

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

3.5.1 Measures of association

Q = {kw ∈ query}

D = {kw ∈ document}

Q ∩ D Shared features

2Q ∩ D

Q + DDice coefficient

Q ∩ D

Q ⋅ DCosine coefficient

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

3.5.2 Cosine Similarity

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

Dissimilarity as “distance”

0.2 0.4 0.6 0.8 1

50

100

150

200D

S

-4 -2 2 4

-20

-10

10

20

S

D

© R. K. Belew 1996-2001Finding Out About Chapter 3: 25 Sept 01

3.7Computing

Partial MatchScores