Latent Semantic Analysis - WordPress.com · Latent Semantic Analysis a tutorial at CogSci'2005...

1

Latent Semantic Analysis

a tutorial at CogSci'2005

Benoît Lemaire

Laboratoire Leibniz-IMAG (CNRS UMR 5522)University of Grenoble

[email protected]

2

Outline

Intuitive introductionrepresenting the meaning of words from the analysis of huge corpora

LSA techniqueTestsCognitive modelsLimitsCompeting models

3

Intuitive introductionIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

Corpus

planeairport

flybee pilot

truckcar

drive

wingsbird

traintravel

parks

restauranteat

applebanana

vegetablesapples

fruittree

gardenflower

flowersgrow

bouquetpetals

flyingflew

nectarhoney

4

From words to textsIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

planeairport

flybee pilot

truckcar

drive

wingsbird

traintravel

parks

restauranteat

applebanana

vegetablesapples

fruittree

gardenflower

flowersgrow

bouquetpetals

flyingflew

nectarhoney

The pilot parks the plane

5


planeairport

flybee pilot

truckcar

drive

wingsbird

traintravel

parks

restauranteat

applebanana

vegetablesapples

fruittree

gardenflower

flowersgrow

bouquetpetals

flyingflew

nectarhoney


6


planeairport

flybee pilot

truckcar

drive

wingsbird

traintravel

parks

restauranteat

applebanana

vegetablesapples

fruittree

gardenflower

flowersgrow

bouquetpetals

flyingflew

nectarhoney


7


planeairport

flybee pilot

truckcar

drive

wingsbird

traintravel

parks

restauranteat

applebanana

vegetablesapples

fruittree

gardenflower

flowersgrow

bouquetpetals

flyingflew

nectarhoney


8

Usages of LSA

plane

airport

flybee

pilot

truck

car

drive

wingsbird

train

travelpark

restauranteat

applebanana

vegetablesapples

fruittree

gardenflower

flowers

grow

bouquetpetals

flyingflew

nectarhoney

Corpus

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

9

LSA as a tool

Compare textsControl experiment material

plane

airport

flybee

pilot

truck

car

drive

wingsbird

train

travelpark

restauranteat

applebanana

vegetablesapples

garden

flowersbouquet

petals

flyingflew

nectarhoney

Corpus


10

LSA as a cognitive modelof semantic memory

Build lots of cognitive modelsText comprehensionInterface navigation...

Corpus

plane

airport

flybee

pilot

truck

car

drive

wingsbird

train

travelpark

restauranteat

applebanana

vegetablesapples

fruittree

gardenflower

flowers

grow

bouquetpetals

flyingflew

nectarhoney


11

LSA as a cognitive model of word acquisition

plane

airport

flybee

pilot

truck

car

drive

wingsbird

train

travelpark

restauranteat

applebanana

vegetablesapples

fruittree

gardenflower

flowers

grow

bouquetpetals

flyingflew

nectarhoney

Corpus

Humans also construct the meaning of words from written materialDoes LSA account for this process?


12

Outline

Intuitive introductionLSA techniqueTestsCognitive modelsLimitsCompeting models

13

Semantic information is drawn from raw texts

Take a huge corpusSplit it into paragraphsDetermine the statistical context in which each word occurs


14

A first approach based on co-occurrence

Define the meaning of a word from the co-occurring wordsTwo words are similar if they occur in the same paragraphs

plane= (1,0,0,1,0,0,0,0,2,1,0,1,1,0,0,0,0,0,1,1)

airport=(1,1,0,0,0,0,0,0,1,1,0,2,1,0,0,0,0,0,0,1)

Does not work well (French, Perfetti)


15

A better approach

Two words are similar if they occur in the same paragraphs


16

A better approach

Two words are similar if they occur in the same paragraphs

Two words are similar if they occur in similar paragraphs


17

A better approach

Two words are similar if they occur in the same paragraph


Two paragraphs are similar if they

contain common words


18


contain common words

A better approach


contain similar words

Two words are similar if they occur in the same paragraph



19

A mutual recursion

Mathematicians know how to solve such circular systemsSuppose P paragraphs, W wordsBuild M, a P x W matrix

M[i,j] = nb of occurrence of word i in paragraph j

Each word is a P-dimension vectorReduce the matrix to its best 300 dimensions


20

Example (Landauer & Dumais 1997)

Human 1 0 0 1 0 0 0 0 0

interface 1 0 1 0 0 0 0 0 0

computer 1 1 0 0 0 0 0 0 0

user 0 1 1 0 1 0 0 0 0

system 0 1 1 2 0 0 0 0 0

response 0 1 0 0 1 0 0 0 0

time 0 1 0 0 1 0 0 0 0

EPS 0 0 1 1 0 0 0 0 0

survey 0 1 0 0 0 0 0 0 1

trees 0 0 0 0 0 1 1 1 0

graph 0 0 0 0 0 0 1 1 1

minors 0 0 0 0 0 0 0 1 1

§1: Human machine interface for ABC computer applications§2: A survey of user opinion of computer system response time§3: The EPS user interface management system§4: System and human system engineering testing of EPS§5: Relation of user perceived response time to error measurement§6: The generation of random, binary, ordered trees§7: The intersection graph of paths in trees§8: Graph minors IV: Widths of trees and well-quasi-ordering§9: Graph minors: A survey


21

Matrix reduction

= x x0

0Mwords

paragraphs

W P


22

Matrix reduction

= x x0

0Mwords

paragraphs

W P

= x xM'

paragraphs

words


23

Example (Landauer & Dumais 1997)

Human 1 0 0 1 0 0 0 0 0 .16 .40 .38 .47 .18 -.05 -.12 -.16 -.09

interface 1 0 1 0 0 0 0 0 0 .14 .37 .33 .40 .16 -.03 -.07 -.10 -.04

computer 1 1 0 0 0 0 0 0 0 .15 .51 .36 .41 .24 .02 .06 .09 .12

user 0 1 1 0 1 0 0 0 0 .26 .84 .61 .70 .39 .03 .08 .12 .19

system 0 1 1 2 0 0 0 0 0 .45 1.23 1.05 1.27 .56 -.07 -.15 -.21 -.05

response 0 1 0 0 1 0 0 0 0 .16 .58 .38 .42 .28 .06 .13 .19 .22

time 0 1 0 0 1 0 0 0 0 .16 .58 .38 .42 .28 .06 .13 .19 .22

EPS 0 0 1 1 0 0 0 0 0 .22 .55 .51 .63 .24 -.07 -.14 -.20 -.11

survey 0 1 0 0 0 0 0 0 1 .10 .53 .23 .21 .27 .14 .31 .44 .42

trees 0 0 0 0 0 1 1 1 0 -.06 .23 -.14 -.27 .14 .24 .55 .77 .66

graph 0 0 0 0 0 0 1 1 1 -.06 .34 -.15 -.30 .20 .31 .69 .98 .85

minors 0 0 0 0 0 0 0 1 1 -.04 .25 -.10 -.21 .15 .22 .50 .71 .62

r(human,user)= -.38 r(human,user)=.94

§1: Human machine interface for ABC computer applications§2: A survey of user opinion of computer system response time§3: The EPS user interface management system§4: System and human system engineering testing of EPS§5: Relation of user perceived response time to error measurement§6: The generation of random, binary, ordered trees§7: The intersection graph of paths in trees§8: Graph minors IV: Widths of trees and well-quasi-ordering§9: Graph minors: A survey


24

Meaning as vectors

The meaning of a each word is represented as a vector

computer

disk workbike

disk = (2,4,6,-1,0,3,-4,8,-5,8,12.....9,0,1,-9,4,1) computer = (1,4,5,-1,0,2,-2,5,-6,8,11.....8,0,0,-5,7,1)

Illustrative purpose only: does not work in 2 dimensions!

eat


25

Comparison with symbolic approaches

bikeride

wheel

bicycle

fun

bike bicycleSYNONYM

wheelsride

FUNCTION HAS

IS-A

Human-propelledvehicle

vehicle

IS-A

guidon

HAS

guidon


26


Representation is absoluteGood for drawing inferencesNot so good for comparisonsHard to compound nodes

bike bicycleSYNONYM

wheelsride

FUNCTION HAS

ISA

Human-propelled

vehicle

vehicle

ISA

guidon

HAS


27


Representation is relativeGood for semantic comparisonsEasy to compound words

bikeride

wheel

bicycle

funguidon


28

Semantic comparison

Semantic distance = cosine between vectorsA value between –1 and 1Other measures:

Euclidian distanceHellinger distance...


29

Outline


30

Word comparisons

A semantic space built from a 4.6 million word encyclopedia (lsa.colorado.edu)

Closest words to bicycle:pedals: .84handlebars .79bicycles .79pedaling .75bike .75

fly – insect: .26fly -plane: .48insect – plane: .02


31

Existing semantic spaces

Englishencyclopedia, literature, psychology textbooks, scientific textbooks (available at lsa.colorado.edu)

Frenchnewspaper, literature, children tales, children production (some of them are available at lsa.colorado.edu)


32

What is the right number of dimensions?

Too few dimensions would not capture the latent semantic structureToo much dimensions would describe idiosyncrasic structuresAn open issue300 dimensions seems a good value for the whole language (Dumais found a maximum of performance at 90 dim. for a specific domain)


33

From words to texts

Sum up word vectors to get text vectors

disk= (2, 4, 6,-1, 0, 3,-4, 8,.....9, 0,-9, 4, 1)

computer=(1 4, 5,-1, 0, 2,-2, 5,.....8, 0,-5, 7, 1)

kid= (1, 0, 0,-9, 4, 5,-1, 2,.....0, 2, 3, 4, 5)

broken= (5,-3, 8, 6, 5,-2, 2, 0,.....6, 1, 7, 1, 2)

S = The computer disk was broken by the kid

S= (9, 5,19,-5, 9, 8,-5, 15....23, 3,-4,16, 9)


34

Text comparisons

A semantic space built from a 4.6 million word encyclopedia (lsa.colorado.edu)

The cat was lost in a forest / My little feline disappeared in the trees: .66The radius of spheres / a circle's diameter: .55 (Landauer)

The radius of spheres / the music of spheres: .01(Landauer)


35

No syntax processing

Paragraphs are bags of wordsSemantic is determined from the lexical level

Lexical level

Syntax level

Semantic level


36

No syntax processing

We learned that this is about birds, nests and a building processThe analysis unit is the paragraph

Birds build nests

Nests build birds


37

Outline


38

Models based on LSA

Semantic representationVocabulary acquisitionText comprehensionFree text assessment


39

Semantic representation

Comparison of wordsTOEFL testVocabulary test for children

Comparison of sentencesMeasure of coherence

Comparison of texts


40

Comparison of words

Synonymy part of the TOEFL test (Landauer & Dumais, 97)

80 items: 1 word/4 alternative wordsGuess which one is the closestExample: levied

imposed

believed

requested

correlated

LSA was trained on a 4.6 million word corpusLSA score : 64.4%Foreign applicants to US colleges: 64.5%


41

Vocabulary test:LSA vs children

(Denhière & Lemaire, 2004)

115 items

4 definitions : correct, close, distant, unrelated

Subjects were in grade 2 to 4

Example (translated from French):

slope rising road tilted surface which goes up or down small piece of ground face of a rock or a mountain


42

A children's semantic space

Stories and tales for children: ~ 1,600,000 wordsChildren productions: ~ 800,000 wordsTextbooks: ~ 400,000 wordsEncyclopedia: ~ 400,000 wordsDictionary: ~ 50,000 words

TOTAL : 3,2 million words


43

Vocabulary test: results

Correct Close Far Unrelated0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

Children data

2nd grade

3rd grade

4th grade


44

Vocabulary test: resultsIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models


5

10

15

20

25

30

35

40

45

50

55

60

65

70

Children+Model data

2nd grade

3rd grade

4th grade

LSA

45

99 most frequent wordsIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models


5

10

15

20

25

30

35

40

45

50

55

60

65

70

Children+Model data

2nd grade

3rd grade

4th grade

115 words

99 words

46

Vocabulary test: word lemmatization


5

10

15

20

25

30

35

40

45

50

55

60

65

70

Children+Model data

2nd grade

3rd grade

4th grade

115 words

99 words

lemm.


47

Vocabulary test: verb lemmatization


5

10

15

20

25

30

35

40

45

50

55

60

65

70

Children+Model data

2nd grade

3rd grade

4th grade

115 words

99 words

lemm.

verb lemm.

48

Comparison of sentences

● Use LSA to measure the coherence of a text

● Compute the semantic similarities between adjacent sentences

coherence (T) = Σ cosine(sentence i, sentence i+1)

n-1

i=1

n-1


49

Foltz et al. (1998) experiment

McNamara et al. (1996) data4 versions of text about the heart

Low local coherence, low macro coherence (lm)Low local coherence, high macro coherence (lM)High local coherence, low macro coherence (Lm)High local coherence, high macro coherence (LM)

Corpus: 21 texts, 100 dimensions


50

Foltz et al. (1998) experiment

Results :Coherence(lm) = 0.177Coherence(lM) = 0.205Coherence(Lm) = 0.210Coherence(LM) = 0.242

comprehension

coherence

0.2

0.3

0.4

.14 .18 .22

LSA r2=0.853word overlap

r2=0.098


51

Can LSA assess text recall ?

Measure M1 : cosine between source text and texts recalled

Measure M2 : number of propositions recalled

Text Recall Nb subjects Correlation M1/M2Poule Immediate n = 48 (8,3 years old) 0.49Dragon Immediate n = 48 (8,3 years old) 0.64Dragon Delayed n = 48 (8,3 years old) 0.7Araignée Immediate n = 57 (16-20 years old) 0.81Clown Immediate n = 57 (16-20 years old) 0.65Géant Immediate n = 130 (16-20 years old) 0.6Ourson Immediate n = 44 (16-20 years old) 0.48Clown Summary n = 44 (16-20 years old) 0.86


52

Comparison of textsFoltz et al. (1996)

● Compare texts and their summaries● Britt et al. (1994) data

● 24 participants read 21 texts ● Participants write an essay

● Corpus: 21 texts + 8 encyclopedia texts + 2 book excerpts

● Assessing the essays:● Score 1 = average similarity between each sentence and

the closest text sentence● Score 2 = same but compare with the main 10 text

sentences


53

Comparison of textsFoltz et al. (1996)

Judge1 Judge2 Judge3 Judge4 Score1 Score2

Judge1 .575** .768** .381 .418* .589*

Judge2 .582** .412* .552** .626**

Judge3 .367 .317 .384+

Judge4 .117 .240

** p<.01 * p<.05 + p<.06


54

Models based on LSA



55

A model of learning

LSA learns the meaning of words from textsDoes it mimic the human rate of learning?Landauer & Dumais (1997) experiment:

Most of the lexicon is acquired from textChildren data:

3500 words read / day50 new words encountered / day10 words learned / dayControlled experiments lead to 2.5 words learned / day


56

Direct and indirect effects

How do children acquire the remaining 7.5 words/day?The poverty of stimulus problem

how do children acquire as much knowledge on the basis of little information?

Hypothesis: a word is learned…From texts containing it (direct effect)From texts not containing it (indirect effect)


57

LSA simulation

TOEFL test score is function of:T: total number of paragraphsS: number of paragraphs containing the stem word

score=0.128 (log 0.076 T) (log 31.910 S)

Correlation with observed score= .98After “reading” 25,000 paragraphs, what is the effect of the 25,001th?

When the stem word belongs to it?When the stem word does not belong to it?


58

Example

Suppose a word that appeared twice in the 25,000 paragraphsScore = 0.128 (log 0.076*25000)(log 31.91*2) = .75809

Direct effect: the word belongs to the 25,001th paragraphScore = 0.128 (log 0.076*25001)(log 31.91*3) = .83202

Indirect effect: the word does not belong to the 25,001th paragraphScore = 0.128 (log 0.076*25001)(log 31.91*2) = .75810


59

LSA simulation

Effects are cumulated over all word-frequency bandDirect effect:

.0007 words gained per word encountered

.0007 * 3500 = 2.5 words acquired / day

Indirect effect:.15 words gained per paragraph encountered.15 * 50 = 7.5 words acquired / day

The meaning of a word would be acquiredfor 25% by reading text containing itfor 75% by reading texts not containing it


60

Direct and indirect effects(Lemaire et al., to appear)


20002721345441874909563063527073780685289249998210797117051261213520-0,1

-0,05

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

0,5

0,55

0,6

acheter (to buy)/magasin (store)

Number of paragraphs

Cosi

ne a

chete

r/m

ag

asi

n

61

Cosine acheter/magasin

Total gain: .58

Gain due to the occurrence of acheter: -.10

Gain due to the occurrence of magasin: -.19

Gain due to the co-occurrence: .73

Gain due to 2nd order co-occurrences: .11

Gain due to 3rd and more co-occurrences: .03


62

Same measures on 28 words

Total gain: .13

Gain due to the occurrence of W1: -.16

Gain due to the occurrence of W2: -.19

Gain due to the co-occurrence: .34

Gain due to 2nd order co-occurrences: .05

Gain due to 3rd and more co-occurrences: .09


63

Models based on LSA



64

Text comprehension

Cognitive models can be simulated on computers, using LSA as a model of semantic memory:

the construction-integration model (Kintsch, 1998)A predication algorithmMetaphor comprehension

65

A Computational model of text comprehension (Lemaire et al, to

appear)


Next sentence

Selectassociates

SEMANTICMEMORY

(LSA)WORKINGMEMORY

EPISODICMEMORY

Retrievepreviouselements

Decay

Integration

66

Example (translated from French)

A woodcutter was walking in the forest

Select associates:walk: stroll, meet, pickwoodcutter: ax, forest, cottage

forest: glade, oak, wood

67



walk/woodcutter/forest

walk

forest

woodcutterstroll

meet

pick ax

cottage

glade

oak

wood

68




walk

forest

woodcutterstroll

meet

pickax

cottage

glade

oak

wood

69




walk

forest

woodcutterstroll

meet

pick ax

cottage

glade

oak

wood

70




walk

forest

woodcutterstroll

meet

pick ax

cottage

glade

oak

wood

71

The construction-integration model

Before LSA, the construction step was performed by handLink weights were set approximatelyAll nodes being vectors, the weight of a link is the cosine of the corresponding nodes A question remains: what is the vector of P(A1,…An) ?


72

The predication algorithm (Kintsch, 2001)

The centroid rule: P(A) P+A

…does not work wellThe meaning of the predicate depends on the meaning of the argumentsThe horse ran / The color ranshark(lawyer) ≠ shark + lawyer

A better rule: P(A) P+A+M1+…+Mn

Mi are vectors that are close to both P and A


73

Example 1:construction step (Kintsch, 2001)

The horse ran ran(horse)

ran

stopped

down

hopped

horse.21

.18

.12


74

Example 1:integration step


ran

stopped

down

hopped

horse.46

.19

.00

The horse ran horse+ran+stopped


75

Example 2:construction step

The color ran ran(color)

ran

stopped

down

hopped

color.06

.11

.02


76

Example2:integration step


ran

stopped

down

hopped

color.00

.70

.00

The color ran color+ran+down


77

Predication algorithm: test(Kintsch, 2001)

.11.01.01dissolve

.29.75.33gallop

color ranhorse ranranLandmarks


78

Inferences (Kintsch, 2001)

.70.73Predication

.54.66Centroid

The elk was deadThe hunter was deadThe hunter shot the elk

.91.87Predication

.66.76Centroid

The glass was broken

The student was broken

The student dropped the glass

.83.62Predication

.71.70Centroid

The table was cleanThe student was clean

The student washed the table


79

Metaphors

My lawyer is a shark shark(lawyer)


lawyer

shark

My lawyer is a shark

The meaning of my lawyer is a shark is not a simplecombination of the meaning of lawyer andthe meaning of shark

80

Metaphors

Look for words that are similar to shark but also close to lawyer

shark

fin

teeth

sea

sharks

vicious

fish

agressive

lawyer


81

Metaphors

lawyer

shark


viciousagressive



82

Metaphors

Much more neighbors of the predicate need to be consideredAccording to Kintsch, predication fails if number of neighbors is less than 100!


83

An incremental algorithm(Lemaire & Bianco, 2003)

Does not work on a predefine set of neigbors but look for them progressively

This kid is intelligent 1... 2... 3 clever 4... 5 brilliant 6... 7... 8... 9 sage

This kid is a scientist1... 2... 3............................................................................................................................................................................97 invention ......................................................................................... 128 knowledge.............................................138 wonderful


84

Experiment

Does the model account for the processing time of various kind of metaphors?12 texts, 4 conditions :

Synonym (child / kid)Conventional metaphor (child / scientist)Original metaphor (child / sage)Bad metaphor (child / colonel)


85

Subjects (N=49)

-200

-150

-100

-50

0

50

100

150

200

250

300

350

Subjects

Synonym

Conv. met.

Unconv. Met.

Bad met.

Types of metaphors

Addit

ionnal ti

me (

ms)


86

Subjects and model

Correlation: .62

0

25

50

75

100

125

150

175

200

225

250

275

300

Simulation

Types of metaphors

Nb o

f n

eig

hb

ors

-200

-150

-100

-50

0

50

100

150

200

250

300

350

Subjects

Synonym

Conv. met.

Unconv. Met.

Bad met.

Types of metaphors

Addit

ionnal

tim

e (

ms)


87

Models based on LSA



88

Free text assessment

Assessing the content of a student essayRely on a domain-specific corpusCompare the student essay with a reference text:

Pre-graded essaysText to be summarizedPortions of a text course

Trusso Haley et al. (2005) published a good state-of-the-art report (Tech. Rep. Open University)


89

Intelligent Essay Assessor

Foltz et al. (1999)Holistic score:

Compare student essay with pre-graded essaysGive the grade of the closest one

Gold standard scoreCompare student essay with a teacher essayGive the grade according to the semantic similarityCorrelation with teacher's grades around .8


90

Summary Street

Wade-Stein & E. Kintsch (2003)Helps the student to summarize a textIndicate the part of the text that are well or not well summarizedAn experiment with 60 6th grade students has shown a positive effect


91

Summary StreetIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

92

ApexIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

Dessus & Lemaire (1999, 2001,2002)Compare student essay with relevant portions of the courseThe course is structured :

#T The first chapter

#N In the first part, we will study the way...

#N Now, I will present a method...

...

#T The second chapter

#N ...

94

Apex

Correlation with teacher's grades : .62Not only for grading but for engaging students in a revising processCan be used at a distance


95

Outline


96

Strengths and limits

LSA strengthsfully automaticvectorial representation allows:

esay representation of sequence of words easy comparisons between words or texts

a cognitive model of learning and representation

LSA weaknessesno syntax processingrepresentations are not explicitnot incrementalparagraphs are bag of words


97

Outline


98

Competing models

Cognitive models of semantic memory could mimic:

the human semantic associations orthe human construction of these associations over a long period of time

Semantic networks are good model of semantic associations BUT... they do not account for the construction of these associations


99

Features

Input (corpus, word association norms)Knowledge representation (vector-based, network-based)Addition of a new text (incrementally or not)Unit of context (paragraph, sliding window)Use of high-order co-occurrences (yes/no)Compositionality (representing the meaning of a text from the meaning of its words)


100

LSA

Input: corpus Knowledge representation: vector-basedAddition of a new text: not incrementalUnit of context: paragraphUse of high-order co-occurrences: yesCompositionality: easy


101

HAL(Burgess, 1998)

Word co-occurrence matrix built from a N-word sliding window


vector = concatenation of row and columnbarn = <0,2,4,3,6,0,5,0,0,0>Measure of similarity = Euclidian distance

102

HAL(Burgess, 1998)


103

HAL(Burgess, 1998)


104

HAL(Burgess, 1998)

Input: corpus Knowledge representation: vector-basedAddition of a new text: incrementalUnit of context: sliding windowUse of high-order co-occurrences: noCompositionality: easy


105

Word association space(Steyvers et al., in press)

Based on association norms for 5,000 wordsWord/word matrix scaled to 200-500 dimensions


107

Word association space(Steyvers et al., in press)

Input: association norms Knowledge representation: vector-basedAddition of a new text: not incrementalUnit of context: N/AUse of high-order co-occurrences: noCompositionality: easy


108

PMI-IR(Turney, 2001)

Based on Mutual Information (Church & Hanks, 89)

compares the probability of two words occuring together with the probability of the words occuring separately

MI(A,B)=log2(p(A,B)/p(A).p(B))

TOEFL test: 1 problem word/4 choicesscore

1(choice)=hits(problem AND choice)/hits(choice)

score2(choice)=hits(problem NEAR choice)/hits(choice)

...


109



110


Input: webKnowledge representation: N/AAddition of a new text: incrementalUnit of context: web pageUse of high-order co-occurrences: noCompositionality: hard


111

ICAN(Lemaire & Denhière, 2004)

Network-based representationbetter for quickly retrieving neighborsbetter for representing asymmetrical relations


cable

electric

connector

networkrope

wire.7

.2.4

.

.5.7

.5.4 .3

plug

.4

.6

112

ICAN

... if you have such a [device, connect the cable to the network connector] then switch ...

Create or reinforce co-occurrence links


cable

electric

connector

networkrope

wire.7

.2.7

.

.5.7

.5.4 .3

device.5

connect

.5

plug

.4

.8cable

electric

connector

network

wire.7

.2.4

.

.5.7

.5.4 .3

plug

.4

.6

rope

113

ICAN



Create or slightly reinforce 2nd-order co-occurrence links


cable

electric

connector

networkrope

wire.7

.2

.7

.

.5.7

.5.4 .3

device.5

connect

.5

plug

.4.3.8

cable

electric

connector

network

wire.7

.2

.4

.

.5.7

.5.4 .3

plug

.4

.6

rope

114

ICAN



Create or slightly reinforce 2nd-order co-occurrence links

Sligthly decrease other links


cable

electric

connector

networkrope

wire.6

.2.7

.

.4.7

.5.4 .2

device.5

connect

.5

plug

.4.3.8

cable

electric

connector

network

wire.7

.2.4

.

.5.7

.5.4 .3

plug

.4

.6

rope

115

ICAN

3.2 million word corpus of texts for childrenTest on 1200 pairs of words (from association norms)Average weights:

stem/1st associate: .415

stem/2nd associate: .269

stem/3rd associate: .236stem/last associates: .098

Correlation with human data: .50


116

ICAN

Input: corpusKnowledge representation: network-basedAddition of a new text: incrementalUnit of context: sliding windowUse of high-order co-occurrences: yesCompositionality: hard


117

Web pages

Testslsa.colorado.edu

Referenceswww-leibniz.imag.fr/~blemaire/lsa.html

Latent Semantic Analysis - WordPress.com · Latent Semantic Analysis a tutorial at CogSci'2005...

Documents

Transcript of Latent Semantic Analysis - WordPress.com · Latent Semantic Analysis a tutorial at CogSci'2005...