Latent Semantic Analysis - WordPress.com · Latent Semantic Analysis a tutorial at CogSci'2005...

Post on 25-Jun-2020

5 views 0 download

Transcript of Latent Semantic Analysis - WordPress.com · Latent Semantic Analysis a tutorial at CogSci'2005...

1

Latent Semantic Analysis

a tutorial at CogSci'2005

Benoît Lemaire

Laboratoire Leibniz-IMAG (CNRS UMR 5522)University of Grenoble

FranceBenoit.Lemaire@imag.fr

2

Outline

Intuitive introductionrepresenting the meaning of words from the analysis of huge corpora

LSA techniqueTestsCognitive modelsLimitsCompeting models

3

Intuitive introductionIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

Corpus

planeairport

flybee pilot

truckcar

drive

wingsbird

traintravel

parks

restauranteat

applebanana

vegetablesapples

fruittree

gardenflower

flowersgrow

bouquetpetals

flyingflew

nectarhoney

4

From words to textsIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

planeairport

flybee pilot

truckcar

drive

wingsbird

traintravel

parks

restauranteat

applebanana

vegetablesapples

fruittree

gardenflower

flowersgrow

bouquetpetals

flyingflew

nectarhoney

The pilot parks the plane

5

From words to textsIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

planeairport

flybee pilot

truckcar

drive

wingsbird

traintravel

parks

restauranteat

applebanana

vegetablesapples

fruittree

gardenflower

flowersgrow

bouquetpetals

flyingflew

nectarhoney

The pilot parks the plane

6

From words to textsIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

planeairport

flybee pilot

truckcar

drive

wingsbird

traintravel

parks

restauranteat

applebanana

vegetablesapples

fruittree

gardenflower

flowersgrow

bouquetpetals

flyingflew

nectarhoney

The pilot parks the plane

7

From words to textsIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

planeairport

flybee pilot

truckcar

drive

wingsbird

traintravel

parks

restauranteat

applebanana

vegetablesapples

fruittree

gardenflower

flowersgrow

bouquetpetals

flyingflew

nectarhoney

The pilot parks the plane

8

Usages of LSA

plane

airport

flybee

pilot

truck

car

drive

wingsbird

train

travelpark

restauranteat

applebanana

vegetablesapples

fruittree

gardenflower

flowers

grow

bouquetpetals

flyingflew

nectarhoney

Corpus

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

9

LSA as a tool

Compare textsControl experiment material

plane

airport

flybee

pilot

truck

car

drive

wingsbird

train

travelpark

restauranteat

applebanana

vegetablesapples

garden

flowersbouquet

petals

flyingflew

nectarhoney

Corpus

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

10

LSA as a cognitive modelof semantic memory

Build lots of cognitive modelsText comprehensionInterface navigation...

Corpus

plane

airport

flybee

pilot

truck

car

drive

wingsbird

train

travelpark

restauranteat

applebanana

vegetablesapples

fruittree

gardenflower

flowers

grow

bouquetpetals

flyingflew

nectarhoney

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

11

LSA as a cognitive model of word acquisition

plane

airport

flybee

pilot

truck

car

drive

wingsbird

train

travelpark

restauranteat

applebanana

vegetablesapples

fruittree

gardenflower

flowers

grow

bouquetpetals

flyingflew

nectarhoney

Corpus

Humans also construct the meaning of words from written materialDoes LSA account for this process?

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

12

Outline

Intuitive introductionLSA techniqueTestsCognitive modelsLimitsCompeting models

13

Semantic information is drawn from raw texts

Take a huge corpusSplit it into paragraphsDetermine the statistical context in which each word occurs

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

14

A first approach based on co-occurrence

Define the meaning of a word from the co-occurring wordsTwo words are similar if they occur in the same paragraphs

plane= (1,0,0,1,0,0,0,0,2,1,0,1,1,0,0,0,0,0,1,1)

airport=(1,1,0,0,0,0,0,0,1,1,0,2,1,0,0,0,0,0,0,1)

Does not work well (French, Perfetti)

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

15

A better approach

Two words are similar if they occur in the same paragraphs

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

16

A better approach

Two words are similar if they occur in the same paragraphs

Two words are similar if they occur in similar paragraphs

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

17

A better approach

Two words are similar if they occur in the same paragraph

Two words are similar if they occur in similar paragraphs

Two paragraphs are similar if they

contain common words

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

18

Two paragraphs are similar if they

contain common words

A better approach

Two paragraphs are similar if they

contain similar words

Two words are similar if they occur in the same paragraph

Two words are similar if they occur in similar paragraphs

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

19

A mutual recursion

Mathematicians know how to solve such circular systemsSuppose P paragraphs, W wordsBuild M, a P x W matrix

M[i,j] = nb of occurrence of word i in paragraph j

Each word is a P-dimension vectorReduce the matrix to its best 300 dimensions

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

20

Example (Landauer & Dumais 1997)

Human 1 0 0 1 0 0 0 0 0

interface 1 0 1 0 0 0 0 0 0

computer 1 1 0 0 0 0 0 0 0

user 0 1 1 0 1 0 0 0 0

system 0 1 1 2 0 0 0 0 0

response 0 1 0 0 1 0 0 0 0

time 0 1 0 0 1 0 0 0 0

EPS 0 0 1 1 0 0 0 0 0

survey 0 1 0 0 0 0 0 0 1

trees 0 0 0 0 0 1 1 1 0

graph 0 0 0 0 0 0 1 1 1

minors 0 0 0 0 0 0 0 1 1

§1: Human machine interface for ABC computer applications§2: A survey of user opinion of computer system response time§3: The EPS user interface management system§4: System and human system engineering testing of EPS§5: Relation of user perceived response time to error measurement§6: The generation of random, binary, ordered trees§7: The intersection graph of paths in trees§8: Graph minors IV: Widths of trees and well-quasi-ordering§9: Graph minors: A survey

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

21

Matrix reduction

= x x0

0Mwords

paragraphs

W P

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

22

Matrix reduction

= x x0

0Mwords

paragraphs

W P

= x xM'

paragraphs

words

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

23

Example (Landauer & Dumais 1997)

Human 1 0 0 1 0 0 0 0 0 .16 .40 .38 .47 .18 -.05 -.12 -.16 -.09

interface 1 0 1 0 0 0 0 0 0 .14 .37 .33 .40 .16 -.03 -.07 -.10 -.04

computer 1 1 0 0 0 0 0 0 0 .15 .51 .36 .41 .24 .02 .06 .09 .12

user 0 1 1 0 1 0 0 0 0 .26 .84 .61 .70 .39 .03 .08 .12 .19

system 0 1 1 2 0 0 0 0 0 .45 1.23 1.05 1.27 .56 -.07 -.15 -.21 -.05

response 0 1 0 0 1 0 0 0 0 .16 .58 .38 .42 .28 .06 .13 .19 .22

time 0 1 0 0 1 0 0 0 0 .16 .58 .38 .42 .28 .06 .13 .19 .22

EPS 0 0 1 1 0 0 0 0 0 .22 .55 .51 .63 .24 -.07 -.14 -.20 -.11

survey 0 1 0 0 0 0 0 0 1 .10 .53 .23 .21 .27 .14 .31 .44 .42

trees 0 0 0 0 0 1 1 1 0 -.06 .23 -.14 -.27 .14 .24 .55 .77 .66

graph 0 0 0 0 0 0 1 1 1 -.06 .34 -.15 -.30 .20 .31 .69 .98 .85

minors 0 0 0 0 0 0 0 1 1 -.04 .25 -.10 -.21 .15 .22 .50 .71 .62

r(human,user)= -.38 r(human,user)=.94

§1: Human machine interface for ABC computer applications§2: A survey of user opinion of computer system response time§3: The EPS user interface management system§4: System and human system engineering testing of EPS§5: Relation of user perceived response time to error measurement§6: The generation of random, binary, ordered trees§7: The intersection graph of paths in trees§8: Graph minors IV: Widths of trees and well-quasi-ordering§9: Graph minors: A survey

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

24

Meaning as vectors

The meaning of a each word is represented as a vector

computer

disk workbike

disk = (2,4,6,-1,0,3,-4,8,-5,8,12.....9,0,1,-9,4,1) computer = (1,4,5,-1,0,2,-2,5,-6,8,11.....8,0,0,-5,7,1)

Illustrative purpose only: does not work in 2 dimensions!

eat

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

25

Comparison with symbolic approaches

bikeride

wheel

bicycle

fun

bike bicycleSYNONYM

wheelsride

FUNCTION HAS

IS-A

Human-propelledvehicle

vehicle

IS-A

guidon

HAS

guidon

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

26

Comparison with symbolic approaches

Representation is absoluteGood for drawing inferencesNot so good for comparisonsHard to compound nodes

bike bicycleSYNONYM

wheelsride

FUNCTION HAS

IS­A

Human-propelled

vehicle

vehicle

IS­A

guidon

HAS

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

27

Comparison with symbolic approaches

Representation is relativeGood for semantic comparisonsEasy to compound words

bikeride

wheel

bicycle

funguidon

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

28

Semantic comparison

Semantic distance = cosine between vectorsA value between –1 and 1Other measures:

Euclidian distanceHellinger distance...

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

29

Outline

Intuitive introductionLSA techniqueTestsCognitive modelsLimitsCompeting models

30

Word comparisons

A semantic space built from a 4.6 million word encyclopedia (lsa.colorado.edu)

Closest words to bicycle:pedals: .84handlebars .79bicycles .79pedaling .75bike .75

fly – insect: .26fly -plane: .48insect – plane: .02

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

31

Existing semantic spaces

Englishencyclopedia, literature, psychology textbooks, scientific textbooks (available at lsa.colorado.edu)

Frenchnewspaper, literature, children tales, children production (some of them are available at lsa.colorado.edu)

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

32

What is the right number of dimensions?

Too few dimensions would not capture the latent semantic structureToo much dimensions would describe idiosyncrasic structuresAn open issue300 dimensions seems a good value for the whole language (Dumais found a maximum of performance at 90 dim. for a specific domain)

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

33

From words to texts

Sum up word vectors to get text vectors

disk= (2, 4, 6,-1, 0, 3,-4, 8,.....9, 0,-9, 4, 1)

computer=(1 4, 5,-1, 0, 2,-2, 5,.....8, 0,-5, 7, 1)

kid= (1, 0, 0,-9, 4, 5,-1, 2,.....0, 2, 3, 4, 5)

broken= (5,-3, 8, 6, 5,-2, 2, 0,.....6, 1, 7, 1, 2)

S = The computer disk was broken by the kid

S= (9, 5,19,-5, 9, 8,-5, 15....23, 3,-4,16, 9)

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

34

Text comparisons

A semantic space built from a 4.6 million word encyclopedia (lsa.colorado.edu)

The cat was lost in a forest / My little feline disappeared in the trees: .66The radius of spheres / a circle's diameter: .55 (Landauer)

The radius of spheres / the music of spheres: .01(Landauer)

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

35

No syntax processing

Paragraphs are bags of wordsSemantic is determined from the lexical level

Lexical level

Syntax level

Semantic level

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

36

No syntax processing

We learned that this is about birds, nests and a building processThe analysis unit is the paragraph

Birds build nests

Nests build birds

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

37

Outline

Intuitive introductionLSA techniqueTestsCognitive modelsLimitsCompeting models

38

Models based on LSA

Semantic representationVocabulary acquisitionText comprehensionFree text assessment

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

39

Semantic representation

Comparison of wordsTOEFL testVocabulary test for children

Comparison of sentencesMeasure of coherence

Comparison of texts

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

40

Comparison of words

Synonymy part of the TOEFL test (Landauer & Dumais, 97)

80 items: 1 word/4 alternative wordsGuess which one is the closestExample: levied

imposed

believed

requested

correlated

LSA was trained on a 4.6 million word corpusLSA score : 64.4%Foreign applicants to US colleges: 64.5%

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

41

Vocabulary test:LSA vs children

(Denhière & Lemaire, 2004)

115 items

4 definitions : correct, close, distant, unrelated

Subjects were in grade 2 to 4

Example (translated from French):

slope rising road tilted surface which goes up or down small piece of ground face of a rock or a mountain

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

42

A children's semantic space

Stories and tales for children: ~ 1,600,000 wordsChildren productions: ~ 800,000 wordsTextbooks: ~ 400,000 wordsEncyclopedia: ~ 400,000 wordsDictionary: ~ 50,000 words

TOTAL : 3,2 million words

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

43

Vocabulary test: results

Correct Close Far Unrelated0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

Children data

2nd grade

3rd grade

4th grade

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

44

Vocabulary test: resultsIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

Correct Close Far Unrelated0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

Children+Model data

2nd grade

3rd grade

4th grade

LSA

45

99 most frequent wordsIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

Correct Close Far Unrelated0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

Children+Model data

2nd grade

3rd grade

4th grade

115 words

99 words

46

Vocabulary test: word lemmatization

Correct Close Far Unrelated0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

Children+Model data

2nd grade

3rd grade

4th grade

115 words

99 words

lemm.

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

47

Vocabulary test: verb lemmatization

Correct Close Far Unrelated0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

Children+Model data

2nd grade

3rd grade

4th grade

115 words

99 words

lemm.

verb lemm.

48

Comparison of sentences

● Use LSA to measure the coherence of a text

● Compute the semantic similarities between adjacent sentences

coherence (T) = Σ cosine(sentence i, sentence i+1)

n-1

i=1

n-1

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

49

Foltz et al. (1998) experiment

McNamara et al. (1996) data4 versions of text about the heart

Low local coherence, low macro coherence (lm)Low local coherence, high macro coherence (lM)High local coherence, low macro coherence (Lm)High local coherence, high macro coherence (LM)

Corpus: 21 texts, 100 dimensions

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

50

Foltz et al. (1998) experiment

Results :Coherence(lm) = 0.177Coherence(lM) = 0.205Coherence(Lm) = 0.210Coherence(LM) = 0.242

comprehension

coherence

0.2

0.3

0.4

.14 .18 .22

LSA r2=0.853word overlap

r2=0.098

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

51

Can LSA assess text recall ?

Measure M1 : cosine between source text and texts recalled

Measure M2 : number of propositions recalled

Text Recall Nb subjects Correlation M1/M2Poule Immediate n = 48 (8,3 years old) 0.49Dragon Immediate n = 48 (8,3 years old) 0.64Dragon Delayed n = 48 (8,3 years old) 0.7Araignée Immediate n = 57 (16-20 years old) 0.81Clown Immediate n = 57 (16-20 years old) 0.65Géant Immediate n = 130 (16-20 years old) 0.6Ourson Immediate n = 44 (16-20 years old) 0.48Clown Summary n = 44 (16-20 years old) 0.86

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

52

Comparison of textsFoltz et al. (1996)

● Compare texts and their summaries● Britt et al. (1994) data

● 24 participants read 21 texts ● Participants write an essay

● Corpus: 21 texts + 8 encyclopedia texts + 2 book excerpts

● Assessing the essays:● Score 1 = average similarity between each sentence and

the closest text sentence● Score 2 = same but compare with the main 10 text

sentences

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

53

Comparison of textsFoltz et al. (1996)

Judge1 Judge2 Judge3 Judge4 Score1 Score2

Judge1 .575** .768** .381 .418* .589*

Judge2 .582** .412* .552** .626**

Judge3 .367 .317 .384+

Judge4 .117 .240

** p<.01 * p<.05 + p<.06

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

54

Models based on LSA

Semantic representationVocabulary acquisitionText comprehensionFree text assessment

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

55

A model of learning

LSA learns the meaning of words from textsDoes it mimic the human rate of learning?Landauer & Dumais (1997) experiment:

Most of the lexicon is acquired from textChildren data:

3500 words read / day50 new words encountered / day10 words learned / dayControlled experiments lead to 2.5 words learned / day

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

56

Direct and indirect effects

How do children acquire the remaining 7.5 words/day?The poverty of stimulus problem

how do children acquire as much knowledge on the basis of little information?

Hypothesis: a word is learned…From texts containing it (direct effect)From texts not containing it (indirect effect)

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

57

LSA simulation

TOEFL test score is function of:T: total number of paragraphsS: number of paragraphs containing the stem word

score=0.128 (log 0.076 T) (log 31.910 S)

Correlation with observed score= .98After “reading” 25,000 paragraphs, what is the effect of the 25,001th?

When the stem word belongs to it?When the stem word does not belong to it?

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

58

Example

Suppose a word that appeared twice in the 25,000 paragraphsScore = 0.128 (log 0.076*25000)(log 31.91*2) = .75809

Direct effect: the word belongs to the 25,001th paragraphScore = 0.128 (log 0.076*25001)(log 31.91*3) = .83202

Indirect effect: the word does not belong to the 25,001th paragraphScore = 0.128 (log 0.076*25001)(log 31.91*2) = .75810

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

59

LSA simulation

Effects are cumulated over all word-frequency bandDirect effect:

.0007 words gained per word encountered

.0007 * 3500 = 2.5 words acquired / day

Indirect effect:.15 words gained per paragraph encountered.15 * 50 = 7.5 words acquired / day

The meaning of a word would be acquiredfor 25% by reading text containing itfor 75% by reading texts not containing it

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

60

Direct and indirect effects(Lemaire et al., to appear)

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

20002721345441874909563063527073780685289249998210797117051261213520-0,1

-0,05

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

0,5

0,55

0,6

acheter (to buy)/magasin (store)

Number of paragraphs

Cosi

ne a

chete

r/m

ag

asi

n

61

Cosine acheter/magasin

Total gain: .58

Gain due to the occurrence of acheter: -.10

Gain due to the occurrence of magasin: -.19

Gain due to the co-occurrence: .73

Gain due to 2nd order co-occurrences: .11

Gain due to 3rd and more co-occurrences: .03

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

62

Same measures on 28 words

Total gain: .13

Gain due to the occurrence of W1: -.16

Gain due to the occurrence of W2: -.19

Gain due to the co-occurrence: .34

Gain due to 2nd order co-occurrences: .05

Gain due to 3rd and more co-occurrences: .09

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

63

Models based on LSA

Semantic representationVocabulary acquisitionText comprehensionFree text assessment

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

64

Text comprehension

Cognitive models can be simulated on computers, using LSA as a model of semantic memory:

the construction-integration model (Kintsch, 1998)A predication algorithmMetaphor comprehension

65

A Computational model of text comprehension (Lemaire et al, to

appear)

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

Next sentence

Selectassociates

SEMANTICMEMORY

(LSA)WORKINGMEMORY

EPISODICMEMORY

Retrievepreviouselements

Decay

Integration

66

Example (translated from French)

A woodcutter was walking in the forest

Select associates:walk: stroll, meet, pickwoodcutter: ax, forest, cottage

forest: glade, oak, wood

67

Example (translated from French)

A woodcutter was walking in the forest

walk/woodcutter/forest

walk

forest

woodcutterstroll

meet

pick ax

cottage

glade

oak

wood

68

Example (translated from French)

A woodcutter was walking in the forest

walk/woodcutter/forest

walk

forest

woodcutterstroll

meet

pickax

cottage

glade

oak

wood

69

Example (translated from French)

A woodcutter was walking in the forest

walk/woodcutter/forest

walk

forest

woodcutterstroll

meet

pick ax

cottage

glade

oak

wood

70

Example (translated from French)

A woodcutter was walking in the forest

walk/woodcutter/forest

walk

forest

woodcutterstroll

meet

pick ax

cottage

glade

oak

wood

71

The construction-integration model

Before LSA, the construction step was performed by handLink weights were set approximatelyAll nodes being vectors, the weight of a link is the cosine of the corresponding nodes A question remains: what is the vector of P(A1,…An) ?

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

72

The predication algorithm (Kintsch, 2001)

The centroid rule: P(A) P+A

…does not work wellThe meaning of the predicate depends on the meaning of the argumentsThe horse ran / The color ranshark(lawyer) ≠ shark + lawyer

A better rule: P(A) P+A+M1+…+Mn

Mi are vectors that are close to both P and A

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

73

Example 1:construction step (Kintsch, 2001)

The horse ran ran(horse)

ran

stopped

down

hopped

horse.21

.18

.12

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

74

Example 1:integration step

The horse ran ran(horse)

ran

stopped

down

hopped

horse.46

.19

.00

The horse ran horse+ran+stopped

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

75

Example 2:construction step

The color ran ran(color)

ran

stopped

down

hopped

color.06

.11

.02

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

76

Example2:integration step

The horse ran ran(horse)

ran

stopped

down

hopped

color.00

.70

.00

The color ran color+ran+down

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

77

Predication algorithm: test(Kintsch, 2001)

.11.01.01dissolve

.29.75.33gallop

color ranhorse ranranLandmarks

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

78

Inferences (Kintsch, 2001)

.70.73Predication

.54.66Centroid

The elk was deadThe hunter was deadThe hunter shot the elk

.91.87Predication

.66.76Centroid

The glass was broken

The student was broken

The student dropped the glass

.83.62Predication

.71.70Centroid

The table was cleanThe student was clean

The student washed the table

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

79

Metaphors

My lawyer is a shark shark(lawyer)

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

lawyer

shark

My lawyer is a shark

The meaning of my lawyer is a shark is not a simplecombination of the meaning of lawyer andthe meaning of shark

80

Metaphors

Look for words that are similar to shark but also close to lawyer

shark

fin

teeth

sea

sharks

vicious

fish

agressive

lawyer

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

81

Metaphors

lawyer

shark

My lawyer is a shark

viciousagressive

My lawyer is a shark

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

82

Metaphors

Much more neighbors of the predicate need to be consideredAccording to Kintsch, predication fails if number of neighbors is less than 100!

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

83

An incremental algorithm(Lemaire & Bianco, 2003)

Does not work on a predefine set of neigbors but look for them progressively

This kid is intelligent 1... 2... 3 clever 4... 5 brilliant 6... 7... 8... 9 sage

This kid is a scientist1... 2... 3............................................................................................................................................................................97 invention ......................................................................................... 128 knowledge.............................................138 wonderful

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

84

Experiment

Does the model account for the processing time of various kind of metaphors?12 texts, 4 conditions :

Synonym (child / kid)Conventional metaphor (child / scientist)Original metaphor (child / sage)Bad metaphor (child / colonel)

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

85

Subjects (N=49)

-200

-150

-100

-50

0

50

100

150

200

250

300

350

Subjects

Synonym

Conv. met.

Unconv. Met.

Bad met.

Types of metaphors

Addit

ionnal ti

me (

ms)

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

86

Subjects and model

Correlation: .62

0

25

50

75

100

125

150

175

200

225

250

275

300

Simulation

Types of metaphors

Nb o

f n

eig

hb

ors

-200

-150

-100

-50

0

50

100

150

200

250

300

350

Subjects

Synonym

Conv. met.

Unconv. Met.

Bad met.

Types of metaphors

Addit

ionnal

tim

e (

ms)

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

87

Models based on LSA

Semantic representationVocabulary acquisitionText comprehensionFree text assessment

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

88

Free text assessment

Assessing the content of a student essayRely on a domain-specific corpusCompare the student essay with a reference text:

Pre-graded essaysText to be summarizedPortions of a text course

Trusso Haley et al. (2005) published a good state-of-the-art report (Tech. Rep. Open University)

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

89

Intelligent Essay Assessor

Foltz et al. (1999)Holistic score:

Compare student essay with pre-graded essaysGive the grade of the closest one

Gold standard scoreCompare student essay with a teacher essayGive the grade according to the semantic similarityCorrelation with teacher's grades around .8

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

90

Summary Street

Wade-Stein & E. Kintsch (2003)Helps the student to summarize a textIndicate the part of the text that are well or not well summarizedAn experiment with 60 6th grade students has shown a positive effect

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

91

Summary StreetIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

92

ApexIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

Dessus & Lemaire (1999, 2001,2002)Compare student essay with relevant portions of the courseThe course is structured :

#T The first chapter

#N In the first part, we will study the way...

#N Now, I will present a method...

...

#T The second chapter

#N ...

93

94

Apex

Correlation with teacher's grades : .62Not only for grading but for engaging students in a revising processCan be used at a distance

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

95

Outline

Intuitive introductionLSA techniqueTestsCognitive modelsLimitsCompeting models

96

Strengths and limits

LSA strengthsfully automaticvectorial representation allows:

esay representation of sequence of words easy comparisons between words or texts

a cognitive model of learning and representation

LSA weaknessesno syntax processingrepresentations are not explicitnot incrementalparagraphs are bag of words

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

97

Outline

Intuitive introductionLSA techniqueTestsCognitive modelsLimitsCompeting models

98

Competing models

Cognitive models of semantic memory could mimic:

the human semantic associations orthe human construction of these associations over a long period of time

Semantic networks are good model of semantic associations BUT... they do not account for the construction of these associations

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

99

Features

Input (corpus, word association norms)Knowledge representation (vector-based, network-based)Addition of a new text (incrementally or not)Unit of context (paragraph, sliding window)Use of high-order co-occurrences (yes/no)Compositionality (representing the meaning of a text from the meaning of its words)

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

100

LSA

Input: corpus Knowledge representation: vector-basedAddition of a new text: not incrementalUnit of context: paragraphUse of high-order co-occurrences: yesCompositionality: easy

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

101

HAL(Burgess, 1998)

Word co-occurrence matrix built from a N-word sliding window

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

vector = concatenation of row and columnbarn = <0,2,4,3,6,0,5,0,0,0>Measure of similarity = Euclidian distance

102

HAL(Burgess, 1998)

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

103

HAL(Burgess, 1998)

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

104

HAL(Burgess, 1998)

Input: corpus Knowledge representation: vector-basedAddition of a new text: incrementalUnit of context: sliding windowUse of high-order co-occurrences: noCompositionality: easy

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

105

Word association space(Steyvers et al., in press)

Based on association norms for 5,000 wordsWord/word matrix scaled to 200-500 dimensions

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

106

107

Word association space(Steyvers et al., in press)

Input: association norms Knowledge representation: vector-basedAddition of a new text: not incrementalUnit of context: N/AUse of high-order co-occurrences: noCompositionality: easy

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

108

PMI-IR(Turney, 2001)

Based on Mutual Information (Church & Hanks, 89)

compares the probability of two words occuring together with the probability of the words occuring separately

MI(A,B)=log2(p(A,B)/p(A).p(B))

TOEFL test: 1 problem word/4 choicesscore

1(choice)=hits(problem AND choice)/hits(choice)

score2(choice)=hits(problem NEAR choice)/hits(choice)

...

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

109

PMI-IR(Turney, 2001)

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

110

PMI-IR(Turney, 2001)

Input: webKnowledge representation: N/AAddition of a new text: incrementalUnit of context: web pageUse of high-order co-occurrences: noCompositionality: hard

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

111

ICAN(Lemaire & Denhière, 2004)

Network-based representationbetter for quickly retrieving neighborsbetter for representing asymmetrical relations

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

cable

electric

connector

networkrope

wire.7

.2.4

.

.5.7

.5.4 .3

plug

.4

.6

112

ICAN

... if you have such a [device, connect the cable to the network connector] then switch ...

Create or reinforce co-occurrence links

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

cable

electric

connector

networkrope

wire.7

.2.7

.

.5.7

.5.4 .3

device.5

connect

.5

plug

.4

.8cable

electric

connector

network

wire.7

.2.4

.

.5.7

.5.4 .3

plug

.4

.6

rope

113

ICAN

... if you have such a [device, connect the cable to the network connector] then switch ...

Create or reinforce co-occurrence links

Create or slightly reinforce 2nd-order co-occurrence links

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

cable

electric

connector

networkrope

wire.7

.2

.7

.

.5.7

.5.4 .3

device.5

connect

.5

plug

.4.3.8

cable

electric

connector

network

wire.7

.2

.4

.

.5.7

.5.4 .3

plug

.4

.6

rope

114

ICAN

... if you have such a [device, connect the cable to the network connector] then switch ...

Create or reinforce co-occurrence links

Create or slightly reinforce 2nd-order co-occurrence links

Sligthly decrease other links

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

cable

electric

connector

networkrope

wire.6

.2.7

.

.4.7

.5.4 .2

device.5

connect

.5

plug

.4.3.8

cable

electric

connector

network

wire.7

.2.4

.

.5.7

.5.4 .3

plug

.4

.6

rope

115

ICAN

3.2 million word corpus of texts for childrenTest on 1200 pairs of words (from association norms)Average weights:

stem/1st associate: .415

stem/2nd associate: .269

stem/3rd associate: .236stem/last associates: .098

Correlation with human data: .50

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

116

ICAN

Input: corpusKnowledge representation: network-basedAddition of a new text: incrementalUnit of context: sliding windowUse of high-order co-occurrences: yesCompositionality: hard

IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models

117

Web pages

Testslsa.colorado.edu

Referenceswww-leibniz.imag.fr/~blemaire/lsa.html