Latent Semantic Analysis - WordPress.com · Latent Semantic Analysis a tutorial at CogSci'2005...
Transcript of Latent Semantic Analysis - WordPress.com · Latent Semantic Analysis a tutorial at CogSci'2005...
1
Latent Semantic Analysis
a tutorial at CogSci'2005
Benoît Lemaire
Laboratoire Leibniz-IMAG (CNRS UMR 5522)University of Grenoble
2
Outline
Intuitive introductionrepresenting the meaning of words from the analysis of huge corpora
LSA techniqueTestsCognitive modelsLimitsCompeting models
3
Intuitive introductionIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
Corpus
planeairport
flybee pilot
truckcar
drive
wingsbird
traintravel
parks
restauranteat
applebanana
vegetablesapples
fruittree
gardenflower
flowersgrow
bouquetpetals
flyingflew
nectarhoney
4
From words to textsIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
planeairport
flybee pilot
truckcar
drive
wingsbird
traintravel
parks
restauranteat
applebanana
vegetablesapples
fruittree
gardenflower
flowersgrow
bouquetpetals
flyingflew
nectarhoney
The pilot parks the plane
5
From words to textsIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
planeairport
flybee pilot
truckcar
drive
wingsbird
traintravel
parks
restauranteat
applebanana
vegetablesapples
fruittree
gardenflower
flowersgrow
bouquetpetals
flyingflew
nectarhoney
The pilot parks the plane
6
From words to textsIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
planeairport
flybee pilot
truckcar
drive
wingsbird
traintravel
parks
restauranteat
applebanana
vegetablesapples
fruittree
gardenflower
flowersgrow
bouquetpetals
flyingflew
nectarhoney
The pilot parks the plane
7
From words to textsIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
planeairport
flybee pilot
truckcar
drive
wingsbird
traintravel
parks
restauranteat
applebanana
vegetablesapples
fruittree
gardenflower
flowersgrow
bouquetpetals
flyingflew
nectarhoney
The pilot parks the plane
8
Usages of LSA
plane
airport
flybee
pilot
truck
car
drive
wingsbird
train
travelpark
restauranteat
applebanana
vegetablesapples
fruittree
gardenflower
flowers
grow
bouquetpetals
flyingflew
nectarhoney
Corpus
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
9
LSA as a tool
Compare textsControl experiment material
plane
airport
flybee
pilot
truck
car
drive
wingsbird
train
travelpark
restauranteat
applebanana
vegetablesapples
garden
flowersbouquet
petals
flyingflew
nectarhoney
Corpus
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
10
LSA as a cognitive modelof semantic memory
Build lots of cognitive modelsText comprehensionInterface navigation...
Corpus
plane
airport
flybee
pilot
truck
car
drive
wingsbird
train
travelpark
restauranteat
applebanana
vegetablesapples
fruittree
gardenflower
flowers
grow
bouquetpetals
flyingflew
nectarhoney
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
11
LSA as a cognitive model of word acquisition
plane
airport
flybee
pilot
truck
car
drive
wingsbird
train
travelpark
restauranteat
applebanana
vegetablesapples
fruittree
gardenflower
flowers
grow
bouquetpetals
flyingflew
nectarhoney
Corpus
Humans also construct the meaning of words from written materialDoes LSA account for this process?
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
12
Outline
Intuitive introductionLSA techniqueTestsCognitive modelsLimitsCompeting models
13
Semantic information is drawn from raw texts
Take a huge corpusSplit it into paragraphsDetermine the statistical context in which each word occurs
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
14
A first approach based on co-occurrence
Define the meaning of a word from the co-occurring wordsTwo words are similar if they occur in the same paragraphs
plane= (1,0,0,1,0,0,0,0,2,1,0,1,1,0,0,0,0,0,1,1)
airport=(1,1,0,0,0,0,0,0,1,1,0,2,1,0,0,0,0,0,0,1)
Does not work well (French, Perfetti)
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
15
A better approach
Two words are similar if they occur in the same paragraphs
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
16
A better approach
Two words are similar if they occur in the same paragraphs
Two words are similar if they occur in similar paragraphs
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
17
A better approach
Two words are similar if they occur in the same paragraph
Two words are similar if they occur in similar paragraphs
Two paragraphs are similar if they
contain common words
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
18
Two paragraphs are similar if they
contain common words
A better approach
Two paragraphs are similar if they
contain similar words
Two words are similar if they occur in the same paragraph
Two words are similar if they occur in similar paragraphs
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
19
A mutual recursion
Mathematicians know how to solve such circular systemsSuppose P paragraphs, W wordsBuild M, a P x W matrix
M[i,j] = nb of occurrence of word i in paragraph j
Each word is a P-dimension vectorReduce the matrix to its best 300 dimensions
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
20
Example (Landauer & Dumais 1997)
Human 1 0 0 1 0 0 0 0 0
interface 1 0 1 0 0 0 0 0 0
computer 1 1 0 0 0 0 0 0 0
user 0 1 1 0 1 0 0 0 0
system 0 1 1 2 0 0 0 0 0
response 0 1 0 0 1 0 0 0 0
time 0 1 0 0 1 0 0 0 0
EPS 0 0 1 1 0 0 0 0 0
survey 0 1 0 0 0 0 0 0 1
trees 0 0 0 0 0 1 1 1 0
graph 0 0 0 0 0 0 1 1 1
minors 0 0 0 0 0 0 0 1 1
§1: Human machine interface for ABC computer applications§2: A survey of user opinion of computer system response time§3: The EPS user interface management system§4: System and human system engineering testing of EPS§5: Relation of user perceived response time to error measurement§6: The generation of random, binary, ordered trees§7: The intersection graph of paths in trees§8: Graph minors IV: Widths of trees and well-quasi-ordering§9: Graph minors: A survey
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
21
Matrix reduction
= x x0
0Mwords
paragraphs
W P
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
22
Matrix reduction
= x x0
0Mwords
paragraphs
W P
= x xM'
paragraphs
words
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
23
Example (Landauer & Dumais 1997)
Human 1 0 0 1 0 0 0 0 0 .16 .40 .38 .47 .18 -.05 -.12 -.16 -.09
interface 1 0 1 0 0 0 0 0 0 .14 .37 .33 .40 .16 -.03 -.07 -.10 -.04
computer 1 1 0 0 0 0 0 0 0 .15 .51 .36 .41 .24 .02 .06 .09 .12
user 0 1 1 0 1 0 0 0 0 .26 .84 .61 .70 .39 .03 .08 .12 .19
system 0 1 1 2 0 0 0 0 0 .45 1.23 1.05 1.27 .56 -.07 -.15 -.21 -.05
response 0 1 0 0 1 0 0 0 0 .16 .58 .38 .42 .28 .06 .13 .19 .22
time 0 1 0 0 1 0 0 0 0 .16 .58 .38 .42 .28 .06 .13 .19 .22
EPS 0 0 1 1 0 0 0 0 0 .22 .55 .51 .63 .24 -.07 -.14 -.20 -.11
survey 0 1 0 0 0 0 0 0 1 .10 .53 .23 .21 .27 .14 .31 .44 .42
trees 0 0 0 0 0 1 1 1 0 -.06 .23 -.14 -.27 .14 .24 .55 .77 .66
graph 0 0 0 0 0 0 1 1 1 -.06 .34 -.15 -.30 .20 .31 .69 .98 .85
minors 0 0 0 0 0 0 0 1 1 -.04 .25 -.10 -.21 .15 .22 .50 .71 .62
r(human,user)= -.38 r(human,user)=.94
§1: Human machine interface for ABC computer applications§2: A survey of user opinion of computer system response time§3: The EPS user interface management system§4: System and human system engineering testing of EPS§5: Relation of user perceived response time to error measurement§6: The generation of random, binary, ordered trees§7: The intersection graph of paths in trees§8: Graph minors IV: Widths of trees and well-quasi-ordering§9: Graph minors: A survey
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
24
Meaning as vectors
The meaning of a each word is represented as a vector
computer
disk workbike
disk = (2,4,6,-1,0,3,-4,8,-5,8,12.....9,0,1,-9,4,1) computer = (1,4,5,-1,0,2,-2,5,-6,8,11.....8,0,0,-5,7,1)
Illustrative purpose only: does not work in 2 dimensions!
eat
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
25
Comparison with symbolic approaches
bikeride
wheel
bicycle
fun
bike bicycleSYNONYM
wheelsride
FUNCTION HAS
IS-A
Human-propelledvehicle
vehicle
IS-A
guidon
HAS
guidon
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
26
Comparison with symbolic approaches
Representation is absoluteGood for drawing inferencesNot so good for comparisonsHard to compound nodes
bike bicycleSYNONYM
wheelsride
FUNCTION HAS
ISA
Human-propelled
vehicle
vehicle
ISA
guidon
HAS
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
27
Comparison with symbolic approaches
Representation is relativeGood for semantic comparisonsEasy to compound words
bikeride
wheel
bicycle
funguidon
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
28
Semantic comparison
Semantic distance = cosine between vectorsA value between –1 and 1Other measures:
Euclidian distanceHellinger distance...
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
29
Outline
Intuitive introductionLSA techniqueTestsCognitive modelsLimitsCompeting models
30
Word comparisons
A semantic space built from a 4.6 million word encyclopedia (lsa.colorado.edu)
Closest words to bicycle:pedals: .84handlebars .79bicycles .79pedaling .75bike .75
fly – insect: .26fly -plane: .48insect – plane: .02
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
31
Existing semantic spaces
Englishencyclopedia, literature, psychology textbooks, scientific textbooks (available at lsa.colorado.edu)
Frenchnewspaper, literature, children tales, children production (some of them are available at lsa.colorado.edu)
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
32
What is the right number of dimensions?
Too few dimensions would not capture the latent semantic structureToo much dimensions would describe idiosyncrasic structuresAn open issue300 dimensions seems a good value for the whole language (Dumais found a maximum of performance at 90 dim. for a specific domain)
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
33
From words to texts
Sum up word vectors to get text vectors
disk= (2, 4, 6,-1, 0, 3,-4, 8,.....9, 0,-9, 4, 1)
computer=(1 4, 5,-1, 0, 2,-2, 5,.....8, 0,-5, 7, 1)
kid= (1, 0, 0,-9, 4, 5,-1, 2,.....0, 2, 3, 4, 5)
broken= (5,-3, 8, 6, 5,-2, 2, 0,.....6, 1, 7, 1, 2)
S = The computer disk was broken by the kid
S= (9, 5,19,-5, 9, 8,-5, 15....23, 3,-4,16, 9)
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
34
Text comparisons
A semantic space built from a 4.6 million word encyclopedia (lsa.colorado.edu)
The cat was lost in a forest / My little feline disappeared in the trees: .66The radius of spheres / a circle's diameter: .55 (Landauer)
The radius of spheres / the music of spheres: .01(Landauer)
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
35
No syntax processing
Paragraphs are bags of wordsSemantic is determined from the lexical level
Lexical level
Syntax level
Semantic level
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
36
No syntax processing
We learned that this is about birds, nests and a building processThe analysis unit is the paragraph
Birds build nests
Nests build birds
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
37
Outline
Intuitive introductionLSA techniqueTestsCognitive modelsLimitsCompeting models
38
Models based on LSA
Semantic representationVocabulary acquisitionText comprehensionFree text assessment
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
39
Semantic representation
Comparison of wordsTOEFL testVocabulary test for children
Comparison of sentencesMeasure of coherence
Comparison of texts
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
40
Comparison of words
Synonymy part of the TOEFL test (Landauer & Dumais, 97)
80 items: 1 word/4 alternative wordsGuess which one is the closestExample: levied
imposed
believed
requested
correlated
LSA was trained on a 4.6 million word corpusLSA score : 64.4%Foreign applicants to US colleges: 64.5%
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
41
Vocabulary test:LSA vs children
(Denhière & Lemaire, 2004)
115 items
4 definitions : correct, close, distant, unrelated
Subjects were in grade 2 to 4
Example (translated from French):
slope rising road tilted surface which goes up or down small piece of ground face of a rock or a mountain
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
42
A children's semantic space
Stories and tales for children: ~ 1,600,000 wordsChildren productions: ~ 800,000 wordsTextbooks: ~ 400,000 wordsEncyclopedia: ~ 400,000 wordsDictionary: ~ 50,000 words
TOTAL : 3,2 million words
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
43
Vocabulary test: results
Correct Close Far Unrelated0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
Children data
2nd grade
3rd grade
4th grade
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
44
Vocabulary test: resultsIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
Correct Close Far Unrelated0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
Children+Model data
2nd grade
3rd grade
4th grade
LSA
45
99 most frequent wordsIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
Correct Close Far Unrelated0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
Children+Model data
2nd grade
3rd grade
4th grade
115 words
99 words
46
Vocabulary test: word lemmatization
Correct Close Far Unrelated0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
Children+Model data
2nd grade
3rd grade
4th grade
115 words
99 words
lemm.
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
47
Vocabulary test: verb lemmatization
Correct Close Far Unrelated0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
Children+Model data
2nd grade
3rd grade
4th grade
115 words
99 words
lemm.
verb lemm.
48
Comparison of sentences
● Use LSA to measure the coherence of a text
● Compute the semantic similarities between adjacent sentences
coherence (T) = Σ cosine(sentence i, sentence i+1)
n-1
i=1
n-1
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
49
Foltz et al. (1998) experiment
McNamara et al. (1996) data4 versions of text about the heart
Low local coherence, low macro coherence (lm)Low local coherence, high macro coherence (lM)High local coherence, low macro coherence (Lm)High local coherence, high macro coherence (LM)
Corpus: 21 texts, 100 dimensions
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
50
Foltz et al. (1998) experiment
Results :Coherence(lm) = 0.177Coherence(lM) = 0.205Coherence(Lm) = 0.210Coherence(LM) = 0.242
comprehension
coherence
0.2
0.3
0.4
.14 .18 .22
LSA r2=0.853word overlap
r2=0.098
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
51
Can LSA assess text recall ?
Measure M1 : cosine between source text and texts recalled
Measure M2 : number of propositions recalled
Text Recall Nb subjects Correlation M1/M2Poule Immediate n = 48 (8,3 years old) 0.49Dragon Immediate n = 48 (8,3 years old) 0.64Dragon Delayed n = 48 (8,3 years old) 0.7Araignée Immediate n = 57 (16-20 years old) 0.81Clown Immediate n = 57 (16-20 years old) 0.65Géant Immediate n = 130 (16-20 years old) 0.6Ourson Immediate n = 44 (16-20 years old) 0.48Clown Summary n = 44 (16-20 years old) 0.86
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
52
Comparison of textsFoltz et al. (1996)
● Compare texts and their summaries● Britt et al. (1994) data
● 24 participants read 21 texts ● Participants write an essay
● Corpus: 21 texts + 8 encyclopedia texts + 2 book excerpts
● Assessing the essays:● Score 1 = average similarity between each sentence and
the closest text sentence● Score 2 = same but compare with the main 10 text
sentences
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
53
Comparison of textsFoltz et al. (1996)
Judge1 Judge2 Judge3 Judge4 Score1 Score2
Judge1 .575** .768** .381 .418* .589*
Judge2 .582** .412* .552** .626**
Judge3 .367 .317 .384+
Judge4 .117 .240
** p<.01 * p<.05 + p<.06
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
54
Models based on LSA
Semantic representationVocabulary acquisitionText comprehensionFree text assessment
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
55
A model of learning
LSA learns the meaning of words from textsDoes it mimic the human rate of learning?Landauer & Dumais (1997) experiment:
Most of the lexicon is acquired from textChildren data:
3500 words read / day50 new words encountered / day10 words learned / dayControlled experiments lead to 2.5 words learned / day
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
56
Direct and indirect effects
How do children acquire the remaining 7.5 words/day?The poverty of stimulus problem
how do children acquire as much knowledge on the basis of little information?
Hypothesis: a word is learned…From texts containing it (direct effect)From texts not containing it (indirect effect)
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
57
LSA simulation
TOEFL test score is function of:T: total number of paragraphsS: number of paragraphs containing the stem word
score=0.128 (log 0.076 T) (log 31.910 S)
Correlation with observed score= .98After “reading” 25,000 paragraphs, what is the effect of the 25,001th?
When the stem word belongs to it?When the stem word does not belong to it?
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
58
Example
Suppose a word that appeared twice in the 25,000 paragraphsScore = 0.128 (log 0.076*25000)(log 31.91*2) = .75809
Direct effect: the word belongs to the 25,001th paragraphScore = 0.128 (log 0.076*25001)(log 31.91*3) = .83202
Indirect effect: the word does not belong to the 25,001th paragraphScore = 0.128 (log 0.076*25001)(log 31.91*2) = .75810
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
59
LSA simulation
Effects are cumulated over all word-frequency bandDirect effect:
.0007 words gained per word encountered
.0007 * 3500 = 2.5 words acquired / day
Indirect effect:.15 words gained per paragraph encountered.15 * 50 = 7.5 words acquired / day
The meaning of a word would be acquiredfor 25% by reading text containing itfor 75% by reading texts not containing it
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
60
Direct and indirect effects(Lemaire et al., to appear)
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
20002721345441874909563063527073780685289249998210797117051261213520-0,1
-0,05
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
0,5
0,55
0,6
acheter (to buy)/magasin (store)
Number of paragraphs
Cosi
ne a
chete
r/m
ag
asi
n
61
Cosine acheter/magasin
Total gain: .58
Gain due to the occurrence of acheter: -.10
Gain due to the occurrence of magasin: -.19
Gain due to the co-occurrence: .73
Gain due to 2nd order co-occurrences: .11
Gain due to 3rd and more co-occurrences: .03
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
62
Same measures on 28 words
Total gain: .13
Gain due to the occurrence of W1: -.16
Gain due to the occurrence of W2: -.19
Gain due to the co-occurrence: .34
Gain due to 2nd order co-occurrences: .05
Gain due to 3rd and more co-occurrences: .09
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
63
Models based on LSA
Semantic representationVocabulary acquisitionText comprehensionFree text assessment
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
64
Text comprehension
Cognitive models can be simulated on computers, using LSA as a model of semantic memory:
the construction-integration model (Kintsch, 1998)A predication algorithmMetaphor comprehension
65
A Computational model of text comprehension (Lemaire et al, to
appear)
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
Next sentence
Selectassociates
SEMANTICMEMORY
(LSA)WORKINGMEMORY
EPISODICMEMORY
Retrievepreviouselements
Decay
Integration
66
Example (translated from French)
A woodcutter was walking in the forest
Select associates:walk: stroll, meet, pickwoodcutter: ax, forest, cottage
forest: glade, oak, wood
67
Example (translated from French)
A woodcutter was walking in the forest
walk/woodcutter/forest
walk
forest
woodcutterstroll
meet
pick ax
cottage
glade
oak
wood
68
Example (translated from French)
A woodcutter was walking in the forest
walk/woodcutter/forest
walk
forest
woodcutterstroll
meet
pickax
cottage
glade
oak
wood
69
Example (translated from French)
A woodcutter was walking in the forest
walk/woodcutter/forest
walk
forest
woodcutterstroll
meet
pick ax
cottage
glade
oak
wood
70
Example (translated from French)
A woodcutter was walking in the forest
walk/woodcutter/forest
walk
forest
woodcutterstroll
meet
pick ax
cottage
glade
oak
wood
71
The construction-integration model
Before LSA, the construction step was performed by handLink weights were set approximatelyAll nodes being vectors, the weight of a link is the cosine of the corresponding nodes A question remains: what is the vector of P(A1,…An) ?
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
72
The predication algorithm (Kintsch, 2001)
The centroid rule: P(A) P+A
…does not work wellThe meaning of the predicate depends on the meaning of the argumentsThe horse ran / The color ranshark(lawyer) ≠ shark + lawyer
A better rule: P(A) P+A+M1+…+Mn
Mi are vectors that are close to both P and A
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
73
Example 1:construction step (Kintsch, 2001)
The horse ran ran(horse)
ran
stopped
down
hopped
horse.21
.18
.12
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
74
Example 1:integration step
The horse ran ran(horse)
ran
stopped
down
hopped
horse.46
.19
.00
The horse ran horse+ran+stopped
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
75
Example 2:construction step
The color ran ran(color)
ran
stopped
down
hopped
color.06
.11
.02
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
76
Example2:integration step
The horse ran ran(horse)
ran
stopped
down
hopped
color.00
.70
.00
The color ran color+ran+down
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
77
Predication algorithm: test(Kintsch, 2001)
.11.01.01dissolve
.29.75.33gallop
color ranhorse ranranLandmarks
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
78
Inferences (Kintsch, 2001)
.70.73Predication
.54.66Centroid
The elk was deadThe hunter was deadThe hunter shot the elk
.91.87Predication
.66.76Centroid
The glass was broken
The student was broken
The student dropped the glass
.83.62Predication
.71.70Centroid
The table was cleanThe student was clean
The student washed the table
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
79
Metaphors
My lawyer is a shark shark(lawyer)
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
lawyer
shark
My lawyer is a shark
The meaning of my lawyer is a shark is not a simplecombination of the meaning of lawyer andthe meaning of shark
80
Metaphors
Look for words that are similar to shark but also close to lawyer
shark
fin
teeth
sea
sharks
vicious
fish
agressive
lawyer
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
81
Metaphors
lawyer
shark
My lawyer is a shark
viciousagressive
My lawyer is a shark
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
82
Metaphors
Much more neighbors of the predicate need to be consideredAccording to Kintsch, predication fails if number of neighbors is less than 100!
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
83
An incremental algorithm(Lemaire & Bianco, 2003)
Does not work on a predefine set of neigbors but look for them progressively
This kid is intelligent 1... 2... 3 clever 4... 5 brilliant 6... 7... 8... 9 sage
This kid is a scientist1... 2... 3............................................................................................................................................................................97 invention ......................................................................................... 128 knowledge.............................................138 wonderful
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
84
Experiment
Does the model account for the processing time of various kind of metaphors?12 texts, 4 conditions :
Synonym (child / kid)Conventional metaphor (child / scientist)Original metaphor (child / sage)Bad metaphor (child / colonel)
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
85
Subjects (N=49)
-200
-150
-100
-50
0
50
100
150
200
250
300
350
Subjects
Synonym
Conv. met.
Unconv. Met.
Bad met.
Types of metaphors
Addit
ionnal ti
me (
ms)
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
86
Subjects and model
Correlation: .62
0
25
50
75
100
125
150
175
200
225
250
275
300
Simulation
Types of metaphors
Nb o
f n
eig
hb
ors
-200
-150
-100
-50
0
50
100
150
200
250
300
350
Subjects
Synonym
Conv. met.
Unconv. Met.
Bad met.
Types of metaphors
Addit
ionnal
tim
e (
ms)
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
87
Models based on LSA
Semantic representationVocabulary acquisitionText comprehensionFree text assessment
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
88
Free text assessment
Assessing the content of a student essayRely on a domain-specific corpusCompare the student essay with a reference text:
Pre-graded essaysText to be summarizedPortions of a text course
Trusso Haley et al. (2005) published a good state-of-the-art report (Tech. Rep. Open University)
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
89
Intelligent Essay Assessor
Foltz et al. (1999)Holistic score:
Compare student essay with pre-graded essaysGive the grade of the closest one
Gold standard scoreCompare student essay with a teacher essayGive the grade according to the semantic similarityCorrelation with teacher's grades around .8
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
90
Summary Street
Wade-Stein & E. Kintsch (2003)Helps the student to summarize a textIndicate the part of the text that are well or not well summarizedAn experiment with 60 6th grade students has shown a positive effect
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
91
Summary StreetIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
92
ApexIntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
Dessus & Lemaire (1999, 2001,2002)Compare student essay with relevant portions of the courseThe course is structured :
#T The first chapter
#N In the first part, we will study the way...
#N Now, I will present a method...
...
#T The second chapter
#N ...
93
94
Apex
Correlation with teacher's grades : .62Not only for grading but for engaging students in a revising processCan be used at a distance
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
95
Outline
Intuitive introductionLSA techniqueTestsCognitive modelsLimitsCompeting models
96
Strengths and limits
LSA strengthsfully automaticvectorial representation allows:
esay representation of sequence of words easy comparisons between words or texts
a cognitive model of learning and representation
LSA weaknessesno syntax processingrepresentations are not explicitnot incrementalparagraphs are bag of words
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
97
Outline
Intuitive introductionLSA techniqueTestsCognitive modelsLimitsCompeting models
98
Competing models
Cognitive models of semantic memory could mimic:
the human semantic associations orthe human construction of these associations over a long period of time
Semantic networks are good model of semantic associations BUT... they do not account for the construction of these associations
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
99
Features
Input (corpus, word association norms)Knowledge representation (vector-based, network-based)Addition of a new text (incrementally or not)Unit of context (paragraph, sliding window)Use of high-order co-occurrences (yes/no)Compositionality (representing the meaning of a text from the meaning of its words)
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
100
LSA
Input: corpus Knowledge representation: vector-basedAddition of a new text: not incrementalUnit of context: paragraphUse of high-order co-occurrences: yesCompositionality: easy
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
101
HAL(Burgess, 1998)
Word co-occurrence matrix built from a N-word sliding window
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
vector = concatenation of row and columnbarn = <0,2,4,3,6,0,5,0,0,0>Measure of similarity = Euclidian distance
102
HAL(Burgess, 1998)
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
103
HAL(Burgess, 1998)
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
104
HAL(Burgess, 1998)
Input: corpus Knowledge representation: vector-basedAddition of a new text: incrementalUnit of context: sliding windowUse of high-order co-occurrences: noCompositionality: easy
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
105
Word association space(Steyvers et al., in press)
Based on association norms for 5,000 wordsWord/word matrix scaled to 200-500 dimensions
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
106
107
Word association space(Steyvers et al., in press)
Input: association norms Knowledge representation: vector-basedAddition of a new text: not incrementalUnit of context: N/AUse of high-order co-occurrences: noCompositionality: easy
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
108
PMI-IR(Turney, 2001)
Based on Mutual Information (Church & Hanks, 89)
compares the probability of two words occuring together with the probability of the words occuring separately
MI(A,B)=log2(p(A,B)/p(A).p(B))
TOEFL test: 1 problem word/4 choicesscore
1(choice)=hits(problem AND choice)/hits(choice)
score2(choice)=hits(problem NEAR choice)/hits(choice)
...
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
109
PMI-IR(Turney, 2001)
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
110
PMI-IR(Turney, 2001)
Input: webKnowledge representation: N/AAddition of a new text: incrementalUnit of context: web pageUse of high-order co-occurrences: noCompositionality: hard
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
111
ICAN(Lemaire & Denhière, 2004)
Network-based representationbetter for quickly retrieving neighborsbetter for representing asymmetrical relations
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
cable
electric
connector
networkrope
wire.7
.2.4
.
.5.7
.5.4 .3
plug
.4
.6
112
ICAN
... if you have such a [device, connect the cable to the network connector] then switch ...
Create or reinforce co-occurrence links
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
cable
electric
connector
networkrope
wire.7
.2.7
.
.5.7
.5.4 .3
device.5
connect
.5
plug
.4
.8cable
electric
connector
network
wire.7
.2.4
.
.5.7
.5.4 .3
plug
.4
.6
rope
113
ICAN
... if you have such a [device, connect the cable to the network connector] then switch ...
Create or reinforce co-occurrence links
Create or slightly reinforce 2nd-order co-occurrence links
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
cable
electric
connector
networkrope
wire.7
.2
.7
.
.5.7
.5.4 .3
device.5
connect
.5
plug
.4.3.8
cable
electric
connector
network
wire.7
.2
.4
.
.5.7
.5.4 .3
plug
.4
.6
rope
114
ICAN
... if you have such a [device, connect the cable to the network connector] then switch ...
Create or reinforce co-occurrence links
Create or slightly reinforce 2nd-order co-occurrence links
Sligthly decrease other links
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
cable
electric
connector
networkrope
wire.6
.2.7
.
.4.7
.5.4 .2
device.5
connect
.5
plug
.4.3.8
cable
electric
connector
network
wire.7
.2.4
.
.5.7
.5.4 .3
plug
.4
.6
rope
115
ICAN
3.2 million word corpus of texts for childrenTest on 1200 pairs of words (from association norms)Average weights:
stem/1st associate: .415
stem/2nd associate: .269
stem/3rd associate: .236stem/last associates: .098
Correlation with human data: .50
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
116
ICAN
Input: corpusKnowledge representation: network-basedAddition of a new text: incrementalUnit of context: sliding windowUse of high-order co-occurrences: yesCompositionality: hard
IntroductionLSA techniqueTestsCognitive modelsLimitsCompeting models
117
Web pages
Testslsa.colorado.edu
Referenceswww-leibniz.imag.fr/~blemaire/lsa.html