Distributional Semantic Models - Part 1:...
Transcript of Distributional Semantic Models - Part 1:...
cat
dog
petisa
isa
Distributional Semantic Models CS 114 James Pustejovsky
slides by Stefan Evert
DSM Tutorial – Part 1 1 / 91
Outline
Outline
IntroductionThe distributional hypothesisThree famous DSM examples
Taxonomy of DSM parametersDefinition & overviewDSM parametersExamples
DSM in practiceUsing DSM distancesQuantitative evaluationSoftware and further information
DSM Tutorial – Part 1 2 / 91
Introduction The distributional hypothesis
Outline
IntroductionThe distributional hypothesisThree famous DSM examples
Taxonomy of DSM parametersDefinition & overviewDSM parametersExamples
DSM in practiceUsing DSM distancesQuantitative evaluationSoftware and further information
DSM Tutorial – Part 1 3 / 91
Introduction The distributional hypothesis
Meaning & distribution
I “Die Bedeutung eines Wortes liegt in seinem Gebrauch.”— Ludwig Wittgenstein
I “You shall know a word by the company it keeps!”— J. R. Firth (1957)
I Distributional hypothesis: difference of meaning correlateswith difference of distribution (Zellig Harris 1954)
I “What people know when they say that they know a word isnot how to recite its dictionary definition – they know how touse it [. . . ] in everyday discourse.” (Miller 1986)
DSM Tutorial – Part 1 4 / 91
Introduction The distributional hypothesis
What is the meaning of “bardiwac”?
I He handed her her glass of bardiwac.I Beef dishes are made to complement the bardiwacs.I Nigel staggered to his feet, face flushed from too much
bardiwac.I Malbec, one of the lesser-known bardiwac grapes, responds
well to Australia’s sunshine.I I dined off bread and cheese and this excellent bardiwac.I The drinks were delicious: blood-red bardiwac as well as light,
sweet Rhenish.+ bardiwac is a heavy red alcoholic beverage made from grapes
The examples above are handpicked, of course. But in a corpus like theBNC, you will find at least as many informative sentences.
DSM Tutorial – Part 1 5 / 91
Introduction The distributional hypothesis
What is the meaning of “bardiwac”?
I He handed her her glass of bardiwac.
I Beef dishes are made to complement the bardiwacs.I Nigel staggered to his feet, face flushed from too much
bardiwac.I Malbec, one of the lesser-known bardiwac grapes, responds
well to Australia’s sunshine.I I dined off bread and cheese and this excellent bardiwac.I The drinks were delicious: blood-red bardiwac as well as light,
sweet Rhenish.+ bardiwac is a heavy red alcoholic beverage made from grapes
The examples above are handpicked, of course. But in a corpus like theBNC, you will find at least as many informative sentences.
DSM Tutorial – Part 1 5 / 91
Introduction The distributional hypothesis
What is the meaning of “bardiwac”?
I He handed her her glass of bardiwac.I Beef dishes are made to complement the bardiwacs.
I Nigel staggered to his feet, face flushed from too muchbardiwac.
I Malbec, one of the lesser-known bardiwac grapes, respondswell to Australia’s sunshine.
I I dined off bread and cheese and this excellent bardiwac.I The drinks were delicious: blood-red bardiwac as well as light,
sweet Rhenish.+ bardiwac is a heavy red alcoholic beverage made from grapes
The examples above are handpicked, of course. But in a corpus like theBNC, you will find at least as many informative sentences.
DSM Tutorial – Part 1 5 / 91
Introduction The distributional hypothesis
What is the meaning of “bardiwac”?
I He handed her her glass of bardiwac.I Beef dishes are made to complement the bardiwacs.I Nigel staggered to his feet, face flushed from too much
bardiwac.
I Malbec, one of the lesser-known bardiwac grapes, respondswell to Australia’s sunshine.
I I dined off bread and cheese and this excellent bardiwac.I The drinks were delicious: blood-red bardiwac as well as light,
sweet Rhenish.+ bardiwac is a heavy red alcoholic beverage made from grapes
The examples above are handpicked, of course. But in a corpus like theBNC, you will find at least as many informative sentences.
DSM Tutorial – Part 1 5 / 91
Introduction The distributional hypothesis
What is the meaning of “bardiwac”?
I He handed her her glass of bardiwac.I Beef dishes are made to complement the bardiwacs.I Nigel staggered to his feet, face flushed from too much
bardiwac.I Malbec, one of the lesser-known bardiwac grapes, responds
well to Australia’s sunshine.
I I dined off bread and cheese and this excellent bardiwac.I The drinks were delicious: blood-red bardiwac as well as light,
sweet Rhenish.+ bardiwac is a heavy red alcoholic beverage made from grapes
The examples above are handpicked, of course. But in a corpus like theBNC, you will find at least as many informative sentences.
DSM Tutorial – Part 1 5 / 91
Introduction The distributional hypothesis
What is the meaning of “bardiwac”?
I He handed her her glass of bardiwac.I Beef dishes are made to complement the bardiwacs.I Nigel staggered to his feet, face flushed from too much
bardiwac.I Malbec, one of the lesser-known bardiwac grapes, responds
well to Australia’s sunshine.I I dined off bread and cheese and this excellent bardiwac.
I The drinks were delicious: blood-red bardiwac as well as light,sweet Rhenish.
+ bardiwac is a heavy red alcoholic beverage made from grapes
The examples above are handpicked, of course. But in a corpus like theBNC, you will find at least as many informative sentences.
DSM Tutorial – Part 1 5 / 91
Introduction The distributional hypothesis
What is the meaning of “bardiwac”?
I He handed her her glass of bardiwac.I Beef dishes are made to complement the bardiwacs.I Nigel staggered to his feet, face flushed from too much
bardiwac.I Malbec, one of the lesser-known bardiwac grapes, responds
well to Australia’s sunshine.I I dined off bread and cheese and this excellent bardiwac.I The drinks were delicious: blood-red bardiwac as well as light,
sweet Rhenish.
+ bardiwac is a heavy red alcoholic beverage made from grapes
The examples above are handpicked, of course. But in a corpus like theBNC, you will find at least as many informative sentences.
DSM Tutorial – Part 1 5 / 91
Introduction The distributional hypothesis
What is the meaning of “bardiwac”?
I He handed her her glass of bardiwac.I Beef dishes are made to complement the bardiwacs.I Nigel staggered to his feet, face flushed from too much
bardiwac.I Malbec, one of the lesser-known bardiwac grapes, responds
well to Australia’s sunshine.I I dined off bread and cheese and this excellent bardiwac.I The drinks were delicious: blood-red bardiwac as well as light,
sweet Rhenish.+ bardiwac is a heavy red alcoholic beverage made from grapes
The examples above are handpicked, of course. But in a corpus like theBNC, you will find at least as many informative sentences.
DSM Tutorial – Part 1 5 / 91
Introduction The distributional hypothesis
What is the meaning of “bardiwac”?
DSM Tutorial – Part 1 6 / 91
Introduction The distributional hypothesis
What is the meaning of “bardiwac”?
DSM Tutorial – Part 1 6 / 91
Introduction The distributional hypothesis
A thought experiment: deciphering hieroglyphs
get sij ius hir iit kil(knife) naif 51 20 84 0 3 0(cat) ket 52 58 4 4 6 26??? dog 115 83 10 42 33 17(boat) beut 59 39 23 4 0 0(cup) kap 98 14 6 2 1 0(pig) pigij 12 17 3 2 9 27(banana) nana 11 2 2 0 18 0
DSM Tutorial – Part 1 7 / 91
Introduction The distributional hypothesis
A thought experiment: deciphering hieroglyphs
get sij ius hir iit kil(knife) naif 51 20 84 0 3 0(cat) ket 52 58 4 4 6 26??? dog 115 83 10 42 33 17(boat) beut 59 39 23 4 0 0(cup) kap 98 14 6 2 1 0(pig) pigij 12 17 3 2 9 27(banana) nana 11 2 2 0 18 0
sim(dog, naif) = 0.770
DSM Tutorial – Part 1 7 / 91
Introduction The distributional hypothesis
A thought experiment: deciphering hieroglyphs
get sij ius hir iit kil(knife) naif 51 20 84 0 3 0(cat) ket 52 58 4 4 6 26??? dog 115 83 10 42 33 17(boat) beut 59 39 23 4 0 0(cup) kap 98 14 6 2 1 0(pig) pigij 12 17 3 2 9 27(banana) nana 11 2 2 0 18 0
sim(dog, pigij) = 0.939
DSM Tutorial – Part 1 7 / 91
Introduction The distributional hypothesis
A thought experiment: deciphering hieroglyphs
get sij ius hir iit kil(knife) naif 51 20 84 0 3 0(cat) ket 52 58 4 4 6 26??? dog 115 83 10 42 33 17(boat) beut 59 39 23 4 0 0(cup) kap 98 14 6 2 1 0(pig) pigij 12 17 3 2 9 27(banana) nana 11 2 2 0 18 0
sim(dog, ket) = 0.961
DSM Tutorial – Part 1 7 / 91
Introduction The distributional hypothesis
English as seen by the computer . . .
get see use hear eat killget sij ius hir iit kil
knife naif 51 20 84 0 3 0cat ket 52 58 4 4 6 26dog dog 115 83 10 42 33 17boat beut 59 39 23 4 0 0cup kap 98 14 6 2 1 0pig pigij 12 17 3 2 9 27banana nana 11 2 2 0 18 0
verb-object counts from British National Corpus
DSM Tutorial – Part 1 8 / 91
Introduction The distributional hypothesis
Geometric interpretation
I row vector xdogdescribes usage ofword dog in thecorpus
I can be seen ascoordinates of pointin n-dimensionalEuclidean space
get see use hear eat killknife 51 20 84 0 3 0cat 52 58 4 4 6 26dog 115 83 10 42 33 17boat 59 39 23 4 0 0cup 98 14 6 2 1 0pig 12 17 3 2 9 27
banana 11 2 2 0 18 0
co-occurrence matrix M
DSM Tutorial – Part 1 9 / 91
Introduction The distributional hypothesis
Geometric interpretation
I row vector xdogdescribes usage ofword dog in thecorpus
I can be seen ascoordinates of pointin n-dimensionalEuclidean space
I illustrated for twodimensions:get and use
I xdog = (115, 10) ●●
●
●
0 20 40 60 80 100 120
020
4060
8010
012
0
Two dimensions of English V−Obj DSM
get
use
catdog
knife
boat
DSM Tutorial – Part 1 10 / 91
Introduction The distributional hypothesis
Geometric interpretation
I similarity = spatialproximity(Euclidean dist.)
I location depends onfrequency of noun(fdog ≈ 2.7 · fcat)
●●
●
●
0 20 40 60 80 100 120
020
4060
8010
012
0
Two dimensions of English V−Obj DSM
get
use
catdog
knife
boat
d = 63.3
d = 57.5
DSM Tutorial – Part 1 11 / 91
Introduction The distributional hypothesis
Geometric interpretation
I similarity = spatialproximity(Euclidean dist.)
I location depends onfrequency of noun(fdog ≈ 2.7 · fcat)
I direction moreimportant thanlocation
●●
●
●
0 20 40 60 80 100 120
020
4060
8010
012
0
Two dimensions of English V−Obj DSM
get
use
catdog
knife
boat
DSM Tutorial – Part 1 12 / 91
Introduction The distributional hypothesis
Geometric interpretation
I similarity = spatialproximity(Euclidean dist.)
I location depends onfrequency of noun(fdog ≈ 2.7 · fcat)
I direction moreimportant thanlocation
I normalise “length”‖xdog‖ of vector
I or use angle α asdistance measure
●●
●
●
0 20 40 60 80 100 120
020
4060
8010
012
0
Two dimensions of English V−Obj DSM
get
use
catdog
knife
boat
●●
●
●
DSM Tutorial – Part 1 13 / 91
Introduction The distributional hypothesis
Geometric interpretation
I similarity = spatialproximity(Euclidean dist.)
I location depends onfrequency of noun(fdog ≈ 2.7 · fcat)
I direction moreimportant than
locationI normalize “length”‖xdog‖ of vector
I or use angle α asdistance measure
●●
●
●
0 20 40 60 80 100 120
020
4060
8010
012
0
Two dimensions of English V−Obj DSM
get
use
catdog
knife
boat
●●
●
●
α = 54.3°
DSM Tutorial – Part 1 13 / 91
Introduction The distributional hypothesis
Semantic distances
I main result of distributionalanalysis are “semantic”distances between words
I typical applicationsI nearest neighboursI clustering of related wordsI construct semantic map
I other applications requireclever use of the distanceinformation
I semantic relationsI relational analogiesI word sense disambiguationI detection of multiword
expressions
pota
toon
ion
cat
bana
nach
icke
nm
ushr
oom
corn
dog
pear
cher
ryle
ttuce
peng
uin
swan
eagl
eow
ldu
ckel
epha
nt pig
cow
lion
helic
opte
rpe
acoc
ktu
rtle ca
rpi
neap
ple
boat
rock
ettr
uck
mot
orcy
cle
snai
lsh
ipch
isel
scis
sors
scre
wdr
iver
penc
ilha
mm
erte
leph
one
knife
spoo
npe
nke
ttle
bottl
ecu
pbo
wl
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Word space clustering of concrete nouns (V−Obj from BNC)
Clu
ster
siz
e
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
−0.4 −0.2 0.0 0.8
−0.
4−
0.2
0.0
0.2
0.4
0.6
Semantic map (V−Obj from BNC)
●
●
●
●
●
●
birdgroundAnimalfruitTreegreentoolvehicle
chicken
eagle duck
swan owl
penguin
peacock
dog
elephantcow
cat
lionpig
snail
turtle
cherry
banana
pearpineapple
mushroom
corn
lettuce
potatoonion
bottle
pencil
pen
cup
bowl
scissors
kettle
knife
screwdriver
hammer
spoon
chisel
telephoneboatcarship
truck
rocket
motorcycle
helicopter
DSM Tutorial – Part 1 14 / 91
Introduction The distributional hypothesis
Some applications in computational linguisticsI Unsupervised part-of-speech induction (Schütze 1995)I Word sense disambiguation (Schütze 1998)I Query expansion in information retrieval (Grefenstette 1994)I Synonym tasks & other language tests
(Landauer and Dumais 1997; Turney et al. 2003)I Thesaurus compilation (Lin 1998a; Rapp 2004)I Ontology & wordnet expansion (Pantel et al. 2009)I Attachment disambiguation (Pantel and Lin 2000)I Probabilistic language models (Bengio et al. 2003)I Subsymbolic input representation for neural networksI Many other tasks in computational semantics:
entailment detection, noun compound interpretation,identification of noncompositional expressions, . . .
DSM Tutorial – Part 1 15 / 91
Introduction Three famous DSM examples
Outline
IntroductionThe distributional hypothesisThree famous DSM examples
Taxonomy of DSM parametersDefinition & overviewDSM parametersExamples
DSM in practiceUsing DSM distancesQuantitative evaluationSoftware and further information
DSM Tutorial – Part 1 16 / 91
Introduction Three famous DSM examples
Latent Semantic Analysis (Landauer and Dumais 1997)
I Corpus: 30,473 articles from Grolier’s Academic AmericanEncyclopedia (4.6 million words in total)
+ articles were limited to first 2,000 charactersI Word-article frequency matrix for 60,768 words
I row vector shows frequency of word in each articleI Logarithmic frequencies scaled by word entropyI Reduced to 300 dim. by singular value decomposition (SVD)
I borrowed from LSI (Dumais et al. 1988)+ central claim: SVD reveals latent semantic features,
not just a data reduction techniqueI Evaluated on TOEFL synonym test (80 items)
I LSA model achieved 64.4% correct answersI also simulation of learning rate based on TOEFL results
DSM Tutorial – Part 1 17 / 91
Introduction Three famous DSM examples
Word Space (Schütze 1992, 1993, 1998)
I Corpus: ≈ 60 million words of news messagesI from the New York Times News Service
I Word-word co-occurrence matrixI 20,000 target words & 2,000 context words as featuresI row vector records how often each context word occurs close
to the target word (co-occurrence)I co-occurrence window: left/right 50 words (Schütze 1998)
or ≈ 1000 characters (Schütze 1992)I Rows weighted by inverse document frequency (tf.idf)I Context vector = centroid of word vectors (bag-of-words)
+ goal: determine “meaning” of a contextI Reduced to 100 SVD dimensions (mainly for efficiency)I Evaluated on unsupervised word sense induction by clustering
of context vectors (for an ambiguous word)I induced word senses improve information retrieval performance
DSM Tutorial – Part 1 18 / 91
Introduction Three famous DSM examples
HAL (Lund and Burgess 1996)
I HAL = Hyperspace Analogue to LanguageI Corpus: 160 million words from newsgroup postingsI Word-word co-occurrence matrix
I same 70,000 words used as targets and featuresI co-occurrence window of 1 – 10 words
I Separate counts for left and right co-occurrenceI i.e. the context is structured
I In later work, co-occurrences are weighted by (inverse)distance (Li et al. 2000)
I Applications include construction of semantic vocabularymaps by multidimensional scaling to 2 dimensions
DSM Tutorial – Part 1 19 / 91
Introduction Three famous DSM examples
Many parameters . . .
I Enormous range of DSM parameters and applicationsI Examples showed three entirely different models, each tuned
to its particular applicationå Need overview of DSM parameters & understand their effects
DSM Tutorial – Part 1 20 / 91
Taxonomy of DSM parameters Definition & overview
Outline
IntroductionThe distributional hypothesisThree famous DSM examples
Taxonomy of DSM parametersDefinition & overviewDSM parametersExamples
DSM in practiceUsing DSM distancesQuantitative evaluationSoftware and further information
DSM Tutorial – Part 1 21 / 91
Taxonomy of DSM parameters Definition & overview
General definition of DSMs
A distributional semantic model (DSM) is a scaled and/ortransformed co-occurrence matrix M, such that each row xrepresents the distribution of a target term across contexts.
get see use hear eat killknife 0.027 -0.024 0.206 -0.022 -0.044 -0.042cat 0.031 0.143 -0.243 -0.015 -0.009 0.131dog -0.026 0.021 -0.212 0.064 0.013 0.014boat -0.022 0.009 -0.044 -0.040 -0.074 -0.042cup -0.014 -0.173 -0.249 -0.099 -0.119 -0.042pig -0.069 0.094 -0.158 0.000 0.094 0.265
banana 0.047 -0.139 -0.104 -0.022 0.267 -0.042
Term = word, lemma, phrase, morpheme, word pair, . . .
DSM Tutorial – Part 1 22 / 91
Taxonomy of DSM parameters Definition & overview
General definition of DSMs
Mathematical notation:I k × n co-occurrence matrix M (example: 7× 6 matrix)
I k rows = target termsI n columns = features or dimensions
M =
m11 m12 · · · m1nm21 m22 · · · m2n...
......
mk1 mk2 · · · mkn
I distribution vector mi = i-th row of M, e.g. m3 = mdogI components mi = (mi1,mi2, . . . ,min) = features of i-th term:
m3 = (−0.026, 0.021,−0.212, 0.064, 0.013, 0.014)= (m31,m32,m33,m34,m35,m36)
DSM Tutorial – Part 1 23 / 91
Taxonomy of DSM parameters Definition & overview
Overview of DSM parameters
Term-context vs. term-term matrix
⇓Definition of terms & linguistic pre-processing
⇓Size & type of context
⇓Geometric vs. probabilistic interpretation
⇓Feature scaling
⇓Normalisation of rows and/or columns
⇓Similarity / distance measure
⇓Dimensionality reduction
DSM Tutorial – Part 1 24 / 91
Taxonomy of DSM parameters Definition & overview
Overview of DSM parameters
Term-context vs. term-term matrix⇓
Definition of terms & linguistic pre-processing
⇓Size & type of context
⇓Geometric vs. probabilistic interpretation
⇓Feature scaling
⇓Normalisation of rows and/or columns
⇓Similarity / distance measure
⇓Dimensionality reduction
DSM Tutorial – Part 1 24 / 91
Taxonomy of DSM parameters Definition & overview
Overview of DSM parameters
Term-context vs. term-term matrix⇓
Definition of terms & linguistic pre-processing⇓
Size & type of context
⇓Geometric vs. probabilistic interpretation
⇓Feature scaling
⇓Normalisation of rows and/or columns
⇓Similarity / distance measure
⇓Dimensionality reduction
DSM Tutorial – Part 1 24 / 91
Taxonomy of DSM parameters Definition & overview
Overview of DSM parameters
Term-context vs. term-term matrix⇓
Definition of terms & linguistic pre-processing⇓
Size & type of context⇓
Geometric vs. probabilistic interpretation
⇓Feature scaling
⇓Normalisation of rows and/or columns
⇓Similarity / distance measure
⇓Dimensionality reduction
DSM Tutorial – Part 1 24 / 91
Taxonomy of DSM parameters Definition & overview
Overview of DSM parameters
Term-context vs. term-term matrix⇓
Definition of terms & linguistic pre-processing⇓
Size & type of context⇓
Geometric vs. probabilistic interpretation⇓
Feature scaling
⇓Normalisation of rows and/or columns
⇓Similarity / distance measure
⇓Dimensionality reduction
DSM Tutorial – Part 1 24 / 91
Taxonomy of DSM parameters Definition & overview
Overview of DSM parameters
Term-context vs. term-term matrix⇓
Definition of terms & linguistic pre-processing⇓
Size & type of context⇓
Geometric vs. probabilistic interpretation⇓
Feature scaling⇓
Normalisation of rows and/or columns
⇓Similarity / distance measure
⇓Dimensionality reduction
DSM Tutorial – Part 1 24 / 91
Taxonomy of DSM parameters Definition & overview
Overview of DSM parameters
Term-context vs. term-term matrix⇓
Definition of terms & linguistic pre-processing⇓
Size & type of context⇓
Geometric vs. probabilistic interpretation⇓
Feature scaling⇓
Normalisation of rows and/or columns⇓
Similarity / distance measure
⇓Dimensionality reduction
DSM Tutorial – Part 1 24 / 91
Taxonomy of DSM parameters Definition & overview
Overview of DSM parameters
Term-context vs. term-term matrix⇓
Definition of terms & linguistic pre-processing⇓
Size & type of context⇓
Geometric vs. probabilistic interpretation⇓
Feature scaling⇓
Normalisation of rows and/or columns⇓
Similarity / distance measure⇓
Dimensionality reduction
DSM Tutorial – Part 1 24 / 91
Taxonomy of DSM parameters DSM parameters
Outline
IntroductionThe distributional hypothesisThree famous DSM examples
Taxonomy of DSM parametersDefinition & overviewDSM parametersExamples
DSM in practiceUsing DSM distancesQuantitative evaluationSoftware and further information
DSM Tutorial – Part 1 25 / 91
Taxonomy of DSM parameters DSM parameters
Overview of DSM parameters
Term-context vs. term-term matrix⇓
Definition of terms & linguistic pre-processing⇓
Size & type of context⇓
Geometric vs. probabilistic interpretation⇓
Feature scaling⇓
Normalisation of rows and/or columns⇓
Similarity / distance measure⇓
Dimensionality reduction
DSM Tutorial – Part 1 26 / 91
Taxonomy of DSM parameters DSM parameters
Term-context matrix
Term-context matrix records frequency of term in each individualcontext (e.g. sentence, document, Web page, encyclopaedia article)
F =
· · · f1 · · ·· · · f2 · · ·
...
...· · · fk · · ·
Felidae
Pet
Feral
Bloat
Philosophy
Kant
Back
pain
cat 10 10 7 – – – –dog – 10 4 11 – – –
animal 2 15 10 2 – – –time 1 – – – 2 1 –
reason – 1 – – 1 4 1cause – – – 2 1 2 6effect – – – 1 – 1 –
DSM Tutorial – Part 1 27 / 91
Taxonomy of DSM parameters DSM parameters
Term-context matrix
Some footnotes:I Features are usually context tokens, i.e. individual instancesI Can also be generalised to context types, e.g.
I bag of content wordsI specific pattern of POS tagsI n-gram of words (or POS tags) around targetI subcategorisation pattern of target verb
I Term-context matrix is often very sparse
DSM Tutorial – Part 1 28 / 91
Taxonomy of DSM parameters DSM parameters
Term-term matrix
Term-term matrix records co-occurrence frequencies with featureterms for each target term
M =
· · · m1 · · ·· · · m2 · · ·
...
...· · · mk · · ·
breed
tail
feed
kill
importa
ntexpla
inlikely
cat 83 17 7 37 – 1 –dog 561 13 30 60 1 2 4
animal 42 10 109 134 13 5 5time 19 9 29 117 81 34 109
reason 1 – 2 14 68 140 47cause – 1 – 4 55 34 55effect – – 1 6 60 35 17
+ we will usually assume a term-term matrix
DSM Tutorial – Part 1 29 / 91
Taxonomy of DSM parameters DSM parameters
Term-term matrix
Some footnotes:I Often target terms 6= feature terms
I e.g. nouns described by co-occurrences with verbs as featuresI identical sets of target & feature terms Ü symmetric matrix
I Different types of contexts (Evert 2008)I surface context (word or character window)I textual context (non-overlapping segments)I syntactic contxt (specific syntagmatic relation)
I Can be seen as smoothing of term-context matrixI average over similar contexts (with same context terms)I data sparseness reduced, except for small windowsI we will take a closer look at the relation between term-context
and term-term models later in this tutorial
DSM Tutorial – Part 1 30 / 91
Taxonomy of DSM parameters DSM parameters
Overview of DSM parameters
Term-context vs. term-term matrix⇓
Definition of terms & linguistic pre-processing⇓
Size & type of context⇓
Geometric vs. probabilistic interpretation⇓
Feature scaling⇓
Normalisation of rows and/or columns⇓
Similarity / distance measure⇓
Dimensionality reduction
DSM Tutorial – Part 1 31 / 91
Taxonomy of DSM parameters DSM parameters
Corpus pre-processingI Minimally, corpus must be tokenised Ü identify termsI Linguistic annotation
I part-of-speech taggingI lemmatisation / stemmingI word sense disambiguation (rare)I shallow syntactic patternsI dependency parsing
I Generalisation of termsI often lemmatised to reduce data sparseness:
go, goes, went, gone, going Ü goI POS disambiguation (light/N vs. light/A vs. light/V)I word sense disambiguation (bankriver vs. bankfinance)
I Trade-off between deeper linguistic analysis andI need for language-specific resourcesI possible errors introduced at each stage of the analysis
DSM Tutorial – Part 1 32 / 91
Taxonomy of DSM parameters DSM parameters
Corpus pre-processingI Minimally, corpus must be tokenised Ü identify termsI Linguistic annotation
I part-of-speech taggingI lemmatisation / stemmingI word sense disambiguation (rare)I shallow syntactic patternsI dependency parsing
I Generalisation of termsI often lemmatised to reduce data sparseness:
go, goes, went, gone, going Ü goI POS disambiguation (light/N vs. light/A vs. light/V)I word sense disambiguation (bankriver vs. bankfinance)
I Trade-off between deeper linguistic analysis andI need for language-specific resourcesI possible errors introduced at each stage of the analysis
DSM Tutorial – Part 1 32 / 91
Taxonomy of DSM parameters DSM parameters
Corpus pre-processingI Minimally, corpus must be tokenised Ü identify termsI Linguistic annotation
I part-of-speech taggingI lemmatisation / stemmingI word sense disambiguation (rare)I shallow syntactic patternsI dependency parsing
I Generalisation of termsI often lemmatised to reduce data sparseness:
go, goes, went, gone, going Ü goI POS disambiguation (light/N vs. light/A vs. light/V)I word sense disambiguation (bankriver vs. bankfinance)
I Trade-off between deeper linguistic analysis andI need for language-specific resourcesI possible errors introduced at each stage of the analysis
DSM Tutorial – Part 1 32 / 91
Taxonomy of DSM parameters DSM parameters
Effects of pre-processing
Nearest neighbors of walk (BNC)
word formsI strollI walkingI walkedI goI pathI driveI rideI wanderI sprintedI sauntered
lemmatised corpus
I hurryI strollI strideI trudgeI ambleI wanderI walk-nnI walkingI retraceI scuttle
DSM Tutorial – Part 1 33 / 91
Taxonomy of DSM parameters DSM parameters
Effects of pre-processing
Nearest neighbors of arrivare (Repubblica)
word formsI giungereI raggiungereI arriviI raggiungimentoI raggiuntoI trovareI raggiungeI arrivasseI arriveràI concludere
lemmatised corpus
I giungereI aspettareI attendereI arrivo-nnI ricevereI accontentareI approdareI pervenireI venireI piombare
DSM Tutorial – Part 1 34 / 91
Taxonomy of DSM parameters DSM parameters
Overview of DSM parameters
Term-context vs. term-term matrix⇓
Definition of terms & linguistic pre-processing⇓
Size & type of context⇓
Geometric vs. probabilistic interpretation⇓
Feature scaling⇓
Normalisation of rows and/or columns⇓
Similarity / distance measure⇓
Dimensionality reduction
DSM Tutorial – Part 1 35 / 91
Taxonomy of DSM parameters DSM parameters
Surface context
Context term occurs within a window of k words around target.
The silhouette of the sun beyond a wide-open bay on the lake; thesun still glitters although evening has arrived in Kuhmo. It’smidsummer; the living room has its instruments and other objectsin each of its corners.
Parameters:I window size (in words or characters)I symmetric vs. one-sided windowI uniform or “triangular” (distance-based) weightingI window clamped to sentences or other textual units?
DSM Tutorial – Part 1 36 / 91
Taxonomy of DSM parameters DSM parameters
Effect of different window sizes
Nearest neighbours of dog (BNC)
2-word windowI catI horseI foxI petI rabbitI pigI animalI mongrelI sheepI pigeon
30-word windowI kennelI puppyI petI bitchI terrierI rottweilerI canineI catI to barkI Alsatian
DSM Tutorial – Part 1 37 / 91
Taxonomy of DSM parameters DSM parameters
Textual context
Context term is in the same linguistic unit as target.
The silhouette of the sun beyond a wide-open bay on the lake; thesun still glitters although evening has arrived in Kuhmo. It’smidsummer; the living room has its instruments and other objectsin each of its corners.
Parameters:I type of linguistic unit
I sentenceI paragraphI turn in a conversationI Web page
DSM Tutorial – Part 1 38 / 91
Taxonomy of DSM parameters DSM parameters
Syntactic context
Context term is linked to target by a syntactic dependency(e.g. subject, modifier, . . . ).
The silhouette of the sun beyond a wide-open bay on the lake; thesun still glitters although evening has arrived in Kuhmo. It’smidsummer; the living room has its instruments and other objectsin each of its corners.
Parameters:I types of syntactic dependency (Padó and Lapata 2007)I direct vs. indirect dependency paths
I direct dependenciesI direct + indirect dependencies
I homogeneous data (e.g. only verb-object) vs.heterogeneous data (e.g. all children and parents of the verb)
I maximal length of dependency path© Evert/Baroni/Lenci (CC-by-sa) DSM Tutorial – Part 1 39 / 91
Taxonomy of DSM parameters DSM parameters
“Knowledge pattern” context
Context term is linked to target by a lexico-syntactic pattern(text mining, cf. Hearst 1992, Pantel & Pennacchiotti 2008, etc.).
In Provence, Van Gogh painted with bright colors such as red andyellow. These colors produce incredible effects on anybody lookingat his paintings.
Parameters:I inventory of lexical patterns
I lots of research to identify semantically interesting patterns(cf. Almuhareb & Poesio 2004, Veale & Hao 2008, etc.)
I fixed vs. flexible patternsI patterns are mined from large corpora and automatically
generalised (optional elements, POS tags or semantic classes)
DSM Tutorial – Part 1 40 / 91
Taxonomy of DSM parameters DSM parameters
Structured vs. unstructured context
I In unstructered models, context specification acts as a filterI determines whether context tokens counts as co-occurrenceI e.g. linked by specific syntactic relation such as verb-object
I In structured models, context words are subtypedI depending on their position in the contextI e.g. left vs. right context, type of syntactic relation, etc.
DSM Tutorial – Part 1 41 / 91
Taxonomy of DSM parameters DSM parameters
Structured vs. unstructured context
I In unstructered models, context specification acts as a filterI determines whether context tokens counts as co-occurrenceI e.g. linked by specific syntactic relation such as verb-object
I In structured models, context words are subtypedI depending on their position in the contextI e.g. left vs. right context, type of syntactic relation, etc.
DSM Tutorial – Part 1 41 / 91
Taxonomy of DSM parameters DSM parameters
Structured vs. unstructured surface context
A dog bites a man. The man’s dog bites a dog. A dog bites a man.
unstructured bitedog 4man 3
A dog bites a man. The man’s dog bites a dog. A dog bites a man.
structured bite-l bite-rdog 3 1man 1 2
DSM Tutorial – Part 1 42 / 91
Taxonomy of DSM parameters DSM parameters
Structured vs. unstructured surface context
A dog bites a man. The man’s dog bites a dog. A dog bites a man.
unstructured bitedog 4man 3
A dog bites a man. The man’s dog bites a dog. A dog bites a man.
structured bite-l bite-rdog 3 1man 1 2
DSM Tutorial – Part 1 42 / 91
Taxonomy of DSM parameters DSM parameters
Structured vs. unstructured dependency context
A dog bites a man. The man’s dog bites a dog. A dog bites a man.
unstructured bitedog 4man 2
A dog bites a man. The man’s dog bites a dog. A dog bites a man.
structured bite-subj bite-objdog 3 1man 0 2
DSM Tutorial – Part 1 43 / 91
Taxonomy of DSM parameters DSM parameters
Structured vs. unstructured dependency context
A dog bites a man. The man’s dog bites a dog. A dog bites a man.
unstructured bitedog 4man 2
A dog bites a man. The man’s dog bites a dog. A dog bites a man.
structured bite-subj bite-objdog 3 1man 0 2
DSM Tutorial – Part 1 43 / 91
Taxonomy of DSM parameters DSM parameters
Comparison
I Unstructured contextI data less sparse (e.g. man kills and kills man both map to the
kill dimension of the vector xman)
I Structured contextI more sensitive to semantic distinctions
(kill-subj and kill-obj are rather different things!)I dependency relations provide a form of syntactic “typing” of
the DSM dimensions (the “subject” dimensions, the“recipient” dimensions, etc.)
I important to account for word-order and compositionality
DSM Tutorial – Part 1 44 / 91
Taxonomy of DSM parameters DSM parameters
Overview of DSM parameters
Term-context vs. term-term matrix⇓
Definition of terms & linguistic pre-processing⇓
Size & type of context⇓
Geometric vs. probabilistic interpretation⇓
Feature scaling⇓
Normalisation of rows and/or columns⇓
Similarity / distance measure⇓
Dimensionality reduction
DSM Tutorial – Part 1 45 / 91
Taxonomy of DSM parameters DSM parameters
Geometric vs. probabilistic interpretation
I Geometric interpretationI row vectors as points or arrows in n-dim. spaceI very intuitive, good for visualisationI use techniques from geometry and linear algebra
I Probabilistic interpretationI co-occurrence matrix as observed sample statisticI “explained” by generative probabilistic modelI recent work focuses on hierarchical Bayesian modelsI probabilistic LSA (Hoffmann 1999), Latent Semantic
Clustering (Rooth et al. 1999), Latent Dirichlet Allocation(Blei et al. 2003), etc.
I explicitly accounts for random variation of frequency countsI intuitive and plausible as topic model
+ focus on geometric interpretation in this tutorial
DSM Tutorial – Part 1 46 / 91
Taxonomy of DSM parameters DSM parameters
Geometric vs. probabilistic interpretation
I Geometric interpretationI row vectors as points or arrows in n-dim. spaceI very intuitive, good for visualisationI use techniques from geometry and linear algebra
I Probabilistic interpretationI co-occurrence matrix as observed sample statisticI “explained” by generative probabilistic modelI recent work focuses on hierarchical Bayesian modelsI probabilistic LSA (Hoffmann 1999), Latent Semantic
Clustering (Rooth et al. 1999), Latent Dirichlet Allocation(Blei et al. 2003), etc.
I explicitly accounts for random variation of frequency countsI intuitive and plausible as topic model
+ focus on geometric interpretation in this tutorial
DSM Tutorial – Part 1 46 / 91
Taxonomy of DSM parameters DSM parameters
Geometric vs. probabilistic interpretation
I Geometric interpretationI row vectors as points or arrows in n-dim. spaceI very intuitive, good for visualisationI use techniques from geometry and linear algebra
I Probabilistic interpretationI co-occurrence matrix as observed sample statisticI “explained” by generative probabilistic modelI recent work focuses on hierarchical Bayesian modelsI probabilistic LSA (Hoffmann 1999), Latent Semantic
Clustering (Rooth et al. 1999), Latent Dirichlet Allocation(Blei et al. 2003), etc.
I explicitly accounts for random variation of frequency countsI intuitive and plausible as topic model
+ focus on geometric interpretationDSM Tutorial – Part 1 46 / 91
Taxonomy of DSM parameters DSM parameters
Overview of DSM parameters
Term-context vs. term-term matrix⇓
Definition of terms & linguistic pre-processing⇓
Size & type of context⇓
Geometric vs. probabilistic interpretation⇓
Feature scaling⇓
Normalisation of rows and/or columns⇓
Similarity / distance measure⇓
Dimensionality reduction
DSM Tutorial – Part 1 47 / 91
Taxonomy of DSM parameters DSM parameters
Feature scaling
Feature scaling is used to “discount” less important features:I Logarithmic scaling: x ′ = log(x + 1)
(cf. Weber-Fechner law for human perception)
I Relevance weighting, e.g. tf.idf (information retrieval)I Statistical association measures (Evert 2004, 2008) take
frequency of target word and context feature into accountI the less frequent the target word and (more importantly) the
context feature are, the higher the weight given to theirobserved co-occurrence count should be (because theirexpected chance co-occurrence frequency is low)
I different measures – e.g., mutual information, log-likelihoodratio – differ in how they balance observed and expectedco-occurrence frequencies
DSM Tutorial – Part 1 48 / 91
Taxonomy of DSM parameters DSM parameters
Feature scaling
Feature scaling is used to “discount” less important features:I Logarithmic scaling: x ′ = log(x + 1)
(cf. Weber-Fechner law for human perception)I Relevance weighting, e.g. tf.idf (information retrieval)
I Statistical association measures (Evert 2004, 2008) takefrequency of target word and context feature into account
I the less frequent the target word and (more importantly) thecontext feature are, the higher the weight given to theirobserved co-occurrence count should be (because theirexpected chance co-occurrence frequency is low)
I different measures – e.g., mutual information, log-likelihoodratio – differ in how they balance observed and expectedco-occurrence frequencies
DSM Tutorial – Part 1 48 / 91
Taxonomy of DSM parameters DSM parameters
Feature scaling
Feature scaling is used to “discount” less important features:I Logarithmic scaling: x ′ = log(x + 1)
(cf. Weber-Fechner law for human perception)I Relevance weighting, e.g. tf.idf (information retrieval)I Statistical association measures (Evert 2004, 2008) take
frequency of target word and context feature into accountI the less frequent the target word and (more importantly) the
context feature are, the higher the weight given to theirobserved co-occurrence count should be (because theirexpected chance co-occurrence frequency is low)
I different measures – e.g., mutual information, log-likelihoodratio – differ in how they balance observed and expectedco-occurrence frequencies
DSM Tutorial – Part 1 48 / 91
Taxonomy of DSM parameters DSM parameters
Association measures: Mutual Information (MI)
word1 word2 fobs f1 f2dog small 855 33,338 490,580dog domesticated 29 33,338 918
Expected co-occurrence frequency:
fexp = f1 · f2N
Mutual Information compares observed vs. expected frequency:
MI(w1,w2) = log2fobsfexp
= log2N · fobsf1 · f2
Disadvantage: MI overrates combinations of rare terms.
DSM Tutorial – Part 1 49 / 91
Taxonomy of DSM parameters DSM parameters
Association measures: Mutual Information (MI)
word1 word2 fobs f1 f2dog small 855 33,338 490,580dog domesticated 29 33,338 918
Expected co-occurrence frequency:
fexp = f1 · f2N
Mutual Information compares observed vs. expected frequency:
MI(w1,w2) = log2fobsfexp
= log2N · fobsf1 · f2
Disadvantage: MI overrates combinations of rare terms.
DSM Tutorial – Part 1 49 / 91
Taxonomy of DSM parameters DSM parameters
Association measures: Mutual Information (MI)
word1 word2 fobs f1 f2dog small 855 33,338 490,580dog domesticated 29 33,338 918
Expected co-occurrence frequency:
fexp = f1 · f2N
Mutual Information compares observed vs. expected frequency:
MI(w1,w2) = log2fobsfexp
= log2N · fobsf1 · f2
Disadvantage: MI overrates combinations of rare terms.
DSM Tutorial – Part 1 49 / 91
Taxonomy of DSM parameters DSM parameters
Association measures: Mutual Information (MI)
word1 word2 fobs f1 f2dog small 855 33,338 490,580dog domesticated 29 33,338 918
Expected co-occurrence frequency:
fexp = f1 · f2N
Mutual Information compares observed vs. expected frequency:
MI(w1,w2) = log2fobsfexp
= log2N · fobsf1 · f2
Disadvantage: MI overrates combinations of rare terms.
DSM Tutorial – Part 1 49 / 91
Taxonomy of DSM parameters DSM parameters
Other association measures
word1 word2 fobs fexp MI
local-MI t-score
dog small 855 134.34 2.67
2282.88 24.64
dog domesticated 29 0.25 6.85
198.76 5.34
dog sgjkj 1 0.00027 11.85
11.85 1.00
The log-likelihood ratio (Dunning 1993) has more complex form,but its “core” is known as local MI (Evert 2004).
local-MI(w1,w2) = fobs ·MI(w1,w2)
The t-score measure (Church and Hanks 1990) is popular inlexicography:
t-score(w1,w2) = fobs − fexp√fobs
Details & many more measures: http://www.collocations.de/
DSM Tutorial – Part 1 50 / 91
Taxonomy of DSM parameters DSM parameters
Other association measures
word1 word2 fobs fexp MI local-MI
t-score
dog small 855 134.34 2.67 2282.88
24.64
dog domesticated 29 0.25 6.85 198.76
5.34
dog sgjkj 1 0.00027 11.85 11.85
1.00
The log-likelihood ratio (Dunning 1993) has more complex form,but its “core” is known as local MI (Evert 2004).
local-MI(w1,w2) = fobs ·MI(w1,w2)
The t-score measure (Church and Hanks 1990) is popular inlexicography:
t-score(w1,w2) = fobs − fexp√fobs
Details & many more measures: http://www.collocations.de/
DSM Tutorial – Part 1 50 / 91
Taxonomy of DSM parameters DSM parameters
Other association measures
word1 word2 fobs fexp MI local-MI t-scoredog small 855 134.34 2.67 2282.88 24.64dog domesticated 29 0.25 6.85 198.76 5.34dog sgjkj 1 0.00027 11.85 11.85 1.00
The log-likelihood ratio (Dunning 1993) has more complex form,but its “core” is known as local MI (Evert 2004).
local-MI(w1,w2) = fobs ·MI(w1,w2)
The t-score measure (Church and Hanks 1990) is popular inlexicography:
t-score(w1,w2) = fobs − fexp√fobs
Details & many more measures: http://www.collocations.de/DSM Tutorial – Part 1 50 / 91
Taxonomy of DSM parameters DSM parameters
Overview of DSM parameters
Term-context vs. term-term matrix⇓
Definition of terms & linguistic pre-processing⇓
Size & type of context⇓
Geometric vs. probabilistic interpretation⇓
Feature scaling⇓
Normalisation of rows and/or columns⇓
Similarity / distance measure⇓
Dimensionality reduction
DSM Tutorial – Part 1 51 / 91
Taxonomy of DSM parameters DSM parameters
Normalisation of row vectors
I geometric distances onlymake sense if vectors arenormalised to unit length
I divide vector by its length:
x/‖x‖
I normalisation depends ondistance measure!
I special case: scale torelative frequencies with‖x‖1 = |x1|+ · · ·+ |xn|Ü probabilisticinterpretation
●●
●
●
0 20 40 60 80 100 120
020
4060
8010
012
0
Two dimensions of English V−Obj DSM
get
use
catdog
knife
boat
●●
●
●
DSM Tutorial – Part 1 52 / 91
Taxonomy of DSM parameters DSM parameters
Scaling of column vectors
I In statistical analysis and machine learning, features areusually centred and scaled so that
mean µ = 0variance σ2 = 1
I In DSM research, this step is less common for columns of MI centring is a prerequisite for certain dimensionality reduction
and data analysis techniques (esp. PCA)I scaling may give too much weight to rare features
I M cannot be row-normalised and column-scaled at the sametime (result depends on ordering of the two steps)
DSM Tutorial – Part 1 53 / 91
Taxonomy of DSM parameters DSM parameters
Scaling of column vectors
I In statistical analysis and machine learning, features areusually centered and scaled so that
mean µ = 0variance σ2 = 1
I In DSM research, this step is less common for columns of MI centering is prerequisite for certain dimensionality reduction
and data analysis techniques (esp. PCA)I scaling may give too much weight to rare features
I M cannot be row-normalised and column-scaled at the sametime (result depends on ordering of the two steps)
DSM Tutorial – Part 1 53 / 91
Taxonomy of DSM parameters DSM parameters
Overview of DSM parameters
Term-context vs. term-term matrix⇓
Definition of terms & linguistic pre-processing⇓
Size & type of context⇓
Geometric vs. probabilistic interpretation⇓
Feature scaling⇓
Normalisation of rows and/or columns⇓
Similarity / distance measure⇓
Dimensionality reduction
DSM Tutorial – Part 1 54 / 91
Taxonomy of DSM parameters DSM parameters
Geometric distance
I Distance between vectorsu, v ∈ Rn Ü (dis)similarity
I u = (u1, . . . , un)I v = (v1, . . . , vn)
I Euclidean distance d2 (u, v)I “City block” Manhattan
distance d1 (u, v)I Both are special cases of theMinkowski p-distance dp (u, v)(for p ∈ [1,∞])
x1
v
x2
1 2 3 4 5
1
2
3
4
5
6
6 u
d2 (!u,!v) = 3.6
d1 (!u,!v) = 5
DSM Tutorial – Part 1 55 / 91
Taxonomy of DSM parameters DSM parameters
Geometric distance
I Distance between vectorsu, v ∈ Rn Ü (dis)similarity
I u = (u1, . . . , un)I v = (v1, . . . , vn)
I Euclidean distance d2 (u, v)
I “City block” Manhattandistance d1 (u, v)
I Both are special cases of theMinkowski p-distance dp (u, v)(for p ∈ [1,∞])
x1
v
x2
1 2 3 4 5
1
2
3
4
5
6
6 u
d2 (!u,!v) = 3.6
d1 (!u,!v) = 5
d2 (u, v) :=√
(u1 − v1)2 + · · ·+ (un − vn)2
DSM Tutorial – Part 1 55 / 91
Taxonomy of DSM parameters DSM parameters
Geometric distance
I Distance between vectorsu, v ∈ Rn Ü (dis)similarity
I u = (u1, . . . , un)I v = (v1, . . . , vn)
I Euclidean distance d2 (u, v)I “City block” Manhattan
distance d1 (u, v)
I Both are special cases of theMinkowski p-distance dp (u, v)(for p ∈ [1,∞])
x1
v
x2
1 2 3 4 5
1
2
3
4
5
6
6 u
d2 (!u,!v) = 3.6
d1 (!u,!v) = 5
d1 (u, v) := |u1 − v1|+ · · ·+ |un − vn|
DSM Tutorial – Part 1 55 / 91
Taxonomy of DSM parameters DSM parameters
Geometric distance
I Distance between vectorsu, v ∈ Rn Ü (dis)similarity
I u = (u1, . . . , un)I v = (v1, . . . , vn)
I Euclidean distance d2 (u, v)I “City block” Manhattan
distance d1 (u, v)I Both are special cases of theMinkowski p-distance dp (u, v)(for p ∈ [1,∞])
x1
v
x2
1 2 3 4 5
1
2
3
4
5
6
6 u
d2 (!u,!v) = 3.6
d1 (!u,!v) = 5
dp (u, v) :=(|u1 − v1|p + · · ·+ |un − vn|p
)1/p
DSM Tutorial – Part 1 55 / 91
Taxonomy of DSM parameters DSM parameters
Geometric distance
I Distance between vectorsu, v ∈ Rn Ü (dis)similarity
I u = (u1, . . . , un)I v = (v1, . . . , vn)
I Euclidean distance d2 (u, v)I “City block” Manhattan
distance d1 (u, v)I Both are special cases of theMinkowski p-distance dp (u, v)(for p ∈ [1,∞])
x1
v
x2
1 2 3 4 5
1
2
3
4
5
6
6 u
d2 (!u,!v) = 3.6
d1 (!u,!v) = 5
dp (u, v) :=(|u1 − v1|p + · · ·+ |un − vn|p
)1/p
d∞ (u, v) = max{|u1 − v1|, . . . , |un − vn|
DSM Tutorial – Part 1 55 / 91
Taxonomy of DSM parameters DSM parameters
Other distance measures
I Information theory: Kullback-Leibler (KL) divergence forprobability vectors (non-negative, ‖x‖1 = 1)
D(u‖v) =n∑
i=1ui · log2
uivi
I Properties of KL divergenceI most appropriate in a probabilistic interpretation of MI zeroes in v without corresponding zeroes in u are problematicI not symmetric, unlike geometric distance measuresI alternatives: skew divergence, Jensen-Shannon divergence
I A symmetric distance measure (Endres and Schindelin 2003)
Duv = D(u‖z) + D(v‖z) with z = u + v2
DSM Tutorial – Part 1 56 / 91
Taxonomy of DSM parameters DSM parameters
Other distance measures
I Information theory: Kullback-Leibler (KL) divergence forprobability vectors (non-negative, ‖x‖1 = 1)
D(u‖v) =n∑
i=1ui · log2
uivi
I Properties of KL divergenceI most appropriate in a probabilistic interpretation of MI zeroes in v without corresponding zeroes in u are problematicI not symmetric, unlike geometric distance measuresI alternatives: skew divergence, Jensen-Shannon divergence
I A symmetric distance measure (Endres and Schindelin 2003)
Duv = D(u‖z) + D(v‖z) with z = u + v2
DSM Tutorial – Part 1 56 / 91
Taxonomy of DSM parameters DSM parameters
Other distance measures
I Information theory: Kullback-Leibler (KL) divergence forprobability vectors (non-negative, ‖x‖1 = 1)
D(u‖v) =n∑
i=1ui · log2
uivi
I Properties of KL divergenceI most appropriate in a probabilistic interpretation of MI zeroes in v without corresponding zeroes in u are problematicI not symmetric, unlike geometric distance measuresI alternatives: skew divergence, Jensen-Shannon divergence
I A symmetric distance measure (Endres and Schindelin 2003)
Duv = D(u‖z) + D(v‖z) with z = u + v2
DSM Tutorial – Part 1 56 / 91
Taxonomy of DSM parameters DSM parameters
Similarity measures
I angle α between twovectors u, v is given by
cosα =∑n
i=1 ui · vi√∑i u2i ·
√∑i v2i
= 〈u, v〉‖u‖2 · ‖v‖2
I cosine measure ofsimilarity: cosα
I cosα = 1 Ü collinearI cosα = 0 Ü orthogonal
●●
●
●
0 20 40 60 80 100 1200
2040
6080
100
120
Two dimensions of English V−Obj DSM
get
use
catdog
knife
boat
●●
●
●
α = 54.3°
DSM Tutorial – Part 1 57 / 91
Taxonomy of DSM parameters DSM parameters
Similarity measures
I angle α between twovectors u, v is given by
cosα =∑n
i=1 ui · vi√∑i u2i ·
√∑i v2i
= 〈u, v〉‖u‖2 · ‖v‖2
I cosine measure ofsimilarity: cosα
I cosα = 1 Ü collinearI cosα = 0 Ü orthogonal
●●
●
●
0 20 40 60 80 100 1200
2040
6080
100
120
Two dimensions of English V−Obj DSM
get
use
catdog
knife
boat
●●
●
●
α = 54.3°
DSM Tutorial – Part 1 57 / 91
Taxonomy of DSM parameters DSM parameters
Overview of DSM parameters
Term-context vs. term-term matrix⇓
Definition of terms & linguistic pre-processing⇓
Size & type of context⇓
Geometric vs. probabilistic interpretation⇓
Feature scaling⇓
Normalisation of rows and/or columns⇓
Similarity / distance measure⇓
Dimensionality reduction
DSM Tutorial – Part 1 58 / 91
Taxonomy of DSM parameters DSM parameters
Dimensionality reduction = model compression
I Co-occurrence matrix M is often unmanageably largeand can be extremely sparse
I Google Web1T5: 1M × 1M matrix with one trillion cells, ofwhich less than 0.05% contain nonzero counts (Evert 2010)
å Compress matrix by reducing dimensionality (= rows)
I Feature selection: columns with high frequency & varianceI measured by entropy, chi-squared test, . . .I may select correlated (Ü uninformative) dimensionsI joint selection of multiple features is useful but expensive
I Projection into (linear) subspaceI principal component analysis (PCA)I independent component analysis (ICA)I random indexing (RI)
+ intuition: preserve distances between data points
DSM Tutorial – Part 1 59 / 91
Taxonomy of DSM parameters DSM parameters
Dimensionality reduction = model compression
I Co-occurrence matrix M is often unmanageably largeand can be extremely sparse
I Google Web1T5: 1M × 1M matrix with one trillion cells, ofwhich less than 0.05% contain nonzero counts (Evert 2010)
å Compress matrix by reducing dimensionality (= rows)
I Feature selection: columns with high frequency & varianceI measured by entropy, chi-squared test, . . .I may select correlated (Ü uninformative) dimensionsI joint selection of multiple features is useful but expensive
I Projection into (linear) subspaceI principal component analysis (PCA)I independent component analysis (ICA)I random indexing (RI)
+ intuition: preserve distances between data points
DSM Tutorial – Part 1 59 / 91
Taxonomy of DSM parameters DSM parameters
Dimensionality reduction = model compression
I Co-occurrence matrix M is often unmanageably largeand can be extremely sparse
I Google Web1T5: 1M × 1M matrix with one trillion cells, ofwhich less than 0.05% contain nonzero counts (Evert 2010)
å Compress matrix by reducing dimensionality (= rows)
I Feature selection: columns with high frequency & varianceI measured by entropy, chi-squared test, . . .I may select correlated (Ü uninformative) dimensionsI joint selection of multiple features is useful but expensive
I Projection into (linear) subspaceI principal component analysis (PCA)I independent component analysis (ICA)I random indexing (RI)
+ intuition: preserve distances between data points
DSM Tutorial – Part 1 59 / 91
Taxonomy of DSM parameters DSM parameters
Dimensionality reduction & latent dimensions
Landauer and Dumais (1997) claim that LSA dimensionalityreduction (and related PCA technique) uncovers latentdimensions by exploiting correlations between features.
I Example: term-term matrixI V-Obj cooc’s extracted from BNC
I targets = noun lemmasI features = verb lemmas
I feature scaling: association scores(modified log Dice coefficient)
I k = 111 nouns with f ≥ 20(must have non-zero row vectors)
I n = 2 dimensions: buy and sell
noun buy sellbond 0.28 0.77cigarette -0.52 0.44dress 0.51 -1.30freehold -0.01 -0.08land 1.13 1.54number -1.05 -1.02per -0.35 -0.16pub -0.08 -1.30share 1.92 1.99system -1.63 -0.70
DSM Tutorial – Part 1 60 / 91
Taxonomy of DSM parameters DSM parameters
Dimensionality reduction & latent dimensions
0 1 2 3 4
01
23
4
buy
sell
acre
advertising
amount
arm
asset
bag
beerbill
bit
bond book
bottle
boxbread
building
business
car
card
carpet
cigaretteclothe
club
coal
collectioncompany
computer
copy
couple
currency
dress
drink
drugequipmentestate
farm
fish
flat
flower
foodfreehold
fruitfurniture
good
home
horse
house
insurance
item
kind
land
licence
liquor
lotmachine
material
meat milkmill
newspaper
number
oil
one
packpackagepacket
painting
pair
paperpart
per
petrol
picture
piece
place
plant
player
pound
productproperty
pub
quality
quantity
range
record
right
seatsecurity
service
set
share
shoe
shop
sitesoftware
stake
stamp
stock
stuff
suit
system
television
thing
ticket
time
tin
unit
vehicle
video
wine
work
year
DSM Tutorial – Part 1 61 / 91
Taxonomy of DSM parameters DSM parameters
Motivating latent dimensions & subspace projection
I The latent property of being a commodity is “expressed”through associations with several verbs: sell, buy, acquire, . . .
I Consequence: these DSM dimensions will be correlated
I Identify latent dimension by looking for strong correlations(or weaker correlations between large sets of features)
I Projection into subspace V of k < n latent dimensionsas a “noise reduction” technique Ü LSA
I Assumptions of this approach:I “latent” distances in V are semantically meaningfulI other “residual” dimensions represent chance co-occurrence
patterns, often particular to the corpus underlying the DSM
DSM Tutorial – Part 1 62 / 91
Taxonomy of DSM parameters DSM parameters
Motivating latent dimensions & subspace projection
I The latent property of being a commodity is “expressed”through associations with several verbs: sell, buy, acquire, . . .
I Consequence: these DSM dimensions will be correlated
I Identify latent dimension by looking for strong correlations(or weaker correlations between large sets of features)
I Projection into subspace V of k < n latent dimensionsas a “noise reduction” technique Ü LSA
I Assumptions of this approach:I “latent” distances in V are semantically meaningfulI other “residual” dimensions represent chance co-occurrence
patterns, often particular to the corpus underlying the DSM
DSM Tutorial – Part 1 62 / 91
Taxonomy of DSM parameters DSM parameters
The latent “commodity” dimension
0 1 2 3 4
01
23
4
buy
sell
acre
advertising
amount
arm
asset
bag
beerbill
bit
bond book
bottle
boxbread
building
business
car
card
carpet
cigaretteclothe
club
coal
collectioncompany
computer
copy
couple
currency
dress
drink
drugequipmentestate
farm
fish
flat
flower
foodfreehold
fruitfurniture
good
home
horse
house
insurance
item
kind
land
licence
liquor
lotmachine
material
meat milkmill
newspaper
number
oil
one
packpackagepacket
painting
pair
paperpart
per
petrol
picture
piece
place
plant
player
pound
productproperty
pub
quality
quantity
range
record
right
seatsecurity
service
set
share
shoe
shop
sitesoftware
stake
stamp
stock
stuff
suit
system
television
thing
ticket
time
tin
unit
vehicle
video
wine
work
year
DSM Tutorial – Part 1 63 / 91
Taxonomy of DSM parameters Examples
Outline
IntroductionThe distributional hypothesisThree famous DSM examples
Taxonomy of DSM parametersDefinition & overviewDSM parametersExamples
DSM in practiceUsing DSM distancesQuantitative evaluationSoftware and further information
DSM Tutorial – Part 1 64 / 91
Taxonomy of DSM parameters Examples
Some well-known DSM examplesLatent Semantic Analysis (Landauer and Dumais 1997)
I term-context matrix with document contextI weighting: log term frequency and term entropyI distance measure: cosineI dimensionality reduction: SVD
Hyperspace Analogue to Language (Lund and Burgess 1996)
I term-term matrix with surface contextI structured (left/right) and distance-weighted frequency countsI distance measure: Minkowski metric (1 ≤ p ≤ 2)I dimensionality reduction: feature selection (high variance)
DSM Tutorial – Part 1 65 / 91
Taxonomy of DSM parameters Examples
Some well-known DSM examplesInfomap NLP (Widdows 2004)
I term-term matrix with unstructured surface contextI weighting: noneI distance measure: cosineI dimensionality reduction: SVD
Random Indexing (Karlgren and Sahlgren 2001)
I term-term matrix with unstructured surface contextI weighting: various methodsI distance measure: various methodsI dimensionality reduction: random indexing (RI)
DSM Tutorial – Part 1 66 / 91
Taxonomy of DSM parameters Examples
Some well-known DSM examplesDependency Vectors (Padó and Lapata 2007)
I term-term matrix with unstructured dependency contextI weighting: log-likelihood ratioI distance measure: information-theoretic (Lin 1998b)I dimensionality reduction: none
Distributional Memory (Baroni and Lenci 2010)
I term-term matrix with structured and unstructereddependencies + knowledge patterns
I weighting: local-MI on type frequencies of link patternsI distance measure: cosineI dimensionality reduction: none
DSM Tutorial – Part 1 67 / 91
DSM in practice Using DSM distances
Outline
IntroductionThe distributional hypothesisThree famous DSM examples
Taxonomy of DSM parametersDefinition & overviewDSM parametersExamples
DSM in practiceUsing DSM distancesQuantitative evaluationSoftware and further information
DSM Tutorial – Part 1 68 / 91
DSM in practice Using DSM distances
Nearest neighboursDSM based on verb-object relations from BNC, reduced to 100 dim. with SVD
Neighbours of dog (cosine angle):+ girl (45.5), boy (46.7), horse(47.0), wife (48.8), baby (51.9),
daughter (53.1), side (54.9), mother (55.6), boat (55.7), rest(56.3), night (56.7), cat (56.8), son (57.0), man (58.2), place(58.4), husband (58.5), thing (58.8), friend (59.6), . . .
Neighbours of school:+ country (49.3), church (52.1), hospital (53.1), house (54.4),
hotel (55.1), industry (57.0), company (57.0), home (57.7),family (58.4), university (59.0), party (59.4), group (59.5),building (59.8), market (60.3), bank (60.4), business (60.9),area (61.4), department (61.6), club (62.7), town (63.3),library (63.3), room (63.6), service (64.4), police (64.7), . . .
DSM Tutorial – Part 1 69 / 91
DSM in practice Using DSM distances
Nearest neighbours
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
girl
boy
horse
wife
baby
daughter
side
mother
boat
other
rest
night
cat
son
man
place
husband
thing
friend
bit
fish
woman
child
minute
animal
car
house
bird
peoplehead
dog
DSM Tutorial – Part 1 70 / 91
DSM in practice Using DSM distances
Semantic maps
●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
−0.4 −0.2 0.0 0.8
−0.
4−
0.2
0.0
0.2
0.4
0.6
Semantic map (V−Obj from BNC)
●
●
●
●
●
●
birdgroundAnimalfruitTreegreentoolvehicle
chicken
eagle duck
swan owl
penguin
peacock
dog
elephantcow
cat
lionpig
snail
turtle
cherry
banana
pearpineapple
mushroom
corn
lettuce
potatoonion
bottle
pencil
pen
cup
bowl
scissors
kettle
knife
screwdriver
hammer
spoon
chisel
telephoneboatcarship
truck
rocket
motorcycle
helicopter
DSM Tutorial – Part 1 71 / 91
DSM in practice Using DSM distances
Clustering
pota
toon
ion
cat
bana
nach
icke
nm
ushr
oom
corn
dog
pear
cher
ryle
ttuce
peng
uin
swan
eagl
eow
ldu
ckel
epha
nt pig
cow
lion
helic
opte
rpe
acoc
ktu
rtle ca
rpi
neap
ple
boat
rock
ettr
uck
mot
orcy
cle
snai
lsh
ipch
isel
scis
sors
scre
wdr
iver
penc
ilha
mm
erte
leph
one
knife
spoo
npe
nke
ttle
bottl
ecu
pbo
wl
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Word space clustering of concrete nouns (V−Obj from BNC)
Clu
ster
siz
e
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
DSM Tutorial – Part 1 72 / 91
DSM in practice Using DSM distances
Latent dimensions
0 1 2 3 4
01
23
4
buy
sell
acre
advertising
amount
arm
asset
bag
beerbill
bit
bond book
bottle
boxbread
building
business
car
card
carpet
cigaretteclothe
club
coal
collectioncompany
computer
copy
couple
currency
dress
drink
drugequipmentestate
farm
fish
flat
flower
foodfreehold
fruitfurniture
good
home
horse
house
insurance
item
kind
land
licence
liquor
lotmachine
material
meat milkmill
newspaper
number
oil
one
packpackagepacket
painting
pair
paperpart
per
petrol
picture
piece
place
plant
player
pound
productproperty
pub
quality
quantity
range
record
right
seatsecurity
service
set
share
shoe
shop
sitesoftware
stake
stamp
stock
stuff
suit
system
television
thing
ticket
time
tin
unit
vehicle
video
wine
work
year
DSM Tutorial – Part 1 73 / 91
DSM in practice Using DSM distances
Semantic similarity graph (topological structure)
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
tea
cup
drink
winewhisky
soup
champagne
offer
beer
citron living
toast
breakfast
salad
journey
meal
lunch
dinner
vodka
rum
coffee
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
head
arm
foot
finger
legeye
place
ball
side
backdoor
face
bit
carlight
minute
money
glass
horse
week
way
hour
time
paper
bag
house
line
part
lookchance
hand
DSM Tutorial – Part 1 74 / 91
DSM in practice Using DSM distances
Semantic similarity graph (topological structure)
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
tea
cup
drink
winewhisky
soup
champagne
offer
beer
citron living
toast
breakfast
salad
journey
meal
lunch
dinner
vodka
rum
coffee ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
head
arm
foot
finger
legeye
place
ball
side
backdoor
face
bit
carlight
minute
money
glass
horse
week
way
hour
time
paper
bag
house
line
part
lookchance
hand
DSM Tutorial – Part 1 74 / 91
DSM in practice Using DSM distances
Context vectors (Schütze 1998)
Distributional representationonly at type level
+ What is the “average”meaning of mouse?(computer vs. animal)
Context vector approximatesmeaning of individual token
I bag-of-words approach:centroid of all contextwords in the sentence animals
com
pu
ters
cat
furry
chase
keyboard
computer
connect
connect the mouse and keyboard to the computer
the furry catchased the
mouse
DSM Tutorial – Part 1 75 / 91
DSM in practice Using DSM distances
Context vectors (Schütze 1998)
Distributional representationonly at type level
+ What is the “average”meaning of mouse?(computer vs. animal)
Context vector approximatesmeaning of individual token
I bag-of-words approach:centroid of all contextwords in the sentence animals
com
pu
ters
cat
furry
chase
keyboard
computer
connect
connect the mouse and keyboard to the computer
the furry catchased the
mouse
DSM Tutorial – Part 1 75 / 91
DSM in practice Quantitative evaluation
Outline
IntroductionThe distributional hypothesisThree famous DSM examples
Taxonomy of DSM parametersDefinition & overviewDSM parametersExamples
DSM in practiceUsing DSM distancesQuantitative evaluationSoftware and further information
DSM Tutorial – Part 1 76 / 91
DSM in practice Quantitative evaluation
The TOEFL synonym task
I The TOEFL datasetI 80 itemsI Target: levied
Candidates: believed, correlated, imposed, requested
I Target fashionCandidates: craze, fathom, manner, ration
I DSMs and TOEFL1. take vectors of the target (t) and of the candidates (c1 . . . cn)2. measure the distance between t and ci , with 1 ≤ i ≤ n3. select ci with the shortest distance in space from t
DSM Tutorial – Part 1 77 / 91
DSM in practice Quantitative evaluation
The TOEFL synonym task
I The TOEFL datasetI 80 itemsI Target: levied
Candidates: believed, correlated, imposed, requested
I Target fashionCandidates: craze, fathom, manner, ration
I DSMs and TOEFL1. take vectors of the target (t) and of the candidates (c1 . . . cn)2. measure the distance between t and ci , with 1 ≤ i ≤ n3. select ci with the shortest distance in space from t
DSM Tutorial – Part 1 77 / 91
DSM in practice Quantitative evaluation
The TOEFL synonym task
I The TOEFL datasetI 80 itemsI Target: levied
Candidates: believed, correlated, imposed, requestedI Target fashion
Candidates: craze, fathom, manner, ration
I DSMs and TOEFL1. take vectors of the target (t) and of the candidates (c1 . . . cn)2. measure the distance between t and ci , with 1 ≤ i ≤ n3. select ci with the shortest distance in space from t
DSM Tutorial – Part 1 77 / 91
DSM in practice Quantitative evaluation
The TOEFL synonym task
I The TOEFL datasetI 80 itemsI Target: levied
Candidates: believed, correlated, imposed, requestedI Target fashion
Candidates: craze, fathom, manner, ration
I DSMs and TOEFL1. take vectors of the target (t) and of the candidates (c1 . . . cn)2. measure the distance between t and ci , with 1 ≤ i ≤ n3. select ci with the shortest distance in space from t
DSM Tutorial – Part 1 77 / 91
DSM in practice Quantitative evaluation
The TOEFL synonym task
I The TOEFL datasetI 80 itemsI Target: levied
Candidates: believed, correlated, imposed, requestedI Target fashion
Candidates: craze, fathom, manner, ration
I DSMs and TOEFL1. take vectors of the target (t) and of the candidates (c1 . . . cn)2. measure the distance between t and ci , with 1 ≤ i ≤ n3. select ci with the shortest distance in space from t
© Evert/Baroni/Lenci (CC-by-sa) DSM Tutorial – Part 1 77 / 91
DSM in practice Quantitative evaluation
Humans vs. machines on the TOEFL task
I Average foreign test taker: 64.5%
I Macquarie University staff (Rapp 2004):I Average of 5 non-natives: 86.75%I Average of 5 natives: 97.75%
I Distributional semanticsI Classic LSA (Landauer and Dumais 1997): 64.4%I Padó and Lapata’s (2007) dependency-based model: 73.0%I Distributional memory (Baroni and Lenci 2010): 76.9%I Rapp’s (2004) SVD-based model, lemmatized BNC: 92.5%I Bullinaria and Levy (2012) carry out aggressive parameter
optimization: 100.0%
DSM Tutorial – Part 1 78 / 91
DSM in practice Quantitative evaluation
Humans vs. machines on the TOEFL task
I Average foreign test taker: 64.5%I Macquarie University staff (Rapp 2004):
I Average of 5 non-natives: 86.75%I Average of 5 natives: 97.75%
I Distributional semanticsI Classic LSA (Landauer and Dumais 1997): 64.4%I Padó and Lapata’s (2007) dependency-based model: 73.0%I Distributional memory (Baroni and Lenci 2010): 76.9%I Rapp’s (2004) SVD-based model, lemmatized BNC: 92.5%I Bullinaria and Levy (2012) carry out aggressive parameter
optimization: 100.0%
DSM Tutorial – Part 1 78 / 91
DSM in practice Quantitative evaluation
Humans vs. machines on the TOEFL task
I Average foreign test taker: 64.5%I Macquarie University staff (Rapp 2004):
I Average of 5 non-natives: 86.75%I Average of 5 natives: 97.75%
I Distributional semanticsI Classic LSA (Landauer and Dumais 1997): 64.4%I Padó and Lapata’s (2007) dependency-based model: 73.0%I Distributional memory (Baroni and Lenci 2010): 76.9%I Rapp’s (2004) SVD-based model, lemmatized BNC: 92.5%I Bullinaria and Levy (2012) carry out aggressive parameter
optimization: 100.0%
DSM Tutorial – Part 1 78 / 91
DSM in practice Quantitative evaluation
Semantic similarity judgments
I Rubenstein and Goodenough (1965) collected similarityratings for 65 noun pairs from 51 subjects on a 0–4 scale
w1 w2 avg. ratingcar automobile 3.9food fruit 2.7cord smile 0.0
I DSMs vs. Rubenstein & Goodenough1. for each test pair (w1,w2), take vectors w1 and w22. measure the distance (e.g. cosine) between w1 and w23. measure (Pearson) correlation between vector distances and
R&G average judgments (Padó and Lapata 2007)
DSM Tutorial – Part 1 79 / 91
DSM in practice Quantitative evaluation
Semantic similarity judgments
I Rubenstein and Goodenough (1965) collected similarityratings for 65 noun pairs from 51 subjects on a 0–4 scale
w1 w2 avg. ratingcar automobile 3.9food fruit 2.7cord smile 0.0
I DSMs vs. Rubenstein & Goodenough1. for each test pair (w1,w2), take vectors w1 and w22. measure the distance (e.g. cosine) between w1 and w23. measure (Pearson) correlation between vector distances and
R&G average judgments (Padó and Lapata 2007)
DSM Tutorial – Part 1 79 / 91
DSM in practice Quantitative evaluation
Semantic similarity judgments: example
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
0 1 2 3 4
3040
5060
7080
9010
0
RG65: British National Corpus
human rating
angu
lar
dist
ance
|rho| = 0.748, p = 0.0000, |r| = 0.623 .. 0.842
DSM Tutorial – Part 1 80 / 91
DSM in practice Quantitative evaluation
Semantic similarity judgments: results
Results on RG65 task:I Padó and Lapata’s (2007) dependency-based model: 0.62I Dependency-based on Web corpus (Herdağdelen et al. 2009)
I without SVD reduction: 0.69I with SVD reduction: 0.80
I Distributional memory (Baroni and Lenci 2010): 0.82I Salient Semantic Analysis (Hassan and Mihalcea 2011): 0.86
DSM Tutorial – Part 1 81 / 91
DSM in practice Software and further information
Outline
IntroductionThe distributional hypothesisThree famous DSM examples
Taxonomy of DSM parametersDefinition & overviewDSM parametersExamples
DSM in practiceUsing DSM distancesQuantitative evaluationSoftware and further information
DSM Tutorial – Part 1 82 / 91
DSM in practice Software and further information
Software packages
HiDEx C++ re-implementation of the HAL model(Lund and Burgess 1996)
SemanticVectors Java scalable architecture based on randomindexing representation
S-Space Java complex object-oriented frameworkJoBimText Java UIMA / Hadoop frameworkGensim Python complex framework, focus on paral-
lelization and out-of-core algorithmsDISSECT Python user-friendly, designed for research on
compositional semanticswordspace R interactive research laboratory, but
scales to real-life data sets
click on package name to open Web page
DSM Tutorial – Part 1 83 / 91
DSM in practice Software and further information
Recent conferences and workshopsI 2007: CoSMo Workshop (at Context ’07)I 2008: ESSLLI Lexical Semantics Workshop & Shared Task,
Special Issue of the Italian Journal of LinguisticsI 2009: GeMS Workshop (EACL 2009), DiSCo Workshop
(CogSci 2009), ESSLLI Advanced Course on DSMI 2010: 2nd GeMS (ACL 2010), ESSLLI Workshop on
Compositionality and DSM, DSM Tutorial (NAACL 2010),Special Issue of JNLE on Distributional Lexical Semantics
I 2011: 2nd DiSCo (ACL 2011), 3rd GeMS (EMNLP 2011)I 2012: DiDaS (at ICSC 2012)I 2013: CVSC (ACL 2013), TFDS (IWCS 2013), DagstuhlI 2014: 2nd CVSC (at EACL 2014)
click on Workshop name to open Web page
DSM Tutorial – Part 1 84 / 91
DSM in practice Software and further information
Further information
I Handouts & other materials available from wordspace wiki athttp://wordspace.collocations.de/
+ based on joint work with Marco Baroni and Alessandro LenciI Tutorial is open source (CC), and can be downloaded from
http://r-forge.r-project.org/projects/wordspace/
I Review paper on distributional semantics:Turney, Peter D. and Pantel, Patrick (2010). From frequencyto meaning: Vector space models of semantics. Journal ofArtificial Intelligence Research, 37, 141–188.
I I should be working on textbook Distributional Semantics forSynthesis Lectures on HLT (Morgan & Claypool)
DSM Tutorial – Part 1 85 / 91
DSM in practice Software and further information
References I
Baroni, Marco and Lenci, Alessandro (2010). Distributional Memory: A generalframework for corpus-based semantics. Computational Linguistics, 36(4), 673–712.
Bengio, Yoshua; Ducharme, Réjean; Vincent, Pascal; Jauvin, Christian (2003). Aneural probabilistic language model. Journal of Machine Learning Research, 3,1137–1155.
Blei, David M.; Ng, Andrew Y.; Jordan, Michael, I. (2003). Latent Dirichletallocation. Journal of Machine Learning Research, 3, 993–1022.
Bullinaria, John A. and Levy, Joseph P. (2012). Extracting semantic representationsfrom word co-occurrence statistics: Stop-lists, stemming and SVD. BehaviorResearch Methods, 44(3), 890–907.
Church, Kenneth W. and Hanks, Patrick (1990). Word association norms, mutualinformation, and lexicography. Computational Linguistics, 16(1), 22–29.
Dumais, S. T.; Furnas, G. W.; Landauer, T. K.; Deerwester, S.; Harshman, R. (1988).Using latent semantic analysis to improve access to textual information. In CHI’88: Proceedings of the SIGCHI conference on Human factors in computingsystems, pages 281–285.
Dunning, Ted E. (1993). Accurate methods for the statistics of surprise andcoincidence. Computational Linguistics, 19(1), 61–74.
DSM Tutorial – Part 1 86 / 91
DSM in practice Software and further information
References II
Endres, Dominik M. and Schindelin, Johannes E. (2003). A new metric for probabilitydistributions. IEEE Transactions on Information Theory, 49(7), 1858–1860.
Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs andCollocations. Dissertation, Institut für maschinelle Sprachverarbeitung, Universityof Stuttgart. Published in 2005, URN urn:nbn:de:bsz:93-opus-23714. Availablefrom http://www.collocations.de/phd.html.
Evert, Stefan (2008). Corpora and collocations. In A. Lüdeling and M. Kytö (eds.),Corpus Linguistics. An International Handbook, chapter 58. Mouton de Gruyter,Berlin, New York.
Evert, Stefan (2010). Google Web 1T5 n-grams made easy (but not for thecomputer). In Proceedings of the 6th Web as Corpus Workshop (WAC-6), pages32–40, Los Angeles, CA.
Firth, J. R. (1957). A synopsis of linguistic theory 1930–55. In Studies in linguisticanalysis, pages 1–32. The Philological Society, Oxford. Reprinted in Palmer (1968),pages 168–205.
Grefenstette, Gregory (1994). Explorations in Automatic Thesaurus Discovery, volume278 of Kluwer International Series in Engineering and Computer Science. Springer,Berlin, New York.
DSM Tutorial – Part 1 87 / 91
DSM in practice Software and further information
References III
Harris, Zellig (1954). Distributional structure. Word, 10(23), 146–162. Reprinted inHarris (1970, 775–794).
Hassan, Samer and Mihalcea, Rada (2011). Semantic relatedness using salientsemantic analysis. In Proceedings of the Twenty-fifth AAAI Conference on ArtificialIntelligence.
Herdağdelen, Amaç; Erk, Katrin; Baroni, Marco (2009). Measuring semanticrelatedness with vector space models and random walks. In Proceedings of the2009 Workshop on Graph-based Methods for Natural Language Processing(TextGraphs-4), pages 50–53, Suntec, Singapore.
Hoffmann, Thomas (1999). Probabilistic latent semantic analysis. In Proceedings ofthe Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI’99).
Karlgren, Jussi and Sahlgren, Magnus (2001). From words to understanding. InY. Uesaka, P. Kanerva, and H. Asoh (eds.), Foundations of Real-WorldIntelligence, chapter 294–308. CSLI Publications, Stanford.
Landauer, Thomas K. and Dumais, Susan T. (1997). A solution to Plato’s problem:The latent semantic analysis theory of acquisition, induction and representation ofknowledge. Psychological Review, 104(2), 211–240.
DSM Tutorial – Part 1 88 / 91
DSM in practice Software and further information
References IV
Li, Ping; Burgess, Curt; Lund, Kevin (2000). The acquisition of word meaningthrough global lexical co-occurences. In E. V. Clark (ed.), The Proceedings of theThirtieth Annual Child Language Research Forum, pages 167–178. StanfordLinguistics Association.
Lin, Dekang (1998a). Automatic retrieval and clustering of similar words. InProceedings of the 17th International Conference on Computational Linguistics(COLING-ACL 1998), pages 768–774, Montreal, Canada.
Lin, Dekang (1998b). An information-theoretic definition of similarity. In Proceedingsof the 15th International Conference on Machine Learning (ICML-98), pages296–304, Madison, WI.
Lund, Kevin and Burgess, Curt (1996). Producing high-dimensional semantic spacesfrom lexical co-occurrence. Behavior Research Methods, Instruments, &Computers, 28(2), 203–208.
Miller, George A. (1986). Dictionaries in the mind. Language and Cognitive Processes,1, 171–185.
Padó, Sebastian and Lapata, Mirella (2007). Dependency-based construction ofsemantic space models. Computational Linguistics, 33(2), 161–199.
DSM Tutorial – Part 1 89 / 91
DSM in practice Software and further information
References V
Pantel, Patrick and Lin, Dekang (2000). An unsupervised approach to prepositionalphrase attachment using contextually similar words. In Proceedings of the 38thAnnual Meeting of the Association for Computational Linguistics, Hongkong,China.
Pantel, Patrick; Crestan, Eric; Borkovsky, Arkady; Popescu, Ana-Maria; Vyas, Vishnu(2009). Web-scale distributional similarity and entity set expansion. In Proceedingsof the 2009 Conference on Empirical Methods in Natural Language Processing,pages 938–947, Singapore.
Rapp, Reinhard (2004). A freely available automatically generated thesaurus of relatedwords. In Proceedings of the 4th International Conference on Language Resourcesand Evaluation (LREC 2004), pages 395–398.
Rooth, Mats; Riezler, Stefan; Prescher, Detlef; Carroll, Glenn; Beil, Franz (1999).Inducing a semantically annotated lexicon via EM-based clustering. In Proceedingsof the 37th Annual Meeting of the Association for Computational Linguistics,pages 104–111.
Rubenstein, Herbert and Goodenough, John B. (1965). Contextual correlates ofsynonymy. Communications of the ACM, 8(10), 627–633.
Schütze, Hinrich (1992). Dimensions of meaning. In Proceedings of Supercomputing’92, pages 787–796, Minneapolis, MN.
DSM Tutorial – Part 1 90 / 91
DSM in practice Software and further information
References VI
Schütze, Hinrich (1993). Word space. In Proceedings of Advances in NeuralInformation Processing Systems 5, pages 895–902, San Mateo, CA.
Schütze, Hinrich (1995). Distributional part-of-speech tagging. In Proceedings of the7th Conference of the European Chapter of the Association for ComputationalLinguistics (EACL 1995), pages 141–148.
Schütze, Hinrich (1998). Automatic word sense discrimination. ComputationalLinguistics, 24(1), 97–123.
Turney, Peter D. and Pantel, Patrick (2010). From frequency to meaning: Vectorspace models of semantics. Journal of Artificial Intelligence Research, 37, 141–188.
Turney, Peter D.; Littman, Michael L.; Bigham, Jeffrey; Shnayder, Victor (2003).Combining independent modules to solve multiple-choice synonym and analogyproblems. In Proceedings of the International Conference on Recent Advances inNatural Language Processing (RANLP-03), pages 482–489, Borovets, Bulgaria.
Widdows, Dominic (2004). Geometry and Meaning. Number 172 in CSLI LectureNotes. CSLI Publications, Stanford.
DSM Tutorial – Part 1 91 / 91