Distributional Semantic Models - Part 1:...

cat

dog

petisa

isa

Distributional Semantic Models CS 114 James Pustejovsky

slides by Stefan Evert

DSM Tutorial – Part 1 1 / 91

Outline

Outline

IntroductionThe distributional hypothesisThree famous DSM examples

Taxonomy of DSM parametersDefinition & overviewDSM parametersExamples

DSM in practiceUsing DSM distancesQuantitative evaluationSoftware and further information


Introduction The distributional hypothesis

Outline






Meaning & distribution

I “Die Bedeutung eines Wortes liegt in seinem Gebrauch.”— Ludwig Wittgenstein

I “You shall know a word by the company it keeps!”— J. R. Firth (1957)

I Distributional hypothesis: difference of meaning correlateswith difference of distribution (Zellig Harris 1954)

I “What people know when they say that they know a word isnot how to recite its dictionary definition – they know how touse it [. . . ] in everyday discourse.” (Miller 1986)



What is the meaning of “bardiwac”?

I He handed her her glass of bardiwac.I Beef dishes are made to complement the bardiwacs.I Nigel staggered to his feet, face flushed from too much

bardiwac.I Malbec, one of the lesser-known bardiwac grapes, responds

well to Australia’s sunshine.I I dined off bread and cheese and this excellent bardiwac.I The drinks were delicious: blood-red bardiwac as well as light,

sweet Rhenish.+ bardiwac is a heavy red alcoholic beverage made from grapes

The examples above are handpicked, of course. But in a corpus like theBNC, you will find at least as many informative sentences.




I He handed her her glass of bardiwac.

I Beef dishes are made to complement the bardiwacs.I Nigel staggered to his feet, face flushed from too much








I He handed her her glass of bardiwac.I Beef dishes are made to complement the bardiwacs.

I Nigel staggered to his feet, face flushed from too muchbardiwac.

I Malbec, one of the lesser-known bardiwac grapes, respondswell to Australia’s sunshine.

I I dined off bread and cheese and this excellent bardiwac.I The drinks were delicious: blood-red bardiwac as well as light,







bardiwac.

I Malbec, one of the lesser-known bardiwac grapes, respondswell to Australia’s sunshine.









well to Australia’s sunshine.









well to Australia’s sunshine.I I dined off bread and cheese and this excellent bardiwac.

I The drinks were delicious: blood-red bardiwac as well as light,sweet Rhenish.

+ bardiwac is a heavy red alcoholic beverage made from grapes








sweet Rhenish.

+ bardiwac is a heavy red alcoholic beverage made from grapes




A thought experiment: deciphering hieroglyphs

get sij ius hir iit kil(knife) naif 51 20 84 0 3 0(cat) ket 52 58 4 4 6 26??? dog 115 83 10 42 33 17(boat) beut 59 39 23 4 0 0(cup) kap 98 14 6 2 1 0(pig) pigij 12 17 3 2 9 27(banana) nana 11 2 2 0 18 0





sim(dog, naif) = 0.770





sim(dog, pigij) = 0.939





sim(dog, ket) = 0.961



English as seen by the computer . . .

get see use hear eat killget sij ius hir iit kil

knife naif 51 20 84 0 3 0cat ket 52 58 4 4 6 26dog dog 115 83 10 42 33 17boat beut 59 39 23 4 0 0cup kap 98 14 6 2 1 0pig pigij 12 17 3 2 9 27banana nana 11 2 2 0 18 0

verb-object counts from British National Corpus



Geometric interpretation

I row vector xdogdescribes usage ofword dog in thecorpus

I can be seen ascoordinates of pointin n-dimensionalEuclidean space

get see use hear eat killknife 51 20 84 0 3 0cat 52 58 4 4 6 26dog 115 83 10 42 33 17boat 59 39 23 4 0 0cup 98 14 6 2 1 0pig 12 17 3 2 9 27

banana 11 2 2 0 18 0

co-occurrence matrix M




I row vector xdogdescribes usage ofword dog in thecorpus

I can be seen ascoordinates of pointin n-dimensionalEuclidean space

I illustrated for twodimensions:get and use

I xdog = (115, 10) ●●

●

●

0 20 40 60 80 100 120

020

4060

8010

012

0

Two dimensions of English V−Obj DSM

get

use

catdog

knife

boat




I similarity = spatialproximity(Euclidean dist.)

I location depends onfrequency of noun(fdog ≈ 2.7 · fcat)

●●

●

●

0 20 40 60 80 100 120

020

4060

8010

012

0


get

use

catdog

knife

boat

d = 63.3

d = 57.5






I direction moreimportant thanlocation

●●

●

●

0 20 40 60 80 100 120

020

4060

8010

012

0


get

use

catdog

knife

boat






I direction moreimportant thanlocation

I normalise “length”‖xdog‖ of vector

I or use angle α asdistance measure

●●

●

●

0 20 40 60 80 100 120

020

4060

8010

012

0


get

use

catdog

knife

boat

●●

●

●






I direction moreimportant than

locationI normalize “length”‖xdog‖ of vector

I or use angle α asdistance measure

●●

●

●

0 20 40 60 80 100 120

020

4060

8010

012

0


get

use

catdog

knife

boat

●●

●

●

α = 54.3°



Semantic distances

I main result of distributionalanalysis are “semantic”distances between words

I typical applicationsI nearest neighboursI clustering of related wordsI construct semantic map

I other applications requireclever use of the distanceinformation

I semantic relationsI relational analogiesI word sense disambiguationI detection of multiword

expressions

pota

toon

ion

cat

bana

nach

icke

nm

ushr

oom

corn

dog

pear

cher

ryle

ttuce

peng

uin

swan

eagl

eow

ldu

ckel

epha

nt pig

cow

lion

helic

opte

rpe

acoc

ktu

rtle ca

rpi

neap

ple

boat

rock

ettr

uck

mot

orcy

cle

snai

lsh

ipch

isel

scis

sors

scre

wdr

iver

penc

ilha

mm

erte

leph

one

knife

spoo

npe

nke

ttle

bottl

ecu

pbo

wl

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Word space clustering of concrete nouns (V−Obj from BNC)

Clu

ster

siz

e

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

−0.4 −0.2 0.0 0.8

−0.

4−

0.2

0.0

0.2

0.4

0.6

Semantic map (V−Obj from BNC)

●

●

●

●

●

●

birdgroundAnimalfruitTreegreentoolvehicle

chicken

eagle duck

swan owl

penguin

peacock

dog

elephantcow

cat

lionpig

snail

turtle

cherry

banana

pearpineapple

mushroom

corn

lettuce

potatoonion

bottle

pencil

pen

cup

bowl

scissors

kettle

knife

screwdriver

hammer

spoon

chisel

telephoneboatcarship

truck

rocket

motorcycle

helicopter



Some applications in computational linguisticsI Unsupervised part-of-speech induction (Schütze 1995)I Word sense disambiguation (Schütze 1998)I Query expansion in information retrieval (Grefenstette 1994)I Synonym tasks & other language tests

(Landauer and Dumais 1997; Turney et al. 2003)I Thesaurus compilation (Lin 1998a; Rapp 2004)I Ontology & wordnet expansion (Pantel et al. 2009)I Attachment disambiguation (Pantel and Lin 2000)I Probabilistic language models (Bengio et al. 2003)I Subsymbolic input representation for neural networksI Many other tasks in computational semantics:

entailment detection, noun compound interpretation,identification of noncompositional expressions, . . .


Introduction Three famous DSM examples

Outline






Latent Semantic Analysis (Landauer and Dumais 1997)

I Corpus: 30,473 articles from Grolier’s Academic AmericanEncyclopedia (4.6 million words in total)

+ articles were limited to first 2,000 charactersI Word-article frequency matrix for 60,768 words

I row vector shows frequency of word in each articleI Logarithmic frequencies scaled by word entropyI Reduced to 300 dim. by singular value decomposition (SVD)

I borrowed from LSI (Dumais et al. 1988)+ central claim: SVD reveals latent semantic features,

not just a data reduction techniqueI Evaluated on TOEFL synonym test (80 items)

I LSA model achieved 64.4% correct answersI also simulation of learning rate based on TOEFL results



Word Space (Schütze 1992, 1993, 1998)

I Corpus: ≈ 60 million words of news messagesI from the New York Times News Service

I Word-word co-occurrence matrixI 20,000 target words & 2,000 context words as featuresI row vector records how often each context word occurs close

to the target word (co-occurrence)I co-occurrence window: left/right 50 words (Schütze 1998)

or ≈ 1000 characters (Schütze 1992)I Rows weighted by inverse document frequency (tf.idf)I Context vector = centroid of word vectors (bag-of-words)

+ goal: determine “meaning” of a contextI Reduced to 100 SVD dimensions (mainly for efficiency)I Evaluated on unsupervised word sense induction by clustering

of context vectors (for an ambiguous word)I induced word senses improve information retrieval performance



HAL (Lund and Burgess 1996)

I HAL = Hyperspace Analogue to LanguageI Corpus: 160 million words from newsgroup postingsI Word-word co-occurrence matrix

I same 70,000 words used as targets and featuresI co-occurrence window of 1 – 10 words

I Separate counts for left and right co-occurrenceI i.e. the context is structured

I In later work, co-occurrences are weighted by (inverse)distance (Li et al. 2000)

I Applications include construction of semantic vocabularymaps by multidimensional scaling to 2 dimensions



Many parameters . . .

I Enormous range of DSM parameters and applicationsI Examples showed three entirely different models, each tuned

to its particular applicationå Need overview of DSM parameters & understand their effects


Taxonomy of DSM parameters Definition & overview

Outline






General definition of DSMs

A distributional semantic model (DSM) is a scaled and/ortransformed co-occurrence matrix M, such that each row xrepresents the distribution of a target term across contexts.

get see use hear eat killknife 0.027 -0.024 0.206 -0.022 -0.044 -0.042cat 0.031 0.143 -0.243 -0.015 -0.009 0.131dog -0.026 0.021 -0.212 0.064 0.013 0.014boat -0.022 0.009 -0.044 -0.040 -0.074 -0.042cup -0.014 -0.173 -0.249 -0.099 -0.119 -0.042pig -0.069 0.094 -0.158 0.000 0.094 0.265

banana 0.047 -0.139 -0.104 -0.022 0.267 -0.042

Term = word, lemma, phrase, morpheme, word pair, . . .



General definition of DSMs

Mathematical notation:I k × n co-occurrence matrix M (example: 7× 6 matrix)

I k rows = target termsI n columns = features or dimensions

M =

m11 m12 · · · m1nm21 m22 · · · m2n...

......

mk1 mk2 · · · mkn

I distribution vector mi = i-th row of M, e.g. m3 = mdogI components mi = (mi1,mi2, . . . ,min) = features of i-th term:

m3 = (−0.026, 0.021,−0.212, 0.064, 0.013, 0.014)= (m31,m32,m33,m34,m35,m36)



Overview of DSM parameters

Term-context vs. term-term matrix

⇓Definition of terms & linguistic pre-processing

⇓Size & type of context

⇓Geometric vs. probabilistic interpretation

⇓Feature scaling

⇓Normalisation of rows and/or columns

⇓Similarity / distance measure

⇓Dimensionality reduction




Term-context vs. term-term matrix⇓

Definition of terms & linguistic pre-processing

⇓Size & type of context


⇓Feature scaling








Definition of terms & linguistic pre-processing⇓

Size & type of context


⇓Feature scaling









Size & type of context⇓

Geometric vs. probabilistic interpretation

⇓Feature scaling










Geometric vs. probabilistic interpretation⇓

Feature scaling











Feature scaling⇓

Normalisation of rows and/or columns










Feature scaling⇓

Normalisation of rows and/or columns⇓

Similarity / distance measure









Feature scaling⇓


Similarity / distance measure⇓

Dimensionality reduction


Taxonomy of DSM parameters DSM parameters

Outline











Feature scaling⇓






Term-context matrix

Term-context matrix records frequency of term in each individualcontext (e.g. sentence, document, Web page, encyclopaedia article)

F =

· · · f1 · · ·· · · f2 · · ·

...

...· · · fk · · ·

Felidae

Pet

Feral

Bloat

Philosophy

Kant

Back

pain

cat 10 10 7 – – – –dog – 10 4 11 – – –

animal 2 15 10 2 – – –time 1 – – – 2 1 –

reason – 1 – – 1 4 1cause – – – 2 1 2 6effect – – – 1 – 1 –



Term-context matrix

Some footnotes:I Features are usually context tokens, i.e. individual instancesI Can also be generalised to context types, e.g.

I bag of content wordsI specific pattern of POS tagsI n-gram of words (or POS tags) around targetI subcategorisation pattern of target verb

I Term-context matrix is often very sparse



Term-term matrix

Term-term matrix records co-occurrence frequencies with featureterms for each target term

M =

· · · m1 · · ·· · · m2 · · ·

...

...· · · mk · · ·

breed

tail

feed

kill

importa

ntexpla

inlikely

cat 83 17 7 37 – 1 –dog 561 13 30 60 1 2 4

animal 42 10 109 134 13 5 5time 19 9 29 117 81 34 109

reason 1 – 2 14 68 140 47cause – 1 – 4 55 34 55effect – – 1 6 60 35 17

+ we will usually assume a term-term matrix



Term-term matrix

Some footnotes:I Often target terms 6= feature terms

I e.g. nouns described by co-occurrences with verbs as featuresI identical sets of target & feature terms Ü symmetric matrix

I Different types of contexts (Evert 2008)I surface context (word or character window)I textual context (non-overlapping segments)I syntactic contxt (specific syntagmatic relation)

I Can be seen as smoothing of term-context matrixI average over similar contexts (with same context terms)I data sparseness reduced, except for small windowsI we will take a closer look at the relation between term-context

and term-term models later in this tutorial








Feature scaling⇓






Corpus pre-processingI Minimally, corpus must be tokenised Ü identify termsI Linguistic annotation

I part-of-speech taggingI lemmatisation / stemmingI word sense disambiguation (rare)I shallow syntactic patternsI dependency parsing

I Generalisation of termsI often lemmatised to reduce data sparseness:

go, goes, went, gone, going Ü goI POS disambiguation (light/N vs. light/A vs. light/V)I word sense disambiguation (bankriver vs. bankfinance)

I Trade-off between deeper linguistic analysis andI need for language-specific resourcesI possible errors introduced at each stage of the analysis



Effects of pre-processing

Nearest neighbors of walk (BNC)

word formsI strollI walkingI walkedI goI pathI driveI rideI wanderI sprintedI sauntered

lemmatised corpus

I hurryI strollI strideI trudgeI ambleI wanderI walk-nnI walkingI retraceI scuttle



Effects of pre-processing

Nearest neighbors of arrivare (Repubblica)

word formsI giungereI raggiungereI arriviI raggiungimentoI raggiuntoI trovareI raggiungeI arrivasseI arriveràI concludere

lemmatised corpus

I giungereI aspettareI attendereI arrivo-nnI ricevereI accontentareI approdareI pervenireI venireI piombare








Feature scaling⇓






Surface context

Context term occurs within a window of k words around target.

The silhouette of the sun beyond a wide-open bay on the lake; thesun still glitters although evening has arrived in Kuhmo. It’smidsummer; the living room has its instruments and other objectsin each of its corners.

Parameters:I window size (in words or characters)I symmetric vs. one-sided windowI uniform or “triangular” (distance-based) weightingI window clamped to sentences or other textual units?



Effect of different window sizes

Nearest neighbours of dog (BNC)

2-word windowI catI horseI foxI petI rabbitI pigI animalI mongrelI sheepI pigeon

30-word windowI kennelI puppyI petI bitchI terrierI rottweilerI canineI catI to barkI Alsatian



Textual context

Context term is in the same linguistic unit as target.


Parameters:I type of linguistic unit

I sentenceI paragraphI turn in a conversationI Web page



Syntactic context

Context term is linked to target by a syntactic dependency(e.g. subject, modifier, . . . ).


Parameters:I types of syntactic dependency (Padó and Lapata 2007)I direct vs. indirect dependency paths

I direct dependenciesI direct + indirect dependencies

I homogeneous data (e.g. only verb-object) vs.heterogeneous data (e.g. all children and parents of the verb)

I maximal length of dependency path© Evert/Baroni/Lenci (CC-by-sa) DSM Tutorial – Part 1 39 / 91


“Knowledge pattern” context

Context term is linked to target by a lexico-syntactic pattern(text mining, cf. Hearst 1992, Pantel & Pennacchiotti 2008, etc.).

In Provence, Van Gogh painted with bright colors such as red andyellow. These colors produce incredible effects on anybody lookingat his paintings.

Parameters:I inventory of lexical patterns

I lots of research to identify semantically interesting patterns(cf. Almuhareb & Poesio 2004, Veale & Hao 2008, etc.)

I fixed vs. flexible patternsI patterns are mined from large corpora and automatically

generalised (optional elements, POS tags or semantic classes)



Structured vs. unstructured context

I In unstructered models, context specification acts as a filterI determines whether context tokens counts as co-occurrenceI e.g. linked by specific syntactic relation such as verb-object

I In structured models, context words are subtypedI depending on their position in the contextI e.g. left vs. right context, type of syntactic relation, etc.



Structured vs. unstructured surface context

A dog bites a man. The man’s dog bites a dog. A dog bites a man.

unstructured bitedog 4man 3


structured bite-l bite-rdog 3 1man 1 2



Structured vs. unstructured dependency context


unstructured bitedog 4man 2


structured bite-subj bite-objdog 3 1man 0 2



Comparison

I Unstructured contextI data less sparse (e.g. man kills and kills man both map to the

kill dimension of the vector xman)

I Structured contextI more sensitive to semantic distinctions

(kill-subj and kill-obj are rather different things!)I dependency relations provide a form of syntactic “typing” of

the DSM dimensions (the “subject” dimensions, the“recipient” dimensions, etc.)

I important to account for word-order and compositionality








Feature scaling⇓







I Geometric interpretationI row vectors as points or arrows in n-dim. spaceI very intuitive, good for visualisationI use techniques from geometry and linear algebra

I Probabilistic interpretationI co-occurrence matrix as observed sample statisticI “explained” by generative probabilistic modelI recent work focuses on hierarchical Bayesian modelsI probabilistic LSA (Hoffmann 1999), Latent Semantic

Clustering (Rooth et al. 1999), Latent Dirichlet Allocation(Blei et al. 2003), etc.

I explicitly accounts for random variation of frequency countsI intuitive and plausible as topic model

+ focus on geometric interpretation in this tutorial




I Geometric interpretationI row vectors as points or arrows in n-dim. spaceI very intuitive, good for visualisationI use techniques from geometry and linear algebra

I Probabilistic interpretationI co-occurrence matrix as observed sample statisticI “explained” by generative probabilistic modelI recent work focuses on hierarchical Bayesian modelsI probabilistic LSA (Hoffmann 1999), Latent Semantic

Clustering (Rooth et al. 1999), Latent Dirichlet Allocation(Blei et al. 2003), etc.

I explicitly accounts for random variation of frequency countsI intuitive and plausible as topic model

+ focus on geometric interpretationDSM Tutorial – Part 1 46 / 91







Feature scaling⇓






Feature scaling

Feature scaling is used to “discount” less important features:I Logarithmic scaling: x ′ = log(x + 1)

(cf. Weber-Fechner law for human perception)

I Relevance weighting, e.g. tf.idf (information retrieval)I Statistical association measures (Evert 2004, 2008) take

frequency of target word and context feature into accountI the less frequent the target word and (more importantly) the

context feature are, the higher the weight given to theirobserved co-occurrence count should be (because theirexpected chance co-occurrence frequency is low)

I different measures – e.g., mutual information, log-likelihoodratio – differ in how they balance observed and expectedco-occurrence frequencies



Feature scaling


(cf. Weber-Fechner law for human perception)I Relevance weighting, e.g. tf.idf (information retrieval)

I Statistical association measures (Evert 2004, 2008) takefrequency of target word and context feature into account

I the less frequent the target word and (more importantly) thecontext feature are, the higher the weight given to theirobserved co-occurrence count should be (because theirexpected chance co-occurrence frequency is low)




Feature scaling


(cf. Weber-Fechner law for human perception)I Relevance weighting, e.g. tf.idf (information retrieval)I Statistical association measures (Evert 2004, 2008) take

frequency of target word and context feature into accountI the less frequent the target word and (more importantly) the

context feature are, the higher the weight given to theirobserved co-occurrence count should be (because theirexpected chance co-occurrence frequency is low)




Association measures: Mutual Information (MI)

word1 word2 fobs f1 f2dog small 855 33,338 490,580dog domesticated 29 33,338 918

Expected co-occurrence frequency:

fexp = f1 · f2N

Mutual Information compares observed vs. expected frequency:

MI(w1,w2) = log2fobsfexp

= log2N · fobsf1 · f2

Disadvantage: MI overrates combinations of rare terms.



Other association measures

word1 word2 fobs fexp MI

local-MI t-score

dog small 855 134.34 2.67

2282.88 24.64

dog domesticated 29 0.25 6.85

198.76 5.34

dog sgjkj 1 0.00027 11.85

11.85 1.00

The log-likelihood ratio (Dunning 1993) has more complex form,but its “core” is known as local MI (Evert 2004).

local-MI(w1,w2) = fobs ·MI(w1,w2)

The t-score measure (Church and Hanks 1990) is popular inlexicography:

t-score(w1,w2) = fobs − fexp√fobs

Details & many more measures: http://www.collocations.de/


http://www.collocations.de/



word1 word2 fobs fexp MI local-MI

t-score

dog small 855 134.34 2.67 2282.88

24.64

dog domesticated 29 0.25 6.85 198.76

5.34

dog sgjkj 1 0.00027 11.85 11.85

1.00





Details & many more measures: http://www.collocations.de/





word1 word2 fobs fexp MI local-MI t-scoredog small 855 134.34 2.67 2282.88 24.64dog domesticated 29 0.25 6.85 198.76 5.34dog sgjkj 1 0.00027 11.85 11.85 1.00





Details & many more measures: http://www.collocations.de/DSM Tutorial – Part 1 50 / 91








Feature scaling⇓






Normalisation of row vectors

I geometric distances onlymake sense if vectors arenormalised to unit length

I divide vector by its length:

x/‖x‖

I normalisation depends ondistance measure!

I special case: scale torelative frequencies with‖x‖1 = |x1|+ · · ·+ |xn|Ü probabilisticinterpretation

●●

●

●

0 20 40 60 80 100 120

020

4060

8010

012

0


get

use

catdog

knife

boat

●●

●

●



Scaling of column vectors

I In statistical analysis and machine learning, features areusually centred and scaled so that

mean µ = 0variance σ2 = 1

I In DSM research, this step is less common for columns of MI centring is a prerequisite for certain dimensionality reduction

and data analysis techniques (esp. PCA)I scaling may give too much weight to rare features

I M cannot be row-normalised and column-scaled at the sametime (result depends on ordering of the two steps)



Scaling of column vectors

I In statistical analysis and machine learning, features areusually centered and scaled so that

mean µ = 0variance σ2 = 1

I In DSM research, this step is less common for columns of MI centering is prerequisite for certain dimensionality reduction

and data analysis techniques (esp. PCA)I scaling may give too much weight to rare features

I M cannot be row-normalised and column-scaled at the sametime (result depends on ordering of the two steps)








Feature scaling⇓






Geometric distance

I Distance between vectorsu, v ∈ Rn Ü (dis)similarity

I u = (u1, . . . , un)I v = (v1, . . . , vn)

I Euclidean distance d2 (u, v)I “City block” Manhattan

distance d1 (u, v)I Both are special cases of theMinkowski p-distance dp (u, v)(for p ∈ [1,∞])

x1

v

x2

1 2 3 4 5

1

2

3

4

5

6

6 u

d2 (!u,!v) = 3.6

d1 (!u,!v) = 5



Geometric distance


I u = (u1, . . . , un)I v = (v1, . . . , vn)

I Euclidean distance d2 (u, v)

I “City block” Manhattandistance d1 (u, v)

I Both are special cases of theMinkowski p-distance dp (u, v)(for p ∈ [1,∞])

x1

v

x2

1 2 3 4 5

1

2

3

4

5

6

6 u

d2 (!u,!v) = 3.6

d1 (!u,!v) = 5

d2 (u, v) :=√

(u1 − v1)2 + · · ·+ (un − vn)2



Geometric distance


I u = (u1, . . . , un)I v = (v1, . . . , vn)


distance d1 (u, v)

I Both are special cases of theMinkowski p-distance dp (u, v)(for p ∈ [1,∞])

x1

v

x2

1 2 3 4 5

1

2

3

4

5

6

6 u

d2 (!u,!v) = 3.6

d1 (!u,!v) = 5

d1 (u, v) := |u1 − v1|+ · · ·+ |un − vn|



Geometric distance


I u = (u1, . . . , un)I v = (v1, . . . , vn)



x1

v

x2

1 2 3 4 5

1

2

3

4

5

6

6 u

d2 (!u,!v) = 3.6

d1 (!u,!v) = 5

dp (u, v) :=(|u1 − v1|p + · · ·+ |un − vn|p

)1/p



Geometric distance


I u = (u1, . . . , un)I v = (v1, . . . , vn)



x1

v

x2

1 2 3 4 5

1

2

3

4

5

6

6 u

d2 (!u,!v) = 3.6

d1 (!u,!v) = 5

dp (u, v) :=(|u1 − v1|p + · · ·+ |un − vn|p

)1/p

d∞ (u, v) = max{|u1 − v1|, . . . , |un − vn|



Other distance measures

I Information theory: Kullback-Leibler (KL) divergence forprobability vectors (non-negative, ‖x‖1 = 1)

D(u‖v) =n∑

i=1ui · log2

uivi

I Properties of KL divergenceI most appropriate in a probabilistic interpretation of MI zeroes in v without corresponding zeroes in u are problematicI not symmetric, unlike geometric distance measuresI alternatives: skew divergence, Jensen-Shannon divergence

I A symmetric distance measure (Endres and Schindelin 2003)

Duv = D(u‖z) + D(v‖z) with z = u + v2



Similarity measures

I angle α between twovectors u, v is given by

cosα =∑n

i=1 ui · vi√∑i u2i ·

√∑i v2i

= 〈u, v〉‖u‖2 · ‖v‖2

I cosine measure ofsimilarity: cosα

I cosα = 1 Ü collinearI cosα = 0 Ü orthogonal

●●

●

●

0 20 40 60 80 100 1200

2040

6080

100

120


get

use

catdog

knife

boat

●●

●

●

α = 54.3°








Feature scaling⇓






Dimensionality reduction = model compression

I Co-occurrence matrix M is often unmanageably largeand can be extremely sparse

I Google Web1T5: 1M × 1M matrix with one trillion cells, ofwhich less than 0.05% contain nonzero counts (Evert 2010)

å Compress matrix by reducing dimensionality (= rows)

I Feature selection: columns with high frequency & varianceI measured by entropy, chi-squared test, . . .I may select correlated (Ü uninformative) dimensionsI joint selection of multiple features is useful but expensive

I Projection into (linear) subspaceI principal component analysis (PCA)I independent component analysis (ICA)I random indexing (RI)

+ intuition: preserve distances between data points



Dimensionality reduction & latent dimensions

Landauer and Dumais (1997) claim that LSA dimensionalityreduction (and related PCA technique) uncovers latentdimensions by exploiting correlations between features.

I Example: term-term matrixI V-Obj cooc’s extracted from BNC

I targets = noun lemmasI features = verb lemmas

I feature scaling: association scores(modified log Dice coefficient)

I k = 111 nouns with f ≥ 20(must have non-zero row vectors)

I n = 2 dimensions: buy and sell

noun buy sellbond 0.28 0.77cigarette -0.52 0.44dress 0.51 -1.30freehold -0.01 -0.08land 1.13 1.54number -1.05 -1.02per -0.35 -0.16pub -0.08 -1.30share 1.92 1.99system -1.63 -0.70



Dimensionality reduction & latent dimensions

0 1 2 3 4

01

23

4

buy

sell

acre

advertising

amount

arm

asset

bag

beerbill

bit

bond book

bottle

boxbread

building

business

car

card

carpet

cigaretteclothe

club

coal

collectioncompany

computer

copy

couple

currency

dress

drink

drugequipmentestate

farm

fish

flat

flower

foodfreehold

fruitfurniture

good

home

horse

house

insurance

item

kind

land

licence

liquor

lotmachine

material

meat milkmill

newspaper

number

oil

one

packpackagepacket

painting

pair

paperpart

per

petrol

picture

piece

place

plant

player

pound

productproperty

pub

quality

quantity

range

record

right

seatsecurity

service

set

share

shoe

shop

sitesoftware

stake

stamp

stock

stuff

suit

system

television

thing

ticket

time

tin

unit

vehicle

video

wine

work

year



Motivating latent dimensions & subspace projection

I The latent property of being a commodity is “expressed”through associations with several verbs: sell, buy, acquire, . . .

I Consequence: these DSM dimensions will be correlated

I Identify latent dimension by looking for strong correlations(or weaker correlations between large sets of features)

I Projection into subspace V of k < n latent dimensionsas a “noise reduction” technique Ü LSA

I Assumptions of this approach:I “latent” distances in V are semantically meaningfulI other “residual” dimensions represent chance co-occurrence

patterns, often particular to the corpus underlying the DSM



The latent “commodity” dimension

0 1 2 3 4

01

23

4

buy

sell

acre

advertising

amount

arm

asset

bag

beerbill

bit

bond book

bottle

boxbread

building

business

car

card

carpet

cigaretteclothe

club

coal

collectioncompany

computer

copy

couple

currency

dress

drink

drugequipmentestate

farm

fish

flat

flower

foodfreehold

fruitfurniture

good

home

horse

house

insurance

item

kind

land

licence

liquor

lotmachine

material

meat milkmill

newspaper

number

oil

one

packpackagepacket

painting

pair

paperpart

per

petrol

picture

piece

place

plant

player

pound

productproperty

pub

quality

quantity

range

record

right

seatsecurity

service

set

share

shoe

shop

sitesoftware

stake

stamp

stock

stuff

suit

system

television

thing

ticket

time

tin

unit

vehicle

video

wine

work

year


Taxonomy of DSM parameters Examples

Outline






Some well-known DSM examplesLatent Semantic Analysis (Landauer and Dumais 1997)

I term-context matrix with document contextI weighting: log term frequency and term entropyI distance measure: cosineI dimensionality reduction: SVD

Hyperspace Analogue to Language (Lund and Burgess 1996)

I term-term matrix with surface contextI structured (left/right) and distance-weighted frequency countsI distance measure: Minkowski metric (1 ≤ p ≤ 2)I dimensionality reduction: feature selection (high variance)



Some well-known DSM examplesInfomap NLP (Widdows 2004)

I term-term matrix with unstructured surface contextI weighting: noneI distance measure: cosineI dimensionality reduction: SVD

Random Indexing (Karlgren and Sahlgren 2001)

I term-term matrix with unstructured surface contextI weighting: various methodsI distance measure: various methodsI dimensionality reduction: random indexing (RI)



Some well-known DSM examplesDependency Vectors (Padó and Lapata 2007)

I term-term matrix with unstructured dependency contextI weighting: log-likelihood ratioI distance measure: information-theoretic (Lin 1998b)I dimensionality reduction: none

Distributional Memory (Baroni and Lenci 2010)

I term-term matrix with structured and unstructereddependencies + knowledge patterns

I weighting: local-MI on type frequencies of link patternsI distance measure: cosineI dimensionality reduction: none


DSM in practice Using DSM distances

Outline






Nearest neighboursDSM based on verb-object relations from BNC, reduced to 100 dim. with SVD

Neighbours of dog (cosine angle):+ girl (45.5), boy (46.7), horse(47.0), wife (48.8), baby (51.9),

daughter (53.1), side (54.9), mother (55.6), boat (55.7), rest(56.3), night (56.7), cat (56.8), son (57.0), man (58.2), place(58.4), husband (58.5), thing (58.8), friend (59.6), . . .

Neighbours of school:+ country (49.3), church (52.1), hospital (53.1), house (54.4),

hotel (55.1), industry (57.0), company (57.0), home (57.7),family (58.4), university (59.0), party (59.4), group (59.5),building (59.8), market (60.3), bank (60.4), business (60.9),area (61.4), department (61.6), club (62.7), town (63.3),library (63.3), room (63.6), service (64.4), police (64.7), . . .



Nearest neighbours

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

girl

boy

horse

wife

baby

daughter

side

mother

boat

other

rest

night

cat

son

man

place

husband

thing

friend

bit

fish

woman

child

minute

animal

car

house

bird

peoplehead

dog



Semantic maps

●

●

●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

−0.4 −0.2 0.0 0.8

−0.

4−

0.2

0.0

0.2

0.4

0.6

Semantic map (V−Obj from BNC)

●

●

●

●

●

●

birdgroundAnimalfruitTreegreentoolvehicle

chicken

eagle duck

swan owl

penguin

peacock

dog

elephantcow

cat

lionpig

snail

turtle

cherry

banana

pearpineapple

mushroom

corn

lettuce

potatoonion

bottle

pencil

pen

cup

bowl

scissors

kettle

knife

screwdriver

hammer

spoon

chisel

telephoneboatcarship

truck

rocket

motorcycle

helicopter



Clustering

pota

toon

ion

cat

bana

nach

icke

nm

ushr

oom

corn

dog

pear

cher

ryle

ttuce

peng

uin

swan

eagl

eow

ldu

ckel

epha

nt pig

cow

lion

helic

opte

rpe

acoc

ktu

rtle ca

rpi

neap

ple

boat

rock

ettr

uck

mot

orcy

cle

snai

lsh

ipch

isel

scis

sors

scre

wdr

iver

penc

ilha

mm

erte

leph

one

knife

spoo

npe

nke

ttle

bottl

ecu

pbo

wl

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Word space clustering of concrete nouns (V−Obj from BNC)

Clu

ster

siz

e

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



Latent dimensions

0 1 2 3 4

01

23

4

buy

sell

acre

advertising

amount

arm

asset

bag

beerbill

bit

bond book

bottle

boxbread

building

business

car

card

carpet

cigaretteclothe

club

coal

collectioncompany

computer

copy

couple

currency

dress

drink

drugequipmentestate

farm

fish

flat

flower

foodfreehold

fruitfurniture

good

home

horse

house

insurance

item

kind

land

licence

liquor

lotmachine

material

meat milkmill

newspaper

number

oil

one

packpackagepacket

painting

pair

paperpart

per

petrol

picture

piece

place

plant

player

pound

productproperty

pub

quality

quantity

range

record

right

seatsecurity

service

set

share

shoe

shop

sitesoftware

stake

stamp

stock

stuff

suit

system

television

thing

ticket

time

tin

unit

vehicle

video

wine

work

year



Semantic similarity graph (topological structure)

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

tea

cup

drink

winewhisky

soup

champagne

offer

beer

citron living

toast

breakfast

salad

journey

meal

lunch

dinner

vodka

rum

coffee

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

head

arm

foot

finger

legeye

place

ball

side

backdoor

face

bit

carlight

minute

money

glass

horse

week

way

hour

time

paper

bag

house

line

part

lookchance

hand



Semantic similarity graph (topological structure)

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

tea

cup

drink

winewhisky

soup

champagne

offer

beer

citron living

toast

breakfast

salad

journey

meal

lunch

dinner

vodka

rum

coffee ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

head

arm

foot

finger

legeye

place

ball

side

backdoor

face

bit

carlight

minute

money

glass

horse

week

way

hour

time

paper

bag

house

line

part

lookchance

hand



Context vectors (Schütze 1998)

Distributional representationonly at type level

+ What is the “average”meaning of mouse?(computer vs. animal)

Context vector approximatesmeaning of individual token

I bag-of-words approach:centroid of all contextwords in the sentence animals

com

pu

ters

cat

furry

chase

keyboard

computer

connect

connect the mouse and keyboard to the computer

the furry catchased the

mouse


DSM in practice Quantitative evaluation

Outline






The TOEFL synonym task

I The TOEFL datasetI 80 itemsI Target: levied

Candidates: believed, correlated, imposed, requested

I Target fashionCandidates: craze, fathom, manner, ration

I DSMs and TOEFL1. take vectors of the target (t) and of the candidates (c1 . . . cn)2. measure the distance between t and ci , with 1 ≤ i ≤ n3. select ci with the shortest distance in space from t





Candidates: believed, correlated, imposed, requestedI Target fashion

Candidates: craze, fathom, manner, ration






Candidates: believed, correlated, imposed, requestedI Target fashion

Candidates: craze, fathom, manner, ration


© Evert/Baroni/Lenci (CC-by-sa) DSM Tutorial – Part 1 77 / 91


Humans vs. machines on the TOEFL task

I Average foreign test taker: 64.5%

I Macquarie University staff (Rapp 2004):I Average of 5 non-natives: 86.75%I Average of 5 natives: 97.75%

I Distributional semanticsI Classic LSA (Landauer and Dumais 1997): 64.4%I Padó and Lapata’s (2007) dependency-based model: 73.0%I Distributional memory (Baroni and Lenci 2010): 76.9%I Rapp’s (2004) SVD-based model, lemmatized BNC: 92.5%I Bullinaria and Levy (2012) carry out aggressive parameter

optimization: 100.0%



Humans vs. machines on the TOEFL task

I Average foreign test taker: 64.5%I Macquarie University staff (Rapp 2004):

I Average of 5 non-natives: 86.75%I Average of 5 natives: 97.75%

I Distributional semanticsI Classic LSA (Landauer and Dumais 1997): 64.4%I Padó and Lapata’s (2007) dependency-based model: 73.0%I Distributional memory (Baroni and Lenci 2010): 76.9%I Rapp’s (2004) SVD-based model, lemmatized BNC: 92.5%I Bullinaria and Levy (2012) carry out aggressive parameter

optimization: 100.0%



Semantic similarity judgments

I Rubenstein and Goodenough (1965) collected similarityratings for 65 noun pairs from 51 subjects on a 0–4 scale

w1 w2 avg. ratingcar automobile 3.9food fruit 2.7cord smile 0.0

I DSMs vs. Rubenstein & Goodenough1. for each test pair (w1,w2), take vectors w1 and w22. measure the distance (e.g. cosine) between w1 and w23. measure (Pearson) correlation between vector distances and

R&G average judgments (Padó and Lapata 2007)



Semantic similarity judgments: example

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

● ●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

0 1 2 3 4

3040

5060

7080

9010

0

RG65: British National Corpus

human rating

angu

lar

dist

ance

|rho| = 0.748, p = 0.0000, |r| = 0.623 .. 0.842



Semantic similarity judgments: results

Results on RG65 task:I Padó and Lapata’s (2007) dependency-based model: 0.62I Dependency-based on Web corpus (Herdağdelen et al. 2009)

I without SVD reduction: 0.69I with SVD reduction: 0.80

I Distributional memory (Baroni and Lenci 2010): 0.82I Salient Semantic Analysis (Hassan and Mihalcea 2011): 0.86


DSM in practice Software and further information

Outline






Software packages

HiDEx C++ re-implementation of the HAL model(Lund and Burgess 1996)

SemanticVectors Java scalable architecture based on randomindexing representation

S-Space Java complex object-oriented frameworkJoBimText Java UIMA / Hadoop frameworkGensim Python complex framework, focus on paral-

lelization and out-of-core algorithmsDISSECT Python user-friendly, designed for research on

compositional semanticswordspace R interactive research laboratory, but

scales to real-life data sets

click on package name to open Web page


http://www.psych.ualberta.ca/~westburylab/downloads/HiDEx.download.html

http://code.google.com/p/semanticvectors/

http://github.com/fozziethebeat/S-Space

http://maggie.lt.informatik.tu-darmstadt.de/jobimtext/

http://radimrehurek.com/gensim/

http://clic.cimec.unitn.it/composes/toolkit/

http://wordspace.r-forge.r-project.org/


Recent conferences and workshopsI 2007: CoSMo Workshop (at Context ’07)I 2008: ESSLLI Lexical Semantics Workshop & Shared Task,

Special Issue of the Italian Journal of LinguisticsI 2009: GeMS Workshop (EACL 2009), DiSCo Workshop

(CogSci 2009), ESSLLI Advanced Course on DSMI 2010: 2nd GeMS (ACL 2010), ESSLLI Workshop on

Compositionality and DSM, DSM Tutorial (NAACL 2010),Special Issue of JNLE on Distributional Lexical Semantics

I 2011: 2nd DiSCo (ACL 2011), 3rd GeMS (EMNLP 2011)I 2012: DiDaS (at ICSC 2012)I 2013: CVSC (ACL 2013), TFDS (IWCS 2013), DagstuhlI 2014: 2nd CVSC (at EACL 2014)

click on Workshop name to open Web page


http://clic.cimec.unitn.it/marco/beyond_words/

http://wordspace.collocations.de/doku.php/workshop:esslli:start

http://wordspace.collocations.de/doku.php/workshop:esslli:task

http://linguistica.sns.it/RdL/2008.html

http://art.uniroma2.it/gems/

http://www.let.rug.nl/disco2009/

http://wordspace.collocations.de/doku.php/course:esslli2009:start

http://art.uniroma2.it/gems010/

http://clic.cimec.unitn.it/roberto/ESSLLI10-dsm-workshop/

http://clic.cimec.unitn.it/roberto/ESSLLI10-dsm-workshop/

http://naaclhlt2010.isi.edu/tutorials/t4.html

http://journals.cambridge.org/action/displayIssue?iid=7911772

http://disco2011.fzi.de

https://sites.google.com/site/geometricalmodels/

http://didas.org

https://sites.google.com/site/cvscworkshop/

http://clic.cimec.unitn.it/roberto/IWCS-TFDS2013/

http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=13462

https://sites.google.com/site/cvscworkshop2014/


Further information

I Handouts & other materials available from wordspace wiki athttp://wordspace.collocations.de/

+ based on joint work with Marco Baroni and Alessandro LenciI Tutorial is open source (CC), and can be downloaded from

http://r-forge.r-project.org/projects/wordspace/

I Review paper on distributional semantics:Turney, Peter D. and Pantel, Patrick (2010). From frequencyto meaning: Vector space models of semantics. Journal ofArtificial Intelligence Research, 37, 141–188.

I I should be working on textbook Distributional Semantics forSynthesis Lectures on HLT (Morgan & Claypool)


http://wordspace.collocations.de/

http://r-forge.r-project.org/projects/wordspace/


References I

Baroni, Marco and Lenci, Alessandro (2010). Distributional Memory: A generalframework for corpus-based semantics. Computational Linguistics, 36(4), 673–712.

Bengio, Yoshua; Ducharme, Réjean; Vincent, Pascal; Jauvin, Christian (2003). Aneural probabilistic language model. Journal of Machine Learning Research, 3,1137–1155.

Blei, David M.; Ng, Andrew Y.; Jordan, Michael, I. (2003). Latent Dirichletallocation. Journal of Machine Learning Research, 3, 993–1022.

Bullinaria, John A. and Levy, Joseph P. (2012). Extracting semantic representationsfrom word co-occurrence statistics: Stop-lists, stemming and SVD. BehaviorResearch Methods, 44(3), 890–907.

Church, Kenneth W. and Hanks, Patrick (1990). Word association norms, mutualinformation, and lexicography. Computational Linguistics, 16(1), 22–29.

Dumais, S. T.; Furnas, G. W.; Landauer, T. K.; Deerwester, S.; Harshman, R. (1988).Using latent semantic analysis to improve access to textual information. In CHI’88: Proceedings of the SIGCHI conference on Human factors in computingsystems, pages 281–285.

Dunning, Ted E. (1993). Accurate methods for the statistics of surprise andcoincidence. Computational Linguistics, 19(1), 61–74.



References II

Endres, Dominik M. and Schindelin, Johannes E. (2003). A new metric for probabilitydistributions. IEEE Transactions on Information Theory, 49(7), 1858–1860.

Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs andCollocations. Dissertation, Institut für maschinelle Sprachverarbeitung, Universityof Stuttgart. Published in 2005, URN urn:nbn:de:bsz:93-opus-23714. Availablefrom http://www.collocations.de/phd.html.

Evert, Stefan (2008). Corpora and collocations. In A. Lüdeling and M. Kytö (eds.),Corpus Linguistics. An International Handbook, chapter 58. Mouton de Gruyter,Berlin, New York.

Evert, Stefan (2010). Google Web 1T5 n-grams made easy (but not for thecomputer). In Proceedings of the 6th Web as Corpus Workshop (WAC-6), pages32–40, Los Angeles, CA.

Firth, J. R. (1957). A synopsis of linguistic theory 1930–55. In Studies in linguisticanalysis, pages 1–32. The Philological Society, Oxford. Reprinted in Palmer (1968),pages 168–205.

Grefenstette, Gregory (1994). Explorations in Automatic Thesaurus Discovery, volume278 of Kluwer International Series in Engineering and Computer Science. Springer,Berlin, New York.


http://www.collocations.de/phd.html


References III

Harris, Zellig (1954). Distributional structure. Word, 10(23), 146–162. Reprinted inHarris (1970, 775–794).

Hassan, Samer and Mihalcea, Rada (2011). Semantic relatedness using salientsemantic analysis. In Proceedings of the Twenty-fifth AAAI Conference on ArtificialIntelligence.

Herdağdelen, Amaç; Erk, Katrin; Baroni, Marco (2009). Measuring semanticrelatedness with vector space models and random walks. In Proceedings of the2009 Workshop on Graph-based Methods for Natural Language Processing(TextGraphs-4), pages 50–53, Suntec, Singapore.

Hoffmann, Thomas (1999). Probabilistic latent semantic analysis. In Proceedings ofthe Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI’99).

Karlgren, Jussi and Sahlgren, Magnus (2001). From words to understanding. InY. Uesaka, P. Kanerva, and H. Asoh (eds.), Foundations of Real-WorldIntelligence, chapter 294–308. CSLI Publications, Stanford.

Landauer, Thomas K. and Dumais, Susan T. (1997). A solution to Plato’s problem:The latent semantic analysis theory of acquisition, induction and representation ofknowledge. Psychological Review, 104(2), 211–240.



References IV

Li, Ping; Burgess, Curt; Lund, Kevin (2000). The acquisition of word meaningthrough global lexical co-occurences. In E. V. Clark (ed.), The Proceedings of theThirtieth Annual Child Language Research Forum, pages 167–178. StanfordLinguistics Association.

Lin, Dekang (1998a). Automatic retrieval and clustering of similar words. InProceedings of the 17th International Conference on Computational Linguistics(COLING-ACL 1998), pages 768–774, Montreal, Canada.

Lin, Dekang (1998b). An information-theoretic definition of similarity. In Proceedingsof the 15th International Conference on Machine Learning (ICML-98), pages296–304, Madison, WI.

Lund, Kevin and Burgess, Curt (1996). Producing high-dimensional semantic spacesfrom lexical co-occurrence. Behavior Research Methods, Instruments, &Computers, 28(2), 203–208.

Miller, George A. (1986). Dictionaries in the mind. Language and Cognitive Processes,1, 171–185.

Padó, Sebastian and Lapata, Mirella (2007). Dependency-based construction ofsemantic space models. Computational Linguistics, 33(2), 161–199.



References V

Pantel, Patrick and Lin, Dekang (2000). An unsupervised approach to prepositionalphrase attachment using contextually similar words. In Proceedings of the 38thAnnual Meeting of the Association for Computational Linguistics, Hongkong,China.

Pantel, Patrick; Crestan, Eric; Borkovsky, Arkady; Popescu, Ana-Maria; Vyas, Vishnu(2009). Web-scale distributional similarity and entity set expansion. In Proceedingsof the 2009 Conference on Empirical Methods in Natural Language Processing,pages 938–947, Singapore.

Rapp, Reinhard (2004). A freely available automatically generated thesaurus of relatedwords. In Proceedings of the 4th International Conference on Language Resourcesand Evaluation (LREC 2004), pages 395–398.

Rooth, Mats; Riezler, Stefan; Prescher, Detlef; Carroll, Glenn; Beil, Franz (1999).Inducing a semantically annotated lexicon via EM-based clustering. In Proceedingsof the 37th Annual Meeting of the Association for Computational Linguistics,pages 104–111.

Rubenstein, Herbert and Goodenough, John B. (1965). Contextual correlates ofsynonymy. Communications of the ACM, 8(10), 627–633.

Schütze, Hinrich (1992). Dimensions of meaning. In Proceedings of Supercomputing’92, pages 787–796, Minneapolis, MN.



References VI

Schütze, Hinrich (1993). Word space. In Proceedings of Advances in NeuralInformation Processing Systems 5, pages 895–902, San Mateo, CA.

Schütze, Hinrich (1995). Distributional part-of-speech tagging. In Proceedings of the7th Conference of the European Chapter of the Association for ComputationalLinguistics (EACL 1995), pages 141–148.

Schütze, Hinrich (1998). Automatic word sense discrimination. ComputationalLinguistics, 24(1), 97–123.

Turney, Peter D. and Pantel, Patrick (2010). From frequency to meaning: Vectorspace models of semantics. Journal of Artificial Intelligence Research, 37, 141–188.

Turney, Peter D.; Littman, Michael L.; Bigham, Jeffrey; Shnayder, Victor (2003).Combining independent modules to solve multiple-choice synonym and analogyproblems. In Proceedings of the International Conference on Recent Advances inNatural Language Processing (RANLP-03), pages 482–489, Borovets, Bulgaria.

Widdows, Dominic (2004). Geometry and Meaning. Number 172 in CSLI LectureNotes. CSLI Publications, Stanford.


Distributional Semantic Models - Part 1:...

Documents

Transcript of Distributional Semantic Models - Part 1:...