From Word Embeddings To Document...

Post on 21-Jun-2020

18 views 0 download

Transcript of From Word Embeddings To Document...

From Word Embeddings To Document DistancesMatt J. Kusner Yu Sun Nicholas I. Kolkin Kilian Q. Weinberger

Goal: a distance between two documents

?

Applicationsdocument classification multi-lingual document matching

song identification

Word Embeddingword2vec

trained on 100 billion words

3 million different words embedded

[Mikolov et al., 2013]

Rd

different from [Collobert & Weston, 2008]

[Mnih & Hinton, 2009] word2vec is not deep!

words

Word Embeddingword2vec

[Mikolov et al., 2013]

Rd

X 2 Rd⇥n

distance between words i and j:

kxi � xjk2

is roughly their dissimilarity

xixj

words

Word Embeddingword2vec

[Mikolov et al., 2013]

Rd

KingMan

Woman Queen

How can we leverage these high quality word embeddings

to compute document distances?

catdogruntree

carrace

frog

oatpicklefly

patt

ern ocean

steam

rock rocket

upbabybunny win

Word Mover’s Distance

Goal

?

Word Mover’s Distance

word embeddingRd

Obama

Illinois

mediapress

President

greets

speaks

P

Obama

speaksmediaIllinois

d1/4

PPresidentgreetspressChicago

1/4

d0

Chicago

Word Mover’s Distance

word embeddingRd

Obama

Illinois

mediapress

President

greets

speaks

P

Obama

speaksmediaIllinois

d1/4

PPresidentgreetspressChicago

1/4

d0

Chicago

Word Mover’s Distance

word embeddingRd

Obama

Illinois

mediapress

President

greets

speaks

P

Obama

speaksmediaIllinois

d1/4

PPresidentgreetspressChicago

1/4

d0

Chicago

word mover’s distance = minimum word distance to transform mass d into d’

Word Mover’s Distance

word embeddingRd

Obama

Illinois

mediapress

President

greets

speaks

P

Obama

speaksmediaIllinois

d1/4

PPresidentgreetspressChicago

1/4

d0

Chicago

word embeddingRd

Obama

Illinois

mediapress

President

greets

speaks

P

Obama

speaksmediaIllinois

d1/4

PPresidentgreetspressChicago

1/4

d0

Chicago

Word Mover’s Distance

Word Mover’s Distance

word embeddingRd

Obama

Illinois

mediapress

President

greets

speaks

P

Obama

speaksmediaIllinois

d1/4

PPresidentgreetspressChicago

1/4

d0

Chicago

Word Mover’s Distance

word embeddingRd

P

Obama

speaksmediaIllinois

d1/4

PPresidentgreetspressChicago

1/4

d0

i

j

1/4

IllinoisChicago

T

Word Mover’s Distance

word embeddingRd

Obama

Illinois

press

President

greets

speaks

PPresidentgreetspressChicago

1/4

d0

Chicago

P

Obama

speaksIllinois

d1/41/2

Word Mover’s Distance

word embeddingRd

Obama

Illinois

press

President

greets

speaks

PPresidentgreetspressChicago

1/4

d0

Chicago

P

Obama

speaksIllinois

d1/41/2

Word Mover’s Distance

word embeddingRd

Obama

Illinois

press

President

greets

speaks

PPresidentgreetspressChicago

1/4

d0

Chicago

P

Obama

speaksIllinois

d1/41/2

Word Mover’s DistanceP

PresidentgreetspressChicago

1/4

d0WMD(d,d0) ,P

Obama

speaksIllinois

d1/41/2

minT�0

nX

i,j=1

Tijkxi � xjk2

Obama

press

Presidentxi

xj

xk

1/4

1/4

1/2

Word Mover’s DistanceP

PresidentgreetspressChicago

1/4

d0WMD(d,d0) ,WMD(d,d0) ,

8i

8j

WMD(d,d0) ,

s.t.

P

Obama

speaksIllinois

d1/41/2

minT�0

nX

i,j=1

Tijkxi � xjk2

nX

j=1

Tij = di

nX

i=1

Tij = d0j

Remarks

in CV this is the Earth Mover’s Distance (EMD) [Rubner et al., 1998]

an old optimal transport problem [Monge, 1781]

nX

j=1

Fij = di

nX

i=1

Fij = d0j

s.t. 8i

8j

WMD(d,d0) ,

8i

8j

s.t.

minT�0

nX

i,j=1

Tijkxi � xjk2

nX

j=1

Tij = di

nX

i=1

Tij = d0j

How well does WMD perform on document classification via k-

nearest neighbors (k-NN)?

Classic Approaches

3

0

0

1

0

2

0

0

0

0

1

‘campaign’

‘Obama’

‘speech’

‘Washington’

bag-of-words TF-IDF

0.01

0

0

0.2

0

0.04

0

0

0

0

0.13

LSI LDA

‘Oba

ma’

‘speech’

‘a’

‘Oba

ma’

‘speech’

‘a’

topic distributions

sports

politics

music

Civil War

politics topic

‘Obam

a’‘speech’‘W

ashington’

‘football’‘M

adonna’‘Vicksburg’‘guitar’

‘soccer’

[Salton & Buckley, 1988] [Deerwester et al., 1990] [Blei et al., 2003]

Results: k-NN

1 2 3 4 5 6 7 8010203040506070

twitter recipe ohsumed classic reuters amazon

test

err

or % 43

33

44

33 32 3229

6663 61

49 51

44

36

8.0 9.7

62

44 4135

6.95.06.72.8

3329

148.16.96.3

3.5

59

42

28

1417

129.37.4

34

1722 21

8.46.44.3

21

4.6

53 5359

5448

4543

5156 54

58

3640

31 29 27

20newsbbcsport

k-nearest neighbor errorBOW [Frakes & Baeza-Yates, 1992] TF-IDF [Jones, 1972]Okapi BM25 [Robertson & Walker, 1994]

LSI [Deerwester et al., 1990]LDA [Blei et al., 2003]mSDA [Chen et al., 2012]Componential Counting Grid [Perina et al., 2013]

Word Mover's Distance

train inputs:

BOW dim:

517

13243 6344

2176 3059

5708

3999

31789

4965

24277

5485

22425

5600

42063

11293

29671

All hyper-parameters set with bayesopt.m [Gardner et al. 2014]

1 2 3 4 5 6 7 80

0.20.40.60.81

1.21.41.6

aver

age

erro

r w.r.

t. BO

W 1.61.41.21.00.80.60.40.2

0

1.291.15

1.00.72

0.60 0.55 0.49 0.42

BOW TF-IDF Okapi BM25

LSI LDA mSDA CCG

WMD

Results: k-NN

LP with 2n constraintsO(n3 log n)

[Pele & Werman, 2009]

minF�0

nX

i,j=1

Fijkxi � xjk2

nX

j=1

Fij = di

nX

i=1

Fij = d0j

s.t. 8i

8j

WMD(d,d0) ,

Computational Complexity

8i

8j

s.t.

minT�0

nX

i,j=1

Tijkxi � xjk2

nX

j=1

Tij = di

nX

i=1

Tij = d0j

[Rubner et al., 1998]; [Levina & Bickel, 2001]; [Grauman & Darrell, 2004]; [Shirdhonkar & Jacobs, 2008]

approximations:

minF�0

nX

i,j=1

Fijkxi � xjk2

nX

j=1

Fij = di

nX

i=1

Fij = d0j

s.t. 8i

8j

WMD(d,d0) ,

Computational Complexity

8i

8j

s.t.

minT�0

nX

i,j=1

Tijkxi � xjk2

nX

j=1

Tij = di

nX

i=1

Tij = d0j

Approximation 1[Rubner et al., 1998]

word embeddingRd

Obama

Illinois

mediapress

President

greets

speaks

Chicago

P

Obama

speaksmediaIllinois

d1/4

PPresidentgreetspressChicago

1/4

d0

[Rubner et al., 1998]

word embeddingRd

Obama

Illinois

mediapress

President

greets

speaks

Chicago

Approximation 1P

Obama

speaksmediaIllinois

d1/4

PPresidentgreetspressChicago

1/4

d0

Rd

Obama

Illinois

mediapress

President

greets

speaks

Chicago

[Rubner et al., 1998]

word embedding

Approximation 1P

Obama

speaksmediaIllinois

d1/4

PPresidentgreetspressChicago

1/4

d0

Word Centroid DistanceWCD(d,d0) , kXd�Xd0k2

O(nd)

Xd0Xd

Faster Approximations

0 500 1000 1500 2000 25000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 100 200 300 400 500 600 700 800 900 10000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.2

1.00.8

0.6

0.4

0.2

1.4

0500 1500 25001000 20000

amazon

WMDRWMDWCD

training input index

dist

ance

1.6

1.4

1.2

1.0

0.80.6

0.4

1.6

0.2

twitter

200 400 800 10006000training input index

0

for a random test input...

minF�0

nX

i,j=1

Fijkxi � xjk2

nX

j=1

Fij = di

nX

i=1

Fij = d0j

s.t. 8i

8j

Approximation 2

8i

8j

s.t.

minT�0

nX

i,j=1

Tijkxi � xjk2

nX

j=1

Tij = di

nX

i=1

Tij = d0j

8i

8j

s.t.

minT�0

nX

i,j=1

Tijkxi � xjk2

nX

j=1

Tij = di

nX

i=1

Tij = d0j

Approximation 2

D1

8i

8j

s.t.

minT�0

nX

i,j=1

Tijkxi � xjk2

nX

j=1

Tij = di

nX

i=1

Tij = d0j

just a nearest-neighbor search!

Approximation 2

D1

8i

8j

s.t.

minT�0

nX

i,j=1

Tijkxi � xjk2

nX

j=1

Tij = di

nX

i=1

Tij = d0j

Approximation 2

just a nearest-neighbor search!

D2

Relaxed Word Mover’s Distance

O(n2d)

Approximation 2

RWMD(d,d0) , max(D1, D2)

8i

8j

s.t.

minT�0

nX

i,j=1

Tijkxi � xjk2

nX

j=1

Tij = di

nX

i=1

Tij = d0j

D1

8i

8j

s.t.

minT�0

nX

i,j=1

Tijkxi � xjk2

nX

j=1

Tij = di

nX

i=1

Tij = d0j

D2

Faster Approximations

0 500 1000 1500 2000 25000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 100 200 300 400 500 600 700 800 900 10000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.2

1.00.8

0.6

0.4

0.2

1.4

0500 1500 25001000 20000

amazon

WMDRWMDWCD

training input index

dist

ance

1.6

1.4

1.2

1.0

0.80.6

0.4

1.6

0.2

twitter

200 400 800 10006000training input index

0

for a random test input...

Faster Approximations

0 500 1000 1500 2000 25000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 100 200 300 400 500 600 700 800 900 10000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.2

1.00.8

0.6

0.4

0.2

1.4

0500 1500 25001000 20000

amazon

WMDRWMDWCD

training input index

dist

ance

1.6

1.4

1.2

1.0

0.80.6

0.4

1.6

0.2

twitter

200 400 800 10006000training input index

0

for a random test input...

Faster Approximations

10

0.20.40.60.81

1.21.21.0

0.8

0.6

0.4

0.2

RWMD WMDWCDRWMDRWMD

0.93

0.56 0.540.45 0.42

0

average kNN error w.r.t. BOW

c2 c1

1.21.0

0.8

0.6

0.4

0.2

RWMD WMDWCDRWMDRWMD

0.92 0.92

0.30

0.96 1.0

0

average distance w.r.t. WMD

c2 c1 D1 D2

Other Embeddings

Conclusion

• Word Mover’s Distance:

• document distances fromword embeddings

• Very accurate as it leverages high quality word2vec embedding

• Fast through approximationsO(n2d)O(n3 log n) O(nd)

WCD RWMDWMD

1 2 3 4 5 6 7 80

0.20.40.60.81

1.21.41.6

aver

age

erro

r w.r.

t. BO

W 1.61.41.21.00.80.60.40.2

0

1.291.15

1.00.72

0.60 0.55 0.49 0.42

BOW TF-IDF Okapi BM25

LSI LDA mSDA CCG

WMD

Obama

Illinois

mediapress

President

greets

speaks

Chicago

Thank you. Questions?Code: http://matthewkusner.com