MUSE: MODULARIZING UNSUPERVISED SENSE EMBEDDINGS miuyvchen/doc/EMNLP17_MUSE_poster.pdf ·...

1
MUSE: MODULARIZING UNSUPERVISED SENSE EMBEDDINGS Guang-He Lee and Yun-Nung (Vivian) Chen https://github.com/MiuLab/MUSE Motivation Conclusion Experiment 2: Synonym Selection o Efficiency: purely sense-level representation learning with linear-time sense decoding o Modeling: single objective for modular unsupervised sense embedding learning o Learning: leverage RL to model the sense selection process o Exploration: introduce various exploration mechanisms for the sense selection for robustness o Experiment: state-of-the-art performance [email protected] [email protected] guanghelee.github.io vivianchen.idv.tw Why sense embeddings o Words are polysemous, but their embeddings are usually not o The word embedding is restricted by the triangle inequality o Sense embeddings can easily circumvent such constraint Smartphone companies including blackberry, and sony will be invited. apple apple-1 apple-2 sense selection sense embedding Key mechanisms Traditional Frameworks Clustering based on contexts as sense ID o Efficient sense selection o Embedding learning involves word tokens word vec in context cluster 1 cluster 2 sense vec word vec in context S 1 S 1 S 1 S 1 S 1 S 2 S 2 S 2 S 2 S 2 S 3 S 3 S 3 S 3 S 3 Sense selection by a distribution o No word tokens are involved o Inefficient EM algorithms The Proposed Approach: MUSE Advantages from both worlds coarse-grained embedding fine-grained embedding efficient sense selection inefficient sense selection word+sense embedding sense embedding Issue 1: impossible for a single model Solution: modular framework o Distinct modules for specific mechanisms Issue 2: how to formulate single objective? Solution: Markov Decision Process (MDP) o sense selection → state/action in MDP o sense representation → reward in MDP sense representation sense selection sense selection: linear neural network o Policy-based: o Value-based: sense representation: skip-gram o Issue 3: how to optimize modules? Solution: Reinforcement Learning (RL) o Policy-based: maximizes the expected rewards o Value-based: estimates the rewards directly Issue 4: how to conduct sense selection? Solution: Exploration o Policy gradient: sampling o Value-based: greedy, ε-Greedy, Boltzmann sampling Approach MaxSimC AvgSimC Huang et al., 2012 26.1 65.7 Neelakantan et al., 2014 60.1 69.3 Tian et al., 2014 63.6 65.4 Li & Jurafsky, 2015 66.6 66.8 Bartunov et al., 2016 53.8 61.2 Qiu et al., 2016 64.9 66.1 MUSE-Policy 66.1 67.4 MUSE-Greedy 66.3 68.3 MUSE-ε-Greedy 67.4 + 68.6 MUSE-Boltzmann 67.9 + 68.7 Experiment 1: Contextual Word Similarities clustering probabilistic word embedding schema 1 2 sense embedding schema Setup Dataset: April 2010 Wikipedia dump Context window: 5 Embedding dim: 300 Evaluation: similarity on contextual word pairs east bank of the Des Moines River basis of all money laundering o Compared target: human-judged similarity bank bank-1 bank-2 0.4 He borrowed the money from 0.6 x + 0.4 x (1) MaxSimC (2) AvgSimC 0.6 MUSE achieves the state-of-the-art on MaxSimC Q: yield (A) submit (B) challenge (C) boast (D) scorn Task: 1 Sense Representation Module ( 2 | 1 ) ( | 1 ) negative sampling matrix matrix ( 2 | ) ( 3 | ) Sense Selection Module = −1 ( 1 | ) Sense selection for target word matrix matrix +1 including apple blackberry companies and Corpus: { Smartphone companies including apple blackberry, and sony will be invited.} sample collocation sense selection → reward ← ( 2 | ) ( 3 | ) Sense Selection Module = ′−1 ( 1 | ) Sense selection for collocated word matrix matrix ′+1 including apple blackberry and sense selection ← 2 2 3 1 Model architecture sony tree trees rock rocks more practical Approach ESL-50 RD-300 TOEFL-80 Global Context 47.73 45.07 60.87 SkipGram 52.08 55.66 66.67 IMS+SkipGram 41.67 53.77 66.67 EM 27.08 33.96 40.00 MSSG (Neelakantan et al., ‘14) 57.14 58.93 78.26 CRP (Li & Jurafsky, ‘15) 50.00 55.36 82.61 MUSE-Policy 52.38 51.79 79.71 MUSE-Greedy 57.14 58.93 79.71 MUSE-ε-Greedy 61.90 + 62.50 + 84.06 + MUSE-Boltzmann 64.29 + 66.07 + 88.41 + Retro-GlobalContext 63.64 66.20 71.01 Retro-SkipGram 56.25 65.09 73.33 word embeddings word sense disambiguation unsupervised sense embeddings supervised sense embeddings MUSE with exploration outperforms all baselines. MUSE can effectively separate different senses in an unsupervised way. MUSE beat some supervised systems w/o any supervision. miu Code Available: http://github.com/MiuLab/MUSE

Transcript of MUSE: MODULARIZING UNSUPERVISED SENSE EMBEDDINGS miuyvchen/doc/EMNLP17_MUSE_poster.pdf ·...

Page 1: MUSE: MODULARIZING UNSUPERVISED SENSE EMBEDDINGS miuyvchen/doc/EMNLP17_MUSE_poster.pdf · guanghelee.github.io vivianchen.idv.tw Why sense embeddings o Words are polysemous, ... word

MUSE: MODULARIZING UNSUPERVISED SENSE EMBEDDINGSGuang-He Lee and Yun-Nung (Vivian) Chen

https://github.com/MiuLab/MUSE

Motivation

ConclusionExperiment 2: Synonym Selection

o Efficiency: purely sense-level

representation learning with linear-time

sense decoding

o Modeling: single objective for modular

unsupervised sense embedding learning

o Learning: leverage RL to model the sense

selection process

o Exploration: introduce various exploration

mechanisms for the sense selection for

robustness

o Experiment: state-of-the-art performance

[email protected] [email protected]

guanghelee.github.io vivianchen.idv.tw

Why sense embeddings

o Words are polysemous, but their embeddings are usually not

o The word embedding is restricted by the triangle inequality

o Sense embeddings can easily circumvent such constraint

Smartphone companies including blackberry, and sony will be invited.apple

apple-1 apple-2sense selection

sense embedding

Key mechanisms

Traditional Frameworks

Clustering based on contexts as sense ID

o Efficient sense selection

o Embedding learning involves word tokens

word vec in context

cluster 1 cluster 2 …

sense vec

word vec in context

S𝑖1 S𝑗1 S𝑘1 S𝑙1 S𝑚1

S𝑖2 S𝑗2 S𝑘2 S𝑙2 S𝑚2

S𝑖3 S𝑗3 S𝑘3 S𝑙3 S𝑚3

… … … … …… … … … …

𝑤𝑖 𝑤𝑗 𝑤𝑘 𝑤𝑙 𝑤𝑚

Sense selection by a distribution

o No word tokens are involved

o Inefficient EM algorithms

The Proposed Approach: MUSE

Advantages from both worlds

coarse-grained embedding

fine-grained embedding

efficient sense selection

inefficient sense selection

word+senseembedding

sense embedding

Issue 1: impossible for a single model

Solution: modular framework

o Distinct modules for specific mechanisms

Issue 2: how to formulate single objective?

Solution: Markov Decision Process (MDP)

o sense selection → state/action in MDP

o sense representation → reward in MDP

sense representationsense selection

sense selection: linear neural network

o Policy-based:

o Value-based:

sense representation: skip-gram

o

Issue 3: how to optimize modules?

Solution: Reinforcement Learning (RL)

o Policy-based: maximizes the expected rewards

o Value-based: estimates the rewards directly

Issue 4: how to conduct sense selection?

Solution: Exploration

o Policy gradient: sampling

o Value-based: greedy, ε-Greedy, Boltzmann sampling

Approach MaxSimC AvgSimC

Huang et al., 2012 26.1 65.7

Neelakantan et al., 2014 60.1 69.3

Tian et al., 2014 63.6 65.4

Li & Jurafsky, 2015 66.6 66.8

Bartunov et al., 2016 53.8 61.2

Qiu et al., 2016 64.9 66.1

MUSE-Policy 66.1 67.4

MUSE-Greedy 66.3 68.3

MUSE-ε-Greedy 67.4+ 68.6

MUSE-Boltzmann 67.9+ 68.7

Experiment 1: Contextual Word Similarities

clusterin

gp

rob

abilistic

𝑠𝑡𝑜𝑛𝑒 𝑠ℎ𝑎𝑘𝑒

𝑟𝑜𝑐𝑘

word embedding schema

𝑠𝑡𝑜𝑛𝑒 𝑠ℎ𝑎𝑘𝑒

𝑟𝑜𝑐𝑘1 𝑟𝑜𝑐𝑘2

≈≈

sense embedding schema

Setup

Dataset: April 2010 Wikipedia dump

Context window: 5

Embedding dim: 300

Evaluation: similarity on contextual word pairs

… east bank of the Des Moines River …

… basis of all money laundering …

o Compared target: human-judged similarity

bank

bank-1 bank-20.4

He borrowed the money from

0.6 x + 0.4 x

(1) MaxSimC

(2) AvgSimC

0.6

MUSE achieves the state-of-the-art on MaxSimC

Q: yield (A) submit (B) challenge (C) boast (D) scorn Task:

𝑧𝑖1

Sense Representation Module

…𝑃(𝑧𝑗2|𝑧𝑖1) 𝑃(𝑧𝑢𝑣|𝑧𝑖1)

negative sampling

matrix 𝑉

matrix 𝑈

𝑞(𝑧𝑖2| ഥ𝐶𝑡) 𝑞(𝑧𝑖3| ഥ𝐶𝑡)

Sense Selection Module

…𝐶𝑡 = 𝑤𝑖𝐶𝑡−1

𝑞(𝑧𝑖1| ഥ𝐶𝑡)

Sense selection for target word 𝐶𝑡

matrix 𝑄𝑖

matrix 𝑃… 𝐶𝑡+1

including apple blackberrycompanies and

Corpus: { Smartphone companies including apple blackberry, and sony will be invited.}sample collocation

sense selectio

n→

reward

𝑞(𝑧𝑗2|𝐶𝑡′) 𝑞(𝑧𝑗3|𝐶𝑡′)

Sense Selection Module

…𝐶𝑡′ = 𝑤𝑗𝐶𝑡′−1

𝑞(𝑧𝑗1|𝐶𝑡′)

Sense selection for collocated word 𝐶𝑡′

matrix 𝑄𝑗

matrix 𝑃… 𝐶𝑡′+1

including apple blackberry and

sense selection ←2

23

1 Model architecture

sony

tree

trees

rock

rocks

more

practical Approach ESL-50 RD-300 TOEFL-80

Global Context 47.73 45.07 60.87

SkipGram 52.08 55.66 66.67

IMS+SkipGram 41.67 53.77 66.67

EM 27.08 33.96 40.00

MSSG (Neelakantan et al., ‘14) 57.14 58.93 78.26

CRP (Li & Jurafsky, ‘15) 50.00 55.36 82.61

MUSE-Policy 52.38 51.79 79.71

MUSE-Greedy 57.14 58.93 79.71

MUSE-ε-Greedy 61.90+ 62.50+ 84.06+

MUSE-Boltzmann 64.29+ 66.07+ 88.41+

Retro-GlobalContext 63.64 66.20 71.01

Retro-SkipGram 56.25 65.09 73.33

word embeddingsword sense

disambiguation

unsupervised sense

embeddings

supervised sense

embeddings

MUSE with exploration outperforms all baselines.MUSE can effectively separate different senses in an unsupervised way.

MUSE beat some supervised systems w/o any supervision.

miu

Code Available:

http://github.com/MiuLab/MUSE