Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan...

30
Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann Technische Universität Darmstadt Germany

Transcript of Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan...

Page 1: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Do Supervised Distributional Methods Really Learn Lexical

Inference Relations?Omer Levy Ido Dagan

Bar-Ilan UniversityIsrael

Steffen Remus Chris BiemannTechnische Universität Darmstadt

Germany

Page 2: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Lexical Inference

Page 3: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Lexical Inference: Task Definition

• Given 2 words • Does infer ?

• In the talk: refers to hypernymy (“ is a ”)

Dataset• Positive examples:

dolphin mammalJon Stewart comedian

• Negative examples:shark mammal

Jon Stewart politician

Page 4: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Distributional Methods of Lexical Inference

Page 5: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Unsupervised Distributional Methods

• Represent and as vectors and • Word Embeddings• Traditional (Sparse) Distributional Vectors

• Measure the similarity of and • Cosine Similarity• Distributional Inclusion (Weeds & Weir, 2003; Kotlerman et al., 2010)

• Tune a threshold over the similarity of and • Train a classifier over a single feature

Page 6: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Supervised Distributional Methods

• Represent and as vectors and • Word Embeddings• Traditional (Sparse) Distributional Vectors

• Represent the pair as a combination of and • Concat: (Baroni et al., 2012)• Diff: (Roller et al., 2014; Weeds et al., 2014; Fu et al., 2014)

• Train a classifier over the representation of • Multi-feature representation

Page 7: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Main Questions

• Are current supervised DMs better than unsupervised DMs?

• Are current supervised DMs learning a relation between and ?• (No)

• If not, what are they learning?

Page 8: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Experiment Setup

Page 9: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Experiment Setup

• 9 Word Representations• 3 Representation Methods: PPMI, SVD (over PPMI), word2vec (SGNS)• 3 Context Types

• Bag-of-Words (5 words to each side)• Positional (2 words to each side + position)• Dependency (all syntactically-connected words + dependency)

• Trained on English Wikipedia

• 5 Lexical-Inference Datasets• Kotlerman et al., 2010• Baroni and Lenci, 2011 (BLESS)• Baroni et al., 2012• Turney and Mohammad, 2014• Levy et al., 2014

Page 10: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Supervised Methods

• Concat:• Baroni et al., 2012

• Diff:• Roller et al., 2014; Weeds et al., 2014; Fu et al., 2014

• Only :

• Only :

Page 11: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Are current supervised DMs better than unsupervised DMs?

Page 12: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Previously Reported Success

Prior Art: • Supervised DMs better than unsupervised DMs• Accuracy >95% (in some datasets)

Our Findings:• High accuracy of supervised DMs stems from lexical memorization

Page 13: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Lexical Memorization

• Learning that a specific word is a strong indicator of label

Example:• Many positive training examples like (*, animal)• The classifier memorizes that animal is a good indicator• Test examples like (*, animal) are correctly classified “for free”

• In other words: overfitting• Raises questions on dataset construction

Page 14: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Lexical Memorization

• Avoid lexical memorization with lexical train/test splits

• If “animal” appears in train, it cannot appear in test

• Lexical splits applied to all our experiments

Page 15: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Experiments without Lexical Memorization• 4 supervised vs 1 unsupervised• Cosine similarity

• Cosine similarity outperforms all supervised DMs in 2/5 datasets

• Conclusion: supervised DMs are not necessarily better

Kotler-man2010

Bless2011 Baroni2012 Turney2014 Levy20140

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Best

Sup

ervi

sed

Best

Sup

ervi

sed

Best

Sup

ervi

sed

Best

Sup

ervi

sed

Best

Sup

ervi

sed

Uns

uper

vise

d

Uns

uper

vise

d

Uns

uper

vise

d

Uns

uper

vise

d

Uns

uper

vise

d

Perf

orm

ance

(F1)

Page 16: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Are current supervised DMslearning a relation between and ?

Page 17: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Learning a Relation between and

• Requires information about the compatibility of and

• What happens when we use Only (ignore )?

• Intuitively, it should fail – could be anything!

Page 18: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Learning a Relation between and

• In practice:

• Almost as well as Concat & Diff

• Best method in 1/5 dataset

Kotler-man2010

Bless2011 Baroni2012 Turney2014 Levy20140

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Best

Sup

ervi

sed

Best

Sup

ervi

sed

Best

Sup

ervi

sed

Best

Sup

ervi

sed

Best

Sup

ervi

sed

Onl

y y

Onl

y y

Onl

y y

Onl

y y

Onl

y y

Perf

orm

ance

(F1)

How can the classifier know that if it does not observe ?

Page 19: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

If these methods are not learninga relation between and,

what exactly are they learning?

Page 20: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

If these methods are not learninga relation between and,

what exactly are they learning?

Page 21: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Prototypical Hypernyms

Hypothesis: the methods learn whether is a prototypical hypernym

• Prototypical Hypernyms:• animal• mammal• fruit• drug• country• …

• Categories, Supersenses, etc.

Page 22: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Prototypical Hypernyms

Hypothesis: the methods learn whether is a prototypical hypernymExperiment:• Given 2 positive examples and

and ✔

• Create artificial negative examples and

and ✘

• These artificial examples contain prototypical hypernyms as • How easily is the classifier “fooled” by these artificial examples?

Page 23: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Prototypical Hypernyms

• Recall: portion of real positive examples (✔) classified true• Match Error: portion of artificial examples (✘) classified true

• Bottom-right: prefer ✔ over ✘• Good classifiers

• Top-left: prefer ✘ over ✔• Worse than random

• Diagonal: cannot distinguish ✔ from ✘• Predicted by hypothesis

Page 24: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Prototypical Hypernyms

• Recall: portion of real positive examples (✔) classified true• Match Error: portion of artificial examples (✘) classified true

• Regression slope: 0.935

• Result: classifiers cannot distinguishbetween artificial (✘) and real (✔)

• Conclusion: classifiers returns truewhen is a prototypical hypernym

Page 25: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Prototypical Hypernyms: Analysis

• What are the classifiers’ most indicative features?

• Indicators that is a category word:

• Partial Hearst (1992) patterns:

Page 26: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Conclusions

Page 27: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Conclusions

• Are current supervised DMs better than unsupervised DMs?• Not necessarily• Previously reported success stems from lexical memorization

• Are current supervised DMs learning a relation between and ?• No, they are not• Only yields similar results to Concat and Diff

• If not, what are they learning?• Whether is a prototypical hypernym• (“mammal”, “fruit”, “country”, …)

Page 28: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

What if the necessary relational information

does not exist in contextual features?

Page 29: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

The Limitations of Contextual Features• Example:

• Contextual features cannot capture “” jointly• What can they capture?

(separately)

Page 30: Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan Bar-Ilan University Israel Steffen Remus Chris Biemann.

Than ou