Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan...
-
Upload
rebecca-day -
Category
Documents
-
view
225 -
download
0
Transcript of Do Supervised Distributional Methods Really Learn Lexical Inference Relations? Omer Levy Ido Dagan...
Do Supervised Distributional Methods Really Learn Lexical
Inference Relations?Omer Levy Ido Dagan
Bar-Ilan UniversityIsrael
Steffen Remus Chris BiemannTechnische Universität Darmstadt
Germany
Lexical Inference
Lexical Inference: Task Definition
• Given 2 words • Does infer ?
• In the talk: refers to hypernymy (“ is a ”)
Dataset• Positive examples:
dolphin mammalJon Stewart comedian
• Negative examples:shark mammal
Jon Stewart politician
Distributional Methods of Lexical Inference
Unsupervised Distributional Methods
• Represent and as vectors and • Word Embeddings• Traditional (Sparse) Distributional Vectors
• Measure the similarity of and • Cosine Similarity• Distributional Inclusion (Weeds & Weir, 2003; Kotlerman et al., 2010)
• Tune a threshold over the similarity of and • Train a classifier over a single feature
Supervised Distributional Methods
• Represent and as vectors and • Word Embeddings• Traditional (Sparse) Distributional Vectors
• Represent the pair as a combination of and • Concat: (Baroni et al., 2012)• Diff: (Roller et al., 2014; Weeds et al., 2014; Fu et al., 2014)
• Train a classifier over the representation of • Multi-feature representation
Main Questions
• Are current supervised DMs better than unsupervised DMs?
• Are current supervised DMs learning a relation between and ?• (No)
• If not, what are they learning?
Experiment Setup
Experiment Setup
• 9 Word Representations• 3 Representation Methods: PPMI, SVD (over PPMI), word2vec (SGNS)• 3 Context Types
• Bag-of-Words (5 words to each side)• Positional (2 words to each side + position)• Dependency (all syntactically-connected words + dependency)
• Trained on English Wikipedia
• 5 Lexical-Inference Datasets• Kotlerman et al., 2010• Baroni and Lenci, 2011 (BLESS)• Baroni et al., 2012• Turney and Mohammad, 2014• Levy et al., 2014
Supervised Methods
• Concat:• Baroni et al., 2012
• Diff:• Roller et al., 2014; Weeds et al., 2014; Fu et al., 2014
• Only :
• Only :
Are current supervised DMs better than unsupervised DMs?
Previously Reported Success
Prior Art: • Supervised DMs better than unsupervised DMs• Accuracy >95% (in some datasets)
Our Findings:• High accuracy of supervised DMs stems from lexical memorization
Lexical Memorization
• Learning that a specific word is a strong indicator of label
Example:• Many positive training examples like (*, animal)• The classifier memorizes that animal is a good indicator• Test examples like (*, animal) are correctly classified “for free”
• In other words: overfitting• Raises questions on dataset construction
Lexical Memorization
• Avoid lexical memorization with lexical train/test splits
• If “animal” appears in train, it cannot appear in test
• Lexical splits applied to all our experiments
Experiments without Lexical Memorization• 4 supervised vs 1 unsupervised• Cosine similarity
• Cosine similarity outperforms all supervised DMs in 2/5 datasets
• Conclusion: supervised DMs are not necessarily better
Kotler-man2010
Bless2011 Baroni2012 Turney2014 Levy20140
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Best
Sup
ervi
sed
Best
Sup
ervi
sed
Best
Sup
ervi
sed
Best
Sup
ervi
sed
Best
Sup
ervi
sed
Uns
uper
vise
d
Uns
uper
vise
d
Uns
uper
vise
d
Uns
uper
vise
d
Uns
uper
vise
d
Perf
orm
ance
(F1)
Are current supervised DMslearning a relation between and ?
Learning a Relation between and
• Requires information about the compatibility of and
• What happens when we use Only (ignore )?
• Intuitively, it should fail – could be anything!
Learning a Relation between and
• In practice:
• Almost as well as Concat & Diff
• Best method in 1/5 dataset
Kotler-man2010
Bless2011 Baroni2012 Turney2014 Levy20140
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Best
Sup
ervi
sed
Best
Sup
ervi
sed
Best
Sup
ervi
sed
Best
Sup
ervi
sed
Best
Sup
ervi
sed
Onl
y y
Onl
y y
Onl
y y
Onl
y y
Onl
y y
Perf
orm
ance
(F1)
How can the classifier know that if it does not observe ?
If these methods are not learninga relation between and,
what exactly are they learning?
If these methods are not learninga relation between and,
what exactly are they learning?
Prototypical Hypernyms
Hypothesis: the methods learn whether is a prototypical hypernym
• Prototypical Hypernyms:• animal• mammal• fruit• drug• country• …
• Categories, Supersenses, etc.
Prototypical Hypernyms
Hypothesis: the methods learn whether is a prototypical hypernymExperiment:• Given 2 positive examples and
and ✔
• Create artificial negative examples and
and ✘
• These artificial examples contain prototypical hypernyms as • How easily is the classifier “fooled” by these artificial examples?
Prototypical Hypernyms
• Recall: portion of real positive examples (✔) classified true• Match Error: portion of artificial examples (✘) classified true
• Bottom-right: prefer ✔ over ✘• Good classifiers
• Top-left: prefer ✘ over ✔• Worse than random
• Diagonal: cannot distinguish ✔ from ✘• Predicted by hypothesis
Prototypical Hypernyms
• Recall: portion of real positive examples (✔) classified true• Match Error: portion of artificial examples (✘) classified true
• Regression slope: 0.935
• Result: classifiers cannot distinguishbetween artificial (✘) and real (✔)
• Conclusion: classifiers returns truewhen is a prototypical hypernym
Prototypical Hypernyms: Analysis
• What are the classifiers’ most indicative features?
• Indicators that is a category word:
• Partial Hearst (1992) patterns:
Conclusions
Conclusions
• Are current supervised DMs better than unsupervised DMs?• Not necessarily• Previously reported success stems from lexical memorization
• Are current supervised DMs learning a relation between and ?• No, they are not• Only yields similar results to Concat and Diff
• If not, what are they learning?• Whether is a prototypical hypernym• (“mammal”, “fruit”, “country”, …)
What if the necessary relational information
does not exist in contextual features?
The Limitations of Contextual Features• Example:
• Contextual features cannot capture “” jointly• What can they capture?
(separately)
Than ou