Hypernyms under Siege - Vered Shwartz · Hypernyms under Siege: Linguistically-motivated Artillery...

Post on 11-Jun-2020

5 views 0 download

Transcript of Hypernyms under Siege - Vered Shwartz · Hypernyms under Siege: Linguistically-motivated Artillery...

1/22

Hypernyms under Siege:Linguistically-motivated Artillery for Hypernymy Detection

Vered Shwartz1, Enrico Santus2 and Dominik Schlechtweg3

1Bar-Ilan University, 2Singapore University of Technology and Design, 3University of Stuttgart

EACL 2017

2/22

Hypernymy Detection - Task Definition

I Given two terms, (x, y), decide whether y is a hypernym of x

I in some senses of x and yI e.g. python’s hypernyms are animal and programming language

I Motivation: taxonomy creation, RTE, QA, ...

I e.g.: A boy is hitting a baseball ⇒ A child is hitting a baseball

2/22

Hypernymy Detection - Task Definition

I Given two terms, (x, y), decide whether y is a hypernym of xI in some senses of x and y

I e.g. python’s hypernyms are animal and programming language

I Motivation: taxonomy creation, RTE, QA, ...

I e.g.: A boy is hitting a baseball ⇒ A child is hitting a baseball

2/22

Hypernymy Detection - Task Definition

I Given two terms, (x, y), decide whether y is a hypernym of xI in some senses of x and yI e.g. python’s hypernyms are animal and programming language

I Motivation: taxonomy creation, RTE, QA, ...

I e.g.: A boy is hitting a baseball ⇒ A child is hitting a baseball

2/22

Hypernymy Detection - Task Definition

I Given two terms, (x, y), decide whether y is a hypernym of xI in some senses of x and yI e.g. python’s hypernyms are animal and programming language

I Motivation: taxonomy creation, RTE, QA, ...

I e.g.: A boy is hitting a baseball ⇒ A child is hitting a baseball

2/22

Hypernymy Detection - Task Definition

I Given two terms, (x, y), decide whether y is a hypernym of xI in some senses of x and yI e.g. python’s hypernyms are animal and programming language

I Motivation: taxonomy creation, RTE, QA, ...

I e.g.: A boy is hitting a baseball ⇒ A child is hitting a baseball

3/22

Hypernymy Detection

corpus-based

path-based distributional

HypeNET[Shwartz et al., 2016]

unsupervised supervised

similarity informativeness inclusion

concat [Baroni et al., 2012]difference [Weeds et al., 2014]

ASYM [Roller et al., 2014]...

Hearst Patterns [Hearst, 1992]Snow [Snow et al., 2005][Necsulescu et al., 2015]

...

Cosine [Salton and McGill, 1986]Lin [Lin, 1998]

APSyn [Santus et al., 2016b]

SLQS [Santus et al., 2014]RCTC [Rimell, 2014]

WeedsPrec [Weeds and Weir, 2003]ClarkeDE [Clarke, 2009]

balAPinc [Kotlerman et al., 2010]...

4/22

Unsupervised Hypernymy Detection

corpus-based

path-based distributional

HypeNET[Shwartz et al., 2016]

unsupervised supervised

similarity informativeness inclusion

concat [Baroni et al., 2012]difference [Weeds et al., 2014]

ASYM [Roller et al., 2014]...

Hearst Patterns [Hearst, 1992]Snow [Snow et al., 2005][Necsulescu et al., 2015]

...

Cosine [Salton and McGill, 1986]Lin [Lin, 1998]

APSyn [Santus et al., 2016b]

SLQS [Santus et al., 2014]RCTC [Rimell, 2014]

WeedsPrec [Weeds and Weir, 2003]ClarkeDE [Clarke, 2009]

balAPinc [Kotlerman et al., 2010]...

5/22

Key Takeouts

I We performed an extensive number of experiments, with:I 14 unsupervised measuresI 12 DSMs: 6 context types × 2 feature weightingsI 4 datasets

I Our findings:1. There is no specific measure which is always preferred

I Different measures capture different aspects of hypernymyI Specific measures are preferred in distinguishing hypernymy

from each other relation

2. Unsupervised measures are not deprecated given supervisedmethods

I In real-application scenario, unsupervised may be preferred

5/22

Key Takeouts

I We performed an extensive number of experiments, with:I 14 unsupervised measuresI 12 DSMs: 6 context types × 2 feature weightingsI 4 datasets

I Our findings:1. There is no specific measure which is always preferred

I Different measures capture different aspects of hypernymyI Specific measures are preferred in distinguishing hypernymy

from each other relation

2. Unsupervised measures are not deprecated given supervisedmethods

I In real-application scenario, unsupervised may be preferred

5/22

Key Takeouts

I We performed an extensive number of experiments, with:I 14 unsupervised measuresI 12 DSMs: 6 context types × 2 feature weightingsI 4 datasets

I Our findings:1. There is no specific measure which is always preferred

I Different measures capture different aspects of hypernymy

I Specific measures are preferred in distinguishing hypernymyfrom each other relation

2. Unsupervised measures are not deprecated given supervisedmethods

I In real-application scenario, unsupervised may be preferred

5/22

Key Takeouts

I We performed an extensive number of experiments, with:I 14 unsupervised measuresI 12 DSMs: 6 context types × 2 feature weightingsI 4 datasets

I Our findings:1. There is no specific measure which is always preferred

I Different measures capture different aspects of hypernymyI Specific measures are preferred in distinguishing hypernymy

from each other relation

2. Unsupervised measures are not deprecated given supervisedmethods

I In real-application scenario, unsupervised may be preferred

5/22

Key Takeouts

I We performed an extensive number of experiments, with:I 14 unsupervised measuresI 12 DSMs: 6 context types × 2 feature weightingsI 4 datasets

I Our findings:1. There is no specific measure which is always preferred

I Different measures capture different aspects of hypernymyI Specific measures are preferred in distinguishing hypernymy

from each other relation

2. Unsupervised measures are not deprecated given supervisedmethods

I In real-application scenario, unsupervised may be preferred

5/22

Key Takeouts

I We performed an extensive number of experiments, with:I 14 unsupervised measuresI 12 DSMs: 6 context types × 2 feature weightingsI 4 datasets

I Our findings:1. There is no specific measure which is always preferred

I Different measures capture different aspects of hypernymyI Specific measures are preferred in distinguishing hypernymy

from each other relation

2. Unsupervised measures are not deprecated given supervisedmethods

I In real-application scenario, unsupervised may be preferred

6/22

Outline

Comparing Between Unsupervised Measures

Hypernym vs. All Other RelationsHypernym vs. One Other Relation

Supervised vs. Unsupervised

PerformanceRobustness

7/22

Outline 1

Comparing Between Unsupervised Measures

Hypernym vs. All Other RelationsHypernym vs. One Other Relation

Supervised vs. Unsupervised

PerformanceRobustness

8/22

Hypernym vs. All

I Q: Is there a “best” unsupervised measure?

I Evaluation setting: hypernymy vs. all other relations

I Compute measure score for each (x , y) pair

I Evaluation metric: average precision

I A: There is no single combination of measure, context typeand feature weighting that consistently performs best

dataset measure context type AP@100EVALution invCL joint 0.661

BLESS invCL win5 0.54

Lenci/Benotto APSyn joint 0.617

Weeds clarkeDE win5d 0.911

8/22

Hypernym vs. All

I Q: Is there a “best” unsupervised measure?

I Evaluation setting: hypernymy vs. all other relations

I Compute measure score for each (x , y) pair

I Evaluation metric: average precision

I A: There is no single combination of measure, context typeand feature weighting that consistently performs best

dataset measure context type AP@100EVALution invCL joint 0.661

BLESS invCL win5 0.54

Lenci/Benotto APSyn joint 0.617

Weeds clarkeDE win5d 0.911

8/22

Hypernym vs. All

I Q: Is there a “best” unsupervised measure?

I Evaluation setting: hypernymy vs. all other relations

I Compute measure score for each (x , y) pair

I Evaluation metric: average precision

I A: There is no single combination of measure, context typeand feature weighting that consistently performs best

dataset measure context type AP@100EVALution invCL joint 0.661

BLESS invCL win5 0.54

Lenci/Benotto APSyn joint 0.617

Weeds clarkeDE win5d 0.911

8/22

Hypernym vs. All

I Q: Is there a “best” unsupervised measure?

I Evaluation setting: hypernymy vs. all other relations

I Compute measure score for each (x , y) pair

I Evaluation metric: average precision

I A: There is no single combination of measure, context typeand feature weighting that consistently performs best

dataset measure context type AP@100EVALution invCL joint 0.661

BLESS invCL win5 0.54

Lenci/Benotto APSyn joint 0.617

Weeds clarkeDE win5d 0.911

8/22

Hypernym vs. All

I Q: Is there a “best” unsupervised measure?

I Evaluation setting: hypernymy vs. all other relations

I Compute measure score for each (x , y) pair

I Evaluation metric: average precision

I A: There is no single combination of measure, context typeand feature weighting that consistently performs best

dataset measure context type AP@100EVALution invCL joint 0.661

BLESS invCL win5 0.54

Lenci/Benotto APSyn joint 0.617

Weeds clarkeDE win5d 0.911

9/22

Hypernym vs. One Other Relation

I Different measures capture different aspects of hypernymyI e.g., that x’s contexts are included in y’s

I Q: Is there a most suitable measure to distinguish hypernymyfrom each other relation?

I Consider all relations that occurred in two datasets

I For such relation, for each dataset, rank measures by AP andselect those with AP ≥ 0.8

9/22

Hypernym vs. One Other Relation

I Different measures capture different aspects of hypernymyI e.g., that x’s contexts are included in y’s

I Q: Is there a most suitable measure to distinguish hypernymyfrom each other relation?

I Consider all relations that occurred in two datasets

I For such relation, for each dataset, rank measures by AP andselect those with AP ≥ 0.8

9/22

Hypernym vs. One Other Relation

I Different measures capture different aspects of hypernymyI e.g., that x’s contexts are included in y’s

I Q: Is there a most suitable measure to distinguish hypernymyfrom each other relation?

I Consider all relations that occurred in two datasets

I For such relation, for each dataset, rank measures by AP andselect those with AP ≥ 0.8

9/22

Hypernym vs. One Other Relation

I Different measures capture different aspects of hypernymyI e.g., that x’s contexts are included in y’s

I Q: Is there a most suitable measure to distinguish hypernymyfrom each other relation?

I Consider all relations that occurred in two datasets

I For such relation, for each dataset, rank measures by AP andselect those with AP ≥ 0.8

10/22

Hypernym vs. Attribute

I Best measures: similarity measures + syntactic contexts

Possible explanation:

I Similarity measures quantify the ratio of shared contextsbetween x and y

I Attributes contexts’ are easy to distinguish syntactically

I y is almost always an adjective, x is a noun/verb

I Less shared contexts than in hyponym-hypernym pairs

10/22

Hypernym vs. Attribute

I Best measures: similarity measures + syntactic contexts

Possible explanation:

I Similarity measures quantify the ratio of shared contextsbetween x and y

I Attributes contexts’ are easy to distinguish syntactically

I y is almost always an adjective, x is a noun/verb

I Less shared contexts than in hyponym-hypernym pairs

10/22

Hypernym vs. Attribute

I Best measures: similarity measures + syntactic contexts

Possible explanation:

I Similarity measures quantify the ratio of shared contextsbetween x and y

I Attributes contexts’ are easy to distinguish syntactically

I y is almost always an adjective, x is a noun/verb

I Less shared contexts than in hyponym-hypernym pairs

10/22

Hypernym vs. Attribute

I Best measures: similarity measures + syntactic contexts

Possible explanation:

I Similarity measures quantify the ratio of shared contextsbetween x and y

I Attributes contexts’ are easy to distinguish syntactically

I y is almost always an adjective, x is a noun/verb

I Less shared contexts than in hyponym-hypernym pairs

10/22

Hypernym vs. Attribute

I Best measures: similarity measures + syntactic contexts

Possible explanation:

I Similarity measures quantify the ratio of shared contextsbetween x and y

I Attributes contexts’ are easy to distinguish syntactically

I y is almost always an adjective, x is a noun/verb

I Less shared contexts than in hyponym-hypernym pairs

11/22

Hypernym vs. Meronym

I Best measures: inclusion measures + syntactic contexts

Possible explanation:

I Inclusion measures quantify the ratio of x’s contexts includedin y’s [Weeds and Weir, 2003, Geffet and Dagan, 2005]

I Window-based contexts capture topical context

I Topics are shared between holonym-meronym andhypernym-hyponym pairs

I Syntactic contexts capture functional contextI Meronyms and holonyms are more often of different functions

than hyponym and hypernymsI cat and animal both eat and are aliveI you wag a tail but not a cat

11/22

Hypernym vs. Meronym

I Best measures: inclusion measures + syntactic contexts

Possible explanation:

I Inclusion measures quantify the ratio of x’s contexts includedin y’s [Weeds and Weir, 2003, Geffet and Dagan, 2005]

I Window-based contexts capture topical context

I Topics are shared between holonym-meronym andhypernym-hyponym pairs

I Syntactic contexts capture functional contextI Meronyms and holonyms are more often of different functions

than hyponym and hypernymsI cat and animal both eat and are aliveI you wag a tail but not a cat

11/22

Hypernym vs. Meronym

I Best measures: inclusion measures + syntactic contexts

Possible explanation:

I Inclusion measures quantify the ratio of x’s contexts includedin y’s [Weeds and Weir, 2003, Geffet and Dagan, 2005]

I Window-based contexts capture topical context

I Topics are shared between holonym-meronym andhypernym-hyponym pairs

I Syntactic contexts capture functional contextI Meronyms and holonyms are more often of different functions

than hyponym and hypernymsI cat and animal both eat and are aliveI you wag a tail but not a cat

11/22

Hypernym vs. Meronym

I Best measures: inclusion measures + syntactic contexts

Possible explanation:

I Inclusion measures quantify the ratio of x’s contexts includedin y’s [Weeds and Weir, 2003, Geffet and Dagan, 2005]

I Window-based contexts capture topical context

I Topics are shared between holonym-meronym andhypernym-hyponym pairs

I Syntactic contexts capture functional contextI Meronyms and holonyms are more often of different functions

than hyponym and hypernymsI cat and animal both eat and are aliveI you wag a tail but not a cat

11/22

Hypernym vs. Meronym

I Best measures: inclusion measures + syntactic contexts

Possible explanation:

I Inclusion measures quantify the ratio of x’s contexts includedin y’s [Weeds and Weir, 2003, Geffet and Dagan, 2005]

I Window-based contexts capture topical context

I Topics are shared between holonym-meronym andhypernym-hyponym pairs

I Syntactic contexts capture functional context

I Meronyms and holonyms are more often of different functionsthan hyponym and hypernyms

I cat and animal both eat and are aliveI you wag a tail but not a cat

11/22

Hypernym vs. Meronym

I Best measures: inclusion measures + syntactic contexts

Possible explanation:

I Inclusion measures quantify the ratio of x’s contexts includedin y’s [Weeds and Weir, 2003, Geffet and Dagan, 2005]

I Window-based contexts capture topical context

I Topics are shared between holonym-meronym andhypernym-hyponym pairs

I Syntactic contexts capture functional contextI Meronyms and holonyms are more often of different functions

than hyponym and hypernyms

I cat and animal both eat and are aliveI you wag a tail but not a cat

11/22

Hypernym vs. Meronym

I Best measures: inclusion measures + syntactic contexts

Possible explanation:

I Inclusion measures quantify the ratio of x’s contexts includedin y’s [Weeds and Weir, 2003, Geffet and Dagan, 2005]

I Window-based contexts capture topical context

I Topics are shared between holonym-meronym andhypernym-hyponym pairs

I Syntactic contexts capture functional contextI Meronyms and holonyms are more often of different functions

than hyponym and hypernymsI cat and animal both eat and are aliveI you wag a tail but not a cat

12/22

Hypernym vs. Synonym (1/2)

I Best measures: SLQS / invCL

Possible explanation:

I SLQS [Santus et al., 2014]: is x more informative than y?I y is less informative than x in hypernymy, not in synonymy

I budgie is more informative than birdI budgie and parakeet are of the same informativeness level

12/22

Hypernym vs. Synonym (1/2)

I Best measures: SLQS / invCL

Possible explanation:

I SLQS [Santus et al., 2014]: is x more informative than y?

I y is less informative than x in hypernymy, not in synonymyI budgie is more informative than birdI budgie and parakeet are of the same informativeness level

12/22

Hypernym vs. Synonym (1/2)

I Best measures: SLQS / invCL

Possible explanation:

I SLQS [Santus et al., 2014]: is x more informative than y?I y is less informative than x in hypernymy, not in synonymy

I budgie is more informative than birdI budgie and parakeet are of the same informativeness level

12/22

Hypernym vs. Synonym (1/2)

I Best measures: SLQS / invCL

Possible explanation:

I SLQS [Santus et al., 2014]: is x more informative than y?I y is less informative than x in hypernymy, not in synonymy

I budgie is more informative than birdI budgie and parakeet are of the same informativeness level

13/22

Hypernym vs. Synonym (2/2)

I invCL [Lenci and Benotto, 2012] is a strict inclusion measure,quantifying also the non-inclusion of y in x

I Inclusion measures may assign high scores to synonyms, sincethey share many contexts

I invCL penalizes (x, y) pairs in which x also includes many ofy’s contexts

13/22

Hypernym vs. Synonym (2/2)

I invCL [Lenci and Benotto, 2012] is a strict inclusion measure,quantifying also the non-inclusion of y in x

I Inclusion measures may assign high scores to synonyms, sincethey share many contexts

I invCL penalizes (x, y) pairs in which x also includes many ofy’s contexts

13/22

Hypernym vs. Synonym (2/2)

I invCL [Lenci and Benotto, 2012] is a strict inclusion measure,quantifying also the non-inclusion of y in x

I Inclusion measures may assign high scores to synonyms, sincethey share many contexts

I invCL penalizes (x, y) pairs in which x also includes many ofy’s contexts

14/22

Outline 2

Comparing Between Unsupervised Measures

Hypernym vs. All Other RelationsHypernym vs. One Other Relation

Supervised vs. Unsupervised

PerformanceRobustness

15/22

Supervised Methods

I (Distributional) state-of-the-art: supervised embedding-basedmethods

I (x, y) term-pairs are represented as a feature vector, based ofthe terms’ embeddings:

I Concatenation ~x ⊕ ~y [Baroni et al., 2012]I Difference ~y − ~x [Weeds et al., 2014]I ASYM [Roller et al., 2014]

I Evaluation setting:I Tune hyper-parameters (methods, pre-trained embeddings)I Train a logistic regression classifierI Use logistic score for ranking, report AP

15/22

Supervised Methods

I (Distributional) state-of-the-art: supervised embedding-basedmethods

I (x, y) term-pairs are represented as a feature vector, based ofthe terms’ embeddings:

I Concatenation ~x ⊕ ~y [Baroni et al., 2012]I Difference ~y − ~x [Weeds et al., 2014]I ASYM [Roller et al., 2014]

I Evaluation setting:I Tune hyper-parameters (methods, pre-trained embeddings)I Train a logistic regression classifierI Use logistic score for ranking, report AP

15/22

Supervised Methods

I (Distributional) state-of-the-art: supervised embedding-basedmethods

I (x, y) term-pairs are represented as a feature vector, based ofthe terms’ embeddings:

I Concatenation ~x ⊕ ~y [Baroni et al., 2012]I Difference ~y − ~x [Weeds et al., 2014]I ASYM [Roller et al., 2014]

I Evaluation setting:I Tune hyper-parameters (methods, pre-trained embeddings)I Train a logistic regression classifierI Use logistic score for ranking, report AP

16/22

Supervised vs. Unsupervised Experiment

I Concat achieves almost perfect AP scores, outperformsunsupervised measures:

dataset relation best supervised best unsupervised

EVALution

meronym 0.998 0.886

attribute 1.000 0.877

antonym 1.000 0.773

synonym 0.996 0.813

BLESS

meronym 1.000 0.939

coord 1.000 0.938

attribute 1.000 0.938

event 1.000 0.847

random-n 0.995 0.818

random-j 1.000 0.917

random-v 1.000 0.895

Lenci/Benottoantonym 0.917 0.807

synonym 0.946 0.914

Weeds coord 0.873 0.824

17/22

Measuring Robustness (1/3)

I Is there any advantage to unsupervised measures?

I apart from not requiring training data...

I Supervised methods do not learn the relation between x and y:I [Levy et al., 2015]: they memorize that y is a prototypical hypernymI [Roller and Erk, 2016]: they trace y’s occurrences in Hearst patternsI [Shwartz et al., 2016]: they provide the “prior” of y to fit the relation

I AP scores may suggest overfitting by the supervised methods...

17/22

Measuring Robustness (1/3)

I Is there any advantage to unsupervised measures?I apart from not requiring training data...

I Supervised methods do not learn the relation between x and y:I [Levy et al., 2015]: they memorize that y is a prototypical hypernymI [Roller and Erk, 2016]: they trace y’s occurrences in Hearst patternsI [Shwartz et al., 2016]: they provide the “prior” of y to fit the relation

I AP scores may suggest overfitting by the supervised methods...

17/22

Measuring Robustness (1/3)

I Is there any advantage to unsupervised measures?I apart from not requiring training data...

I Supervised methods do not learn the relation between x and y:

I [Levy et al., 2015]: they memorize that y is a prototypical hypernymI [Roller and Erk, 2016]: they trace y’s occurrences in Hearst patternsI [Shwartz et al., 2016]: they provide the “prior” of y to fit the relation

I AP scores may suggest overfitting by the supervised methods...

17/22

Measuring Robustness (1/3)

I Is there any advantage to unsupervised measures?I apart from not requiring training data...

I Supervised methods do not learn the relation between x and y:I [Levy et al., 2015]: they memorize that y is a prototypical hypernym

I [Roller and Erk, 2016]: they trace y’s occurrences in Hearst patternsI [Shwartz et al., 2016]: they provide the “prior” of y to fit the relation

I AP scores may suggest overfitting by the supervised methods...

17/22

Measuring Robustness (1/3)

I Is there any advantage to unsupervised measures?I apart from not requiring training data...

I Supervised methods do not learn the relation between x and y:I [Levy et al., 2015]: they memorize that y is a prototypical hypernymI [Roller and Erk, 2016]: they trace y’s occurrences in Hearst patterns

I [Shwartz et al., 2016]: they provide the “prior” of y to fit the relation

I AP scores may suggest overfitting by the supervised methods...

17/22

Measuring Robustness (1/3)

I Is there any advantage to unsupervised measures?I apart from not requiring training data...

I Supervised methods do not learn the relation between x and y:I [Levy et al., 2015]: they memorize that y is a prototypical hypernymI [Roller and Erk, 2016]: they trace y’s occurrences in Hearst patternsI [Shwartz et al., 2016]: they provide the “prior” of y to fit the relation

I AP scores may suggest overfitting by the supervised methods...

17/22

Measuring Robustness (1/3)

I Is there any advantage to unsupervised measures?I apart from not requiring training data...

I Supervised methods do not learn the relation between x and y:I [Levy et al., 2015]: they memorize that y is a prototypical hypernymI [Roller and Erk, 2016]: they trace y’s occurrences in Hearst patternsI [Shwartz et al., 2016]: they provide the “prior” of y to fit the relation

I AP scores may suggest overfitting by the supervised methods...

18/22

Measuring Robustness (2/3)

I We repeated the experiment of [Santus et al., 2016a]I Adding switched hypernym pairs as randomI e.g. (apple, animal), (cow, fruit)

I Using BLESS (the only dataset with random pairs)I For each hypernym pair (x1,y1):

I Sampled y2 such that (x2,y2) ∈ dataset is a hypernym-pairI such that (x1,y2) /∈ datasetI Add (x1,y2) as random

I We added 139 such new random pairs

I We used the best supervised and unsupervised methods tore-classify the revised dataset

18/22

Measuring Robustness (2/3)

I We repeated the experiment of [Santus et al., 2016a]I Adding switched hypernym pairs as randomI e.g. (apple, animal), (cow, fruit)

I Using BLESS (the only dataset with random pairs)

I For each hypernym pair (x1,y1):I Sampled y2 such that (x2,y2) ∈ dataset is a hypernym-pairI such that (x1,y2) /∈ datasetI Add (x1,y2) as random

I We added 139 such new random pairs

I We used the best supervised and unsupervised methods tore-classify the revised dataset

18/22

Measuring Robustness (2/3)

I We repeated the experiment of [Santus et al., 2016a]I Adding switched hypernym pairs as randomI e.g. (apple, animal), (cow, fruit)

I Using BLESS (the only dataset with random pairs)I For each hypernym pair (x1,y1):

I Sampled y2 such that (x2,y2) ∈ dataset is a hypernym-pair

I such that (x1,y2) /∈ datasetI Add (x1,y2) as random

I We added 139 such new random pairs

I We used the best supervised and unsupervised methods tore-classify the revised dataset

18/22

Measuring Robustness (2/3)

I We repeated the experiment of [Santus et al., 2016a]I Adding switched hypernym pairs as randomI e.g. (apple, animal), (cow, fruit)

I Using BLESS (the only dataset with random pairs)I For each hypernym pair (x1,y1):

I Sampled y2 such that (x2,y2) ∈ dataset is a hypernym-pairI such that (x1,y2) /∈ dataset

I Add (x1,y2) as random

I We added 139 such new random pairs

I We used the best supervised and unsupervised methods tore-classify the revised dataset

18/22

Measuring Robustness (2/3)

I We repeated the experiment of [Santus et al., 2016a]I Adding switched hypernym pairs as randomI e.g. (apple, animal), (cow, fruit)

I Using BLESS (the only dataset with random pairs)I For each hypernym pair (x1,y1):

I Sampled y2 such that (x2,y2) ∈ dataset is a hypernym-pairI such that (x1,y2) /∈ datasetI Add (x1,y2) as random

I We added 139 such new random pairs

I We used the best supervised and unsupervised methods tore-classify the revised dataset

18/22

Measuring Robustness (2/3)

I We repeated the experiment of [Santus et al., 2016a]I Adding switched hypernym pairs as randomI e.g. (apple, animal), (cow, fruit)

I Using BLESS (the only dataset with random pairs)I For each hypernym pair (x1,y1):

I Sampled y2 such that (x2,y2) ∈ dataset is a hypernym-pairI such that (x1,y2) /∈ datasetI Add (x1,y2) as random

I We added 139 such new random pairs

I We used the best supervised and unsupervised methods tore-classify the revised dataset

18/22

Measuring Robustness (2/3)

I We repeated the experiment of [Santus et al., 2016a]I Adding switched hypernym pairs as randomI e.g. (apple, animal), (cow, fruit)

I Using BLESS (the only dataset with random pairs)I For each hypernym pair (x1,y1):

I Sampled y2 such that (x2,y2) ∈ dataset is a hypernym-pairI such that (x1,y2) /∈ datasetI Add (x1,y2) as random

I We added 139 such new random pairs

I We used the best supervised and unsupervised methods tore-classify the revised dataset

19/22

Measuring Robustness (3/3)

I Supervised method drops in performance, unsupervisedperformance is stable:

method AP@100 original AP@100 switched ∆supervised concat, word2vec, L1 0.995 0.575 -0.42

unsupervised cosWeeds, win2d, ppmi 0.818 0.882 +0.064

I 121 out of the 139 switched pairs were falsely classified ashypernyms by the supervised method!

I Unsupervised measures are robust, performance even slightlyimproves

I Consistent with the transfer-learning experiment of[Turney and Mohammad, 2015]

19/22

Measuring Robustness (3/3)

I Supervised method drops in performance, unsupervisedperformance is stable:

method AP@100 original AP@100 switched ∆supervised concat, word2vec, L1 0.995 0.575 -0.42

unsupervised cosWeeds, win2d, ppmi 0.818 0.882 +0.064

I 121 out of the 139 switched pairs were falsely classified ashypernyms by the supervised method!

I Unsupervised measures are robust, performance even slightlyimproves

I Consistent with the transfer-learning experiment of[Turney and Mohammad, 2015]

19/22

Measuring Robustness (3/3)

I Supervised method drops in performance, unsupervisedperformance is stable:

method AP@100 original AP@100 switched ∆supervised concat, word2vec, L1 0.995 0.575 -0.42

unsupervised cosWeeds, win2d, ppmi 0.818 0.882 +0.064

I 121 out of the 139 switched pairs were falsely classified ashypernyms by the supervised method!

I Unsupervised measures are robust, performance even slightlyimproves

I Consistent with the transfer-learning experiment of[Turney and Mohammad, 2015]

19/22

Measuring Robustness (3/3)

I Supervised method drops in performance, unsupervisedperformance is stable:

method AP@100 original AP@100 switched ∆supervised concat, word2vec, L1 0.995 0.575 -0.42

unsupervised cosWeeds, win2d, ppmi 0.818 0.882 +0.064

I 121 out of the 139 switched pairs were falsely classified ashypernyms by the supervised method!

I Unsupervised measures are robust, performance even slightlyimproves

I Consistent with the transfer-learning experiment of[Turney and Mohammad, 2015]

20/22

Recap

I We experimented with an extensive number of unsupervisedmeasures for hypernymy detection

I There is no specific measure which is always preferredI Different measures capture different aspects of hypernymyI Specific measures are preferred in distinguishing hypernymy

from each other relation

I Unsupervised measures are not deprecated given supervisedmethods

I Supervised methods are favorable by the existing datasetsI However, they are extremely sensitive to training dataI Unsupervised methods are more robust

20/22

Recap

I We experimented with an extensive number of unsupervisedmeasures for hypernymy detection

I There is no specific measure which is always preferred

I Different measures capture different aspects of hypernymyI Specific measures are preferred in distinguishing hypernymy

from each other relation

I Unsupervised measures are not deprecated given supervisedmethods

I Supervised methods are favorable by the existing datasetsI However, they are extremely sensitive to training dataI Unsupervised methods are more robust

20/22

Recap

I We experimented with an extensive number of unsupervisedmeasures for hypernymy detection

I There is no specific measure which is always preferredI Different measures capture different aspects of hypernymy

I Specific measures are preferred in distinguishing hypernymyfrom each other relation

I Unsupervised measures are not deprecated given supervisedmethods

I Supervised methods are favorable by the existing datasetsI However, they are extremely sensitive to training dataI Unsupervised methods are more robust

20/22

Recap

I We experimented with an extensive number of unsupervisedmeasures for hypernymy detection

I There is no specific measure which is always preferredI Different measures capture different aspects of hypernymyI Specific measures are preferred in distinguishing hypernymy

from each other relation

I Unsupervised measures are not deprecated given supervisedmethods

I Supervised methods are favorable by the existing datasetsI However, they are extremely sensitive to training dataI Unsupervised methods are more robust

20/22

Recap

I We experimented with an extensive number of unsupervisedmeasures for hypernymy detection

I There is no specific measure which is always preferredI Different measures capture different aspects of hypernymyI Specific measures are preferred in distinguishing hypernymy

from each other relation

I Unsupervised measures are not deprecated given supervisedmethods

I Supervised methods are favorable by the existing datasetsI However, they are extremely sensitive to training dataI Unsupervised methods are more robust

20/22

Recap

I We experimented with an extensive number of unsupervisedmeasures for hypernymy detection

I There is no specific measure which is always preferredI Different measures capture different aspects of hypernymyI Specific measures are preferred in distinguishing hypernymy

from each other relation

I Unsupervised measures are not deprecated given supervisedmethods

I Supervised methods are favorable by the existing datasets

I However, they are extremely sensitive to training dataI Unsupervised methods are more robust

20/22

Recap

I We experimented with an extensive number of unsupervisedmeasures for hypernymy detection

I There is no specific measure which is always preferredI Different measures capture different aspects of hypernymyI Specific measures are preferred in distinguishing hypernymy

from each other relation

I Unsupervised measures are not deprecated given supervisedmethods

I Supervised methods are favorable by the existing datasetsI However, they are extremely sensitive to training data

I Unsupervised methods are more robust

20/22

Recap

I We experimented with an extensive number of unsupervisedmeasures for hypernymy detection

I There is no specific measure which is always preferredI Different measures capture different aspects of hypernymyI Specific measures are preferred in distinguishing hypernymy

from each other relation

I Unsupervised measures are not deprecated given supervisedmethods

I Supervised methods are favorable by the existing datasetsI However, they are extremely sensitive to training dataI Unsupervised methods are more robust

21/22

Future Work

I A combination of unsupervised measures may perform well indistinguishing hypernym from all other relations

I A simple classifier with unsupervised measures as featuresperforms well

I However, it’s not as robust as the unsupervised measures

I Current datasets are biased and encourage overfitting

I In the future, the development of natural datasets may betterreflect the performance of methods in real-applicationscenarios

Thank you!

21/22

Future Work

I A combination of unsupervised measures may perform well indistinguishing hypernym from all other relations

I A simple classifier with unsupervised measures as featuresperforms well

I However, it’s not as robust as the unsupervised measures

I Current datasets are biased and encourage overfitting

I In the future, the development of natural datasets may betterreflect the performance of methods in real-applicationscenarios

Thank you!

21/22

Future Work

I A combination of unsupervised measures may perform well indistinguishing hypernym from all other relations

I A simple classifier with unsupervised measures as featuresperforms well

I However, it’s not as robust as the unsupervised measures

I Current datasets are biased and encourage overfitting

I In the future, the development of natural datasets may betterreflect the performance of methods in real-applicationscenarios

Thank you!

21/22

Future Work

I A combination of unsupervised measures may perform well indistinguishing hypernym from all other relations

I A simple classifier with unsupervised measures as featuresperforms well

I However, it’s not as robust as the unsupervised measures

I Current datasets are biased and encourage overfitting

I In the future, the development of natural datasets may betterreflect the performance of methods in real-applicationscenarios

Thank you!

21/22

Future Work

I A combination of unsupervised measures may perform well indistinguishing hypernym from all other relations

I A simple classifier with unsupervised measures as featuresperforms well

I However, it’s not as robust as the unsupervised measures

I Current datasets are biased and encourage overfitting

I In the future, the development of natural datasets may betterreflect the performance of methods in real-applicationscenarios

Thank you!

21/22

Future Work

I A combination of unsupervised measures may perform well indistinguishing hypernym from all other relations

I A simple classifier with unsupervised measures as featuresperforms well

I However, it’s not as robust as the unsupervised measures

I Current datasets are biased and encourage overfitting

I In the future, the development of natural datasets may betterreflect the performance of methods in real-applicationscenarios

Thank you!

22/22

ReferencesBaroni, M., Bernardi, R., Do, N.-Q., and Shan, C.-c. (2012).Entailment above the word level in distributional semantics.In Proceedings of the 13th Conference of the EuropeanChapter of the Association for Computational Linguistics,pages 23–32. Association for Computational Linguistics.

Clarke, D. (2009).Context-theoretic semantics for natural language: anoverview.In Proceedings of the Workshop on Geometrical Models ofNatural Language Semantics, pages 112–119. Association forComputational Linguistics.

Geffet, M. and Dagan, I. (2005).The distributional inclusion hypotheses and lexicalentailment.In Proceedings of the 43rd Annual Meeting on Association forComputational Linguistics, pages 107–114. Association forComputational Linguistics.

Hearst, M. A. (1992).Automatic acquisition of hyponyms from large text corpora.In COLING 1992 Volume 2: The 15th InternationalConference on Computational Linguistics.

Kotlerman, L., Dagan, I., Szpektor, I., andZhitomirsky-Geffet, M. (2010).Directional distributional similarity for lexical inference.Natural Language Engineering, 16(04):359–389.

Lenci, A. and Benotto, G. (2012).Identifying hypernyms in distributional semantic spaces.In *SEM 2012: The First Joint Conference on Lexical andComputational Semantics – Volume 1: Proceedings of themain conference and the shared task, and Volume 2:Proceedings of the Sixth International Workshop on SemanticEvaluation (SemEval 2012), pages 75–79. Association forComputational Linguistics.

Levy, O., Remus, S., Biemann, C., and Dagan, I. (2015).Do supervised distributional methods really learn lexicalinference relations?In Proceedings of the 2015 Conference of the North AmericanChapter of the Association for Computational Linguistics:Human Language Technologies, pages 970–976. Associationfor Computational Linguistics.

Lin, D. (1998).An information-theoretic definition of similarity.ICML, 98:296–304.

Necsulescu, S., Mendes, S., Jurgens, D., Bel, N., and Navigli,R. (2015).Reading between the lines: Overcoming data sparsity foraccurate classification of lexical relationships.Lexical and Computational Semantics (* SEM 2015), page182.

Rimell, L. (2014).Distributional lexical entailment by topic coherence.In Proceedings of the 14th Conference of the EuropeanChapter of the Association for Computational Linguistics,pages 511–519. Association for Computational Linguistics.

Roller, S. and Erk, K. (2016).Relations such as hypernymy: Identifying and exploitinghearst patterns in distributional vectors for lexical entailment.

In Proceedings of the 2016 Conference on Empirical Methodsin Natural Language Processing, pages 2163–2172.Association for Computational Linguistics.

Roller, S., Erk, K., and Boleda, G. (2014).Inclusive yet selective: Supervised distributional hypernymydetection.In Proceedings of COLING 2014, the 25th InternationalConference on Computational Linguistics: Technical Papers,pages 1025–1036. Dublin City University and Association forComputational Linguistics.

Salton, G. and McGill, M. J. (1986).Introduction to modern information retrieval.McGraw-Hill, Inc.

Santus, E., Lenci, A., Chiu, T.-S., Lu, Q., and Huang, C.-R.(2016a).Nine features in a random forest to learn taxonomicalsemantic relations.In Proceedings of the Tenth International Conference onLanguage Resources and Evaluation (LREC 2016), pages4557–4564. European Language Resources Association(ELRA).

Santus, E., Lenci, A., Chiu, T.-S., Lu, Q., and Huang, C.-R.(2016b).Unsupervised measure of word similarity: How to outperformco-occurrence and vector cosine in vsms.In Thirtieth AAAI Conference on Artificial Intelligence, pages4260–4261.

Santus, E., Lenci, A., Lu, Q., and Schulte im Walde, S.(2014).Chasing hypernyms in vector spaces with entropy.In Proceedings of the 14th Conference of the EuropeanChapter of the Association for Computational Linguistics,volume 2: Short Papers, pages 38–42. Association forComputational Linguistics.

Shwartz, V., Goldberg, Y., and Dagan, I. (2016).Improving hypernymy detection with an integratedpath-based and distributional method.In Proceedings of the 54th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Papers),pages 2389–2398. Association for Computational Linguistics.

Snow, R., Jurafsky, D., and Ng, A. Y. (2005).Learning syntactic patterns for automatic hypernymdiscovery.In Advances in Neural Information Processing Systems 17,pages 1297–1304. MIT Press.

Turney, P. D. and Mohammad, S. M. (2015).Experiments with three approaches to recognizing lexicalentailment.Natural Language Engineering, 21(03):437–476.

Weeds, J., Clarke, D., Reffin, J., Weir, D., and Keller, B.(2014).Learning to distinguish hypernyms and co-hyponyms.In Proceedings of COLING 2014, the 25th InternationalConference on Computational Linguistics: Technical Papers,pages 2249–2259. Dublin City University and Association forComputational Linguistics.

Weeds, J. and Weir, D. (2003).A general framework for distributional similarity.In Proceedings of the 2003 Conference on Empirical Methodsin Natural Language Processing, pages 81–88.