Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" •...
Transcript of Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" •...
Not All Contexts Are Equal: Automa)c iden)fica)on of Antonyms, Hypernyms, Co-‐
Hyponyms, and Synonyms in DSMs
Enrico Santus, Qin Lu, Alessandro Lenci and Chu-‐Ren Huang
Modeling Human Language Ability
• In the last decades, NLP has achieved impressive progress in modeling human language ability, developing a large number of applica)ons:
– Informa)on Retrieval (IR) – Informa)on Extrac)on (IE) – Ques)on Answering (QA) – Machine Transla)on (MT) – Others…
The Need of Resources
• These applica)ons were not only improved through beOering the algorithms, but also through the use of beOer lexical resources and ontologies (Lenci, 2008a)
– WordNet – SUMO – DOLCE – ConceptNET – Others…
Automa)c Crea)on of Resources
• As the relevance of these resources has grown, systems for their automa)c crea)on have assumed a key role in NLP.
• Handmade resources are in fact: – Arbitrary – Expensive to create – Time consuming – Difficult to keep updated
Seman)c Rela)ons as Building Blocks
• En;;es and rela;ons have been iden)fied as the main building blocks of these resources (Herger, 2014).
• NLP has focused on methods for the automa)c extrac)on and representa)on of en))es and rela)ons, in order to: – Increase the effec;veness of such resources; – Reduce the costs of development and updates.
• Yet, we are s)ll far from achieving sa)sfying results.
Seman)c Rela)ons Iden)fica)on and Discrimina)on in DSMs
• The distribu;onal approach was chosen because it can be: – Completely unsupervised (Turney and Pantel, 2010); – Portable to any language for which large corpora can be collected (ibid.);
– Large range of applicability (ibid.) – Cogni)vely plausible (Lenci, 2008b) – Strength in iden)fying similarity (ibid.)
From Syntagma)c to Paradigma)c Rela)ons
• The main seman)c rela)ons (i.e. synonymy, antonymy, hypernymy, meronymy) are also called paradigma;c seman;c rela;ons.
• Paradigma)c rela)ons are concerned with the possibility of subs;tu;on in the same syntagma;c contexts.
• They should be considered in opposi)on to the syntagma;c rela;ons, which – instead – are concerned with the posi;on in the sentence (syntagm).
(de Saussure, 1916)
Distribu)onal Seman)cs • Distribu)onal Seman)cs can be used to derive the paradigma)c rela)ons from syntagma)c ones.
• It relies on the Distribu(onal Hypothesis (Harris, 1954), according to which: 1. At least some aspects of meaning of a linguis6c
expression depend on its distribu6on in contexts; 2. The degree of similarity between two linguis6c
expressions is a func6on of the similarity of the contexts in which they occur.
Vector Space Models • Star)ng from the Distribu)onal Hypothesis, computa;onal
models represen)ng words as vectors, whose dimensions contain the Strength of Associa6on (SoA) with the contexts, have been developed.
– SoA is generally the frequency of co-‐occurrence or the mutual informa)on (PMI, PPMI, LMI, etc.)
– Contexts may be single words within a window or within a syntac)c structure, pairs of words, etc.
• Words are therefore spa)ally represented and their meaning is given by the proximity with other vectors in such vector space, oden also referred as seman;c space (Turney and Pantel, 2010).
Distribu)onal Seman)c Models: Similarity as Proximity
• DSMs are known for their ability in iden;fying seman;cally similar lexemes.
• Vector cosine is generally used: Vector cosine returns a value between 0 and 1, where 0 is paradigma)cally totally unrelated, and 1 is distribu)onally iden)cal.
(Santus et al., 2014a-‐c)
Shortcomings of DSMs: Seman)c Rela)ons
• Unfortunately the defini;on of distribu6onal similarity is so loose that under its umbrella fall not only near-‐synonyms (e.g. nice-‐good), but also:
– hypernyms (e.g. car-‐vehicle) – co-‐hyponyms (e.g. car-‐motorbike) – antonyms (e.g. good-‐bad) – meronyms (e.g. dog-‐tail)
• Words holding these rela)ons have in fact similar distribu)ons.
(Santus et al., 2014a-‐c; 2015a-‐c)
How to Iden)fy and Discriminate Seman)c Rela)ons
• Iden;fica;on of Seman;c rela;ons (classifica)on) consists in classifying word-‐pairs according to the seman)c rela)on they hold. F1 score is generally used to evaluate the accuracy of the algorithm.
• Seman;c rela;ons discrimina;on (rela)on retrieval): consists in returning a list of word-‐pairs, sorted according to a score that aims to predict a specific rela)on. Average Precision is generally used to evaluate the accuracy of the algorithm.
Distribu)onal Seman)c Model • All the experiments described in the next slides are
performed on standard window-‐based DSM recording co-‐occurrences with the nearest X content words to the leJ and right of each target word. – In most of our experiments, we have used X = 2 or 5, because small
windows are most appropriate for paradigma)c rela)on
• Co-‐occurrences were extracted from a combina)on of the freely available ukWaC and WaCkypedia corpora, and weighted with Local Mutual Informa;on (LMI; Evert, 2005).
(Santus et al., 2014a-‐c; 2015a-‐c)
NEAR-‐SYNONYMY
Near-‐Synonymy: TOEFL & ESL • Near-‐Synonym: a word having the same or nearly the same
meaning as another in the language, as happy, joyful, elated. • Similarity is the main organizer of the seman)c lexicon
(Landuaer and Dumais, 1997)
• Two common tests for evalua)ng methods of near-‐synonymy iden)fica)on are the TOEFL and the ESL test.
• These tests consist in several ques;ons: a word is provided and the algorithm should find the most similar one (near-‐synonym), among four possible ones. – TOEFL (Test of English as Foreign Language): 80 ques;ons, with four
choices each; – ESL (English as Second Language): 50 ques;ons, with four choices each.
(Santus et al., 2016b; 2016d)
APSyn: Hypothesis
• We have developed APSyn (Average Precision for Synonyms), a varia)on of the Average Precision measure that aims to automa)c iden)fy near-‐synonyms in corpora.
• The measure is based on the hypothesis that: – Not only similar words occur in similar contexts, but also they tend to share their most relevant contexts. • E.g. good-‐nice will share contexts like preDy, very, rather, quite, weather, etc. (SketchEngine: Diffs)
APSyn: Method
• To iden;fy the most related contexts, we decided to rank them according to the Local Mutual Informa)on (LMI; Evert, 2005) – LMI is similar to the Pointwise Mutual Informa)on (PMI), but it not biased towards low frequent elements.
• In our experiments, ader having ranked the contexts, we pick the N top ones, were: 100≤N≤1000
• At this point, the intersec;on among the top N contexts of the two target words in the word-‐pair is evaluated and weighted according to the average rank of the shared contexts.
APSyn: Defini)on
• For every feature f included in the intersec)on between the top N features of w1 (i.e. N(F1)), and w2 (i.e. N(F2)), APSyn will add 1 divided by the average rank of the feature, among the top LMI ranked features of w1 (i.e. rank1(f)), and w2 (i.e. rank2(f)).
• Expected scores: – High scores for Synonyms – Low scores or zero for less similar words
Experiments 1. Ques)ons were transformed into
word-‐pairs: PROBL_WORD – POSSIB_CHOICE
2. APSyn scores were assigned to all the word-‐pairs.
3. In every ques)on, the word-‐pairs were ranked decreasing, according to APSyn.
4. If the right answer was ranked first, we added 0.25 ;mes the number of WRONG ANSWERS present in our DSM to the final score.
• BASELINES – Cosine and Co-‐Occurrence Baseline
are provided for comparison. – Random baseline is 25%. – Average non-‐English US college
applicant in the TOEFL is 64.5%
Discussion • APSyn, without any op)miza)on, is s)ll not as good as the state-‐of-‐the-‐art
– 100% on TOEFL and 82% on ESL
• However, it is: – completely unsupervised (therefore applicable to other languages) – linguis;cally grounded (therefore it catches some linguis)c proper)es)
• And it performs: – beOer than the random baseline, the vector cosine and the co-‐occurrence, on our DMS; – very similarly to foreign students that take the TOEFL test.
• The value of N: – the smaller N (close to 100), the beOer the performance of APSyn.
• This is probably due to the fact that when N is too big, not only the most relevant contexts are considered.
• In order to op)mize the performance, N can be learnt from a training set.
(Santus et al., 2016b; 2016d)
ANTONYMY
Antonymy: Importance & Defini)on
• Antonymy: – is one of the main rela)ons shaping the organiza)on of the
seman;c memory (together with near-‐synonymy and hypernymy). – Although it is essen)al for many NLP tasks (e.g. MT, SA, etc.),
current approaches to antonymy iden)fica)on are s)ll weak in discrimina;ng antonyms from synonyms.
• Antonymy is in fact: – similar to synonymy in many aspects (e.g. distribu)onal behavior); – hard to define:
• there are many subtypes of antonymy; • even na)ve speakers of a language do not always agree on classifying word-‐pairs as antonyms.
(Mohammad et al., 2008; 2013)
Antonymy: Defini)on • Over the years, scholars from different disciplines have tried to
– define antonymy; – classify the different subtypes of antonymy.
• Kempson (1977) defined antonyms as word-‐pairs with a “binary
incompa)ble rela)on”, such that the presence of one meaning entails the absence of the other.
• giant – dwarf vs. giant – person
• Cruse (1986) iden)fied an important property of antonymy and called it the paradox of simultaneous similarity and difference between antonyms: – Antonyms are similar in every dimension of meaning except in a
specific one. • giant = dwarf, except for the size (big vs. small)
Antonymy: Co-‐Occurrence Hypothesis
• Most of the unsupervised work on antonymy iden)fica)on is based on the co-‐occurrence hypothesis: – antonyms co-‐occur in the same sentence more oJen than expected
by chance (e.g. in coordinate contexts of the form A and/or B) • Do you prefer meat or vegetables?
• Shortcoming: also other seman)c rela)ons are characterized by this property (e.g. co-‐hyponyms, near synonyms).
• Do you prefer a dog or a cat? • Is she only pre>y or wonderful?
(Santus et al., 2014b-‐c)
APAnt: Hypothesis • If we consider the paradox of simultaneous similarity and difference between
antonyms, we have the following distribu)onal correlate:
• We can fill the empty field with: – “Similar distribu;onal behaviors except for one dimension of meaning”.
• Since giant and dwarf are similar in every dimension of meaning except for the one related to size à They occur in similar contexts, except for those related to that dimension.
• We can also assume that the dimension of meaning in which they differ is a salient one, and – by consequence – that they will behave distribu)onally differently in their most relevant contexts.
• The size is a salient dimension for both giant and dwarf and they are expected to have a different distribu)onal behavior for this dimension of meaning (i.e. big vs. small).
MEANING à DISTRIBUTIONAL BEHAVIOUR
SYNONYMS Similar in every dimension à Similar distribu)onal behaviors
ANTONYMS Similar in every dimension except one à ???
APAnt: Method
• APAnt (Average Precision for Antonyms) is defined as the inverse of APSyn (Santus et al., 2014b-‐c; 2015b-‐c). – Note:
• 1/vector cosine = “no distribu)onal similarity” • while 1/APSyn = “no sharing the most salient contexts”
• Expected scores: – High scores for words not sharing many top contexts (antonyms or unrelated words) – Low scores for words sharing many top contexts (near-‐synonyms)
• For every feature f included in the intersec)on between the top N features of w1 (i.e. N(F1)), and w2 (i.e. N(F2)), APSyn will add 1 divided by the average rank of the feature, among the top LMI ranked features of w1 (i.e. rank1(f)), and w2 (i.e. rank2(f)).
• Expected scores:
– High scores for Synonyms – Low scores or zero for less similar words
APAnt: Evalua)on • We performed several antonyms retrieval experiments to evaluate
APAnt.
• For the evalua)on, we relied on three main datasets, which contain word-‐pairs labeled with the seman)c rela)ons they hold: – BLESS (Baroni and Lenci, 2011)
• Hypernyms, Co-‐Hyponyms, Meronyms, etc. – Lenci/Benoao (Santus et al., 2014b-‐c)
• Antonyms, Synonyms and Hypernyms – EVALu;on 1.0 (Santus et al., 2015a)
• Hypernyms, Meronyms, Synonyms, Antonyms, etc.
• DSMs: 2 content words on the led and the right.
APAnt: Experiment 1 Informa;on Retrieval
• APAnt scores were assigned to all the word-‐pairs in the dataset.
• Word-‐pairs were ranked decreasing according to the first word in the pair, and then the APAnt value.
• Average Precision is used to evaluate the rank (Kotlerman et al., 2010). It returns a value between 0 and 1, where 0 is returned if all antonyms are at the boOom, 1 if they are all on the top.
• Results for the Lenci/BenoOo dataset (2.232 word-‐pairs), and by POS, are provided.
• BASELINES – Vector Cosine and Co-‐Occurrence Baseline.
(Santus et al., 2014b-‐c)
APAnt – Discussion • The evalua)on was performed on:
– 2,232 word-‐pairs • about 50% Antonyms • about 50% Synonyms
• APAnt outperforms vector cosine and co-‐occurrence baseline in the full dataset. – The co-‐occurrence and the cosine promote synonyms.
• APAnt outperforms vector cosine and co-‐occurrence baseline also for the different POS: – Best results are obtained for NOUNS – Worst results are obtained for ADJECTIVES
• It is in fact likely that opposite adjec)ves share their main contexts more than nouns do (e.g. cold/hot can be used to define the same en)ty, while giant/dwarf no).
• N=100 is the best value in our setngs.
HYPERNYMY
Hypernymy: Hypothesis • Another measure for the iden)fica)on of hypernyms was proposed in Santus et al. (2014a): SLQS.
• Given a word-‐pair, SLQS evaluates the generality of the N most related contexts of the two words, under the hypothesis that: – hypernyms tend to occur with more general contexts (e.g. animalàeat) than hyponyms (e.g. dogàbark).
• Generality is evaluated in terms of median Shannon entropy among the N most related contexts: the higher the median entropy, the more general the word is considered.
SLQS: Method • The N most LMI related contexts for both words are selected (100≤N≤250) • For each context, we calculate the entropy (Shannon, 1948):
• Then, for each word, we pick the median entropy among the N most LMI related contexts:
• And we finally calculate SLQS according to the following formula:
• Expected Results: – SLQS = 0 if the words in the pair have similar generality – SLQS > 0 if w2 is more general – SLQS < 0 if w2 is less general
SLQS: Experiment 1 • Task: Iden)fy the direc)onality of the pair.
• DSM: 2-‐window • Dataset: BLESS (Baroni and Lenci, 2011)
– 1277 hypernyms • Note: all of them are in the hyponym-‐hypernym order; therefore we expect to have SLQS>0
• Results: – SLQS obtains 87% precision,
outperforming both WeedsPrec, which is based on the Inclusion hypothesis, and frequency baselines.
(Santus et al., 2014)
SLQS: Experiment 2 • Task: Informa)on Retrieval task.
– Given hypernyms, coordinates, meronyms and randoms, score them in a way that the hypernyms are ranked on top.
• DSM: 2-‐window • Dataset: BLESS (Baroni and Lenci, 2011)
– 1277 à hypernyms, coordinates, meronyms and randoms. • We combined SLQS and Cosine, as they respec)vely caught generality and
similarity.
• Results: – SLQS*Cosine obtains 59% AP
(Kotlerman et al., 2010), outperforming WeedsPrec, which is based on the Inclusion hypothesis, cosines and frequency baselines.
(Santus et al., 2014)
HYPERNYMY, CO-‐HYPONYMY AND RANDOMS
ROOT13: A Supervised Method • Task: Classifica)on of Hypernyms, Co-‐Hyponyms and Randoms • Classifier: RandomForest
• Features: – Cosine, co-‐occurrence frequency, frequency of w1-‐w2, entropy of w1-‐w2 (Turney and
Pantel, 2010; Shannon, 1948) – Shared: size of the intersec)on between the top 1k associated contexts of the two
terms according to the LMI score (Evert 2005) – APSyn: for every context in the intersec)on between the top 1k associated contexts of
the two terms, this measure adds 1, divided by its average rank in the term-‐context list (Santus et al. 2014b)
– Diff Freqs: difference between the terms frequencies – Diff Entrs: difference between the terms entropies – C-‐Freq 1, 2: two features storing the average frequency among the top 1k associated
contexts for each term – C-‐Entr 1, 2: two features, storing the average entropy among the top 1k associated
contexts for each term (Shannon, 1948)
• Dataset: combina)on of BLESS, Lenci/BenoOo and EVALu)on 1.0 – 9,600 pairs: 33% Hypernyms, 33% Co-‐Hyponyms and 33% Randoms
ROOT13: Experiment 1 • Baseline: Cosine • Accuracy: measured
with F1
• Three classes: – 88.3% vs. 57.6%
• Hyper-‐Coord: – 93.4% vs. 60.2%
• Hyper-‐Random: – 92.3% vs. 65.5%
• Coord-‐Random: – 97.3% vs. 81.5%
(Santus et al., 2016)
CONCLUSIONS
Conclusions • Extrac)ng proper)es from the most related contexts seems to provide important
informa)on about seman)c rela)ons (entropy, intersec)on, frequency, etc.). • APSyn, APAnt, SLQS and ROOT13 are all methods that try to inves)gate and
combine such proper)es.
• The former three methods have obtained good results in the tasks we have performed, without any par)cular op)miza)on. Moreover, they are: – Unsupervised (and therefore applicable to several languages) – Linguis)cally grounded (they tell us something about word usage)
• Up to this date, we have iden)fied the most related contexts with Local Mutual Informa)on (Evert, 2005), but it is likely that we will start using the Posi;ve Pointwise Mutual Informa;on, as most of the literature is using it.
• We are currently developing a system to automa)cally extract as many sta)s)cal proper)es as possible, evalua)ng their correla)on with seman)c rela)ons.
• Briefly: Not All Contexts Are Equal
Thank you Enrico Santus – The Hong Kong Polytechnic University
Alessandro Lenci – University of Pisa Qin Lu – The Hong Kong Polytechnic University
Sabine Schulte im Walde – Ins)tute for Natural Language Processing, University of StuOgart Frances Yung – Nara Ins)tute of Science and Technology