Introduction Motivation Linguistic Levels Types of MWEs Approaches to identify MWEs Limitations...
description
Transcript of Introduction Motivation Linguistic Levels Types of MWEs Approaches to identify MWEs Limitations...
Multiword Expressions
Presented by:
Bhuban Seth (09305005)
Somya Gupta (10305011)
Advait Mohan Raut (09305923)
Victor Chakraborty (09305903)
Under the guidance of: Prof. Pushpak Bhattacharya.
Contents
Introduction Motivation Linguistic Levels Types of MWEs Approaches to identify MWEs Limitations Conclusion References
Introduction
Put the sweater on Put the sweater on the table Put the light on
Introduction
Put the sweater on Put the sweater on the table Put the light on
Roughly defined as: Idiosyncratic interpretations that cross word
boundaries (or spaces)
Examples
His grandfather kicked the bucket. This job is a piece of cake Put the sweater on He is the dark horse of the match
Google Translations of above sentences:
अपने दादा बाल्टी लात मारी
इस काम के केक का एक टुकड़ा है
स्वेटर पर रखो
वह मैच के अंधेरे घोड़ा है
Motivation
Multiword expressions
•“Of the same order of magnitude as the number of single words” (Jakendoff 1977)•41% - WordNet 1.7 (Fellbaum 1999)
Resolution needed in:
•Machine Translation – Google translate Poor performance example•Information Retrieval•Tagging , Parsing , Question Answering System , WSD
Linguistic Levels
•In short, Ad hocLexicology
•Put on weight, Put the sweater on
Morphology and Syntax
•Spill the BeansSemantics
•Kick the Bucket, Kick the bucket filled with waterPragmatics
How to Handle These?
Variation in Flexibility
Syntactic Idiomaticity
Types (Sag et al 2002)
Types - Examples
Type ExampleFixed In Short , Ad hoc, Palo Alto, Alta
VistaCompound Nominals Congressman, Car park, Part of
SpeechProper Names Deccan Chargers, Delhi
DaredevilsNon Decomposable Idioms Kick the Bucket
Decomposable Idioms Spill the Beans, Let the Cat out Verb Particle Constructions Take off, Put on, Light Verb Constructions Give a Demo, Take a Shower
Institutionalized Phrases Black and White, Traffic Light, Telephone booth
Approaches
Knowledge Based Approach
1)Word with space : Fixed expression• Stemmer may be used to
detect MWEs.• But it fails .. Why???• Kicks the bucket MWE• Kick the buckets Not
MWE• Princeton Wordnet – Flaw
2)Circumscribed Constructions:• Consecutive
Nouns Most probably MWE
3) Inflection Head : Semi fixed expression• Ex : part of
speech parts of speech
Statistical Approaches
Co-occurrence properties
Substitutability
Distributional Similarity
Semantic Similarity
Co-occurrence properties
Example: Black and White
Scan a corpus and find probabilities of bigrams and tri-grams.
P(X|Y) = P(XY)/P(Y)
If P(X|Y) is high, then there is a chance that word sequence ‘YX’ is a MWE.
Demerit:• “I am “ Not MWE.
Point-wise Mutual Information (PMI)
PMI(X,Y)= log {P(X,Y)/(P(X).P(Y))}
PMI(X,Y) of a word pair (X,Y) is measure of strength of their
collocation
Other methods like students-t test and Pearson chi-square can also be used.
Demerit:• Need to differentiate between
systematic & chance co-occurrence
Pearson’s chi-square test
Based on assumption of normal distribution of word frequency, which
could be a limitation
Null hypothesis: the words are independent of each other.
Higher the value of the chi-square statistic, the stronger the association
between the words
Demerit:• For small data collections, assumptions
of normality and chi-square distribution do not hold. Hence, large corpus required
Substitutability
The ability to replace parts of lexical items with alternatives.
Alternatives can be similar or opposite words with respect to tasks & approaches.
Mostly after the substitution the new phrase no longer remains MWE.
Can be used to remove possible Non-MWEs
Src: Kim, 2008
Distributional Similarity
A method to extract the semantic similarity using the context
When two words are similar, then their context words are also similar
Src: Kim, 2008
Semantic Similarity
Similar NCs could have same semantic relations
Src: Kim, 2008
Method
Src: Kim, 2008
MWE Resources
•British National Corpus (BNC)•Brown CorpusCorpus•WordNet•Moby’s Thesaurus- contains 30K root words & 2.5M synonyms and related words
Lexical Resources
•WordNet::Similarity- gives measure of semantic similarity between two given wordsTools
Limitations of current Approaches
Many NLP approaches treat MWEs according to the words-with-spaces method
Many approaches get commonly-attested MWE usages right, sometimes using “ad hoc” methods, e.g. preprocessing
However, most approaches handle variation badly, fail to generalize, and result in NLP systems that are difficult to maintain and extend
Conclusion
MWEs have been classified in terms of lexicalized phrases (like fixed , semi fixed and syntactically flexible) and institutionalized phrases.
MWE analysis in NLP is equally important as any of the other domain like MT or WSD.
Hybrid approach is most probably the best method so far to extract MWE from corpus.
References
Kim, S. N. (2008). Statistical modeling of multiword expressions.
Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Filckinger, D. (2001). Multiword Expression : A pain in the neck for the NLP. In the proceeding of the 3rd International conference on Intelligent text processing and computational linguistics.
Calzolari, N. a. (2002). Towards best practice for
multiword expressions in computational lexicons. Proc. of the 3rd International conference of language resources and evaluation, (pp. 1934--40).
Thank You
Questions???