Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical...
Transcript of Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical...
![Page 1: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/1.jpg)
Empirical Approaches to
Multilingual Lexical Acquisition
Lecturer: Timothy Baldwin
![Page 2: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/2.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Lecture 4
Unsupervised Approaches to Lexical
Acquisition: Word Segmentation and MWE
Extraction
1
![Page 3: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/3.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Word Segmentation
2
![Page 4: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/4.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Basic Task Description
• Take an unsegmented string and dynamically insert the word
boundaries:
spotthebreaksinthisstring : spot the breaks in this string
• Applications in:
⋆ the segmentation of non-segmenting languages (e.g. Japanese,
Chinese, Thai)
⋆ segmenting up a speech signal into words
⋆ modelling child word acquisition
3
![Page 5: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/5.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Difficulties in Word Detection
• Interdepedence between word and boundary detection
• Segmentation of frequently-occurring strings
ad hoc, 東京都 tou·kyou·to “Tokyo prefecture”
• Inherent boundary ambiguities
大学長 dai·gaku·chou “(university) president”
• Data sparseness
4
![Page 6: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/6.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Unsupervised Discovery ofMorphemes
(Creutz and Lagus 2002)
5
![Page 7: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/7.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Types of morphological linkage
• Inflectional (table : tables, bus : buses/busses, phenomenon
: phenomena)
• Derivational (central : centralise)
• Compositional (house + boat : houseboat)
6
![Page 8: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/8.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Semantic linkage
• employ : employer , employee
• Some of these are lost in the mists of time, e.g. ploy :
employ/deploy?
7
![Page 9: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/9.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Basic tasks in morphology
• Discovering the morphemes
• Determining the interaction between the morphemes in a given
word
unfathomable
l
un·fathom·ablel
unNEG·fathomV ·ableV →N
8
![Page 10: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/10.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Unsupervised Discovery of Morphemes
• Segment each word token into morphemes without using prior
knowledge of morphological processes or morphemes:
misinformed : mis·inform·edreconstruction : re·construct·ionkahvinjuojallekin : kahvi·n·juo·ja·lle·kin [FI]
• Assume agglutinative or simple inflectional morphology
• Based on orthographic regularities
Creutz and Lagus (2002) 9
![Page 11: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/11.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Applications
• Language modelling (speech recognition)
• Language-independent lexicographic tool for determining
stems/word boundaries
• N.B. assumes a corpus of raw text up-front
Creutz and Lagus (2002) 10
![Page 12: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/12.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Minimum Description Length
• Minimum Description Length (MDL, a.k.a. Minimum Message
Length or MML) is a method for simultaneously evaluating the
compactness of a particular encoding and the compactness of the
data using that encoding
Description length = model DL + data DL
‘Simplicity vs. Accuracy’
• Too much ‘accuracy’ : overfitting
• How can we compress / generalise the data model?
Creutz and Lagus (2002) 11
![Page 13: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/13.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
MDL in word segmentation
• In morpheme discovery terms, minimise C:
C = Cost(Source text) + Cost(Codebook)
=∑
tokens
− log p(mi) +∑
types
k × l(mj)
Creutz and Lagus (2002) 12
![Page 14: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/14.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Example: Word splitting
howmuchwoodcouldawoodchuckchuckifawoodchuckcouldchuckwood
Creutz and Lagus (2002) 13
![Page 15: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/15.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
mi − log p(mi) freq(mi) l(mi)
howmuchwoodcoulda... 0.00 1 57
— 0.00 57
C = 285.00 (k = 5)
Creutz and Lagus (2002) 14
![Page 16: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/16.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
mi − log p(mi) freq(mi) l(mi)
a 2.70 2 1
chuck 2.70 2 5
could 2.70 2 5
how 3.70 1 3
if 3.70 1 2
much 3.70 1 4
wood 2.70 2 4
woodchuck 2.70 2 9
— 38.11 33 C = 203.11 (k = 5)
Creutz and Lagus (2002) 15
![Page 17: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/17.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
mi − log p(mi) freq(mi) l(mi)
a 2.91 2 1
chuck 1.91 4 5
could 2.91 2 5
how 3.91 1 3
if 3.91 1 2
much 3.91 1 4
wood 1.91 4 4
— 38.60 24 C = 158.60 (k = 5)
Creutz and Lagus (2002) 16
![Page 18: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/18.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
mi − log p(mi) freq(mi) l(mi)
a 4.83 2 1
c 2.37 11 1
d 3.25 6 1
f 5.83 1 1
h 3.25 6 1
i 5.83 1 1
k 3.83 4 1
l 4.83 2 1
m 5.83 1 1
o 2.37 11 1
u 3.03 7 1
w 3.51 5 1
— 182.09 12 C = 242.09 (k = 5)
Creutz and Lagus (2002) 17
![Page 19: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/19.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Greedy search algorithm
• Read in each word at a time, and recursively select the binary
segmentation of a word which minimises C
• Continue until no C-optimal segmentations remain
• Dynamically segment even if the word has been observed previously
in the data (removing any previous segmentations from the model)
Creutz and Lagus (2002) 18
![Page 20: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/20.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Dreaming
• At regular intervals, randomly iterate over words already processed
to remove any global sub-optimal segmentations
• Dream for a fixed time (over a fixed number of words) OR until no
significant change in C is observed
• Demo:
http://www.cis.hut.fi/cgi-bin/morpho/nform.cgi
19
![Page 21: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/21.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Mostly-UnsupervisedStatistical Segmentation ofJapanese Kanji Sequences
(Kubota Ando and Lee 2003)
20
![Page 22: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/22.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Outline
• Unsupervised method for segmenting Japanese kanji sequences
社長兼業務部長l
社長 |兼 |業務 |部長 sha·chou|ken|gyou·mu|bu·chou “president
and general business manager”
• Based on character n-grams (sequences of n characters)
Kubota Ando and Lee (2003) 21
![Page 23: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/23.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Why the Interest in Kanji?
• Japanese consists of three basic orthographies:
⋆ hiragana: 46 elements (functional words, onomatapaea,
verb/adjective suffixes)
⋆ katakana: 46 elements (words of foreign origin, scientific names)
⋆ kanji: 1,945 “official” elements, more in practice (content words)
• Kanji the most productive and sparse of the three
Kubota Ando and Lee (2003) 22
![Page 24: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/24.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Basic Intuition
• N -gram “windows” over a kanji string can be used as predictors of
word boundaries in that:
1. frequent n-grams suggest the lack of a boundary
2. infrequent n-grams suggest the presence of a boundary
spotthebreaks with 4-gram window:s p o t t h e b r e a k s
s p o t
p o t t
o t t h...
Kubota Ando and Lee (2003) 23
![Page 25: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/25.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Example: spotthebreaks
Word N-gram Freq
spot 119
pott 24
otth 120
tthe 6075
theb 2118
hebr 218
ebre 49
brea 328
reak 246
eaks 57
Kubota Ando and Lee (2003) 24
![Page 26: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/26.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Method
• For each potential boundary point:
⋆ For each n-gram order (e.g. 4):
∗ For each of the n-grams to the left and right of the boundary:
◦ What is the proportion of n-grams straddling the boundary
that are less frequent than the n-gram adjoining it?
• vn = 12(n−1)
∑
d∈{L,R}
∑n−1j=1 I>(#(sn
d), #(tnj ))
Kubota Ando and Lee (2003) 25
![Page 27: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/27.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Returning to our Example
• For the boundary spot|theb:
v4 = 12(4−1)(1 + 2) = 0.5
• For the boundary pott|hebr :
v4 = 12(4−1)(0 + 1) ≈ 0.17
Kubota Ando and Lee (2003) 26
![Page 28: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/28.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Combining n-gram Orders
• So as to balance up the effects of data sparseness, the method
advocates “voting” across different n-gram orders:
vN(k) = 1|N |
∑
n∈N vn(k)
• Place boundaries at locations l where EITHER:
1. vN(l) > vN(l − 1) and vN(l) > vN(l + 1) OR
2. vN(l) > t
t = threshold constant
Kubota Ando and Lee (2003) 27
![Page 29: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/29.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Evaluation
• Extract n-gram statistics from 150MB of newswire data (N.B.
unannotated)
• Hand-annotated 5 held-out datasets containing 500 kanji
sequences each
• For each held-out dataset, use 450 sequences for parameter training
(N, t) and the remaining 50 sequences for evaluation (hence
mostly-unsupervised)
Kubota Ando and Lee (2003) 28
![Page 30: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/30.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Words vs. Morphemes
• Two separate evaluations were carried out, based on:
⋆ words: a stem and its affixes are considered as a single unit
[電話器] [deN·wa·ki] “telephone (device)”
⋆ morphemes: stems and affixes are each individual units
[電話][器] [deN·wa][ki] “[telephone][device]”
Kubota Ando and Lee (2003) 29
![Page 31: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/31.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Parameter Optimisation
• Run the algorithm over each set of 450 sequences using the different
values of N and t
• Optimise according to one of:
⋆ word precision = correct proposed bracketsproposed brackets
⋆ word recall = correct proposed bracketsactual brackets
⋆ word F(-measure/score) = precision×recallprecision+recall
• Evaluate according to the optimal parameter setting
Kubota Ando and Lee (2003) 30
![Page 32: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/32.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Results (Briefly)
• Word performance better than established Japanese morphological
analyzers (JUMAN, Chasen)
• Word performance considerably better than morpheme performance
• Single-character morphemes greatest source of errors
Kubota Ando and Lee (2003) 31
![Page 33: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/33.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Summary
• Method proposed for segmenting up Japanese kanji sequences
based on simple n-gram statistics and voting
• Basic intuition that n-grams straddling word boundaries will be
less-frequent than those adjacent to word boundaries
• Impressive results achieved for such a conceptually simple algorithm
32
![Page 34: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/34.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Multiword ExpressionDiscovery
33
![Page 35: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/35.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Basic Task Description
• Identify the multiword expression (MWE) types in tagged (or raw)
text from observation of the token distribution
• Recall: a MWE is made up of multiple words and is lexically,
syntactically, semantically, pragmatically and/or statistically
idiosyncratic
• Considerable body of literature on collocation extraction
34
![Page 36: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/36.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Lexico-syntactic Idiomaticityby and large
???
conjP Adj
by and large
ad hoc
Adj
? ?
ad hoc
wine and dine
V[trans]
conjV[intrans] V[intrans]
wine and dine
35
![Page 37: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/37.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Semantic Idiomaticity
kick the bucket spill the beans
die’ reveal’(secret’)
kindle excitement
kindle’(excitement’)
36
![Page 38: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/38.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Pragmatic Idiomaticity
• The Wheel of Fortune factor — how to represent the jumble of
phrases stored in the mental lexicon?
• The Monty Python factor — mish-mash of language fragments
which evoke particular events/individuals/memories
37
![Page 39: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/39.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Statistical Idiomaticityunblemished spotless flawless immaculate impeccable
eye – – – – +gentleman – – ? – +home ? + – + ?lawn – – ? + –memory – – + – ?quality – – – – +record + + + + +reputation + – – + +taste – – – – +
Adapted from Cruse (1986)
38
![Page 40: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/40.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Complications in MWE Discovery
• Working out the extent of the collocation (phrase boundary
detection)
trip the light X
trip the light fantastic V
trip the light fantastic at X
• Fine line between collocations and simple default lexical
combinations
buy a car/purchase power
39
![Page 41: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/41.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Collocations vs. MWEs
• A collocation (≈ institutionalised phrase) is an arbitrary and
recurrent word combination
• Collocations tend to stand in opposition to anti-collocations =
lexically-marked synonym-substitutes
• Collocations can be semantically-marked (e.g. dark horse) but tend
to be compositional (e.g. strong coffee)
• Predicative collocations vs. rigid NPs vs. phrasal templates
40
![Page 42: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/42.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Retrieving Collocations fromText: Xtract
(Smadja 1993)
41
![Page 43: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/43.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Outline
• Automatic method for extracting collocations from raw text based
on n-gram statistics
• Basic intuition: collocations are more rigid syntactically and more
frequent than other word combinations
• Method: attempt to capture this intuition using the basic statistics
of word combinations
Smadja (1993) 42
![Page 44: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/44.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Stage 1: Extract Significant Bigrams
• w and wi co-occur (wi is a collocate of w) if they are found in a
single sentence separated by fewer than 5 words
• a bigram (w,wi) is significant iff:
⋆ w and wi co-occur more frequently than chance
⋆ w and wi appear in a relatively rigid configuration
• Divide up the set of collocates according to POS
Smadja (1993) 43
![Page 45: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/45.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Example Corpus
... multiword(−1) expressions ...
... multiword(−1) expressions ...
... dialect(−1) expressions ...
... dialect(−2) and expressions ...
... expressions of interest(2) ...
... multiword(−1) expressions ...
... collocation extraction ...
... expressions dialect(1) ...
... multiword(−1) expressions ...
... multiword(−1) expressions ...
Smadja (1993) 44
![Page 46: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/46.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Noun Co-occurrence Table
w = expressions, D = {−2,−1, 1, 2}
multiword dialect collocation interestcollocate
w1 w2 w3 w4
p−2i 0 1 0 0
p−1i 5 1 0 0
p1i 0 1 0 0
p2i 0 0 0 1
freq i 5 3 0 1
wi ∈ C yes yes no yes
Smadja (1993) 45
![Page 47: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/47.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Statistics of Expectation
• f =P
wi∈C freqi
|C| = 5+3+13 = 3 (frequency average)
• σ =
√
P
wi∈C(freqi−f)2
|C| =
√
(5−3)2+(3−3)2+(1−3)2
3 ≈ 1.63
• ki = freqi−f
σ(strength)
• pi =P
j∈D pji
|D| (pair count average)
• Ui =P
j∈D(pji−pi)
2
|D| (pair count variance)
Smadja (1993) 46
![Page 48: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/48.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Collocation Filters
• Strength: ki > kα(= 1)
: select frequent collocates
• Spread: Ui > U0(= 1)
: select spiky distributions
• Peakiness: pji ≥ pi + (kβ ×
√Ui) kβ = 0.51
: identify interesting spikes
1Value of 10 suggested for kβ in Smadja (1993)
Smadja (1993) 47
![Page 49: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/49.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Back to our Example: Strength
• w1 (multiword)
⋆ k1 = 5−31.63 = 1.22 > 1 V
• w2 (dialect)
⋆ k2 = 3−31.63 < 1 X
• w4 (interest)
⋆ k3 = 1−31.63 < 1 X
Smadja (1993) 48
![Page 50: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/50.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Spread and Peakiness (w1 = multiword)
• U1 = (0−1.25)2+(5−1.25)2+(0−1.25)2+(0−1.25)2
4 ≈ 20.31 > 1 V
⋆ p1 = 0+5+0+04 = 1.25
⋆ p1 + (kβ ×√
U1) = 1.25 + (0.5 ×√
20.31) ≈ 3.50
j pi + (kβ ×√
Ui)
p−21 0 < 3.50 X
p−11 5 ≥ 3.50 V
p+11 0 < 3.50 X
p+21 0 < 3.50 X
Smadja (1993) 49
![Page 51: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/51.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Stage 2: Bigrams to n-grams
• Independent filter to detect larger n-grams
• Method: for each fixed-distance collocate (w,wji ), extract out
contiguous word sequences where max(p(word[i])) > T (= 0.75)
Smadja (1993) 50
![Page 52: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/52.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Example
w = resistance, w−3i = path
Concordances from the BNC:
... trod the path(−3) of least resistance , ...
... finding the path(−3) of least resistance will ...
... along the path(−3) of least resistance .
... the safest path(−3) of least resistance through ...
... took the path(−3) of least resistance and ...
: the path of least resistance is a rigid noun phrase
Smadja (1993) 51
![Page 53: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/53.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Stage 3: Add Syntax
• Use cass dependency parser to assign syntactic labels to bigrams
• Same basic thresholding filter as used in Stage 2
Smadja (1993) 52
![Page 54: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/54.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Summary
• Two methods proposed for extracting collocations based on
distributional statistics
⋆ 1st method based on the intuition that collocations occur
frequently in relatively fixed configurations
⋆ 2nd method based on the predictability of contiguous lexical
contexts for given concordances
• Basic method illustrated for determining the syntax of extracted
collocations
Smadja (1993) 53
![Page 55: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/55.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
Reflections
• (At the time) groundbreaking research on collocation extraction
• Not effective at extracting out low-frequency words
• Difficulties in evaluating the results of collocation extraction
(applies to this day)
• Difficulties in extracting non-contiguous (predicative) collocations
such as verb particles
Smadja (1993) 54
![Page 56: Empirical Approaches to Multilingual Lexical Acquisitiontbaldwin/lexacq/lecture04.pdfEmpirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008) Difficulties in](https://reader030.fdocuments.in/reader030/viewer/2022041215/5e0395ef21422b1220141b13/html5/thumbnails/56.jpg)
Empirical Approaches to Multilingual Lexical Acquisition Lecture 4 (17/7/2008)
ReferencesCreutz, Mathias, and Krista Lagus. 2002. Unsupervised discovery of morphemes. In
Proc. of the 6th Meeting of the ACL Special Interest Group in Computational Phonology ,
Philadelphia, USA.
Cruse, D. Alan. 1986. Lexical Semantics. Cambridge, UK: Cambridge University Press.
Kubota Ando, Rie, and Lillian Lee. 2003. Mostly-unsupervised statistical segmentation of
Japanese kanji sequences. Natural Language Engineering 9.127–49.
Smadja, Frank. 1993. Retrieving collocations from text: Xtract. Computational Linguistics
19.143–77.
55