Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
description
Transcript of Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Morpho-SyntacticAnalysis and Language Modeling using Machine
Learning Techniques
Guy De Pauw Walter Daelemans
[email protected] [email protected]
CNTS – Language Technology Group
http://www.cnts.ua.ac.be
2
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Morpho-Syntactic Analysis using Machine Learning Techniques
• Why?- As an NLP tool proper (!)- Annotate new datasets (e.g. Mediargus)- Extra information source for language modeling
• How?- Machine Learning techniques (MBL + maxent)- Shallow linguistic analysis
3
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Shallow linguistic analysis
• For many NLP applications, full analysis is often not necessary- e.g. morphological analysis
• uitzonderingsgevallen:FULL: ((((uitzonder)[V],(ing)[N|V.])[N],(s)[N|N.N],(geval)[N])[N]),(en)[N-m]
vsSHALLOW: uitzonder@V + ing@N|V. + s@N|N.N + geval@N + en@N-m
• Shallow Analysis: fast + robust
4
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Shallow linguistic analysis word morphology POS-tag SP-tag
nu nu BW I-ADVP
treft tref+t WW3 S-MAIN
de de LID I-NP
nietsvermoedende niets+vermoed+end+e ADJ1 I-NP
poolreiziger pool+reiziger N1 I-NP
vuilnisbelten vuil+nis+belt+en N3 B-NP
tussen tussen VZ1 I-PP
de de LID I-NP
ijsbergen ijs+berg+en N3 I-NP
aan aan VZ2 I-SVP
. . LET 0
5
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Shallow linguistic analysis
[ADVP nu][SMAIN tref+t][NP de niets+vermoed+end+e pool+reiziger][NP vuilnis+belt+en][PP tussen ][NP de ijs+berg+en][SVP aan].
6
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Morphological Analysis
parelvissersSegmentation
parel+viss+er+sTagging
parel@N+viss@V+er@N|V.+s@INFLmAlternation
parel@N+vis@V+er@N|V.+s@INFLm
7
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Morphological Analysis
ParelvissersSegmentation
Parel+viss+er+sTagging
parel@N+viss@V+er@N|V.+s@INFLmAlternation
parel@N+vis@V+er@N|V.+s@INFLm
8
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Morphological Segmentation
9
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Morphological Segmentation
• Trained and evaluated on (adapted) morphological database of CELEX
• Experimental Results (full word score):- FS (minimal boundaries + unigram): 86.7%- Morpheme Boundary Prediction: 89.2%- FS + Morpheme Prediction: 94.8%
10
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Morphological Analysis
ParelvissersSegmentation
Parel+viss+er+sTagging
parel@N+viss@V+er@N|V.+s@INFLmAlternation
parel@N+vis@V+er@N|V.+s@INFLm
96%
11
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Morphological Analysis
parelvissersSegmentation
parel+viss+er+sTagging
parel@N+viss@V+er@N|V.+s@INFLmAlternation
parel@N+vis@V+er@N|V.+s@INFLm
12
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Alternation
• Map parel+viss+er+sto parel+vis+er+saan+lop+en to aan+loop+en
but also aan+ge+bracht to aan+ge+breng
13
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Alternation
• Grapheme based alternation
14
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Alternation
• Grapheme based alternation
15
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Alternation
• Grapheme based alternation
16
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Alternation
• Grapheme based alternation• 99.4% of morphemes correctly alternated
- Including complex alternations like bracht->breng
17
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Morphological Analysis
• Use morphological analysis cascade to analyze all words in CGN and Mediargus (not in CELEX)
e.g. F1: flowerpower-afstammelingenF2: flowerpower-@N+af@P+stamm@V+eling@N|V.+en@INFLmF3: flowerpower@N+af@P+stam@V+eling@N|V.+en@INFLmF4: m
• Huge morphological database of ±2.7M words
18
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Shallow linguistic analysis word morphology POS-tag SP-tag
nu nu BW I-ADVP
treft tref+t WW3 S-MAIN
de de LID I-NP
nietsvermoedende niets+vermoed+end+e ADJ1 I-NP
poolreiziger pool+reiziger N1 I-NP
vuilnisbelten vuil+nis+belt+en N3 B-NP
tussen tussen VZ1 I-PP
de de LID I-NP
ijsbergen ijs+berg+en N3 I-NP
aan aan VZ2 I-SVP
. . LET 0
19
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Part-of-Speech Tagging
• Trained and evaluated on CGN + STIL• Some Experimental Results
- Contextual + orthographic features 96.6% (uw82.5%)
- + morphological information 97.2% (uw86.9%)
• Tags of morphemes• Lemma• Flection tag
20
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Shallow linguistic analysis word morphology POS-tag SP-tag
nu nu BW I-ADVP
treft tref+t WW3 S-MAIN
de de LID I-NP
nietsvermoedende niets+vermoed+end+e ADJ1 I-NP
poolreiziger pool+reiziger N1 I-NP
vuilnisbelten vuil+nis+belt+en N3 B-NP
tussen tussen VZ1 I-PP
de de LID I-NP
ijsbergen ijs+berg+en N3 I-NP
aan aan VZ2 I-SVP
. . LET 0
89.5% tagging accuracy
87.4 F-score
21
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
System for morpho-syntactic analysis
• Morphological analysis: ±5 w/s• Tagging + Phrase Chunking: ±450 w/s• Used to annotate entire Mediargus corpus
- Morphological analysis (±2B morphemes)- Part-of-speech tags- Phrase chunks
::demo::http://www.cnts.ua.ac.be/flavor
22
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Language Modeling
• Problem1: input is not a sequence of words, but a sequence of morphemes
• Problem2: scoring hypotheses using shallow linguistic annotation
23
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Language Modeling
• Problem1: input is a sequence of morphemes
Nu tref t de niets vermoed end e pool reiziger vuil nis belt en tussen de ijs berg en aan
• Disambiguate between word and morpheme boundaries• Use morphologically analyzed mediargus as training material• Approach: morpheme sequence tagging
24
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Language Modelingnu NWB
tref V
t INFLtWB
de EWB
niets B
vermoed V
end A|BV.
e INFLPWB
pool N
reiziger NWB
vuil A
nis N|A.
belt N
en INFLmWB
tussen BWB
de EWB
ijs N
berg N
en INFLmWB
aan PWB
. .
25
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Language Modeling
• Problem1: input is a sequence of morphemes
Nu tref t de niets vermoed end e pool reiziger vuil nis belt en tussen de ijs berg en aan
[w nu ] [w tref t] [w de ] [w niets vermoed end e ] …- word boundaries: 97.2%- Morpheme boundaries: 93.1%- F-score of 92.3%
26
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Language Modeling
• (Big) remaining problem:- aanlopen -> aan+lop+en or aan+loop+en- gebracht -> ge+bracht or ge+breng- But not: aan+loop+en en ge+bracht
- Information not available in CELEX- But: Orthography closest guess
True pronounced morphemes quite workableDecent accuracy on harder task?? Regular expression + grapheme-to-phoneme
conversion- Not yet integrated in recognizer
27
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Language Modeling
• Turn morphemes into word forms (+ reverse alternation)- Re-analyze word form
• Tag + shallow parse sequence of words
::demo::www.cnts.ua.ac.be/flavor
28
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Language Modeling
• Problem2: scoring hypotheses- Option1: n-gram models trained on annotated Mediargus
corpus• Morpheme N-grams: de niets vermoed end <e>• Tagged-morpheme N-grams Ewb B V A|BV. <INFLPWB>• Word n-grams• Part-of-Speech tag n-grams• Shallow Parsing tag n-grams• Combination: de@LID@NP <kan@WW@NP> or <kan@N1@NP>
- Interpolate LM scores
29
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Language Modeling
• Problem2: scoring hypotheses- Option2: classifier “certainty”
• Use maximum entropy classifiers, that can output proper probabilities
• Quite informative for WSJ LM-task
30
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Language Modeling
• Problem2: scoring hypotheses- Option3: Maxent classifier as LM
• Information Source: surrounding context (words, morphemes, linguistic annotation)
• To classify: word (or morpheme)• VERY slow training time
31
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Language Modeling: circumstantial evidence
• Wall-Street Journal: n-gram rescoring- VP set: 8.11% 7.57%- NVP set: 8.08% 7.74%
+ maxent classifier probabilities
+ POS 3-grams
• Mediargus: perplexity - Word 3-gram: 148.42- Morpheme 3-gram: 56.36- Tagged Morpheme 3-gram: 53.17
32
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Limitations
• Morpheme representation problematic for integration in recognizer
• Efficiency as LM not yet properly evaluated for Dutch
33
FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques
Available Tools & DataTools:
- All-in-one morpho-syntactic analyzer for Dutch• Morphological analyzer• Part-of-Speech tagger• Phrase Chunker
- Word vs Morpheme Boundary detector for Dutch- Promising outlook for Dutch N-gram LM using extra
annotation layersData:
- Adjusted version of CELEX (incl segmented orthographic forms)- 2.7M word database of morphologically analyzed words- Morphologically analyzed, tagged & shallow-parsed
Mediargus