Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

33
Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques Guy De Pauw Walter Daelemans [email protected] [email protected] CNTS – Language Technology Group http://www.cnts.ua.ac.be

description

Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques. Guy De Pauw Walter Daelemans [email protected]@ua.ac.be CNTS – Language Technology Group http://www.cnts.ua.ac.be. Morpho-Syntactic Analysis using Machine Learning Techniques. Why? - PowerPoint PPT Presentation

Transcript of Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Page 1: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Morpho-SyntacticAnalysis and Language Modeling using Machine

Learning Techniques

Guy De Pauw Walter Daelemans

[email protected] [email protected]

CNTS – Language Technology Group

http://www.cnts.ua.ac.be

Page 2: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

2

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Morpho-Syntactic Analysis using Machine Learning Techniques

• Why?- As an NLP tool proper (!)- Annotate new datasets (e.g. Mediargus)- Extra information source for language modeling

• How?- Machine Learning techniques (MBL + maxent)- Shallow linguistic analysis

Page 3: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

3

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Shallow linguistic analysis

• For many NLP applications, full analysis is often not necessary- e.g. morphological analysis

• uitzonderingsgevallen:FULL: ((((uitzonder)[V],(ing)[N|V.])[N],(s)[N|N.N],(geval)[N])[N]),(en)[N-m]

vsSHALLOW: uitzonder@V + ing@N|V. + s@N|N.N + geval@N + en@N-m

• Shallow Analysis: fast + robust

Page 4: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

4

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Shallow linguistic analysis word morphology POS-tag SP-tag

nu nu BW I-ADVP

treft tref+t WW3 S-MAIN

de de LID I-NP

nietsvermoedende niets+vermoed+end+e ADJ1 I-NP

poolreiziger pool+reiziger N1 I-NP

vuilnisbelten vuil+nis+belt+en N3 B-NP

tussen tussen VZ1 I-PP

de de LID I-NP

ijsbergen ijs+berg+en N3 I-NP

aan aan VZ2 I-SVP

. . LET 0

Page 5: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

5

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Shallow linguistic analysis

[ADVP nu][SMAIN tref+t][NP de niets+vermoed+end+e pool+reiziger][NP vuilnis+belt+en][PP tussen ][NP de ijs+berg+en][SVP aan].

Page 6: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

6

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Morphological Analysis

parelvissersSegmentation

parel+viss+er+sTagging

parel@N+viss@V+er@N|V.+s@INFLmAlternation

parel@N+vis@V+er@N|V.+s@INFLm

Page 7: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

7

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Morphological Analysis

ParelvissersSegmentation

Parel+viss+er+sTagging

parel@N+viss@V+er@N|V.+s@INFLmAlternation

parel@N+vis@V+er@N|V.+s@INFLm

Page 8: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

8

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Morphological Segmentation

Page 9: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

9

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Morphological Segmentation

• Trained and evaluated on (adapted) morphological database of CELEX

• Experimental Results (full word score):- FS (minimal boundaries + unigram): 86.7%- Morpheme Boundary Prediction: 89.2%- FS + Morpheme Prediction: 94.8%

Page 10: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

10

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Morphological Analysis

ParelvissersSegmentation

Parel+viss+er+sTagging

parel@N+viss@V+er@N|V.+s@INFLmAlternation

parel@N+vis@V+er@N|V.+s@INFLm

96%

Page 11: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

11

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Morphological Analysis

parelvissersSegmentation

parel+viss+er+sTagging

parel@N+viss@V+er@N|V.+s@INFLmAlternation

parel@N+vis@V+er@N|V.+s@INFLm

Page 12: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

12

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Alternation

• Map parel+viss+er+sto parel+vis+er+saan+lop+en to aan+loop+en

but also aan+ge+bracht to aan+ge+breng

Page 13: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

13

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Alternation

• Grapheme based alternation

Page 14: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

14

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Alternation

• Grapheme based alternation

Page 15: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

15

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Alternation

• Grapheme based alternation

Page 16: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

16

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Alternation

• Grapheme based alternation• 99.4% of morphemes correctly alternated

- Including complex alternations like bracht->breng

Page 17: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

17

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Morphological Analysis

• Use morphological analysis cascade to analyze all words in CGN and Mediargus (not in CELEX)

e.g. F1: flowerpower-afstammelingenF2: flowerpower-@N+af@P+stamm@V+eling@N|V.+en@INFLmF3: flowerpower@N+af@P+stam@V+eling@N|V.+en@INFLmF4: m

• Huge morphological database of ±2.7M words

Page 18: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

18

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Shallow linguistic analysis word morphology POS-tag SP-tag

nu nu BW I-ADVP

treft tref+t WW3 S-MAIN

de de LID I-NP

nietsvermoedende niets+vermoed+end+e ADJ1 I-NP

poolreiziger pool+reiziger N1 I-NP

vuilnisbelten vuil+nis+belt+en N3 B-NP

tussen tussen VZ1 I-PP

de de LID I-NP

ijsbergen ijs+berg+en N3 I-NP

aan aan VZ2 I-SVP

. . LET 0

Page 19: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

19

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Part-of-Speech Tagging

• Trained and evaluated on CGN + STIL• Some Experimental Results

- Contextual + orthographic features 96.6% (uw82.5%)

- + morphological information 97.2% (uw86.9%)

• Tags of morphemes• Lemma• Flection tag

Page 20: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

20

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Shallow linguistic analysis word morphology POS-tag SP-tag

nu nu BW I-ADVP

treft tref+t WW3 S-MAIN

de de LID I-NP

nietsvermoedende niets+vermoed+end+e ADJ1 I-NP

poolreiziger pool+reiziger N1 I-NP

vuilnisbelten vuil+nis+belt+en N3 B-NP

tussen tussen VZ1 I-PP

de de LID I-NP

ijsbergen ijs+berg+en N3 I-NP

aan aan VZ2 I-SVP

. . LET 0

89.5% tagging accuracy

87.4 F-score

Page 21: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

21

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

System for morpho-syntactic analysis

• Morphological analysis: ±5 w/s• Tagging + Phrase Chunking: ±450 w/s• Used to annotate entire Mediargus corpus

- Morphological analysis (±2B morphemes)- Part-of-speech tags- Phrase chunks

::demo::http://www.cnts.ua.ac.be/flavor

Page 22: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

22

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Language Modeling

• Problem1: input is not a sequence of words, but a sequence of morphemes

• Problem2: scoring hypotheses using shallow linguistic annotation

Page 23: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

23

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Language Modeling

• Problem1: input is a sequence of morphemes

Nu tref t de niets vermoed end e pool reiziger vuil nis belt en tussen de ijs berg en aan

• Disambiguate between word and morpheme boundaries• Use morphologically analyzed mediargus as training material• Approach: morpheme sequence tagging

Page 24: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

24

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Language Modelingnu NWB

tref V

t INFLtWB

de EWB

niets B

vermoed V

end A|BV.

e INFLPWB

pool N

reiziger NWB

vuil A

nis N|A.

belt N

en INFLmWB

tussen BWB

de EWB

ijs N

berg N

en INFLmWB

aan PWB

. .

Page 25: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

25

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Language Modeling

• Problem1: input is a sequence of morphemes

Nu tref t de niets vermoed end e pool reiziger vuil nis belt en tussen de ijs berg en aan

[w nu ] [w tref t] [w de ] [w niets vermoed end e ] …- word boundaries: 97.2%- Morpheme boundaries: 93.1%- F-score of 92.3%

Page 26: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

26

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Language Modeling

• (Big) remaining problem:- aanlopen -> aan+lop+en or aan+loop+en- gebracht -> ge+bracht or ge+breng- But not: aan+loop+en en ge+bracht

- Information not available in CELEX- But: Orthography closest guess

True pronounced morphemes quite workableDecent accuracy on harder task?? Regular expression + grapheme-to-phoneme

conversion- Not yet integrated in recognizer

Page 27: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

27

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Language Modeling

• Turn morphemes into word forms (+ reverse alternation)- Re-analyze word form

• Tag + shallow parse sequence of words

::demo::www.cnts.ua.ac.be/flavor

Page 28: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

28

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Language Modeling

• Problem2: scoring hypotheses- Option1: n-gram models trained on annotated Mediargus

corpus• Morpheme N-grams: de niets vermoed end <e>• Tagged-morpheme N-grams Ewb B V A|BV. <INFLPWB>• Word n-grams• Part-of-Speech tag n-grams• Shallow Parsing tag n-grams• Combination: de@LID@NP <kan@WW@NP> or <kan@N1@NP>

- Interpolate LM scores

Page 29: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

29

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Language Modeling

• Problem2: scoring hypotheses- Option2: classifier “certainty”

• Use maximum entropy classifiers, that can output proper probabilities

• Quite informative for WSJ LM-task

Page 30: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

30

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Language Modeling

• Problem2: scoring hypotheses- Option3: Maxent classifier as LM

• Information Source: surrounding context (words, morphemes, linguistic annotation)

• To classify: word (or morpheme)• VERY slow training time

Page 31: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

31

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Language Modeling: circumstantial evidence

• Wall-Street Journal: n-gram rescoring- VP set: 8.11% 7.57%- NVP set: 8.08% 7.74%

+ maxent classifier probabilities

+ POS 3-grams

• Mediargus: perplexity - Word 3-gram: 148.42- Morpheme 3-gram: 56.36- Tagged Morpheme 3-gram: 53.17

Page 32: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

32

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Limitations

• Morpheme representation problematic for integration in recognizer

• Efficiency as LM not yet properly evaluated for Dutch

Page 33: Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

33

FLaVoR Workshop (17/11/2006) – Morpho-Syntactic Analysis and Language Modeling using Machine Learning Techniques

Available Tools & DataTools:

- All-in-one morpho-syntactic analyzer for Dutch• Morphological analyzer• Part-of-Speech tagger• Phrase Chunker

- Word vs Morpheme Boundary detector for Dutch- Promising outlook for Dutch N-gram LM using extra

annotation layersData:

- Adjusted version of CELEX (incl segmented orthographic forms)- 2.7M word database of morphologically analyzed words- Morphologically analyzed, tagged & shallow-parsed

Mediargus