7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG)...

50
7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran

Transcript of 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG)...

Page 1: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 University of Tehran

Persian POS Tagging

Hadi Amiri

Database Research Group (DBRG)

ECE Department, University of Tehran

Page 2: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Outline

• What is POS tagging• How is data tagged for POS?• Tagged Corpora• POS Tagging Approaches• Corpus Training• How to Evaluate a tagger?• Bijankhan Corpus• Memory Based POS• MLE Based POS• Neural Network POS Tagger

Page 3: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

What is POS tagging

Annotating each word for its part of speech (grammaticaltype) in a given sentence.

e.g. I/PRP would/MD prefer/VB to/TO study/VB at/IN a/DT traditional/JJ school/NN

Properties:• It helps parsing• It resolves pronunciation ambiguities

As the water grew colder, their hands grew number. (number=ADJ, not N)

• It resolves semantic ambiguitiesPatients can bear pain.

Page 4: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

POS Application

Part-of-speech (POS) tagging is important for manyapplications• Word sense disambiguation • Parsing• Language modeling• Q&A and Information extraction• Text-to-speech• Tagging techniques can be used for a variety of tasks• Semantic tagging• Dialogue tagging• Information Retrieval….

Page 5: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

POS Tags

N noun baby, toy

V verb see, kiss

ADJ adjective tall, grateful, alleged

ADV adverb quickly, frankly, ...

P preposition in, on, near

DET determiner the, a, that

WhPron wh-pronoun who, what, which, …

COORD coordinator and, or

Open Class

Page 6: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

POS Tags

• There is no standard set of POS tags Some use coarse classes: e.g., N, V, A, Aux, …. Others prefer finer distinctions (e.g., Penn Treebank):

• PRP: personal pronouns (you, me, she, he, them, him, …)

• PRP$: possessive pronouns (my, our, her, his, …)

• NN: singular common nouns (sky, door, theorem, …)

• NNS: plural common nouns (doors, theorems, women, …)

• NNP: singular proper names (Fifi, IBM, Canada, …)

• NNPS: plural proper names (Americas, Carolinas, …)

Page 7: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

How is data tagged for POS?

• We are trying to model human performance.

• So we have humans tag a corpus and try to match their performance.

To creating a model A corpora are hand-tagged for POS by more than 1

annotator Then checked for reliability

Page 8: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Penn Treebank Corpus

(WSJ, 4.5M)

History

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS)93%-95%

Greene and RubinRule Based - 70%

LOB Corpus Created (EN-UK)1 Million Words

DeRose/ChurchEfficient HMMSparse Data

95%+

British National Corpus

(tagged by CLAWS)

POS Tagging separated from

other NLP

Transformation Based Tagging

(Eric Brill)Rule Based – 95%+

Tree-Based Statistics (Helmut Shmid)

Rule Based – 96%+

Neural Network 96%+

Trigram Tagger(Kempe)

96%+

Combined Methods98%+

LOB Corpus Tagged

Page 9: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Tagged Corpora

Corpus # Tags #Tokens

Brown 87 1 million

British Natl 61 100 million

Penn Treebank 45 4.8 million

Original Bijankhan 550 ?

Bijankhan 40 2.6 million

Page 10: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

POS Tagging Approaches

POS Tagging

Supervised Unsupervised

Rule-Based Stochastic Neural Rule-Based Stochastic Neural

Page 11: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Rule-Based POS Tagger

Lexicon with tagsidentified for each word

that ADV PRON DEM SG DET

CENTRAL DEM SG CS

Constraints to eliminate tags:

If next word is adj, adv,

quant And following is S bdry And previous word is not

consider-type V

Then Eliminate non-ADV tags

He was that drunk.

Page 12: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Probabilistic POS Tagging

• Provides the possibility of automatic training rather than painstaking rule revision.

• Automatic training means that a tagger can be easily adapted to new text domains.

E.g.

A moving/VBG house

A moving/JJ ceremony

Page 13: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Probabilistic POS Tagging

• Needs large tagged corpus for training

• Unigram statistics (most common part-of-speech for each word) get us to about 90% accuracy

• For greater accuracy, we need some information on adjacent words

Page 14: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Corpus Training

• The probabilities in a statistical model come from the corpus it is trained on.

• If the corpus is too domain-specific, the model may not be portable to other domains.

• If the corpus is too general, it will not capitalize on the advantages of domain-specific probabilities

Page 15: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Tagger Evaluation

• Once a tagging model has been built, how is it tested? Typically, a corpus is split into a training set (usually ~90%

of the data) and a test set (10%). The test set is held out from the training. The tagger learns the tag sequences that maximize the

probabilities for that model. The tagger is tested on the test set.

• Tagger is not trained on test data.• But test data is highly similar to training data.

Page 16: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Current Performance

• How many tags are correct? About 98% currently But baseline is already 90% Baseline algorithm:

• Tag every word with its most frequent tag

• Tag unknown words as nouns

• How well do people do?

Page 17: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 University of Tehran

Memory Based Part Of Speech Tagging Experiments With

Persian Text

Page 18: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Corpus Study

• At first the corpus had 550 tags.• The content is gathered form daily news and common

texts. • Each document is assigned a subject such as political,

cultural and so on. Totally, there are 4300 different subjects. This subject categorization provides an ideal experimental

environment for clustering, filtering, categorization research.

• In this research, we simply ignored the subject categories of the documents and concentrated on POS tags.

Page 19: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Selecting Suitable Tags

• At first frequencies of each tags was gathered. • Then many of the tags were grouped together and a

smaller tag set was produced • Each tag in the tag set is placed in a hierarchical

structure. As an example, consider the tag “N_PL_LOC”.

N stands for a noun

PL describes the plurality of the tag

LOC defines the tag as about locations

Page 20: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

The Tags Distribution

Page 21: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Max, Min, AVG, Total # of Tags in The Training Set

Page 22: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Number of Different Tags

For instance, the word “آسمان” which means “the sky” in English is always tagged with "N_SING" in the whole corpus; but a word like “باال” which means “high or above” has been tagged by several tags ("ADJ_SIM", "ADV", "ADV_NI", "N_SING", "P", and "PRO").

Page 23: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Classifying the Rare WordsETC12%

PRO2%

V_PA3%

N_PL6%

CON8%

ADJ_SIM9%

DELM10%

P12%

N_SING38%

The Tags whose number of occurrences is below 5000 times in the corpus are gathered to “ETC” group.

Page 24: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Bijankhan Corpus

Page 25: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Implemented Mehtods

• MLE Based POS Tagger

• Neural Network POS Tagger

• Memory Based POS Tagger

Page 26: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Implemented Mehtods

• MLE Based POS Tagger

• Neural Network POS Tagger

• Memory Based POS Tagger

Page 27: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Memory-Based POS Tagging

• Memory-based POS tagging is also called Lazy Leaning, Example Based learning or Case Based Learning

• MBT uses some specifications of each word such as its possible tags, and a fixed width context as features.

• We used MBT, a tool for memory based tagger generation and tagging. (available at: http://ilk.uvt.nl/mbt/)

Page 28: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

The MBT tool generates a tagger by working

through the annotated corpus and creating

three data structures: a lexicon, associating words to tags as evident in

the training corpus a case base for known words (words occurring in

the lexicon) a case base for unknown words.

Memory-Based POS Tagging

Selecting appropriate feature sets for known and unknown words has important impact on the accuracy of the results

Page 29: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

After different experiments, we chose “ddfa” as thefeature set for known words.

So “ddfa” is choosing the appropriate tag for each known word, based on the tag of two words before and possible tags of the word after it.

Memory-Based POS Tagging

afdd

d stand for disambiguated tags

d stand for disambiguated tags

f means focus (current) worda is ambiguous word after the current word.

Page 30: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

The feature set chosen for unknown word is “dFass”

Memory-Based POS Tagging

ssaFd

current word

d is the disambiguated tag of the word before current word

a stands for ambiguous tags of the word after current word

ss are two suffix letters of the current word.

The F in unknown words features indicates position of the focus word and it is not included in actual feature set for tagging.

Page 31: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

MBT Results- Known Words

“ddfa”

Page 32: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

MBT Results- Unknown Words

“dFass”

Page 33: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

MBT Results- Overall

Page 34: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Implemented Mehtods

• Neural Network POS Tagger

• MLE Based POS Tagger

• Memory Based POS Tagger

Page 35: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Maximum Likelihood Estimation

As a bench mark of POS tagging accuracy, wechose Maximum Likelihood Estimation (MLE)approach.

Calculating the maximum likelihood probabilities for each tag assigned to any word in the training set.

Choosing the tag with greater maximum likelihood probability (designated tag) for each word and make it the only tag assignable to that word.

• In order to evaluate this method we analyze the words in the test set and assign the designated tags to the words in the test set.

Page 36: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Maximum Likelihood Estimation

Occurrence Word Tag MLE

1 پدرانه ADV_NI 0.1667

5 پدرانه ADJ_SIM 0.8333

4 پديدار ADJ_SIM 0.1538

22 پديدار N_SING 0.8462

1 پذيرفته N_SING 0.0096

3 پذيرفته ADJ_SIM 0.0288

6 پذيرفته V_PA 0.0577

94 پذيرفته ADJ_INO 0.9038

2 اند پراكنده V_PRE 0.5000

2 اند پراكنده V_PA 0.5000

Page 37: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

MLE Results-Known Words

Page 38: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

MLE Results- Unknown Words, “DEFAULT”

For each unknown word we assign the “DEFAULT” tag.

Page 39: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

MLE Results- Overall, “DEFAULT”

For each unknown word we assign the “DEFAULT” tag.

Page 40: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

MLE Results- Unknown Words, “N_SING”

For each unknown word we assign the “N_SING” tag.

Page 41: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

MLE Results- Overall, “N_SING”

For each unknown word we assign the “N_SING” tag, most assigned tag.

Page 42: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Comparison With Other Languages

Page 43: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Implemented Mehtods

• MLE Based POS Tagger

• Neural Network POS Tagger

• Memory Based POS Tagger

Page 44: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Neural Network

Each unit corresponds to one of the tags in the tag set.

Preceding Words Following Words

Page 45: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Neural Network

• For each POS tag, posi and each of the p+1+f in the context, there is an input unit whose activation ini,j represent the probability that wordi has pos posj.

Input representation for the currently tagged word and the following words:

The activation value for the preceding words:

Page 46: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Neural Network Results on Bijankhan Corpus

Training Algorithm

No. of Hidden Layer

No. of Input for Train

Training Duration (Hour)

No. of Input for Test

Accuracy

MLP 2 1mil 120:00:87

1000 Too Low

MLP 3 1mil ? 1000 Too Low

Generalized Feed Forward

1 1mil 95:30:57 1000 Too Low

Generalized Feed Forward

2 1mil ? 1000 Too Low

Generalized Feed Forward

2 20000 1:53:35 1000 %58

Page 47: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Neural Network on Other Languages

English

Page 48: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Neural Network on Other Languages

Chinese

Page 49: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Future Work

• Using more than 1 level POS tags.

• Unsupervised POS tagging using Hamshahri Collection

• Investigation of other methods for Persian POS tagging such as Support Vector Machine (SVM) based tagging

• KASRE YE EZAFE in Persian!

Page 50: 7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

7 November 2006 DBRG- University of Tehran

Thank You

Space for Question?