Natural Language processing Parts of speech tagging, its classes, and how to process it

Post on 23-Jun-2015

109 views 4 download

Tags:

description

Natural Language processing Parts of speech tagging, classes, processing

Transcript of Natural Language processing Parts of speech tagging, its classes, and how to process it

Part of Speech Tagging

Perpectivising NLP: Areas of AI andtheir inter-dependencies

KnowledgeSearch Logic Representation

MachineLearning Planning

ExpertSystemsNLP Vision Robotics

Two pictures

ProblemNLP

Semantics NLPnity

Parsing

MorphVision SpeechAnalysis

HMMStatistics and Probability Hindi English

LanguageCRF+

Knowledge Based

MEMM

Algorithm

Semantics N

Tri

Part of SpeechTagging

Marathi French

What it is

POS Tagging is a process that attaches

each word in a sentence with a suitable

tag from a given set of tags.

The set of tags is called the Tag-set.

Standard Tag-set : Penn Treebank (for

English).

Definition

Tagging is the assignment of a

singlepart-of-speech tag to each word

(and punctuation marker) in a corpus.

“_“ The_DT guys_NNS that_WDT

make_VBP traditional_JJ hardware_NN

are_VBP really_RB being_VBG

obsoleted_VBN by_IN microprocessor-

based_JJ machines_NNS ,_, ”_” said_VBD

Mr._NNP Benton_NNP ._.

POS Tags

NN – Noun; e.g.

VM – Main Verb;

Dog_NN

e.g. Run_VM

VAUX – Auxiliary Verb; e.g. Is_VAUX

JJ – Adjective; e.g. Red_JJ

PRP – Pronoun; e.g. You_PRP

NNP – Proper Noun; e.g. John_NNP

etc.

POS Tag Ambiguity

In English : I bank1 on the bank2 on the

river bank3 for

Bank1 is verb,

my transactions.

the other two banks are

noun

In Hindi :

”Khaanaa” : can be noun (food) or

eat)

verb (to

For Hindi

Rama achhaa gaata hai. (hai is VAUX :

Auxiliary verb); Ram sings well

Rama achha ladakaa hai. (hai is VCOP :

Copula verb); Ram is a good boy

Process

List all possible tag for each word in

sentence.

Choose best suitable tag sequence.

Example

”People jump high”.

People : Noun/Verb

jump : Noun/Verb

high : Noun/Verb/Adjective

We can start with probabilities.

Importance of POS tagging

Ack: presentation by ClaireGardent on POS tagging by

NLTK

What is Part of Speech (POS)

Words can be divided into classesbehave similarly.

Traditionally eight parts of speech

that

inEnglish: noun, verb, pronoun,preposition, adverb,adjective and article

More recently larger

conjunction,

sets have beenused: e.g. Penn Treebank (45 tags),Susanne (353 tags).

Why POS POS tell us a lot about a word (and the

words near it). E.g, adjectives often followed by nouns

personal pronouns often followed by verbs

possessive pronouns by nouns

Pronunciations depends on POS, e.g. object (first syllable NN, second syllable VM), content, discount

First step in many NLP applications

Categories of POSOpen and closed classes

Closed classes have a fixed membership of words: determiners, pronouns, prepositions

Closed class words are usually functionword: frequently occurring, grammatically important, often short (e.g. of, it, the, in)

Open classes: nouns, verbs, adjectives and adverbs(allow new addition of word)

Open Class (1/2) Nouns:

Proper nouns (Scotland, BBC),

common nouns count nouns (goat, glass)

mass nouns (snow, pacifism)

Verbs: actions and processes (run, hope)

also auxiliary verbs (is, are, am, will, can)

Open Class (2/2) Adjectives:

properties and qualitiesvalue)

Adverbs:

(age, colour,

modify verbs, or verb phrases, or otheradverbs- Unfortunately John walked home extremely slowly yesterday

Sentential adverb: unfortunately

Manner adverb: extremely, slowly

Time adverb: yesterday

Closed class Prepositions: on, under, over, to, with,

by

Determiners: the, a, an, some

Pronouns: she, you, I, who

Conjunctions: and, but, or, as, when, if Auxiliary verbs: can, may, are

Penn tagset (1/2)

Penn tagset (2/2)

IndianNoun

Language Tagset:

Indian Language Tagset:Pronoun

Indian Language Tagset:Quantifier

Indian Language Tagset:Demonstrative

3 Demonstrative DM DM Vaha, jo, yaha,

3.1 Deictic DMD DM DMD Vaha, yaha

3.2 Relative DMR DM DMR jo, jis

3.3 Wh-word DMQ DM DMQ kis, kaun

Indefinite DMI DM DMI KoI, kis

Indian Language Tagset:Verb, Adjective, Adverb

Indian Language Tagset:Postposition, conjunction

Indian Language Tagset:Particle

Indian Language Tagset:Residuals

BigramBest tag sequence

Assumption

===

T*argmax P(T|W)argmax P(T)P(W|T) (by Baye’s Theorem)

P(T) = P(t0=^ t1t2 … tn+1=.)

= P(t0)P(t1|t0)P(t2|t1t0)P(t3|t2t1t0) …P(tn|tn-1tn-2…t0)P(tn+1|tntn-1…t0)

= P(t0)P(t1|t0)P(t2|t1) … P(tn|tn-1)P(tn+1|tn)

N+1

∏i = 0

= P(ti|ti-1) Bigram Assumption

Lexical Probability AssumptionP(W|T) = P(w0|t0-tn+1)P(w1|w0t0-tn+1)P(w2|w1w0t0-tn+1) …

P(wn|w0-wn-1t0-tn+1)P(wn+1|w0-wnt0-tn+1)

Assumption: A word is determined completely by inspired by speech recognition

its tag. This is

= P(wo|to)P(w1|t1) … P(wn+1|tn+1)

n+1

=∏ P(wi|ti)i = 0

n+1

= ∏ P(wi|ti)i = 1

(Lexical Probability Assumption)

Generative Model

^_^ People_N Jump_V High_R ._.

LexicalProbabilities

^ N V A .

V N N BigramProbabilities

NA A

This model is called Generative model.Here words are observed from tags as states.This is similar to HMM.

Bigram probabilities

N V A

N 0.2 0.7 0.1

V 0.6 0.2 0.2

A 0.5 0.2 0.3

Lexical Probability

10

10

10

People jump high

N- 5

10- 3

0.4x10 - 7

V- 7

10-2

10 - 7

A 0 0 - 1

values in cell are P(col-heading/row-heading)

Calculation from Corpus

actual data

^ Ram got many NLP books. He found themall very interesting.

Pos Tagged ^ N V A N N . N V N A R A .

Recording numbers^ N V A R .

^ 0 2 0 0 0 0

N 0 1 2 1 0 1

V 0 1 0 1 0 0

A 0 1 0 0 1 1

R 0 0 0 1 0 0

. 1 0 0 0 0 0

Probabilities^ N V A R .

^ 0 1 0 0 0 0

N 0 1/5 2/5 1/5 0 1/5

V 0 1/2 0 1/2 0 0

A 0 1/3 0 0 1/3 1/3

R 0 0 0 1 0 0

. 1 0 0 0 0 0

To find

T* = argmax (P(T) P(W/T)) P(T).P(W/T) = Π P( ti / ti+1 ).P(wi /ti)

i=1 n

) : Bigram probability P( ti / ti+1

P(wi /ti): Lexical probability

Bigram probabilities

N V A R

N 0.15 0.7 0.05 0.1

V 0.6 0.2 0.1 0.1

A 0.5 0.2 0.3 0

R 0.1 0.3 0.5 0.1

Lexical Probability

10

10

10

People jum p high

N-5

10-3

0.4x10 -7

V-7

10-2

10 -7

A 0 0 -1

R 0 0 0

values in cell are P(col-heading/row-heading)