LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15.
-
date post
20-Dec-2015 -
Category
Documents
-
view
213 -
download
0
Transcript of LING 388: Language and Computers Sandiway Fong Lecture 23: 11/15.
LING 388: Language and Computers
Sandiway Fong
Lecture 23: 11/15
Part-of-Speech (POS) Tagging
• Basic Idea:– assign the right part-of-speech tag, e.g. noun, verb, conjunction, to
a word– useful for shallow parsing – or as first stage of a deeper/more sophisticated system
• Question:– Is it a hard task?
• i.e. can’t we just look the words up in a dictionary?
• Answer:– Yes.
• Ambiguity.
– No. • POS tagging programs typically claim 95%+ accuracy
POS Tagging
• Task:– assign the right part-of-speech tag to a word in context– not always easy
• Example: walk– the walk : noun I took …– I walk : verb 2 miles every day
• Example: still: noun, adjective, adverb, verb– the still of the night, a glass still– still waters– stand still– still struggling– Still, I didn’t give way– still your fear of the dark (transitive)– the bubbling waters stilled (intransitive)
POS Tagging
• Issues/Questions:– What are the parts of speech and
subclasses that we might want to tag?– What does a typical tagset look like?– What methods can we use to assign tags?
Parts-of-Speech
• Divide words into classes based on grammatical function– nouns (open-class: unlimited set)
• referential items (denoting objects/concepts etc.)– proper nouns: John– pronouns: he, him, she, her, it– anaphors: himself, herself (reflexives)– common nouns: dog, dogs, water
» number: dog (singular), dogs (plural)» count-mass distinction: many dogs, *many waters
– eventive nouns: dismissal, concert, playback, destruction (deverbal)
• nonreferential items– it as in it is important to study– there as in there seems to be a problem– some languages don’t have these: e.g. Japanese
• open-class– factoid, email, bush-ism
Parts-of-Speech
• Pronouns:– it– I– he– you– his– they– this– that– she– her– we– all– which– their– what
Parts-of-Speech
• Divide words into classes based on grammatical function– verbs (closed-class: fixed set)
• auxiliaries– be (passive, progressive)– have (pluperfect tense)– do (what did John buy?, Did Mary win?)– modals: can, could, would, will, may
• Irregular: – is, was, were, does, did
Parts-of-Speech
• Divide words into classes based on grammatical function– verbs (open-class: unlimited set)
• Intransitive– unaccusatives: arrive (achievement)– unergatives: run, jog (activities)
• Transitive– actions: hit (semelfactive: hit the ball for an hour)– actions: eat, destroy (accomplishment)– psych verbs: frighten (x frightens y), fear (y fears x)
• Ditransitive– put (x put y on z, *x put y)– give (x gave y z, *x gave y, x gave z to y)– load (x loaded y (on z), x loaded z (with y))
– Open-class: • reaganize, email, fax
Parts-of-Speech
• Divide words into classes based on grammatical function– adjectives (open-class: unlimited set)
• modify nouns• black, white, open, closed, sick, well• attributive: black (black car, car is black), main (main street, *street is main),
atomic• predicative: afraid (*afraid child, the child is afraid)• stage-level: drunk (there is a man drunk in the pub)• individual-level: clever, short, tall (*there is a man tall in the bar)• object-taking: proud (proud of him,*well of him)• intersective: red (red car: intersection of the set of red things and the set of cars)• non-intersective: former (former architect), atomic (atomic scientist)• comparative, superlative: blacker, blackest, *opener, *openest
– open-class:• hackable, spammable
Parts-of-Speech
• Divide words into classes based on grammatical function– adverbs (open-class: unlimited set)
• modify verbs (adjectives and other adverbs)• manner: slowly (moved slowly)• degree: slightly, more (more clearly), very (very bad), almost• sentential: unfortunately, suddenly• question: how• temporal: when, soon, yesterday (noun?)• location: sideways, here (John is here)
– open-class:• spam-wise
Parts-of-Speech
• Divide words into classes based on grammatical function– prepositions (closed-class: fixed set)– come before an object, assigns a semantic function (from Mars, *Mars from)
• head-final languages: postpositions (Japanese: amerika-kara)
– location: on, in, by– temporal: by, until
POS Tagging
• Task:– assign the right part-of-speech tag, e.g. noun, verb,
conjunction, to a word in context
• POS taggers– need to be fast in order to process large corpora
• should take no more than time linear in the size of the corpora– full parsing is slow
• e.g. context-free grammar n3, n length of the sentence– POS taggers try to assign correct tag without actually
parsing the sentence
POS Tagging
• Components:– Dictionary of words
• Exhaustive list of closed class items– Examples:
» the, a, an: determiner» from, to, of, by: preposition» and, or: coordination conjunction
• Large set of open class (e.g. noun, verbs, adjectives) items with frequency information
POS Tagging
• Components:– Mechanism to assign tags
• Context-free: by frequency• Context: bigram, trigram, HMM, hand-coded rules
– Example:» Det Noun/*Verb the walk…
– Mechanism to handle unknown words (extra-dictionary)• Capitalization• Morphology: -ed, -tion
How Hard is Tagging?
• Brown Corpus (Francis & Kucera, 1982):– 1 million words– 39K distinct words– 35K words with only 1 tag– 4K with multiple tags (DeRose, 1988)
How Hard is Tagging?
• Easy task to do well on:– naïve algorithm
• assign tag by frequency
– 90% accuracy (Charniak et al., 1993)
Penn TreeBank Tagset
• 48-tag simplification of Brown Corpus tagset• Examples:
1. CC Coordinating conjunction
3. DT Determiner
7. JJ Adjective
11. MD Modal
12. NN Noun (singular,mass)
13. NNS Noun (plural)
27 VB Verb (base form)
28 VBD Verb (past)
Penn TreeBank Tagsetwww.ldc.upenn.edu/doc/treebank2/cl93.html
1 CC Coordinating conjunction2 CD Cardinal number3 DT Determiner4 EX Existential there5 FW Foreign word6 IN Preposition/subord. conjunction 7 JJ Adjective8 JJR Adjective, comparative9 JJS Adjective, superlative
10 LS List item marker11 MD Modal12 NN Noun, singular or mass13 NNS Noun, plural14 NNP Proper noun, singular15 NNPS Proper noun, plural16 PDT Predeterminer17 POS Possessive ending18 PRP Personal pronoun19 PP Possessive pronoun20 RB Adverb21 RBR Adverb, comparative22 RBS Adverb, superlative23 RP Particle24 SYM Symbol (mathematical or scientific)
Penn TreeBank Tagsetwww.ldc.upenn.edu/doc/treebank2/cl93.html
25 TO to26 UH Interjection27 VB Verb, base form28 VBD Verb, past tense29 VBG Verb, gerund/present participle30 VBN Verb, past participle31 VBP Verb, non-3rd ps. sing. present32 VBZ Verb, 3rd ps. sing. present33 WDT wh-determiner34 WP wh-pronoun35 WP Possessive wh-pronoun36 WRB wh-adverb37 # Pound sign38 $ Dollar sign39 . Sentence-final punctuation40 , Comma41 : Colon, semi-colon42 ( Left bracket character43 ) Right bracket character44 " Straight double quote45 ` Left open single quote46 " Left open double quote47 ' Right close single quote48 " Right close double quote
$
Penn TreeBank Tagset
• How many tags?– Tag criterion
• Distinctness with respect to grammatical behavior?
– Make tagging easier?
• Punctuation tags – Penn Treebank numbers 37- 48
• Trivial computational task
Penn TreeBank Tagset
• Simplifications:– Tag TO:
• infinitival marker, preposition• I want to win• I went to the store
– Tag IN:• preposition: that, when, although • I know that I should have stopped, although…• I stopped when I saw Bill
Penn TreeBank Tagset
• Simplifications:– Tag DT:
• determiner: any, some, these, those• any man• these *man/men
– Tag VBP: • verb, present: am, are, walk• Am I here?• *Walked I here?/Did I walk here?
Hard to Tag Items
• Syntactic Function– Example:
• resultative
• I saw the man tired from running • Examples (from Brown Corpus Manual)
– Hyphenation:• long-range, high-energy• shirt-sleeved • signal-to-noise
– Foreign words:• mens sana in corpore sano
Rule-Based POS Tagging
• Example Systems– ENGCG (1,100 rules)
• http://www.lingsoft.fi/cgi-bin/engcg – ENGCG-2 (4000 rules)
• http://www.connexor.com/demos/tagger_en.html
• Core Components– English morphological analyzer based on two-level morphology
• see last lecture
– 56K word stems– processing
• apply morphological engine• get all possible tags for each word• apply rules
Rule-Based POS Tagging
• Example:– Pavlov had
shown that salivation can be a conditioned reflex
Rule-Based POS Tagging
• Examples of tags:– PCP2 past
participle– SV subject
verb– SVOO
subject verb object object
Rule-Based POS Tagging
• Example:– it isn’t that:adv odd
• Rule:– given input “that”– if
• (+1 A/ADV/QUANT)• (+2 SENT-LIM)• (NOT -1 SVOC/A)
– then eliminate non-ADV tags– else eliminate ADV tag
Rule-Based POS Tagging
• Now ENGCG-2 (4000 rules)– http://www.connexor.com/demos/tagger_en.html
Rule-Based POS Tagging
• Now ENGCG-2 (4000 rules)– http://www.connexor.com/demos/tagger_en.html
Rule-Based POS Tagging
• Best performance of all systems: 99.7%
Next Time
• Look at statistical techniques …