Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA 94305-4115,...

Distributional Part-of-Speech TaggingDistributional Part-of-Speech Tagging

Hinrich SchützeHinrich SchützeCSLI, Ventura HallCSLI, Ventura Hall

Stanford, CA 94305-4115 , USAStanford, CA 94305-4115 , USA

email: [email protected]: [email protected]

NLP Applications By Masood Ghayoomi Oct 15, 2007

Outline of the TalkOutline of the Talk

IntroductionIntroduction Brief review on the literatureBrief review on the literature Presenting a hypothesisPresenting a hypothesis Introducing induction experimentsIntroducing induction experiments ResultsResults ConclusionsConclusions DiscussionsDiscussions


Abstract of the TalkAbstract of the Talk

This paper presents an algorithm for tagging This paper presents an algorithm for tagging words whose part-of-speech properties are words whose part-of-speech properties are unknown.unknown.

This algorithm categorizes word tokens in This algorithm categorizes word tokens in context. context.


IntroductionIntroduction

Why is it needed?Why is it needed?

Increasing on line texts need to use automatic Increasing on line texts need to use automatic techniques to analyze a text.techniques to analyze a text.


Related WorksRelated Works

Stochastic Tagging:Stochastic Tagging:

-- Bigram or trigram models: require a relatively Bigram or trigram models: require a relatively large large tagged training text (Church, 1989; Charniak tagged training text (Church, 1989; Charniak et et al.,1993)al.,1993)

-- Hidden Markov Models: require no pretagged Hidden Markov Models: require no pretagged text text (Jelinek, 1985; Cutting et al., 1991; Kupiec, (Jelinek, 1985; Cutting et al., 1991; Kupiec, 1992)1992)

Rule-based Tagging:Rule-based Tagging:

-- Transformation-based tagging as Transformation-based tagging as introduced by Brill introduced by Brill (1993): requires a hand-(1993): requires a hand-tagged text for trainingtagged text for training


Other Related WorksOther Related Works

Using connectionist net to predict words by Using connectionist net to predict words by reflecting grammatical categories (Elman, 1990)reflecting grammatical categories (Elman, 1990)

Inferring grammatical category from bigram Inferring grammatical category from bigram statistics (Brill et al, 1990)statistics (Brill et al, 1990)

Using vector models in which words are clustered Using vector models in which words are clustered according to the similarity of their close neighbors in according to the similarity of their close neighbors in a corpus (Finch and Chater, 1992; Finch, 1993)a corpus (Finch and Chater, 1992; Finch, 1993)

Presenting a probabilistic model for entropy Presenting a probabilistic model for entropy maximization that relies on the immediate maximization that relies on the immediate neighbors of words in a corpus (Kneser and Ney, neighbors of words in a corpus (Kneser and Ney, 1993)1993)

Applying factor analysis to collocations of two target Applying factor analysis to collocations of two target words with their immediate neighbors (Biber, 1993)words with their immediate neighbors (Biber, 1993)


Hypothesis for New Tagging Hypothesis for New Tagging AlgorithmAlgorithm

The syntactic behavior of a word is represented The syntactic behavior of a word is represented with respect to its left and right context.with respect to its left and right context.

Left neighbor Left neighbor WORD WORD Right neighbor Right neighbor

Left context vector Left context vector Right context Right context

vector vector


4 POS Tag Induction 4 POS Tag Induction ExperimentsExperiments

Based on word type only Based on word type only

Based on word type and contextBased on word type and context

Based on word type and context, restricted to Based on word type and context, restricted to “natural” contexts “natural” contexts

Based on word type and context, using Based on word type and context, using generalized left and right context vectors generalized left and right context vectors


Word Type OnlyWord Type Only

A base line to evaluate the performance of A base line to evaluate the performance of distributional POS taggersdistributional POS taggers

Words from BNC corpus clustered into 200 classes Words from BNC corpus clustered into 200 classes by considering left and right vector context by considering left and right vector context similarities.similarities.All occurrences of a word assigned to one class.All occurrences of a word assigned to one class.

Drawback: Problematic for ambiguous words;Drawback: Problematic for ambiguous words;e.g. Work, Booke.g. Work, Book


Word Type and ContextWord Type and Context

Dependency of a word’s syntactic role on:Dependency of a word’s syntactic role on:- the syntactic properties of its neighbors, - the syntactic properties of its neighbors, - its own potential relationships with the neighbors.- its own potential relationships with the neighbors.

Considering context for distributional tagging:Considering context for distributional tagging:- The right context vector of the preceding word.- The right context vector of the preceding word.- The left context vector of - The left context vector of ww..- The right context vector of - The right context vector of ww..- The left context vector of the following word.- The left context vector of the following word.

Drawback: fails for words whose neighbors are Drawback: fails for words whose neighbors are punctuation marks, since there are no grammatical punctuation marks, since there are no grammatical dependencies between words and punctuation marks, in dependencies between words and punctuation marks, in contrast to strong dependencies between neighboring contrast to strong dependencies between neighboring words.words.


Word Type and Context,Word Type and Context,Restricted to “Natural” Contexts Restricted to “Natural” Contexts

For this drawback only for words with informative For this drawback only for words with informative contexts were considered. contexts were considered.

words next to punctuation marks, words with rare words next to punctuation marks, words with rare words as neighbors (less than ten occurrences) words as neighbors (less than ten occurrences) were excluded.were excluded.


Word Type and Context,Word Type and Context,Using Generalized Left and Right Context Using Generalized Left and Right Context

VectorsVectors Generalization: The right context vector makes Generalization: The right context vector makes

clear the classes of left context vectors which occur clear the classes of left context vectors which occur to the right of a word; and vice versa.to the right of a word; and vice versa.

In this method the information about left and right In this method the information about left and right context vectors of a word is kept context vectors of a word is kept separateseparate in the in the computation.computation.In the previous methods left and right context In the previous methods left and right context vectors of a word are vectors of a word are alwaysalways used. used.

This method is applied in two steps:This method is applied in two steps:- A generalized right context vector for a word - A generalized right context vector for a word

is is formed by considering the 200 classes formed by considering the 200 classes - A generalized left context vectors by using - A generalized left context vectors by using

word word based right context vectors. based right context vectors.


2 Examples2 Examples

““seemed” and “would” have similar seemed” and “would” have similar left contextsleft contexts, , and they characterize the right contexts of “he” and they characterize the right contexts of “he” and “the firefighter”. The left contexts are verbs and “the firefighter”. The left contexts are verbs which potentially belong to one syntactic which potentially belong to one syntactic category. category.

Transitive verbs and prepositions belong to Transitive verbs and prepositions belong to different syntactic categories, but their different syntactic categories, but their right right contextscontexts are identical which they require a noun are identical which they require a noun phrase.phrase.


ResultsResults

The Penn Treebank parses of the BNC were used.The Penn Treebank parses of the BNC were used. The results of the four experiments are evaluated by The results of the four experiments are evaluated by

forming 16 classes of tags from the Penn Treebank.forming 16 classes of tags from the Penn Treebank. tt tagtag frequencyfrequency the frequency of the frequency of tt in the corpus in the corpus # classes# classes the number of induced tags i0, i1, . . . , ilthe number of induced tags i0, i1, . . . , il correct correct the number of times an occurrence of the number of times an occurrence of tt was correctly labeled was correctly labeled

as belonging to one of i0, i1, . . . , ilas belonging to one of i0, i1, . . . , il incorrectincorrect the number of times that a token of a different tag the number of times that a token of a different tag t’t’

was was miscategorized as being an instance of i0, i1, . . . , ilmiscategorized as being an instance of i0, i1, . . . , il precisionprecision the number of correct tokens divided by the sum of the number of correct tokens divided by the sum of

correct correct and incorrect tokens.and incorrect tokens. RecallRecall the number of correct tokens divided by the total number of the number of correct tokens divided by the total number of

tokens of tokens of tt FF an aggregate score from precision and recallan aggregate score from precision and recall


Result: Word Type OnlyResult: Word Type Only


Table 1: Precision and recall for induction based on word type.

Result: Word Type and ContextResult: Word Type and Context


Table 2: Precision and recall for induction based on word type and context.

Result:Result: Word Type and Context;Word Type and Context;Generalized Left and Right Context Generalized Left and Right Context

VectorsVectors


Table 3: Precision and recall for induction based on generalized context vectors.

Result: Word Type and Context;Result: Word Type and Context;Restricted to “Natural” ContextsRestricted to “Natural” Contexts


Table 4: Precision and recall for induction for natural contexts.

ConclusionsConclusions

Taking context into account improves the Taking context into account improves the performance of distributional tagging,performance of distributional tagging,as F score increases:as F score increases:

0.49 < 0.72 0.49 < 0.72 << 0.74 < 0.74 < 0.790.79 Performance for generalized context vectors is Performance for generalized context vectors is

better than for word-based context vectors (0.74 better than for word-based context vectors (0.74 vs. 0.72).vs. 0.72).


DiscussionsDiscussions

““Natural” contexts’ performance is better than the other Natural” contexts’ performance is better than the other contexts (0.79), even though having low quality of the contexts (0.79), even though having low quality of the distributional information about punctuation marks and distributional information about punctuation marks and rare words are a difficulty for this tag induction.rare words are a difficulty for this tag induction.

Performing fairly good for typical and frequent contexts:Performing fairly good for typical and frequent contexts:prepositions, determiners, pronouns, conjunctions, the prepositions, determiners, pronouns, conjunctions, the infinitive marker, modals, and the possessive marker infinitive marker, modals, and the possessive marker

Failing tag induction for punctuations, rare words, andFailing tag induction for punctuations, rare words, and“-ing” forms of present participles and gerunds which “-ing” forms of present participles and gerunds which are difficult as both exhibit verbal and nominal are difficult as both exhibit verbal and nominal properties.properties.


Thanks for your listening!Thanks for your listening!

Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA 94305-4115,...

Documents

Transcript of Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA 94305-4115,...