Albert Gatt Corpora and Statistical Methods Lecture 9.

Albert Gatt Corpora and Statistical Methods Lecture 9

POS Tagging overview; HMM taggers, TBL tagging Part 2

The task Assign each word in continuous text a tag indicating its part of speech. Essentially a classification problem. Current state of the art: taggers typically have 96-97% accuracy figure evaluated on a per-word basis in a corpus with sentences of average length 20 words, 96% accuracy can mean one tagging error per sentence

Sources of difficulty in POS tagging Mostly due to ambiguity when words have more than one possible tag. need context to make a good guess about POS context alone wont suffice A simple approach which assigns only the most common tag to each word performs with 90% accuracy!

The information sources 1. Syntagmatic information: the tags of other words in the context of w Not sufficient on its own. E.g. Greene/Rubin 1977 describe a context-only tagger with only 77% accuracy 2. Lexical information (dictionary): most common tag(s) for a given word e.g. in English, many nouns can be used as verbs (flour the pan, wax the car) however, their most likely tag remains NN distribution of a words usages across different POSs is uneven: usually, one highly likely, other much less

Tagging in other languages (than English) In English, high reliance on context is a good idea, because of fixed word order Free word order languages make this assumption harder Compensation: these languages typically have rich morphology Good source of clues for a tagger

Evaluation and error analysis Training a statistical POS tagger requires splitting corpus into training and test data. Often, we need a development set as well, to tune parameters. Using (n-fold) cross-validation is a good idea to save data. randomly divide data into train + test train and evaluate on test repeat n times and take an average NB: cross-validation requires the whole corpus to be blind. To examine the training data, best to have fixed training & test sets, perform cross-validation on training data, and final evaluation on test set.

Evaluation Typically carried out against a gold standard based on accuracy (% correct). Ideal to compare accuracy of our tagger with: baseline (lower-bound): standard is to choose the unigram most likely tag ceiling (upper bound): e.g. see how well humans do at the same task humans apparently agree on 96-7% tags means it is highly suspect for a tagger to get 100% accuracy

HMM taggers

Using Markov models Basic idea: sequences of tags are a Markov Chain: Limited horizon assumption: sufficient to look at previous tag for information about current tag Time invariance: The probability of a sequence remains the same over time

Implications/limitations Limited horizon ignores long-distance dependences e.g. cant deal with WH-constructions Chomsky (1957): this was one of the reasons cited against probabilistic approaches Time invariance: e.g. P(finite verb|pronoun) is constant but we may be more likely to find a finite verb following a pronoun at the start of a sentence than in the middle!

Notation We let t i range over tags Let w i range over words Subscripts denote position in a sequence Use superscripts to denote word types: w j = an instance of word type j in the lexicon t j = tag t assigned to word w j Limited horizon property becomes:

Basic strategy Training set of manually tagged text extract probabilities of tag sequences: e.g. using Brown Corpus, P(NN|JJ) = 0.45, but P(VBP|JJ) = 0.0005 Next step: estimate the word/tag probabilities: These are basically symbol emission probabilities

Training the tagger: basic algorithm 1. Estimate probability of all possible sequences of 2 tags in the tagset from training data 2. For each tag t j and for each word w l estimate P(w l | t j ). 3. Apply smoothing.

Finding the best tag sequence Given: a sentence of n words Find: t 1,n = the best n tags Application of Bayes rule denominator can be eliminated as its the same for all tag sequences.

Finding the best tag sequence The expression needs to be reduced to parameters that can be estimated from the training corpus need to make some simplifying assumptions 1. words are independent of eachother 2. a words identity depends only on its tag

The independence assumption Probability of a sequence of words given a sequence of tags is computed as a function of each word independently

The identity assumption Probability of a word given a tag sequence = probability a word given its own tag

Applying these assumptions

Tagging with the Markov Model Can use the Viterbi Algorithm to find the best sequence of tags given a sequence of words (sentence) Reminder: probability of being in state (tag) j at word i on the best path most probable state (tag) at word i given that were in state j at word i+1

The algorithm: initialisation Assume that P(PERIOD) = 1 at end of sentence Set all other tag probs to 0

Algorithm: induction step for i = 1 to n step 1: for all tags t j do: Probability of tag t j at i+1 on best path through i Most probable tag leading to tj at i+1

Algorithm: backtrace State at n+1 for j = n to 1 do: retrieve the most probable tags for every point in sequence Calculate probability for the sequence of tags selected

Some observations The model is a Hidden Markov Model we only observe words when we tag In actuality, during training we have a visible Markov Model because the training corpus provides words + tags

True HMM taggers Applied to cases where we do not have a large training corpus We maintain the usual MM assumptions Initialisation: use dictionary: set emission probability for a word/tag to 0 if its not in dictionary Training: apply to data, use forward-backward algorithm Tagging: exactly as before

Albert Gatt Corpora and Statistical Methods Lecture 9.

Documents

Transcript of Albert Gatt Corpora and Statistical Methods Lecture 9.