1 I256: Applied Natural Language Processing Marti Hearst Sept 20, 2006.

I256: Applied Natural Language Processing

Marti HearstSept 20, 2006

Tagging methods

Hand-codedStatistical taggersBrill (transformation-based) tagger

nltk_lite tag package

Type of taggers: tag.Default()

tag.Regexp()

tag.Affix()

tag.Unigram()

tag.Bigram()

tag.Trigram()

Actions: tag.tag() tag.tagsents() tag.untag() tag.train() tag.accuracy() tag.tag2tuple() tag.string2words() tag.string2tags()

Hand-coded Tagger

Make up some regexp rules that make use of morphology

Compare to Brown tags

Training and Testing ofLearning Algorithms

Algorithms that “learn” from data see a set of examples and try to generalize from them.Training set:

Examples trained on

Test set:Also called held-out data and unseen dataUse this for evaluating your algorithmMust be separate from the training set

– Otherwise, you cheated!

“Gold” standardA test set that a community has agreed on and uses as a common benchmark.

Cross-Validation of Learning Algorithms

Cross-validation setPart of the training set.

Used for tuning parameters of the algorithm without “polluting” (tuning to) the test data.

You can train on x%, and then cross-validate on the remaining 1-x%

– E.g., train on 90% of the training data, cross-validate (test) on the remaining 10%

– Repeat several times with different splits

This allows you to choose the best settings to then use on the real test set.

– You should only evaluate on the test set at the very end, after you’ve gotten your algorithm as good as possible on the cross-validation set.

Strong Baselines

When designing NLP algorithms, you need to evaluate them by comparing to others.Baseline Algorithm:

An algorithm that is relatively simple but can be expected to do wellShould get the best score possible by doing the somewhat obvious thing.

A Tagging Baseline

Find the most likely tag for the most frequent words

Frequent words are ambiguousYou’re likely to see frequent words in any collection

– Will always see “to” but might not see “armadillo”

How to do this?First find the most likely words and their tags in the training dataTrain a tagger that looks up these results in a table

– Note that the tag.Lookup() tagger type is not defined in this version of nltk_lite, so we’ll write our own.

Find the most frequent words and their tags

Subclassing a Python Class

The Lookup module isn’t in our version of nltk_liteLet’s make a subclass of the tag.Unigram class that has this functionality.

Details from the Unigram class

Define our own tagger class

Use our own tagger class

N-GramsThe N stands for how many terms are used

Unigram: 1 term (0th order)Bigram: 2 terms (1st order)Trigrams: 3 terms (2nd order)

– Usually don’t go beyond this

You can use different kinds of terms, e.g.:Character based n-gramsWord-based n-gramsPOS-based n-grams

OrderingOften adjacent, but not required

We use n-grams to help determine the context in which some linguistic phenomenon happens.

E.g., look at the words before and after the period to see if it is the end of a sentence or not.

18Modified from Massio Poesio's lecture

Tagging with lexical frequencies

Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NNPeople/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NNProblem: assign a tag to race given its lexical frequencySolution: we choose the tag that has the greater

P(race|VB)P(race|NN)

Unigram Tagger

Train on a set of sentencesKeep track of how many times each word is seen with each tag.After training, associate with each word its most likely tag.

Problem: many words never seen in the training data.Solution: have a default tag to “backoff” to.

Unigram tagger with Backoff

What’s wrong with unigram?

Most frequent tag isn’t always right!Need to take the context into account

Which sense of “to” is being used?Which sense of “like” is being used?

N-gram tagger

Uses the preceding N-1 predicted tagsAlso uses the unigram estimate for the current word

N-gram taggers in nltk_liteConstructs a frequency distribution describing the frequencies each word is tagged with in different contexts.

The context considered consists of the word to be tagged and the n-1 previous words' tags.

After training, tag words by assigning each word the tag with the maximum frequency given its context.

Assigns “None” tag if it sees a word in a context for which it has no data (which it has not seen).

Tuning parameters“cutoff” is the minimal number of times that the context must have been seen in training in order to be incorporated into the statisticsDefault cutoff is 1

24Modified from Diane Litman's version of Steve Bird's notes

Bigram Tagging

For tagging, in addition to considering the token’s type, the context also considers the tags of the n preceding tokens

– What is the most likely tag for word n, given word n-1 and tag n-1?

The tagger picks the tag which is most likely for that context.

Reading the Bigram tableThe current word

The predicted POS

The previously seen tag

Combining Taggers using Backoff

Use more accurate algorithms when we can, backoff to wider coverage when needed.

Try tagging the token with the 1st order tagger. If the 1st order tagger is unable to find a tag for the token, try finding a tag with the 0th order tagger. If the 0th order tagger is also unable to find a tag, use the default tagger to find a tag.

Important point:Bigram and trigram taggers use the previous tag context to assign new tags. If they see a tag of “None” in the previous context, they will output “None” too.

Demonstrating the n-gram taggers

Trained on brown.tagged(‘a’), tested on brown.tagged(‘b’)Backs off to a default of ‘nn’

Demonstrating the n-gram taggers

Combining Taggers

The bigram backoff tagger did worse than the unigram! Why?Why does it get better again with trigrams?How can we improve these scores?

Rule-Based Tagger

The Linguistic ComplaintWhere is the linguistic knowledge of a tagger?Just a massive table of numbersAren’t there any linguistic insights that could emerge from the data?Could thus use handcrafted sets of rules to tag input sentences, for example, if input follows a determiner tag it as a noun.

31Slide modified from Massimo Poesio's

The Brill tagger

An example of Transformation-Based LearningBasic idea: do a quick job first (using frequency), then revise it using contextual rules.Painting metaphor from the readings

Very popular (freely available, works fairly well)A supervised method: requires a tagged corpus

Brill Tagging: In more detail

Start with simple (less accurate) rules…learn better ones from tagged corpus

Tag each word initially with most likely POSExamine set of transformations to see which improves tagging decisions compared to tagged corpus Re-tag corpus using best transformationRepeat until, e.g., performance doesn’t improveResult: tagging procedure (ordered list of transformations) which can be applied to new, untagged text

33Slide modified from Massimo Poesio's

An exampleExamples:

They are expected to race tomorrow.The race for outer space.

Tagging algorithm:1. Tag all uses of “race” as NN (most likely tag in the

Brown corpus)• They are expected to race/NN tomorrow• the race/NN for outer space

2. Use a transformation rule to replace the tag NN with VB for all uses of “race” preceded by the tag TO:• They are expected to race/VB tomorrow• the race/NN for outer space

Example Rule Transformations

Sample Final Rules

Error Analysis

To improve your algorithm, examine where it fails on the cross-validation set

It’s often useful to characterize in detail which examples it fails on and which succeed.

Make fixes, and then re-train on the training set, again using cross-validation

Error Analysis

Assignment + Next Time

I’ve posted an assignment, due in a weekWork in pairs, but only if you work together

In your writeup, make clear who did what, and what you did together

Next week: shallow parsing

1 I256: Applied Natural Language Processing Marti Hearst Sept 20, 2006.

Documents

Transcript of 1 I256: Applied Natural Language Processing Marti Hearst Sept 20, 2006.

Incorporating Metadata into Search UIs Marti Hearst UC Berkeley.

SIMS 247: Information Visualization and Presentation Marti Hearst

Thoughts on Tagging & Search Marti Hearst UC Berkeley.

1 I256: Applied Natural Language Processing Marti Hearst Nov 15, 2006.

Foundations of Software Design Fall 2002 Marti Hearst

Searching in Hypertext Prof. Marti Hearst SIMS 202, Lecture 28.

DERRICK COETZEE, ARMANDO FOX, MARTI A. HEARST, BJÖRN ...

Midterm Review Packet Stolen Borrowed from Prof. Marti Hearst

Text Visualization Marti Hearst Guest Lecture, i247, Spring 2012.

Information Seeking Behavior Prof. Marti Hearst SIMS 202, Lecture 25.

1 I256: Applied Natural Language Processing Marti Hearst Nov 13, 2006.

1 I256: Applied Natural Language Processing Marti Hearst Sept 27, 2006.

1 i247: Information Visualization and Presentation Marti Hearst Interactive Visualization.

1 Question Answering Marti Hearst November 14, 2005.

1 I256: Applied Natural Language Processing Marti Hearst Nov 8, 2006.

1 i247: Information Visualization and Presentation Marti Hearst Perceptual Principles.

1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006.

1 I256: Applied Natural Language Processing Marti Hearst Oct 2, 2006.

1 I256: Applied Natural Language Processing Marti Hearst Sept 13, 2006.

1 Email Viz Future Directions Marti Hearst UC Berkeley.