Ling 570 Day 17: Named Entity Recognition Chunking.

Ling 570 Day 17:Named Entity Recognition

Chunking

Sequence Labeling

• Goal: Find most probable labeling of a sequence

• Many sequence labeling tasks– POS tagging– Word segmentation– Named entity tagging– Story/spoken sentence segmentation– Pitch accent detection– Dialog act tagging

NER AS SEQUENCE LABELING

NER as Classification Task

• Instance:

• Instance: token• Labels:

• Instance: token• Labels:– Position: B(eginning), I(nside), Outside

• Instance: token• Labels:– Position: B(eginning), I(nside), Outside– NER types: PER, ORG, LOC, NUM

• Instance: token• Labels:– Position: B(eginning), I(nside), Outside– NER types: PER, ORG, LOC, NUM– Label: Type-Position, e.g. PER-B, PER-I, O, …– How many tags?

• Instance: token• Labels:– Position: B(eginning), I(nside), Outside– NER types: PER, ORG, LOC, NUM– Label: Type-Position, e.g. PER-B, PER-I, O, …– How many tags?• (|NER Types|x 2) + 1

NER as Classification: Features

• What information can we use for NER?

– Predictive tokens: e.g. MD, Rev, Inc,..• How general are these features?

– Predictive tokens: e.g. MD, Rev, Inc,..• How general are these features? – Language? Genre? Domain?

NER as Classification: Shape Features

• Shape types:

• Shape types:– lower: e.g. e. e. cummings• All lower case

– capitalized: e.g. Washington• First letter uppercase

– all caps: e.g. WHO• all letters capitalized

– mixed case: eBay• Mixed upper and lower case

• Shape types:– lower: e.g. e. e. cummings

• All lower case

– Capitalized with period: H.

• All lower case

– Capitalized with period: H.– Ends with digit: A9

• All lower case

– Capitalized with period: H.– Ends with digit: A9– Contains hyphen: H-P

Example Instance Representation

• Example

Sequence Labeling

• Example

Evaluation

• System: output of automatic tagging• Gold Standard: true tags

Evaluation

• Precision: # correct chunks/# system chunks• Recall: # correct chunks/# gold chunks• F-measure:

Evaluation

• F1 balances precision & recall

Evaluation

• Standard measures:– Precision, Recall, F-measure– Computed on entity types (Co-NLL evaluation)

Evaluation

• Classifiers vs evaluation measures– Classifiers optimize tag accuracy

Evaluation

• Classifiers vs evaluation measures– Classifiers optimize tag accuracy• Most common tag?

Evaluation

• Classifiers vs evaluation measures– Classifiers optimize tag accuracy• Most common tag?

– O – most tokens aren’t NEs

– Evaluation measures focuses on NE

Evaluation

• Classifiers vs evaluation measures– Classifiers optimize tag accuracy

• Most common tag? – O – most tokens aren’t NEs

– Evaluation measures focuses on NE• State-of-the-art:– Standard tasks: PER, LOC: 0.92; ORG: 0.84

Hybrid Approaches

• Practical sytems– Exploit lists, rules, learning…

Hybrid Approaches

• Practical sytems– Exploit lists, rules, learning…– Multi-pass:• Early passes: high precision, low recall• Later passes: noisier sequence learning

Hybrid Approaches

• Practical sytems– Exploit lists, rules, learning…– Multi-pass:• Early passes: high precision, low recall• Later passes: noisier sequence learning

• Hybrid system:– High precision rules tag unambiguous mentions• Use string matching to capture substring matches

Hybrid Approaches

• Practical sytems– Exploit lists, rules, learning…– Multi-pass:

• Early passes: high precision, low recall• Later passes: noisier sequence learning

• Hybrid system:– High precision rules tag unambiguous mentions

• Use string matching to capture substring matches

– Tag items from domain-specific name lists– Apply sequence labeler

CHUNKING

What is Chunking?

• Form of partial (shallow) parsing– Extracts major syntactic units, but not full parse trees

• Task: identify and classify – Flat, non-overlapping segments of a sentence– Basic non-recursive phrases– Correspond to major POS

• May ignore some categories; i.e. base NP chunking

– Create simple bracketing• [NPThe morning flight][PPfrom][NPDenver][Vphas arrived]

• [NPThe morning flight] from [NPDenver] has arrived

ExampleS

Breaking

broken

office

NPPPVPNP

ExampleS

Breaking

broken

office

Why Chunking?

• Used when full parse unnecessary– Or infeasible or impossible (when?)

• Extraction of subcategorization frames– Identify verb arguments

• e.g. VP NP• VP NP NP• VP NP to NP

• Information extraction: who did what to whom• Summarization: Base information, remove mods• Information retrieval: Restrict indexing to base NPs

Processing Example

• Tokenization: The morning flight from Denver has arrived

• POS tagging: DT JJ N PREP NNP AUX V

• Chunking: NP PP NP VP

• Extraction: NP NP VP

• etc

Approaches

• Finite-state Approaches– Grammatical rules in FSTs– Cascade to produce more complex structure

• Machine Learning– Similar to POS tagging

Finite-State Rule-Based Chunking

• Hand-crafted rules model phrases– Typically application-specific

• Left-to-right longest match (Abney 1996)– Start at beginning of sentence– Find longest matching rule– Greedy approach, not guaranteed optimal

• Chunk rules:– Cannot contain recursion• NP -> Det Nominal: Okay• Nominal -> Nominal PP: Not okay

• Examples:– NP (Det) Noun* Noun– NP Proper-Noun– VP Verb– VP Aux Verb

• Chunk rules:– Cannot contain recursion

• NP -> Det Nominal: Okay• Nominal -> Nominal PP: Not okay

• Examples:– NP (Det) Noun* Noun– NP Proper-Noun– VP Verb– VP Aux Verb

• Consider: Time flies like an arrow• Is this what we want?

Cascading FSTs

• Richer partial parsing– Pass output of FST to next FST

• Approach:– First stage: Base phrase chunking– Next stage: Larger constituents (e.g. PPs, VPs)– Highest stage: Sentences

Example

Chunking by Classification

• Model chunking as task similar to POS tagging• Instance:

• Model chunking as task similar to POS tagging• Instance: tokens • Labels: – Simultaneously encode segmentation &

identification

identification– IOB (or BIO tagging) (also BIOE or BIOSE)• Segment: B(eginning), I (nternal), O(utside)

identification– IOB (or BIO tagging) (also BIOE or BIOSE)• Segment: B(eginning), I (nternal), O(utside)• Identity: Phrase category: NP, VP, PP, etc.

identification– IOB (or BIO tagging) (also BIOE or BIOSE)

• Segment: B(eginning), I (nternal), O(utside)• Identity: Phrase category: NP, VP, PP, etc.• The morning flight from Denver has arrived• NP-B NP-I NP-I PP-B NP-B VP-B VP-I

• Model chunking as task similar to POS tagging• Instance: tokens • Labels: – Simultaneously encode segmentation & identification– IOB (or BIO tagging) (also BIOE, etc.)

• Segment: B(eginning), I (nternal), O(utside)• Identity: Phrase category: NP, VP, PP, etc.• The morning flight from Denver has arrived• NP-B NP-I NP-I PP-B NP-B VP-B VP-I• NP-B NP-I NP-I NP-B

Features for Chunking

• What are good features?

Features for Chunking

• What are good features?– Preceding tags

• for 2 preceding words

– Words• for 2 preceding, current, 2 following

– Parts of speech• for 2 preceding, current, 2 following

• Vector includes those features + true label

Chunking as Classification

• Example

Evaluation

• System: output of automatic tagging• Gold Standard: true tags – Typically extracted from parsed treebank

• F1 balances precision & recall

State-of-the-Art

• Base NP chunking: 0.96

State-of-the-Art

• Base NP chunking: 0.96• Complex phrases: Learning: 0.92-0.94

• Most learners achieve similar results

– Rule-based: 0.85-0.92

State-of-the-Art

– Rule-based: 0.85-0.92• Limiting factors:

State-of-the-Art

– Rule-based: 0.85-0.92• Limiting factors:– POS tagging accuracy– Inconsistent labeling (parse tree extraction)– Conjunctions

• Late departures and arrivals are common in winter• Late departures and cancellations are common in winter

Ling 570 Day 17: Named Entity Recognition Chunking.

Documents

Transcript of Ling 570 Day 17: Named Entity Recognition Chunking.

Chunking - Searchstaff.um.edu.mt/mros1/csa2050/docs/nltk_chunking.pdf · Chunking 5.1 Introduction Chunking is an efﬁcient and robust method for identifying short phrases in text,

CONTENT-DEPENDENT CHUNKING FOR DIFFERENTIAL COMPRESSION ...

Multiplication By Chunking

Ling 570 Introduction and Overview 1. Roadmap Course Overview Tokenization Homework #1 2.

Chunking in Spatial Memory 1 Running head: CHUNKING IN SPATIAL MEMORYhome.gwu.edu/~dopkins/SargentDopkinsPhilbeckChichka.pdf · 2012. 5. 7. · Chunking in Spatial Memory 2 Abstract

Chunking and Named Entities

Learning in ACT-R: Chunking Revisited

Memory Chunking An Elephant

Chunking, Annotation, & Summary. Reading Strategies Chunking Summarization Annotation Hint: They all work together!!!!

Literature Sound Blending and Chunking

Chunking - School of Informatics, University of Edinburgh · Chunking in NLTK-Lite Chunking in Cass Chunking as Tagging Summary and Reading Problems with Full Parsing, 2 I Speed:

How Chunking Helps Content Processing

Chunking Text

Ling 570: Day 8 Classification, Mallet 1. Roadmap Open questions? Quick review of classification Feature templates 2.

Ultratech Dwell - Ultrafabrics LLC Portal · 2019. 5. 14. · 570-3128 Sandpiper 570-1328 Sizzle 570-3217 Nutmeg 570-3445 Wild Honey 570-2587 Hurricane 570-4307 Rainforest 570-2524

HDF5 Chunking and Compression - STAR · Why HDF5 chunking? • Chunking is required for several HDF5 features - Expanding/shrinking dataset dimensions and adding/”deleting” data

Instructional Sequence and Pacing: Chunking a Lesson … · Instructional Sequence and Pacing: Chunking a Lesson Objective! Vertical chunking Sometimes the lesson depends on all teaching

Tagging and Chunking Best Practices

Semantic Role Chunking Combining Complementary Syntactic Views

Chunking slides