Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji [email protected]...

42
Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji [email protected] Acknowledgement: some slides from Radu Florian and Stephen Soderland

Transcript of Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji [email protected]...

Page 1: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Information Extraction in the Past 20 Years:Traditional vs. Open

Heng Ji

[email protected]

Acknowledgement: some slides from Radu Florian and Stephen Soderland

Page 2: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

20+ Years of Information Extraction

Long successful run– MUC– CoNLL– ACE– TAC-KBP– DEFT– BioNLP

2

Genres– Newswire– Broadcast news– Broadcast conversations– Weblogs– Blogs– Newsgroups– Speech– Biomedical data– Electronic Medical Records

Programs– MUC– ACE– GALE– MRP– BOLT– DEFT

Page 3: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Two Siblings under IE Umbrella

3

Quality Portability

Page 4: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Quality Challenges

4

Page 5: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

We’re thriving Entity Linking

We’re making slow but consistent progress Relation Extraction Event Extraction Slot Filling

We’re running around in circles Name Tagging

We’re stuck in a tunnel Entity Coreference Resolution

5

Where have we been?

Page 6: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Name Tagging: “Old” Milestones

6

Year Tasks & Resources

Methods F-Measure Example References

1966 - First person name tagger with punch card30+ decision tree type rules

- (Borkowski et al., 1966)

1998 MUC-6 MaxEnt with diverse levels of linguistic features

97.12% (Borthwick and Grishman, 1998)

2003 CONLL System combination; Sequential labeling with Conditional Random Fields

89% (Florian et al., 2003; McCallum et al., 2003; Finkel et al., 2005)

2006 ACE Diverse levels of linguistic features, Re-ranking, joint inference

~89% (Florian et al., 2006; Ji and Grishman, 2006)

Our progress compared to 1966: More data, a few more features and more fancy learning

algorithms Not much active work after ACE because we tend to believe

it’s a solved problem…

Page 7: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

7

State-of-the-art reported in papers

The end of extreme happiness is sadness…

Page 8: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

8

The end of extreme happiness is sadness… Experiments on ACE2005 data

Page 9: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Challenges

Defining or choosing an IE schema

Dealing with genres & variations–Dealing with novelty

Bootstrapping a new language

Improving the state-of-the-art with unlabeled data

Dealing with a new domain

Robustness

9

Page 10: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

99 Schemas of IE on the Wall…

Many IE schemas over the years:– MUC – 7 types

• PER, ORG, LOC, DATE, TIME, MONEY, PERCENT– ACE – 5 7 5 types

• PER, ORG, GPE, LOC, FAC, WEA, VEH• Has substructure (subtypes, mention types, specificity, roles)

– CoNLL: 4 types• ORG, PER, LOC, MISC

– ONTONotes: 18 types• CARDINAL,DATE,EVENT,FAC,GPE,LANGUAGE,LAW,LOC,MONEY,NORP,ORDINA

L,ORG,PERCENT,PERSON,PRODUCT,QUANTITY,TIME,WORK_OF_ART– IBM KLUE2: 50 types, including event anchors– Freebase categories– Wikipedia categories

Challenges:– Selecting an appropriate schema to model– Combining training data

10

Page 11: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

My Favorite Booby-trap Document

LVMH Makes a Two-Part Offer for Donna Karan

By LESLIE KAUFMANPublished: December 19, 2000

The fashion house of Donna Karan, which has long struggled to achieve financial equilibrium, has finally found a potential buyer. The giant luxury conglomerate LVMH-Moet Hennessy Louis Vuitton, which has been on a sustained acquisition bid, has offered to acquire Donna Karan International for $195 million in a cash deal with the idea that it could expand the company's revenues and beef up accessories and overseas sales.

At $8.50 a share, the LVMH offer represents a premium of nearly 75 percent to the closing stock price on Friday. Still, it is significantly less than the $24 a share at which the company went public in 1996. The final price is also less than one-third of the company's annual revenue of $662 million, a significantly smaller multiple than European luxury fashion houses like Fendi were receiving last year.

The deal is still subject to board approval, but in a related move that will surely help pave the way, LVMH purchased Gabrielle Studio, the company held by the designer and her husband, Stephan Weiss, that holds all of the Donna Karan trademarks, for $450 million. That price would be reduced by as much as $50 million if LVMH enters into an agreement to acquire Donna Karan International within one year. In a press release, LVMH said it aimed to combine Gabrielle and Donna Karan International and that it expected that Ms. Karan and her husband ''will exchange a significant portion of their DKI shares for, and purchase additional stock in, the combined entity.''

11

http://www.nytimes.com/2000/12/19/business/lvmh-makes-a-two-part-offer-for-donna-karan.html

Page 12: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Analysis of an Error

12

Donna Karan International

Page 13: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Saddam Hussein International

Ronald Reagan International

Analysis of an Error: How can you Tell?

13

Donna Karan International

Dana International

FAC Saddam Hussein International Airport 8FAC Baghdad International 1ORG Amnesty International 3FAC International Space Station 1ORG International Criminal Court 1ORG Habitat for Humanity International 1ORG U-Haul International 1FAC Saddam International Airport 7ORG International Committee of the Red Cross 4ORG International Committee for the Red Cross 1FAC International Press Club 1ORG American International Group Inc. 1ORG Boots and Coots International Well Control Inc. 1ORG International Committee of Red Cross 1ORG International Black Coalition for Peace and Justice 1FAC Baghdad International Airport RG Center for Strategic and International Studies 2ORG International Monetary Fund 1

Page 14: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

14

Page 15: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Dealing With Different Genres:

Weblogs:– All lower case data

• obama has stepped up what bush did even to the point of helping our enemy in Libya.

– Non-standard capitalization/title case• LiveLeak.com - Hillary Clinton: Saddam Has WMD, Terrorist Ties (Video)

Solution: Case Restoration (truecasing)

15

Page 16: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

}

16

Page 17: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Out-of-domain data

17

Volunteers have also aided victims of numerous other disasters, including hurricanes Katrina, Rita, Andrew and Isabel, the Oklahoma City bombing, and the September 11 terrorist attacks.

Page 18: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Out-of-domain Data

Manchester United manager Sir Alex Ferguson got a boost on Tuesday as a horse he part owns What A Friend landed the prestigious Lexus Chase here at Leopardstown racecourse.

18

Page 19: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Bootstrapping a New Language

English is resource-rich:–Lexical resources: gazetteers–Syntactic resources: PennTreeBank–Semantic resources: Wordnet, entity-labeled data (MUC, ACE,

CoNLL), Framenet, PropBank, NomBank, OntoBank

How can we leverage these resources in other languages?

MT to the rescue!

Page 20: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

B-LOC

Mention Detection Transfer

ES: El soldado nepalés fue baleado por ex soldados haitianos cuando patrullaba la zona central de Haiti , informó Minustah .

EN: The Nepalese soldier was gunned down by former Haitian soldiers when patrullaba the central area of Haiti , reported minustah .

O

OOO OO

OOOO

O

OOOO

El soldadonepalésfuebaleado porexsoldadoshaitianos cuando patrullaba la zona central de Haiti , informó Minustah .

TheNepalese soldierwasgunneddown byformerHaitiansoldierswhenpatrollingthecentralareaofHaiti,reportedminustah.

B-GPE

B-GPE

B-GPE B-PER

B-PER

Page 21: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Mention Detection Transfer: Results

System F-measure

Spanish

Direct Transfer 66.5

Source Only (100k words) 71.0

Source Only (160k words) 76.0

Source + Transfer 78.5

Arabic

Direct Transfer 51.6

Source Only (186k tokens) 79.6

Source + Transfer 80.5

Chinese

Direct Transfer 58.5

Source Only 74.5

Source + Transfer 76.0

Page 22: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Mention Detection Challenges

22

How to deal with out-of-domain data? How to even detect if you’re out of domain?

How to deal with unseen WotD? (e.g. ISIS, ISIL, IS, Ebola)

How to improve significantly the state-of-the-art using unlabeled data?

Page 23: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

What’s Wrong?

23

Name tagger s are getting old (trained from 2003 news & test on 2012 news)

Genre adaptation (informal contexts, posters) Revisit the definition of name mention – extraction for linking Limited types of entities (we really only cared about PER, ORG,

GPE) Old unsolved problems

Identification: “Asian Pulp and Paper Joint Stock Company , Lt. of Singapore” Classification: “FAW has also utilized the capital market to directly finance,…” (FAW =

First Automotive Works)

Potential Solutions for Quality Word clustering, Lexical Knowledge Discovery (Brown, 1992; Ratinov and

Roth, 2009; Ji and Lin, 2010) Feedback from Linking, Relation, Event (Sil and Yates, 2013; Li and Ji, 2014)

Potential Solutions for Portability Extend entity types based on AMR (140+)

Page 24: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Entity Linking Milestones

24

2006: The first definition of Wikification task (Bunescu and Pasca, 2006)

2009: TAC-KBP Entity Linking launched (McNamee and Dang, 2009) 2008-2012: Supervised learning-to-rank with diverse levels of features

such as entity profiling, various popularity and similarity measures were developed (Gao et al., 2010; Chen and Ji, 2011; Ratinov et al., 2011; Zheng et al., 2010; Dredze et al., 2010; Anastacio et al., 2011)

2008-2013: Collective Inference, Coherence measures were developed (Milne and Witten, 2008; Kulkarni et al., 2009; Ratinov et al., 2011; Chen and Ji, 2011; Ceccarelli et al., 2013; Cheng and Roth, 2013)

2012: Various applications(e.g., Coreference resolution (Ratinov & Roth, 2012) – Dan’s talk

2014: TAC-KBP Entity Discovery and Linking (end-to-end name tagging, cross-document entity clustering, entity linking) (Ji et al., 2014)

Many different versions of international evaluations were inspired from TAC-KBP; more than 130 papers have been published

Page 25: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Current Linking Problems and Possible Solutions

State-of-the-art Entity Linking: 85% B-cubed+ F-score on formal genres and 70% B-cubed+ F-score on informal genres

State-of-the-art Entity Discovery and Linking: 66% Discovery and Linking F-score, 73% Clustering CEAFm F-score

Remaining Challenges Popularity bias Require better meaning representation Select collaborators from rich contexts Knowledge gap between source and KB Cross-lingual Entity Linking (name translation problem)

Potential Solutions: Deep knowledge acquisition and representation (e.g., AMR) Better graph search alignment algorithms Make more people excited about Chinese and Spanish

25

Page 26: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Slot Filling Milestones

26

2009-2014: Top systems achieved 30%-40% F-measure Ground-truth is created based on manual assessment of pooled system

output – relative recall; score may appear lower with stronger teams 2014 queries are more challenging than 2013; including some ambiguous

queries sharing with entity linking (Stephen’s talk)

Consistent progress on individual system (RPI, test on 2014 data): 2010: 20% 2011: 22% 2013: 28% 2014: 34%

Successful Methods Multi-label Multi-instance learning (Seadeanu et al., 2012) Combination of distant supervision with heuristic rules and patterns (Roth

et al., 2013) Cross-source Cross-system Inference (Chen et al., 2011; Yu et al., 2014) Linguistic constraints (Yu et al., 2014) – Heng’s one-week pencil-and-

paper work to semi-automatically acquire trigger phrases; an awfully simple trigger scoping method beat all 2013 systems

Page 27: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

27/35

Have the Error Sources Changed over Years?

2010

00.05

0.10.15

0.20.25

0.3

2014

(Min and Grishman, 2011)

(Yu and Ji, 2014)

Page 28: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Blame Ourselves First…

28

Non-verb and multi-word expression as triggers his men back to their compound

Knowledge scarcity - Long-tail A suicide bomber detonated explosives at the entrance to a crowded medical teams carting away dozens of wounded victims Today I was let go from my job after working there for 4 1/2 years. Possible solution: increase coverage with FrameNet (Li et al., 2014)

Global context I didn't want to hurt him . I miss him to death. I threw stone out of the window. vs. I threw him out of the window. Ellison to spend $10.3 billion to get his company. We believe that the likelihood of them using those weapons goes up. Fifteen people were killed and more than 30 wounded Wednesday as a

suicide bomber blew himself up on a student bus in the northern town of Haifa

Possible solution: joint modeling between triggers and arguments (Li et al., 2013)

Page 29: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Then Blame Others… Fundamental language problem – ambiguity and

variety Coreference, coreference, coreference…

25% of the examples involve coreference which is beyond current system capabilities, such as nominal anaphors and non-identity coreference

Almost overnight, he became fabulously rich, with a $3-million book deal, a $100,000 speech making fee, and a lucrative multifaceted consulting business, Giuliani Partners. … His consulting partners included seven of those who were with him on 9/11, and in 2002 Alan Placa, his boyhood pal, went to work at the firm.

After successful karting career in Europe, Perera became part of the Toyota F1 Young Drivers Development Program and was a Formula One test driver for the Japanese company in 2006.

“a woman charged with running a prostitution ring … her business, Pamela Martin and Associates”

29

Page 30: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Then Blame Others… Paraphrase, paraphrase, paraphrase…

30

“employee/member”: Sutil, a trained pianist, tested for Midland in 2006 and raced for Spyker in 2007

where he scored one point in the Japanese Grand Prix. Daimler Chrysler reports 2004 profits of $3.3 billion; Chrysler earns $1.9 billion. In her second term, she received a seat on the powerful Ways and Means

Committee Jennifer Dunn was the face of the Washington state Republican Party for more

than two decades Buchwald lied about his age and escaped into the Marine Corps. By 1942, Peterson was performing with one of Canada's leading big bands, the

Johnny Holmes Orchestra. “spouse”:

Buchwald 's 1952 wedding -- Lena Horne arranged for it to be held in London 's Westminster Cathedral -- was attended by Gene Kelly , John Huston , Jose Ferrer , Perle Mesta and Rosemary Clooney , to name a few

Page 31: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Inference, Inference, Inference… systems would benefit from specialists which are able to reason

about times, locations, family relationships, and employment relationships.

People Magazine has confirmed that actress Julia Roberts has given birth to her third child a boy named Henry Daniel Moder. Henry was born Monday in Los Angeles and weighed 8 lbs. Roberts, 39, and husband Danny Moder, 38, are already parents to twins Hazel and Phinnaeus who were born in November…

He [Pascal Yoadimnadji] has been evacuated to France on Wednesday after falling ill and slipping into a coma in Chad, Ambassador Moukhtar Wawa Dahab told The Associated Press. His wife, who accompanied Yoadimnadji to Paris, will repatriate his body to Chad, the amba. is he dead? in Paris?

Until last week, Palin was relatively unknown outside Alaska… does she live in Alaska?

The list says that the state is owed $2,665,305 in personal income taxes by singer Dionne Warwick of South Orange, N.J., with the tax lien dating back to 1997. does she live in NJ?

31

Then Blame Others…

Page 32: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Portability/Scalability Challenges

32

Page 33: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Defining the Problem

• Deep understanding of all possible relations?• Open IE, pre-emptive IE, on-demand IE…

DEFT PI meeting -- U. Washington 3310/15/2014

Page 34: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Defining the Problem

• Deep understanding of all possible relations?• Deep Extraction for Focused Tasks (D.E.F.T.)

– User has a focused information need:• A few dozen relations, several entity types:• Date_of_birth(per, date), city_of_headquarters(org, city), …• Treatment(substance, condition), studies_disease(per/org, condition),

…• Arrive_in(per, loc), meet_with(per, per), unveil(org, product), …

– Quickly train an extractor for the task• Domain independent: parsing, Open IE, SRL, …• Task specific: semantic tagging, extraction patterns, …

Freedman et al. Extreme Extraction -- Machine Reading in a Week. EMNLP 2011

Zhang et al. NewsSpike Event Extractor, in review

TAC-KBP

DEFT PI meeting -- U. Washington 3410/15/2014

Page 35: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Aim for the Head ?

Frequency

Patterns

Dead simple

A Zipfian Distribution of surface forms to express a textual relation

The real challenge

A hopeless case

DEFT PI meeting -- U. Washington 3510/15/2014

Page 36: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Open IE for KBP

DEFT PI meeting -- U. Washington 3610/15/2014

• Advantages of Open IE– Robust– Massively scalable– Works out of the box– Finds whatever relations are expressed in the text– Not tied to an ontology of relations

• Disadvantages– Finds whatever relations are expressed in the text– Not tied to an ontology of relations

• Challenge– Map Open IE to an ontology of relations– Minimum of user effort

github/knowitall/openie

Page 37: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

OpenIE–KBP Rule Language

DEFT PI meeting -- U. Washington 37

(Smith, was appointed , Acting Director of Acme Corporation) entity slotfill

per:employee_or_member_of (Smith, Acme Corporation)

Terms in Rule Example

Target relation: per:employee_or_member_ofQuery entity in: Arg1Slotfill in: Arg2Slotfill type: OrganizationArg1 terms: -Relation terms: appointedArg2 terms: <JobTitle> of

10/15/2014

Arg1 Arg2Rel

Page 38: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

Hits the Head, but …

• High precision, average recall• Limited recall from Open IE,

– Good with verb-based relations– Weak on noun-based relations

• “Implicit relation” patterns“Bashardost, 43, is …”

(Baradost, [has age], 43)“… the Election Complaints Commission (ECC)…”

(Election Complaints Commission, [has acronym], ECC)“French journalist Jean LeGall reported that …”

(Jean LeGall, [has job title], journalist )(Jean LeGall, [has nationality], French )

10/15/2014 DEFT PI meeting -- U. Washington 38

Page 39: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

NewsSpike Event Extractor• Extracts event relations from news streams

– Event = event_phrase(arg1_type, arg2_type)• NewsSpike = (entity1, entity2, date, {sentences})

– from parallel news streams– Open IE identifies entity1, entity2, and event phrase– a spike in frequency on that date indicates

an event between entity1 and entity2• Automatically discover relations not covered by Freebase

10/15/2014 DEFT PI meeting -- U. Washington 39

arrive_in (person, location)beat (sports_team, sports_team)meet_with (person, person)nominate (person/politician, person)unveil (organization, product)…

Page 40: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

NewsSpike Architecture

10/15/2014 DEFT PI meeting -- U. Washington 40

Parallel news streams

E=e(t1,t2)

EventDiscover events

NewsSpike w/Parallel sentences

r1 r2 r3

(a1,a2,t)

r1 r2 r3 r4 r5

r1 r2 r3

NS=(a1,a2,d,S)

Group

S={s1, s2 ,s3} r1 r2 r3

(a1,a2,t)

r1 r2 r3 r4 r5

r1 r2 r3

E=e(t1,t2)sE(a1,a2)

s’E(a’1,a’2)

Generate training data

Training sentences

EventExtractor

learn

Testsentences

input extract

s(a1,a2)

Extractions

s

Training Phase Testing Phase

Page 41: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

High Quality Training

• Paraphrases in NewsSpike gives positive training• Negative training from Temporal negation heuristic:

– If event phrases e1 and e2 are in the same NewsSpike– and one of them is negated– e1 is probably not a paraphrase of e2

“Team1 faces Team2” “Team1 did not beat Team2” face ≠ beat

• High precision from negative training

10/15/2014 DEFT PI meeting -- U. Washington 41

Page 42: Information Extraction in the Past 20 Years: Traditional vs. Open Heng Ji jih@rpi.edu Acknowledgement: some slides from Radu Florian and Stephen Soderland.

High Precision Event Extractor

Doubles the area under PR curve vs. Universal Schemas

10/15/2014 DEFT PI meeting -- U. Washington 42

NewsSpike-E2 on news stream

Universal Schemas on NYT

Universal Schemas on news stream