Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey...

45
Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee [email protected]

Transcript of Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey...

Page 1: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Overview of Entity Discovery and Linking Tasks at KBP2014

Heng Ji (RPI)Joel Nothman, Ben Hachey (Univ. of

Sydney)Thanks to KBP2014 Organizing Committee

[email protected]

Page 2: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Goals and The Task

2

Page 3: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Overview • Motivations

o The most popular EL Trend: Collective Inference - disambiguate a set of relevant mentions simultaneously by leveraging the global topical coherence between entities

o A lot of research has been done in parallel in the Wikification community (Bunescu, 2006) - extract prominent ngrams as concept mentions, and link each concept mention to the KB

o One important research direction of KBP: “Cold-start”• What’s New in 2014

o Extend English task to Entity Discovery and Linking (full Entity Extraction + Entity Linking + NIL Clustering)

o Add discussion forums to Cross-lingual trackso Share some source collections and queries with regular and

cold-start slot filling tracks, to investigate the role of EDL in the entire cold-start KBP pipeline

o Provide automatic annotations, reading list, software tools

3

Page 4: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Entity Mention Extraction

It’s a version of Chicago – the standard classic Macintosh menu font, with that distinctive thick diagonal in the ”N”.

Chicago was used by default for Mac menus through MacOS 7.6, and OS 8 was released mid-1997..

Chicago VIII was one of the early 70s-era Chicago albums to catch myear, along with Chicago II.

4

Page 5: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Clustering: Cross-doc Coreference

ResolutionIt’s a version of Chicago – the standard classic Macintosh menu font, with that distinctive thick diagonal in the ”N”.

Chicago was used by default for Mac menus through MacOS 7.6, and OS 8 was released mid-1997..

Chicago VIII was one of the early 70s-era Chicago albums to catch myear, along with Chicago II.

5

Page 6: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

6

Linking: Disambiguation to KB

It’s a version of Chicago – the standard classic Macintosh menu font, with that distinctive thick diagonal in the ”N”.

Chicago was used by default for Mac menus through MacOS 7.6, and OS 8 was released mid-1997..

Chicago VIII was one of the early 70s-era Chicago albums to catch myear, along with Chicago II.

Page 7: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Evaluation Measures

7

• Added type matching variant into each measure

Page 8: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.
Page 9: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.
Page 10: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

CEAF (Luo, 2005)

• Idea: a mention or entity should not be credited more than once

• Formulated as a bipartite matching problem o A special ILP problem o Efficient algorithm: Kuhn-Munkres

Page 11: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.
Page 12: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Participants

12

• EDL: 20 teams, 75 runs; EL: 17 teams, 55 runs

Page 13: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

The Results

13

Page 14: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

General Architecture

14

Feedback from linking to improve extraction

New ranking algorithm: Progamming with Personalized PageRank algorithm by CohenCMU (Mazaitis et al., 2014)

A nice summary of the state-of-the-art ranking features by Tohoku NL (Zhou et al., 2014)

Page 15: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Overall Performance: Extraction + Linking

15

Scoring: span, type and KB ID match Systems with > 60% NERL F1 are significantly better than

others (90% confidence interval)

Page 16: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Overall Performance: Extraction + Clustering

16

Scoring: span, type and clustering LCC and RPI systems are significantly better than others (90%

confidence interval)

Page 17: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Impact of Entity Mention Extraction

17

NER: span; NERC: span_type; NERL: span_type_KBID KBIDs: docid_KBID

NER (extraction) correlates with NERL (Extraction + Linking) well Bug in IBM system

75%, Much lower than state-of-the-art name tagging (89%)

Page 18: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Diagnostic Entity Linking Performance

18

High performance with perfect entity mentions (70%90%)

IBM is somewhere here too!

Page 19: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Entity Types and Textual Genres

19

Scoring: span, type and linking

Easiest: persons and discussion forum

Page 20: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Clustering Measures

20

B-cubed is very sensitive to mention extraction errors

Page 21: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Cross-lingual Entity Linking

21

Both systems followed their English EL approaches IBM achieved similar performance with the top English EDL

system (the difficulty level of queries are not comparable) Many Chinese teams chose to focus on English EDL (a cloned

version in NLPCC2014 organized by PKU) Tri-lingual EDL in KBP2015

Query TeamB-cubed+ (%)

P R F

SpanishHITS1 78.9 68.4 73.2

IBM1 84.0 81.6 82.8

EnglishHITS1 68.4 60.3 64.1

IBM1 80.6 77.7 79.1

Page 22: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

What’s New and What Works - Or How to Make My Advisor Happy

22

A roll-coaster-style conversation 12 hours before this presentation…

R: I started to question why we are doing all of these… H: Please don’t tell me all of these are meaningless… R: Did EDL produce any new science? H: Of course! Blabla…blabla…blabla…blabla…and Blabla R: You make me happy

Page 23: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Entity Linking Milestones

23

2006: The first definition of Wikification task (Bunescu and Pasca, 2006) 2009: TAC-KBP Entity Linking launched (McNamee and Dang, 2009) 2008-2012: Supervised learning-to-rank with diverse levels of features

such as entity profiling, various popularity and similarity measures were developed (Gao et al., 2010; Chen and Ji, 2011; Ratinov et al., 2011; Zheng et al., 2010; Dredze et al., 2010; Anastacio et al., 2011)

2008-2013: Collective Inference, Coherence measures were developed (Milne and Witten, 2008; Kulkarni et al., 2009; Ratinov et al., 2011; Chen and Ji, 2011; Ceccarelli et al., 2013; Cheng and Roth, 2013)

2012: Various applications(e.g., Knowledge Acquisition (via grounding), Coreference resolution (Ratinov and Roth, 2012) and Document classification (Vitale et al., 2012; Song and Roth, 2014; Gao et al., 2014)

2014: TAC-KBP Entity Discovery and Linking (end-to-end name tagging, cross-document entity clustering, entity linking)

2012-2014: Many different versions of international evaluations were inspired from TAC-KBP; more than 130 papers have been published

Page 24: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Joint Extraction and Linking

24

Some recent work (Sil and Yates, 2013; Meij et al., 2012; Guo et al., 2013; Huang et al., 2014b) proved extraction and linking can mutually enhance each other

Bosch will provide the rear axle. Robert Bosch Tool Corporation ORG Parker was 15 for 21 from the field, putting up a season high while scoring

nine of San Antonio’s final 10 points in regulation San Antonio Spurs ORG

IBM (Sil and Florian, 2014), MSIIPL THU (Zhao et al., 2014), SemLinker (Meurs et al., 2014), UBC (Barrena et al., 2014) and RPI (Hong et al., 2014) used the properties in external KBs such as DBPedia as feedback to refine the identification and classification of name mentions.

RPI system successfully corrected 11.26% wrong mentions HITS team (Judea et al., 2014) proposed a joint approach that

simultaneously solves extraction, linking and clustering using Markov Logic Networks

Document Linking Event Extraction (Ji and Grishman, 2008) Entity Linking Relation Extraction (Chan and Roth, 2010) Toward more interactions and joint inferences between tasks

Marry EDL and SF in KBP2015

Page 25: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

25 25

DavidCone,aKansasCitynative,wasoriginallysignedbytheRoyalsandbrokeintothemajorswiththeteam

Entity Linking to Improve Relation Extraction (Chan and Roth, 2010)

David Brian Cone (born January 2, 1963) is a former Major League Baseball pitcher. He compiled an 8–3 postseason record over 21 postseason starts and was a part of five World Series championship teams (1992 with the Toronto Blue Jays and 1996, 1998, 1999 & 2000 with the New York Yankees). He had a career postseason ERA of 3.80. He is the subject of the book A Pitcher's Story: Innings With David Cone by Roger Angell. Fans of David are known as "Cone-Heads."Cone lives in Stamford, Connecticut, and is formerly a color commentator for the Yankees on the YES Network.[1]

Contents[hide]1 Early years2 Kansas City Royals3 New York Mets

Partly because of the resulting lack of leadership, after the 1994 season the Royals decided to reduce payroll by trading pitcher David Cone and outfielder Brian McRae, then continued their salary dump in the 1995 season. In fact, the team payroll, which was always among the league's highest, was sliced in half from $40.5 million in 1994 (fourth-highest in the major leagues) to $18.5 million in 1996 (second-lowest in the major leagues)

Page 26: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Task-specific / Genre-specific Mention Extraction

26

Extraction for Linking 4% entity mentions included nested mentions Posters in discussion forum should be extracted

HITS (Judea et al., 2014), LCC (Monahan et al., 2014), MSIIPL THU (Zhao et al., 2014), NYU (Nguyen et al., 2014) and RPI (Hong et al., 2014) developed heuristic rules to significantly improve name tagging

Page 27: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Toward Deep Understanding of Full Documents

27

Old Query-driven Entity Linking Limited exploration of co-occurring entity mentions Bag-of-words style

New EDL Task Deep representation and understanding the

relations among entities in the source documents Natural Language Understanding style e.g., Use Abstract Meaning Representation (details

in RPI’s EDL talk)

Page 28: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

It was a pool report typo. Here is exact Rhodes quote: ”this is not gonna be a couple of weeks. It will be a period of days.”

At a WH briefing here in Santiago, NSA spox Rhodes came with a litany of pushback on idea WH didn’t consult with Congress.

Rhodes singled out a Senate resolution that passed on March 1st which denounced Khaddafy’s atrocities. WH says UN rez incorporates it

Ben Rhodes (Speech Writer)

28

Better Meaning Representation

Page 29: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Source:No matter what, he never should have given Michael Jackson that propofol. He seems to think a “proper” court would have let Murray go free.

KB:The trial of Conrad Murray was the American criminal trial of Michael Jackson's personal physician, Conrad Murray.

Social Relation

29

Select Collaborators from Rich Context

Page 30: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Source:Mubarak, the wife of deposed Egyptian President Hosni Mubarak, …

wife

30

Select Collaborators from Rich Context

Family

KB:Suzanne Mubarak (born 28 February 1941) is the wife of former Egyptian President Hosni Mubarak…

Page 31: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Source:Hundreds of protesters from various groups converged on the state capitol in Topeka, Kansas today…Second, I have a really hard time believing that there were any ACTUAL “explosives” since the news story they link to talks about one guy getting arrested for THREATENING Governor Brownback.

Sam BrownbackPeter Brownback

31

Select Collaborators from Rich Context

Employment

KB:Sam Brownback was elected Governor of Kansas in 2010 and took office in January 2011.

Page 32: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Source:AT&T coverage in GA is good along the interstates and in the major cities like Atlanta, Athens, Rome, Roswell and Albany.

Rome, Georgia

32

Select Collaborators from Rich Context

Part-whole

KB:At the 2010 census, Rome had a total population of 36,303, and is the largest city in Northwest [Georgia] and the 19th largest city in the state.

Rome, Italy

Page 33: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Source:Going into the big Super Tuesday, Romney had won the most votes, states and delegates, Santorum had won some contests and was second, Gingrich had only one contest.

Mitt RomneyGeorge W. Romney

33

Select Collaborators from Rich Context

Start-position Event

KB:The Super Tuesday primaries took place on March 6. Mitt Romney carried six states, Rich Santorum carried three, and Newt Gingrich won only in his home state of Georgia.

Page 34: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Graph-based NIL Entity Clustering

34

Bad News in EL2012 CUNY-BLENDER (Tamang et al., 2012) explored more than

40 clustering algorithms and found that advanced graph-based clustering algorithms did not significantly out-perform single baseline “All-in-one” clustering algorithm on the overall queries (except the most difficult ones)

Good News in EDL2014 LCC (Monahan et al., 2014) proved that graph partition

based algorithm achieved significant gains.

Page 35: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Remaining Challenges

35

Page 36: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Name Tagging: “Old” Milestones

36

Year Tasks & Resources

Methods F-Measure Example References

1966 - First person name tagger with punch card30+ decision tree type rules

- (Borkowski et al., 1966)

1998 MUC-6 MaxEnt with diverse levels of linguistic features

97.12% (Borthwick and Grishman, 1998)

2003 CONLL System combination; Sequential labeling with Conditional Random Fields

89% (Florian et al., 2003; McCallum et al., 2003; Finkel et al., 2005)

2006 ACE Diverse levels of linguistic features, Re-ranking, joint inference

~89% (Florian et al., 2006; Ji and Grishman, 2006)

Our progress compared to 1966: More data, a few more features and more fancy learning

algorithms Not much active work after ACE because we tend to believe

it’s a solved problem…

Page 37: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

37

Cross-genre Name Tagging Experiments on ACE2005 data

Page 38: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

What’s Wrong?

38

Name taggers are getting old (trained from 2003 news & test on 2012 news)

Genre adaptation (informal contexts, posters) Revisit the definition of name mention – extraction for linking Old unsolved problems

Identification: “Asian Pulp and Paper Joint Stock Company , Lt. of Singapore” Classification: “FAW has also utilized the capital market to directly finance,…”

(FAW = First Automotive Works)

Potential Solutions for Quality Word clustering, Lexical Knowledge Discovery (Brown, 1992; Ratinov and

Roth, 2009; Ji and Lin, 2010) Feedback from Linking, Relation, Event (Sil and Yates, 2013; Li and Ji, 2014)

Page 39: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Remaining Challenges for Linking

Remaining Challenges Popularity bias Knowledge gap between source and KB Commonsense Knowledge

Potential Solutions Deep knowledge acquisition and representation

(e.g., AMR) Better graph search alignment algorithms Make more people excited about Chinese and

Spanish by providing more resources Tri-lingual EDL in KBP2015

39

Page 40: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Popularity BiasIf you are called Michael Jordan…

Page 41: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

A Little Better…

Page 42: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Source: breaking news/new information/rumors

KB: bio, summary, snapshot of life

Christies denial of marriage privledges to gays will alienate independents and his “I wanted to have the people vote on it” will ring hollow.

Christie has said that he favoured New Jersey's law allowing same-sex couples to form civil unions, but would veto any bill legalizing same-sex marriage in New Jersey

Translation out of hype-speak: some kook made threatening noises at Brownback and go arrested

Samuel Dale "Sam" Brownback (born September 12, 1956) is an American politician, the 46th and current Governor of Kansas.

Connect/SortBackground Knowledge

Man Accused Of Making Threatening

Phone Call To Kansas Gov. Sam Brownback May Face Felony Charge

42

Knowledge Gap between Source and KB

Page 43: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

2008-07-26During talks in Geneva attended by William J. Burns Iran refused to respond to Solana’s offers.

Commonsense Knowledge

William_J._Burns (1861-1932) William_Joseph_Burns (1956- )

43

Page 44: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

Conclusions and Looking Forward

44

The new EDL task has attracted much interests from the KBP community and produced some interesting research problems and new directions

KBP2015 Improve the annotation guideline and annotation quality of

the training and evaluation data sets Develop more open sources, data and resources for Spanish

and Chinese EDL Encourage researchers to re-visit the entity mention

extraction problem in the new cold-start KBP setting Propose a new tri-lingual EDL task on a source collection

from three languages: English, Chinese and Spanish Investigate the impact of EDL on the end-to-end cold-start

KBP framework; joint inference between EDL and SF

Page 45: Overview of Entity Discovery and Linking Tasks at KBP2014 Heng Ji (RPI) Joel Nothman, Ben Hachey (Univ. of Sydney) Thanks to KBP2014 Organizing Committee.

We can do it!

45