1 People in CALO’s World: Contact Info, Expertise, Groups & Roles Information Extraction,...

Post on 27-Dec-2015

218 views 0 download

Transcript of 1 People in CALO’s World: Contact Info, Expertise, Groups & Roles Information Extraction,...

1

People in CALO’s World:Contact Info, Expertise, Groups & Roles

Information Extraction, Coreference, Group/Topic Models

Andrew McCallum Aron Culotta, Xuerui Wang, Charles Sutton, Wei Li

UMass Amherst

4

DEX ExampleTo: “Andrew McCallum” mccallum@cs.umass.edu

Subject ...

First Name:

Andrew

Middle Name:

Kachites

Last Name:

McCallum

JobTitle: Associate Professor

Company: University of Massachusetts

Street Address:

140 Governor’s Dr.

City: Amherst

State: MA

Zip: 01003

Company Phone:

(413) 545-1323

Links: Fernando Pereira, Sam Roweis,…

Key Words:

Information extraction,

social network,…

Search for new people

6

Outline

Information Extraction– Learning in the wild– Transfer learning

Identity Uncertainty

Modeling Groups, Roles and Topics

7

Outline

Information Extraction– Learning in the wild– Transfer learning

Identity Uncertainty

Modeling Groups, Roles and Topics

9

User feedback “in the wild”as labeling

Labeling forClassification

Easy:Often found in user interfaces

e.g. CALO IRIS, Apple Mail

Seminar:How to Organize your Life

by Jane Smith, Stevenson & SmithMezzanine Level, Papadapoulos Sq

3:30 pmThursday March 31

In this seminar we will learn how to use CALO to...

Seminar announcement

Todo request

Other

Labeling forExtraction

Painful:Difficult even for paid labelers

Complex tools

Seminar:How to Organize your Life

by Jane Smith, Stevenson & SmithMezzanine Level, Papadapoulos Sq

3:30 pmThursday March 31

In this seminar we will learn how to use CALO to...

Click, drag, adjust, label,Click, drag, adjust, label,...

10

Multiple-choice Annotation forLearning Extractors “in the wild”

[Culotta, McCallum 2005]

Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq.

Task: Information Extraction.Fields: NAME COMPANY ADDRESS (and others)

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

user corrects labels, not segmentations

Interface presents top hypothesized segmentations

11

Multiple-choice Annotation forLearning Extractors “in the wild”

[Culotta, McCallum 2005]

Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

user corrects labels, not segmentations

Interface presents top hypothesized segmentations

Task: Information extraction.Fields: NAME COMPANY ADDRESS (and others)

12

Multiple-choice Annotation forLearning Extractors “in the wild”

[Culotta, McCallum 2005]

Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

29% percent reduction in user actions needed to train

Interface presents top hypothesized segmentations

Task: Information extraction.Fields: NAME COMPANY ADDRESS (and others)

13

Outline

Information Extraction– Learning in the wild– Transfer learning

Identity Uncertainty

Modeling Groups, Roles and Topics

14

Piecewise Training in Factorial CRFsfor Transfer Learning

Emailed seminar ann’mt entities

Email English words

[Sutton, McCallum, 2005]

Too little labeled training data.

60k words training. GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

15

Piecewise Training in Factorial CRFsfor Transfer Learning

Newswire named entities

Newswire English words

[Sutton, McCallum, 2005]

Train on “related” task with more data.

200k words training.

CRICKET - MILLNS SIGNS FOR BOLAND

CAPE TOWN 1996-08-22

South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional.

16

Piecewise Training in Factorial CRFsfor Transfer Learning

Newswire named entities

Email English words

[Sutton, McCallum, 2005]

At test time, label email with newswire NEs...

17

Piecewise Training in Factorial CRFsfor Transfer Learning

Newswire named entities

Emailed seminar ann’mt entities

Email English words

[Sutton, McCallum, 2005]

…then use these labels as features for final task

18

Piecewise Training in Factorial CRFsfor Transfer Learning

Newswire named entities

Seminar Announcement entities

English words

[Sutton, McCallum, 2005]

Use joint inference at test time.

An alternative to hierarchical Bayes.Needn’t know anything about parameterization of subtask.

AccuracyNo transfer < Cascaded Transfer < Joint Inference Transfer

20

Outline

Information Extraction– Learning in the wild– Transfer learning

Identity Uncertainty

Modeling Groups, Roles and Topics

21

Y/N

Y/N

Y/N

Joint Co-reference Decisions,Discriminative Model

Stuart Russell

Stuart Russell

[Culotta & McCallum 2005]

S. Russel

People

22

Y/N

Y/N

Y/N

Y/N

Y/N

Y/N

Co-reference for Multiple Entity Types

Stuart Russell

Stuart Russell

University of California at Berkeley

[Culotta & McCallum 2005]

S. Russel

Berkeley

Berkeley

People Organizations

23

Y/N

Y/N

Y/N

Y/N

Y/N

Y/N

Joint Co-reference of Multiple Entity Types

Stuart Russell

Stuart Russell

University of California at Berkeley

[Culotta & McCallum 2005]

S. Russel

Berkeley

Berkeley

People Organizations

Reduces error by 22%

25

Outline

Information Extraction– Learning in the wild– Transfer learning

Identity Uncertainty

Modeling Groups, Roles and Topics

26

Social network from my email

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

30

From LDA to Author-Recipient-Topic

(ART)

32

Enron Email Corpus

250k email messages 23k people

Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT)From: debra.perlingiere@enron.comTo: steve.hooser@enron.comSubject: Enron/TransAltaContract dated Jan 1, 2001

Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions.

DP

Debra PerlingiereEnron North America Corp.Legal Department1400 Smith Street, EB 3885Houston, Texas 77002dperlin@enron.com

33

Topics, and prominent sender/receiversdiscovered by ART

Titles chosen by me

34

Topics, and prominent sender/receiversdiscovered by ART

Beck = “Chief Operations Officer”Dasovich = “Government Relations Executive”Shapiro = “Vice Presidence of Regulatory Affairs”Steffes = “Vice President of Government Affairs”

35

Comparing Role Discovery

connection strength (A,B) =

distribution overauthored topics

Traditional SNA

distribution overrecipients

distribution overauthored topics

Author-TopicART

36

Comparing Role Discovery Tracy Geaconne Dan McCarty

Traditional SNA Author-TopicART

Similar roles Different rolesDifferent roles

Geaconne = “Secretary”McCarty = “Vice President”

38

Traditional SNA Author-TopicART

Different roles Very differentVery similar

Blair = “Gas pipeline logistics”Watson = “Pipeline facilities planning”

Comparing Role Discovery Lynn Blair Kimberly Watson

40

McCallum Email Corpus 2004

January - October 2004 23k email messages 825 people

From: kate@cs.umass.eduSubject: NIPS and ....Date: June 14, 2004 2:27:41 PM EDTTo: mccallum@cs.umass.edu

There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for:

NIPS registration receipt.CALO registration receipt.

Thanks,Kate

42

Four most prominent topicsin discussions with ____?

44

Two most prominent topicsin discussions with ____?

Words Problove 0.030514house 0.015402

0.013659time 0.012351great 0.011334hope 0.011043dinner 0.00959saturday 0.009154left 0.009154ll 0.009009

0.008282visit 0.008137evening 0.008137stay 0.007847bring 0.007701weekend 0.007411road 0.00712sunday 0.006829kids 0.006539flight 0.006539

47

49

Role-Author-Recipient-Topic Models

50

Year Three Plans: “People”

Extraction, for Expert-finding and Group/Role Analysis• Make learning-in-the-wild practical for extraction.• Transfer from noisy/incomplete databases to improve IE.• Support questions about contact info, organizational affiliation, etc.

Identity Uncertainty• Central problem for going from text to knowledge base. • Many interacting entity types, relationships.

Group/Role/Topic Analysis• Explicit “topic models” of groups, roles, expertise, tasks,

and its interation with extraction...• Support Qs about topical expertise, forwarding messages, team building.

Etc.• Continue to support and enhance MALLET toolkit, in collaboration

with UPenn and others.