1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information...

89
1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst

Transcript of 1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information...

1

Interactive Information Extractionand Social Network Analysis

Andrew McCallum Information Extraction and Synthesis Laboratory

UMass Amherst

9

Motivation

• Capture confidence of records in extracted database

• Alerts data mining to possible errors in database

First Name Last Name Confidence

Bill Gates 0.96

Bill banks 0.43

10

Confidence Estimation inLinear-chain CRFs [Culotta, McCallum 2004]

yt - 1

yt

xt

yt+1

xt +1

xt -1

. . . Lattice ofFSM states

observations

yt+2

xt +2

yt+3

xt +3

said Arden Bement NSF Director …

output sequence

input sequence

OTHER

TITLE

ORG

PERSON

Finite State Lattice

p(y | x) =1

Z(x)Φy (y t , y t−1)Φxy (x t , y t )

t=1

T

11

Confidence Estimation inLinear-chain CRFs [Culotta, McCallum 2004]

yt - 1

yt

xt

yt+1

xt +1

xt -1

. . . Lattice ofFSM states

observations

yt+2

xt +2

yt+3

xt +3

said Arden Bement NSF Director …

output sequence

input sequence

OTHER

TITLE

ORG

PERSON

Constrained Forward-Backward

p(Arden Bement = PERSON | x) =1

Z(x)Φy (y t ,y t−1)Φxy (x t , y t )

t=1

T

∏y∈C

12

Forward-Backward Confidence Estimation

improves accuracy/coverage

op

timal

ourforward-backwardconfidence

traditionaltoken-wiseconfidence

no use ofconfidence

13

Application of Confidence Estimation

Interactive Information Extraction:– To correct predictions, direct user to least confident field

14

Interactive Information Extraction

IE algorithm calculates confidence scores UI uses confidence scores to alert user to

possible errors IE algorithm takes corrections into account

and propagates correction to other fields

15

User Correction

User Corrects a field, e.g. dragging Stanley to the First Name field

First Name

Last Name

Address Line

Charles Stanley 100 Charles Street

y1 y2 y3 y4 y5

x1 x2 x3 x4 x5

16

Remove Paths

User Corrects a field, e.g. dragging Stanley to the First Name field

First Name

Last Name

Address Line

Charles Stanley 100 Charles Street

y1 y2 y3 y4 y5

x1 x2 x3 x4 x5

17

Constrained Viterbi

Viterbi algorithm is constrained to pass through the designated state.

First Name

Last Name

Address Line

Charles Stanley 100 Charles Street

Adjacent field changed: Correction Propagation

y1 y2 y3 y4 y5

x1 x2 x3 x4 x5

18

Constrained Viterbi

After fixing least confident field,constrained Viterbi automatically reduces error by another 23%.

Recent work reduces annotation effort further– simplifies annotation to multiple-choice

First Name Last Name City

Bill Gates Redmond WA

Bill Gates Redmond

A)

B)

19

User feedback “in the wild”as labeling

Labeling forClassification

Easy:Often found in user interfaces

e.g. CALO IRIS, Apple Mail

Seminar:How to Organize your Life

by Jane Smith, Stevenson & SmithMezzanine Level, Papadapoulos Sq

3:30 pmThursday March 31

In this seminar we will learn how to use CALO to...

Seminar announcement

Todo request

Other

Labeling forExtraction

Painful:Difficult even for paid labelers

Complex tools

Seminar:How to Organize your Life

by Jane Smith, Stevenson & SmithMezzanine Level, Papadapoulos Sq

3:30 pmThursday March 31

In this seminar we will learn how to use CALO to...

Click, drag, adjust, label,Click, drag, adjust, label,...

20

Multiple-choice Annotation forLearning Extractors “in the wild”

[Culotta, McCallum 2005]

Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq.

Task: Information Extraction.Fields: NAME COMPANY ADDRESS (and others)

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

user corrects labels, not segmentations

Interface presents top hypothesized segmentations

21

Multiple-choice Annotation forLearning Extractors “in the wild”

[Culotta, McCallum 2005]

Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

user corrects labels, not segmentations

Interface presents top hypothesized segmentations

Task: Information extraction.Fields: NAME COMPANY ADDRESS (and others)

22

Multiple-choice Annotation forLearning Extractors “in the wild”

[Culotta, McCallum 2005]

Jane Smith , Stevenson & Smith , Mezzanine Level, Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

Jane Smith , Stevenson & Smith Mezzanine Level , Papadopoulos Sq.

29% percent reduction in user actions needed to train

Interface presents top hypothesized segmentations

Task: Information extraction.Fields: NAME COMPANY ADDRESS (and others)

24

Piecewise Training in Factorial CRFsfor Transfer Learning

Emailed seminar ann’mt entities

Email English words

[Sutton, McCallum, 2005]

Too little labeled training data.

60k words training. GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

25

Piecewise Training in Factorial CRFsfor Transfer Learning

Newswire named entities

Newswire English words

[Sutton, McCallum, 2005]

Train on “related” task with more data.

200k words training.

CRICKET - MILLNS SIGNS FOR BOLAND

CAPE TOWN 1996-08-22

South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional.

26

Piecewise Training in Factorial CRFsfor Transfer Learning

Newswire named entities

Email English words

[Sutton, McCallum, 2005]

At test time, label email with newswire NEs...

27

Piecewise Training in Factorial CRFsfor Transfer Learning

Newswire named entities

Emailed seminar ann’mt entities

Email English words

[Sutton, McCallum, 2005]

…then use these labels as features for final task

28

Piecewise Training in Factorial CRFsfor Transfer Learning

Newswire named entities

Seminar Announcement entities

English words

[Sutton, McCallum, 2005]

Use joint inference at test time.

An alternative to hierarchical Bayes.Needn’t know anything about parameterization of subtask.

AccuracyNo transfer < Cascaded Transfer < Joint Inference Transfer

A Conditional Random Field for Discriminatively-trained

Finite-state String Edit Distance

Andrew McCallum

Kedar Bellare

Fernando Pereira

Thanks to Charles Sutton, Xuerui Wang and Mikhail Bilenko for helpful discussions.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

31

String Edit Distance

Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations

that transform string x into y.

32

String Edit Distance

Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations

that transform string x into y.

Applications– Database Record Deduplication

Apex International Hotel Grassmarket Street

Apex Internat’l Grasmarket Street

Records are duplicates of the same hotel?

33

String Edit Distance

Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations

that transform string x into y.

Applications– Database Record Deduplication

– Biological Sequences

AGCTCTTACGATAGAGGACTCCAGA

AGGTCTTACCAAAGAGGACTTCAGAQuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

34

String Edit Distance

Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations

that transform string x into y.

Applications– Database Record Deduplication

– Biological Sequences

– Machine Translation

Il a achete une pomme

He bought an apple

35

String Edit Distance

Distance between sequences x and y:– “cost” of lowest-cost sequence of edit operations

that transform string x into y.

Applications– Database Record Deduplication

– Biological Sequences

– Machine Translation

– Textual Entailment He bought a new car last night

He purchased a brand new automobile yesterday evening

36

Levenshtein Distance

copy Copy a character from x to y (cost 0)insert Insert a character into y (cost 1)delete Delete a character from y (cost 1)subst Substitute one character for another (cost 1)

Edit operations

Lowest costalignment

W i l l i a m _ W . _ C o h o n

W i l l l e a m _ C o h e n

cop

y

cop

y

cop

y

cop

y

sub

st

cop

y

cop

y

cop

y

cop

y

inse

rt

cop

y

dele

te

dele

te

sub

st

cop

y

cop

y

operation cost

Total cost = 6= Levenshtein Distance

dele

te

0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0

Align two strings William W. CohonWillleam Cohen

x1 =

x2 =

[1966]

37

Levenshtein Distance

copy Copy a character from x to y (cost 0)insert Insert a character into y (cost 1)delete Delete a character from y (cost 1)subst Substitute one character for another (cost 1)

Edit operations

W i l l l e a m 0 1 2 3 4 5 6 7 8W 1 0 1 2 3 4 5 6 7i 2 1 0 1 2 3 4 5 6l 3 2 1 0 1 2 3 4 5l 4 3 2 1 0 1 2 3 4i 5 4 3 2 1 1 2 3 4a 6 5 4 3 2 2 2 2 4m 7 6 5 4 3 3 3 3 2

insert

subst

D(i,j) = score of best alignment from x1... xi to y1... yj.

D(i-1,j-1) + (xi≠yj )D(i,j) = min D(i-1,j) + 1

D(i,j-1) + 1

Dynamic program

total cost =distance

38

Levenshtein Distancewith Markov Dependencies

Cost after a c i d scopy Copy a character from x to y 0 0 0 0insert Insert a character into y 1 1 1delete Delete a character from y 1 1 1 subst Substitute one character for another 1 1 1 1

Edit operations

W i l l l e a m 0 1 2 3 4 5 6 7 8W 1 0 1 2 3 4 5 6 7i 2 1 0 1 2 3 4 5 6l 3 2 1 0 1 2 3 4 5l 4 3 2 1 0 1 2 3 4i 5 4 3 2 1 1 2 3 4a 6 5 4 3 2 2 2 2 4m 7 6 5 4 3 3 3 3 2

Learn these costsfrom training data

subst

insertdelete

3DDPtable

repeateddelete

is cheaper

copy

12

12

39

Ristad & Yianilos (1997)

Essentially a Pair-HMM,generating a edit/state/alignment-sequence and two strings

p(a,x1,x2) = p(at | at−1)p(x1,a t .i1, x2,a t .i2

| at )t

∏ complete data likelihood

Learn via EM: Expectation step: Calculate likelihood of alignment paths Maximization step: Make those paths more likely.

W i l l i a m _ W . _ C o h o n

W i l l l e a m _ C o h e n

cop

y

cop

y

cop

y

cop

y

sub

st

cop

y

cop

y

cop

y

cop

y

inse

rt

cop

y

dele

te

dele

te

sub

st

cop

y

cop

y

dele

te

1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16

1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14

x1

x2

a.i1

a.ea.i2

string 1

alignment

string 2

p(x1,x2) = p(at | at−1)p(x1,a t .i1, x2,a t .i2

| at )t

∏a:x1 ,x 2

∑ incomplete data likelihood(sum over all alignments consistent with x1 and x2)

Match score =

O = p(x1

( j ),x2

( j ))j

∏Given training set ofmatching string pairs, objective fn is

40

Ristad & Yianilos Regrets

Limited features of input strings– Examine only single character pair at a time– Difficult to use upcoming string context, lexicons, ...– Example: “Senator John Green” “John Green”

Limited edit operations– Difficult to generate arbitrary jumps in both strings– Example: “UMass” “University of Massachusetts”.

Trained only on positive match data– Doesn’t include information-rich “near misses”– Example: “ACM SIGIR” ≠ “ACM SIGCHI”

So, consider model trained by conditional probability

41

Conditional Probability (Sequence) Models

We prefer a model that is trained to maximize a conditional probability rather than joint probability:P(y|x) instead of P(y,x):

– Can examine features, but not responsible for generating them.

– Don’t have to explicitly model their dependencies.

42

Jointyt-1 yt

xt

yt+1

xt+1xt-1

...

...

[Lafferty, McCallum, Pereira 2001]

From HMMs to Conditional Random Fields

P(y,x) = P(y t | y t−1)P(x t | y t )t=1

|x |

vs = s1,s2,...sn

v o = o1,o2,...on

Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]

Wide-spread interest, positive experimental results in many applications.

Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…

Conditional

P(y | x) =1

P(x)P(y t | y t−1)P(x t | y t )

t=1

|v o |

∏yt-1 yt yt+1

xt xt+1xt-1

...

...

=1

Z(x)Φs(y t ,y t−1)Φo(x t , y t )

t=1

|x |

(A super-special case of Conditional Random Fields.)

where

Set parameters by maximum likelihood, using optimization method on L.

Φo(x t , y t ) = exp λ k fk (y t ,x t )k

∑ ⎛

⎝ ⎜

⎠ ⎟

Linear-chain ^

44

CRF String Edit Distance

W i l l i a m _ W . _ C o h o n

W i l l l e a m _ C o h e n

cop

y

cop

y

cop

y

cop

y

sub

st

cop

y

cop

y

cop

y

cop

y

inse

rt

cop

y

dele

te

dele

te

sub

st

cop

y

cop

y

dele

te

p(a | x1,x2) =1

Zx1 ,x 2

Φ(at ,at−1,x1,x2)t

joint complete data likelihood

1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16

1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14

x1

x2

a.i1

a.ea.i2

string 1

alignment

string 2

conditional complete data likelihood

p(a,x1,x2) = p(at | at−1)p(x1,a t .i1,x2,a t .i2

| at )t

Want to train from set of string pairs,each labeled one of {match, non-match}

match “William W. Cohon” “Willlleam Cohen”non-match “Bruce D’Ambrosio” “Bruce Croft”match “Tommi Jaakkola” “Tommi Jakola”match “Stuart Russell” “Stuart Russel”non-match “Tom Deitterich” “Tom Dean”

45

CRF String Edit Distance FSM

subst

insertdelete

copy

46

CRF String Edit Distance FSM

subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

p(m | x1,x2) =1

Zx1 ,x 2

Φ(at ,at−1,x1,x2)t

∏a∈Sm

∑conditional incomplete data likelihood

47

CRF String Edit Distance FSM

subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

Probability summed overall alignments in match states

0.8

Probability summed overall alignments in non-match states

0.2

x1 = “Tommi Jaakkola”x2 = “Tommi Jakola”

48

CRF String Edit Distance FSM

subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

Probability summed overall alignments in match states

0.1

Probability summed overall alignments in non-match states

0.9

x1 = “Tom Dietterich”x2 = “Tom Dean”

49

Parameter Estimation

Expectation Maximization E-step: Estimate distribution over alignments,

, using current parameters M-step: Change parameters to maximize the

complete (penalized) log likelihood, with an iterative quasi-Newton method (BFGS)

O = log p(m( j ) | x1

( j ),x2

( j ))( )j

∑Given training set ofstring pairs and match/non-match labels,objective fn is the incomplete log likelihood

The complete log likelihood

log p(m( j ) | a,x1

( j ),x2

( j ))p(a | x1

( j ),x2

( j ))( )a

∑j

p(a | x1

( j ),x2

( j ))

This is “conditional EM”, but avoid complexities of [Jebara 1998], because no need to solve M-step in closed form.

50

Efficient Training

Dynamic programming table is 3D;|x1| = |x2| = 100, |S| = 12, .... 120,000 entries

Use beam search during E-step[Pal, Sutton, McCallum 2005]

Unlike completely observed CRFs, objective function is not convex.

Initialize parameters not at zero, but so as to yield a reasonable initial edit distance.

51

What Alignments are Learned?

subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

x1 = “Tommi Jaakkola”x2 = “Tommi Jakola”

T o m m i J a a k k o l a

Tommi

Jakola

52

What Alignments are Learned?

subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

x1 = “Bruce Croft”x2 = “Tom Dean”

B r u c e C r o f t

Tom

Dean

53

What Alignments are Learned?

subst

insertdelete

copy

subst

insertdelete

copy

Start

matchm = 1

non-matchm = 0

x1 = “Jaime Carbonell”x2 = “Jamie Callan”

J a i m e C a r b o n e l l

Jamie

Callan

55

Summary of Advantages

Arbitrary features of the input strings– Examine past, future context– Use lexicons, WordNet

Extremely flexible edit operations– Single operation may make arbitrary jumps in both

strings, of size determined by input features Discriminative Training

– Maximize ability to predict match vs non-match

56

Experimental Results:Data Sets

Restaurant name, Restaurant address– 864 records, 112 matches– E.g. “Abe’s Bar & Grill, E. Main St”

“Abe’s Grill, East Main Street” People names, UIS DB generator

– synthetic noise– E.g. “John Smith” vs “Snith, John”

CiteSeer Citations– In four sections: Reason, Face, Reinforce, Constraint– E.g. “Rusell & Norvig, “Artificial Intelligence: A Modern...”

“Russell & Norvig, “Artificial Intelligence: An Intro...”

57

Experimental Results:Features

same, different same-alphabetic, different alphbetic same-numeric, different-numeric punctuation1, punctuation2 alphabet-mismatch, numeric-mismatch end-of-1, end-of-2 same-next-character, different-next-character

58

Experimental Results:Edit Operations

insert, delete, substitute/copy swap-two-characters skip-word-if-in-lexicon skip-parenthesized-words skip-any-word substitute-word-pairs-in-translation-lexicon skip-word-if-present-in-other-string

59

Experimental Results

CiteSeerReason Face Reinf Constraint

0.927 0.952 0.893 0.9240.938 0.966 0.907 0.9410.897 0.922 0.903 0.9230.924 0.875 0.808 0.913

Restaurantname

0.2900.3540.3650.433

Restaurantaddress

0.6860.7120.3800.532

Distancemetric

LevenshteinLearned Leven.VectorLearned Vector

[Bilenko & Mooney 2003]

F1 (average of precision and recall)

60

Experimental Results

CiteSeerReason Face Reinf Constraint

0.927 0.952 0.893 0.9240.938 0.966 0.907 0.9410.897 0.922 0.903 0.9230.924 0.875 0.808 0.913

0.964 0.918 0.917 0.976

Restaurantname

0.2900.3540.3650.433

0.448

Restaurantaddress

0.6860.7120.3800.532

0.783

Distancemetric

LevenshteinLearned Leven.VectorLearned Vector

CRF Edit Distance

[Bilenko & Mooney 2003]

F1 (average of precision and recall)

61

Experimental Results

F1

0.8560.981

Without skip-if-present-in-other-stringWith skip-if-present-in-other-string

Data set: person names, with word-order noise added

63

Y/N

Y/N

Y/N

Joint Co-reference Decisions,Discriminative Model

Stuart Russell

Stuart Russell

[Culotta & McCallum 2005]

S. Russel

People

64

Y/N

Y/N

Y/N

Y/N

Y/N

Y/N

Co-reference for Multiple Entity Types

Stuart Russell

Stuart Russell

University of California at Berkeley

[Culotta & McCallum 2005]

S. Russel

Berkeley

Berkeley

People Organizations

65

Y/N

Y/N

Y/N

Y/N

Y/N

Y/N

Joint Co-reference of Multiple Entity Types

Stuart Russell

Stuart Russell

University of California at Berkeley

[Culotta & McCallum 2005]

S. Russel

Berkeley

Berkeley

People Organizations

Reduces error by 22%

68

Social network from my email

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

69

Clustering words into topics withLatent Dirichlet Allocation

[Blei, Ng, Jordan 2003]

Sample a distributionover topics,

For each document:

Sample a topic, z

For each word in doc

Sample a wordfrom the topic, w

Example:

70% Iraq war30% US election

Iraq war

“bombing”

GenerativeProcess:

70

STORYSTORIES

TELLCHARACTER

CHARACTERSAUTHOR

READTOLD

SETTINGTALESPLOT

TELLINGSHORT

FICTIONACTION

TRUEEVENTSTELLSTALE

NOVEL

MINDWORLDDREAM

DREAMSTHOUGHT

IMAGINATIONMOMENT

THOUGHTSOWNREALLIFE

IMAGINESENSE

CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE

WATERFISHSEA

SWIMSWIMMING

POOLLIKE

SHELLSHARKTANK

SHELLSSHARKSDIVING

DOLPHINSSWAMLONGSEALDIVE

DOLPHINUNDERWATER

DISEASEBACTERIADISEASES

GERMSFEVERCAUSE

CAUSEDSPREADVIRUSES

INFECTIONVIRUS

MICROORGANISMSPERSON

INFECTIOUSCOMMONCAUSING

SMALLPOXBODY

INFECTIONSCERTAIN

Example topicsinduced from a large collection of text

FIELDMAGNETIC

MAGNETWIRE

NEEDLECURRENT

COILPOLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGY

FIELDPHYSICS

LABORATORYSTUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELD

PLAYERBASKETBALL

COACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTS

BATTERRY

JOBWORKJOBS

CAREEREXPERIENCE

EMPLOYMENTOPPORTUNITIES

WORKINGTRAINING

SKILLSCAREERS

POSITIONSFIND

POSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

[Tennenbaum et al]

71

STORYSTORIES

TELLCHARACTER

CHARACTERSAUTHOR

READTOLD

SETTINGTALESPLOT

TELLINGSHORT

FICTIONACTION

TRUEEVENTSTELLSTALE

NOVEL

MINDWORLDDREAM

DREAMSTHOUGHT

IMAGINATIONMOMENT

THOUGHTSOWNREALLIFE

IMAGINESENSE

CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE

WATERFISHSEA

SWIMSWIMMING

POOLLIKE

SHELLSHARKTANK

SHELLSSHARKSDIVING

DOLPHINSSWAMLONGSEALDIVE

DOLPHINUNDERWATER

DISEASEBACTERIADISEASES

GERMSFEVERCAUSE

CAUSEDSPREADVIRUSES

INFECTIONVIRUS

MICROORGANISMSPERSON

INFECTIOUSCOMMONCAUSING

SMALLPOXBODY

INFECTIONSCERTAIN

FIELDMAGNETIC

MAGNETWIRE

NEEDLECURRENT

COILPOLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGY

FIELDPHYSICS

LABORATORYSTUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELD

PLAYERBASKETBALL

COACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTS

BATTERRY

JOBWORKJOBS

CAREEREXPERIENCE

EMPLOYMENTOPPORTUNITIES

WORKINGTRAINING

SKILLSCAREERS

POSITIONSFIND

POSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

Example topicsinduced from a large collection of text

[Tennenbaum et al]

72

From LDA to Author-Recipient-Topic

(ART)

73

Inference and Estimation

Gibbs Sampling:- Easy to implement- Reasonably fast

r

74

Outline

Email, motivation

ART Graphical Model.

Experimental Results

– Enron Email (corpus)

– Academic Email (one person)

RART: Roles for ART

Group-Topic Model

– Experiments on voting data

– Voting data from U.S. Senate and the U.N.

75

Enron Email Corpus

250k email messages 23k people

Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT)From: [email protected]: [email protected]: Enron/TransAltaContract dated Jan 1, 2001

Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions.

DP

Debra PerlingiereEnron North America Corp.Legal Department1400 Smith Street, EB 3885Houston, Texas [email protected]

76

Topics, and prominent senders / receivers

discovered by ARTTopic names,by hand

77

Topics, and prominent sender/receiversdiscovered by ART

Beck = “Chief Operations Officer”Dasovich = “Government Relations Executive”Shapiro = “Vice President of Regulatory Affairs”Steffes = “Vice President of Government Affairs”

78

Comparing Role Discovery

connection strength (A,B) =

distribution overauthored topics

Traditional SNA

distribution overrecipients

distribution overauthored topics

Author-TopicART

79

Comparing Role Discovery Tracy Geaconne Dan McCarty

Traditional SNA Author-TopicART

Similar roles Different rolesDifferent roles

Geaconne = “Secretary”McCarty = “Vice President”

80

Traditional SNA Author-TopicART

Different roles Very similarNot very similar

Geaconne = “Secretary”Hayslett = “Vice President & CTO”

Comparing Role Discovery Tracy Geaconne Rod Hayslett

81

Traditional SNA Author-TopicART

Different roles Very differentVery similar

Blair = “Gas pipeline logistics”Watson = “Pipeline facilities planning”

Comparing Role Discovery Lynn Blair Kimberly Watson

82

McCallum Email Corpus 2004

January - October 2004 23k email messages 825 people

From: [email protected]: NIPS and ....Date: June 14, 2004 2:27:41 PM EDTTo: [email protected]

There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for:

NIPS registration receipt.CALO registration receipt.

Thanks,Kate

83

McCallum Email Blockstructure

84

Four most prominent topicsin discussions with ____?

85

86

Two most prominent topicsin discussions with ____?

Words Problove 0.030514house 0.015402

0.013659time 0.012351great 0.011334hope 0.011043dinner 0.00959saturday 0.009154left 0.009154ll 0.009009

0.008282visit 0.008137evening 0.008137stay 0.007847bring 0.007701weekend 0.007411road 0.00712sunday 0.006829kids 0.006539flight 0.006539

Words Probtoday 0.051152tomorrow 0.045393time 0.041289ll 0.039145meeting 0.033877week 0.025484talk 0.024626meet 0.023279morning 0.022789monday 0.020767back 0.019358call 0.016418free 0.015621home 0.013967won 0.013783day 0.01311hope 0.012987leave 0.012987office 0.012742tuesday 0.012558

89

90

Outline

Email, motivation

ART Graphical Model.

Experimental Results

– Enron Email (corpus)

– Academic Email (one person)

RART: Roles for ART

Group-Topic Model

– Experiments on voting data

– Voting data from U.S. Senate and the U.N.

91

Role-Author-Recipient-Topic Models

92

Results with RART:People in “Role #3” in Academic Email

olc lead Linux sysadmin gauthier sysadmin for CIIR group irsystem mailing list CIIR sysadmins system mailing list for dept. sysadmins allan Prof., chair of “computing

committee” valerie second Linux sysadmin tech mailing list for dept. hardware steve head of dept. I.T. support

93

Roles for allan (James Allan)

Role #3 I.T. support Role #2 Natural Language

researcher

Roles for pereira (Fernando Pereira)

Role #2 Natural Language researcher Role #4 SRI CALO project participant Role #6 Grant proposal writer Role #10 Grant proposal coordinator Role #8 Guests at McCallum’s house

94

Outline

Email, motivation

ART Graphical Model.

Experimental Results

– Enron Email (corpus)

– Academic Email (one person)

RART: Roles for ART

Group-Topic Model

– Experiments on voting data

– Voting data from U.S. Senate and the U.N.

95

Traditional SNA Author-TopicART

Block structured NotNot

ART & RART: Roles but not Groups

Enron TransWestern Division

96

A Group Model:“Stochastic Blockstructures Model”

97

Group-Topic Model

[Wang, Mohanty, McCallum 2005]

98

U.S. Senate Data sets

3426 bills from 16 years of voting records from the U.S. Senate

Yea / Nea / Abstain (absent) Each bill comes with an abstract (text

describing the contents of the bill).

99

Topics Discovered

Traditional“Mixtures of Unigrams”

Group-TopicModel

100

Groups Discovered

Groups from topicEducation + Domestic

Agreement Index

101

Senators who change Coalition Dependent on Topic

e.g. Senator Shelby (D-AL) votes with the Republicans on Economicwith the Democrats on Education + Domesticwith a small group of maverick Republicans on Social Security + Medicaid

102

U.N. Data Set

931 U.N. Resolutions, voted on by 192 countries, from 1990-2003.

Yes / No / Abstain votes List of keywords summarizes the content of the

resolution.

Also experiments later with resolutions from 1960-2003

103

Topics Discovered

Traditionalmixture ofunigrams

Group-TopicModel

104

GroupsDiscovered

105

Groups and Topics, Trends over Time