Statistical Models of (Social) Networks Andrew McCallum Computer Science Department University of...

Post on 01-Jan-2016

221 views 2 download

Tags:

Transcript of Statistical Models of (Social) Networks Andrew McCallum Computer Science Department University of...

Statistical Models of (Social) Networks

Andrew McCallum

Computer Science Department

University of Massachusetts Amherst

Joint work with

Xuerui Wang, Natasha Mohanty, Andres Corrada

Workplace effectiveness ~ Ability to leverage network of acquaintances

But filling Contacts DB by hand is tedious, and incomplete.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Email Inbox Contacts DB

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

WWW

Automatically

Managing and UnderstandingConnections of People in our Email World

System Overview

ContactInfo andPerson Name

Extraction

Person Name

Extraction

NameCoreference

HomepageRetrieval

Social NetworkAnalysis

KeywordExtraction

CRFWWW

names

Email QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

An ExampleTo: “Andrew McCallum” mccallum@cs.umass.edu

Subject ...

First Name:

Andrew

Middle Name:

Kachites

Last Name:

McCallum

JobTitle: Associate Professor

Company: University of Massachusetts

Street Address:

140 Governor’s Dr.

City: Amherst

State: MA

Zip: 01003

Company Phone:

(413) 545-1323

Links: Fernando Pereira, Sam Roweis,…

Key Words:

Information extraction,

social network,…

Search for new people

Summary of Results

Token

Acc

Field

Prec

Field

Recall

Field

F1

CRF 94.50 85.73 76.33 80.76

Person Keywords

William Cohen Logic programming

Text categorization

Data integration

Rule learning

Daphne Koller Bayesian networks

Relational models

Probabilistic models

Hidden variables

Deborah McGuiness

Semantic web

Description logics

Knowledge representation

Ontologies

Tom Mitchell Machine learning

Cognitive states

Learning apprentice

Artificial intelligence

Contact info and name extraction performance (25 fields)

Example keywords extracted

1. Expert Finding: When solving some task, find friends-of-friends with relevant expertise. Avoid “stove-piping” in large org’s by automatically suggesting collaborators. Given a task, automatically suggest the right team for the job. (Hiring aid!)

2. Social Network Analysis: Understand the social structure of your organization. Suggest structural changes for improved efficiency.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Outline

• Social Network Analysis with (Language) Attributes

– Roles and Topics (Author-Recipient-Topic Model)

– Groups and Topics (Group-Topic Model)

• Demo: Rexa, a Web portal for researchers

Outline

• Social Network Analysis with (Language) Attributes

– Roles and Topics (Author-Recipient-Topic Model)

– Groups and Topics (Group-Topic Model)

• Demo: Rexa, a Web portal for researchers

Clustering words into topics withLatent Dirichlet Allocation

[Blei, Ng, Jordan 2003]

Sample a distributionover topics,

For each document:

Sample a topic, z

For each word in doc

Sample a wordfrom the topic, w

Example:

70% Iraq war30% US election

Iraq war

“bombing”

GenerativeProcess:

STORYSTORIESTELL

CHARACTERCHARACTERS

AUTHORREADTOLD

SETTINGTALESPLOT

TELLINGSHORTFICTIONACTIONTRUE

EVENTSTELLSTALENOVEL

MINDWORLDDREAMDREAMSTHOUGHT

IMAGINATIONMOMENT

THOUGHTSOWNREALLIFE

IMAGINESENSE

CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE

WATERFISHSEASWIM

SWIMMINGPOOLLIKESHELLSHARKTANK

SHELLSSHARKSDIVING

DOLPHINSSWAMLONGSEALDIVE

DOLPHINUNDERWATER

DISEASEBACTERIADISEASESGERMSFEVERCAUSECAUSEDSPREADVIRUSES

INFECTIONVIRUS

MICROORGANISMSPERSON

INFECTIOUSCOMMONCAUSING

SMALLPOXBODY

INFECTIONSCERTAIN

Example topicsinduced from a large collection of text

FIELDMAGNETICMAGNETWIRE

NEEDLECURRENT

COILPOLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGYFIELD

PHYSICSLABORATORY

STUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELD

PLAYERBASKETBALL

COACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTSBAT

TERRY

JOBWORKJOBS

CAREEREXPERIENCEEMPLOYMENTOPPORTUNITIES

WORKINGTRAININGSKILLS

CAREERSPOSITIONS

FINDPOSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

[Tennenbaum et al]

STORYSTORIESTELL

CHARACTERCHARACTERS

AUTHORREADTOLD

SETTINGTALESPLOT

TELLINGSHORTFICTIONACTIONTRUE

EVENTSTELLSTALENOVEL

MINDWORLDDREAMDREAMSTHOUGHT

IMAGINATIONMOMENT

THOUGHTSOWNREALLIFE

IMAGINESENSE

CONSCIOUSNESSSTRANGEFEELINGWHOLEBEINGMIGHTHOPE

WATERFISHSEASWIM

SWIMMINGPOOLLIKESHELLSHARKTANK

SHELLSSHARKSDIVING

DOLPHINSSWAMLONGSEALDIVE

DOLPHINUNDERWATER

DISEASEBACTERIADISEASESGERMSFEVERCAUSECAUSEDSPREADVIRUSES

INFECTIONVIRUS

MICROORGANISMSPERSON

INFECTIOUSCOMMONCAUSING

SMALLPOXBODY

INFECTIONSCERTAIN

FIELDMAGNETICMAGNETWIRE

NEEDLECURRENT

COILPOLESIRON

COMPASSLINESCORE

ELECTRICDIRECTION

FORCEMAGNETS

BEMAGNETISM

POLEINDUCED

SCIENCESTUDY

SCIENTISTSSCIENTIFIC

KNOWLEDGEWORK

RESEARCHCHEMISTRY

TECHNOLOGYMANY

MATHEMATICSBIOLOGYFIELD

PHYSICSLABORATORY

STUDIESWORLD

SCIENTISTSTUDYINGSCIENCES

BALLGAMETEAM

FOOTBALLBASEBALLPLAYERS

PLAYFIELDPLAYER

BASKETBALLCOACHPLAYEDPLAYING

HITTENNISTEAMSGAMESSPORTSBAT

TERRY

JOBWORKJOBS

CAREEREXPERIENCEEMPLOYMENTOPPORTUNITIES

WORKINGTRAININGSKILLS

CAREERSPOSITIONS

FINDPOSITIONFIELD

OCCUPATIONSREQUIRE

OPPORTUNITYEARNABLE

Example topicsinduced from a large collection of text

[Tennenbaum et al]

From LDA to Author-Recipient-Topic(ART)

Inference and Estimation

Gibbs Sampling:- Easy to implement- Reasonably fast

r

Enron Email Corpus

• 250k email messages• 23k people

Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT)From: debra.perlingiere@enron.comTo: steve.hooser@enron.comSubject: Enron/TransAltaContract dated Jan 1, 2001

Please see below. Katalin Kiss of TransAlta has requested an electronic copy of our final draft? Are you OK with this? If so, the only version I have is the original draft without revisions.

DP

Debra PerlingiereEnron North America Corp.Legal Department1400 Smith Street, EB 3885Houston, Texas 77002dperlin@enron.com

Topics, and prominent senders / receiversdiscovered by ARTTopic names,

by hand

Topics, and prominent senders / receiversdiscovered by ART

Beck = “Chief Operations Officer”Dasovich = “Government Relations Executive”Shapiro = “Vice President of Regulatory Affairs”Steffes = “Vice President of Government Affairs”

Comparing Role Discovery

connection strength (A,B) =

distribution overauthored topics

Traditional SNA

distribution overrecipients

distribution overauthored topics

Author-TopicART

Comparing Role Discovery Tracy Geaconne Dan McCarty

Traditional SNA Author-TopicART

Similar roles Different rolesDifferent roles

Geaconne = “Secretary”McCarty = “Vice President”

Traditional SNA Author-TopicART

Different roles Very similarNot very similar

Geaconne = “Secretary”Hayslett = “Vice President & CTO”

Comparing Role Discovery Tracy Geaconne Rod Hayslett

Traditional SNA Author-TopicART

Different roles Very differentVery similar

Blair = “Gas pipeline logistics”Watson = “Pipeline facilities planning”

Comparing Role Discovery Lynn Blair Kimberly Watson

McCallum Email Corpus 2004

• January - October 2004• 23k email messages• 825 people

From: kate@cs.umass.eduSubject: NIPS and ....Date: June 14, 2004 2:27:41 PM EDTTo: mccallum@cs.umass.edu

There is pertinent stuff on the first yellow folder that is completed either travel or other things, so please sign that first folder anyway. Then, here is the reminder of the things I'm still waiting for:

NIPS registration receipt.CALO registration receipt.

Thanks,Kate

McCallum Email Blockstructure

Four most prominent topicsin discussions with ____?

Two most prominent topicsin discussions with ____?

Words Problove 0.030514house 0.015402

0.013659time 0.012351great 0.011334hope 0.011043dinner 0.00959saturday 0.009154left 0.009154ll 0.009009

0.008282visit 0.008137evening 0.008137stay 0.007847bring 0.007701weekend 0.007411road 0.00712sunday 0.006829kids 0.006539flight 0.006539

Words Probtoday 0.051152tomorrow 0.045393time 0.041289ll 0.039145meeting 0.033877week 0.025484talk 0.024626meet 0.023279morning 0.022789monday 0.020767back 0.019358call 0.016418free 0.015621home 0.013967won 0.013783day 0.01311hope 0.012987leave 0.012987office 0.012742tuesday 0.012558

Pairs with highestrank difference between ART & SNA

5 other professors3 other ML researchers

Role-Author-Recipient-Topic Models

Results with RART:People in “Role #3” in Academic Email

• olc lead Linux sysadmin• gauthier sysadmin for CIIR group• irsystem mailing list CIIR sysadmins• system mailing list for dept. sysadmins• allan Prof., chair of “computing

committee”• valerie second Linux sysadmin• tech mailing list for dept. hardware• steve head of dept. I.T. support

Roles for allan (James Allan)

• Role #3 I.T. support• Role #2 Natural Language

researcher

Roles for pereira (Fernando Pereira) • Role #2 Natural Language researcher• Role #4 SRI CALO project participant• Role #6 Grant proposal writer• Role #10 Grant proposal coordinator• Role #8 Guests at McCallum’s house

Traditional SNA Author-TopicART

Block structured NotNot

ART: Roles but not Groups

Enron TransWestern Division

Outline

• Social Network Analysis with (Language) Attributes

– Roles and Topics (Author-Recipient-Topic Model)

– Groups and Topics (Group-Topic Model)

• Demo: Rexa, a Web portal for researchers

Groups and Topics

• Input:– Observed relations between people– Attributes on those relations (text, or categorical)

• Output:– Attributes clustered into “topics”– Groups of people---varying depending on topic

Discovering Groups from Observed Set of Relations

Admiration relations among six high school students.

Student Roster

AdamsBennettCarterDavisEdwardsFrederking

Academic Admiration

Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)

Adjacency Matrix Representing Relations

A B C D E FABCDEF

A B C D E FG1G2G1G2G3G3

G1G2G1G2G3G3

ABCDEF

A C B D E FG1G1G2G2G3G3

G1G1G2G2G3G3

ACBDEF

Student Roster

AdamsBennettCarterDavisEdwardsFrederking

Academic Admiration

Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)

Group Model: Partitioning Entities into Groups

2Sv

β

2Gγ α

Stochastic Blockstructures for Relations[Nowicki, Snijders 2001]

S: number of entities

G: number of groups

Enhanced with arbitrary number of groups in [Kemp, Griffiths, Tenenbaum 2004]

BetaDirichlet

Binomial

SgMultinomial

Two Relations with Different Attributes

A C B D E FG1G1G2G2G3G3

G1G1G2G2G3G3

A C E B D FG1G1G1G2G2G2

G1G1G1G2G2G2

ACEBDF

Student Roster

AdamsBennettCarterDavisEdwardsFrederking

Academic Admiration

Acad(A, B) Acad(C, B)Acad(A, D) Acad(C, D)Acad(B, E) Acad(D, E)Acad(B, F) Acad(D, F)Acad(E, A) Acad(F, A)Acad(E, C) Acad(F, C)

Social Admiration

Soci(A, B) Soci(A, D) Soci(A, F)Soci(B, A) Soci(B, C) Soci(B, E)Soci(C, B) Soci(C, D) Soci(C, F)Soci(D, A) Soci(D, C) Soci(D, E)Soci(E, B) Soci(E, D) Soci(E, F)Soci(F, A) Soci(F, C) Soci(F, E)

ACBDEF

Goal:Model relations and their (textual) attributes simultaneously to obtain better groups and more meaningful topics.

budget, funding, annual, cash

document, corrections, review, annual

The Group-Topic Model: Discovering Groups and Topics Simultaneously

bNw

t

B

T

φ

η

DirichletMultinomial

Uniform

2Sv

β

2Gγ α

Beta

Dirichlet

Binomial

SgMultinomial

T

Inference and EstimationGibbs Sampling:- Many r.v.s can be integrated out- Easy to implement- Reasonably fast

We assume the relationship is symmetric.

Dataset #1:U.S. Senate

• 16 years of voting records in the US Senate (1989 – 2005)

• a Senator may respond Yea or Nay to a resolution

• 3423 resolutions with text attributes (index terms)

• 191 Senators in total across 16 years

S.543 Title: An Act to reform Federal deposit insurance, protect the deposit insurance funds, recapitalize the Bank Insurance Fund, improve supervision and regulation of insured depository institutions, and for other purposes. Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2) Latest Major Action: 12/19/1991 Became Public Law No: 102-242. Index terms: Banks and banking Accounting Administrative fees Cost control Credit Deposit insurance Depressed areas and other 110 terms

Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay ……

Topics Discovered (U.S. Senate)Education Energy

MilitaryMisc.

Economic

education energy government federalschool power military laboraid water foreign insurance

children nuclear tax aiddrug gas congress tax

students petrol aid businesselementary research law employeeprevention pollution policy care

Mixture of Unigrams

Group-Topic Model

Education

+ DomesticForeign Economic

Social Security

+ Medicareeducation foreign labor socialschool trade insurance securityfederal chemicals tax insuranceaid tariff congress medical

government congress income caretax drugs minimum medicare

energy communicable wage disabilityresearch diseases business assistance

Groups Discovered (US Senate)

Groups from topic Education + Domestic

Senators Who Change Coalition the most Dependent on Topic

e.g. Senator Shelby (D-AL) votes with the Republicans on Economicwith the Democrats on Education + Domesticwith a small group of maverick Republicans on Social Security + Medicaid

Dataset #2:The UN General Assembly

• Voting records of the UN General Assembly (1990 - 2003)

• A country may choose to vote Yes, No or Abstain

• 931 resolutions with text attributes (titles)

• 192 countries in total

• Also experiments later with resolutions from 1960-2003

Vote on Permanent Sovereignty of Palestinian People, 87th plenary meeting

The draft resolution on permanent sovereignty of the Palestinian people in the occupied Palestinian territory, including Jerusalem, and of the Arab population in the occupied Syrian Golan over their natural resources (document A/54/591) was adopted by a recorded vote of 145 in favour to 3 against with 6 abstentions:

In favour: Afghanistan, Argentina, Belgium, Brazil, Canada, China, France, Germany, India, Japan, Mexico, Netherlands, New Zealand, Pakistan, Panama, Russian Federation, South Africa, Spain, Turkey, and other 126 countries. Against: Israel, Marshall Islands, United States. Abstain: Australia, Cameroon, Georgia, Kazakhstan, Uzbekistan, Zambia.

Topics Discovered (UN)

Everything Nuclear

Human RightsSecurity

in Middle East

nuclear rights occupiedweapons human israel

use palestine syriaimplementation situation security

countries israel calls

Mixture ofUnigrams

Group-TopicModel

NuclearNon-proliferation

Nuclear Arms Race

Human Rights

nuclear nuclear rightsstates arms humanunited prevention palestine

weapons race occupiednations space israel

GroupsDiscovered(UN)The countries list for each group are ordered by their 2005 GDP (PPP) and only 5 countries are shown in groups that have more than 5 members.

Do We Get Better Groups with the GT Model?

1. Cluster bills into topics using mixture of unigrams;

2. Apply group model on topic-specific subsets of bills.

Agreement Index (AI) measures group cohesion. Higher, better.

Datasets Avg. AI for Baseline Avg. AI for GT p-value

Senate 0.8198 0.8294 <.01

UN 0.8548 0.8664 <.01

1. Jointly cluster topic and groups at the same time using the GT model.

Baseline Model GT Model

Groups and Topics, Trends over Time (UN)

Outline

• Social Network Analysis with (Language) Attributes

– Roles and Topics (Author-Recipient-Topic Model)

– Groups and Topics (Group-Topic Model)

• Demo: Rexa, a Web portal for researchers

Previous Systems

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

ResearchPaper

Cites

Previous Systems

ResearchPaper

Cites

Person

UniversityVenue

Grant

Groups

Expertise

More Entities and Relations

Outline

• Examples of IE and Data Mining.

• Brief introduction of Conditional Random Fields

• Joint inference: Motivation and examples

– Joint Labeling of Cascaded Sequences (Belief Propagation)

– Joint Labeling for Transfer Learning (Piecewise Training & BP)

– Joint Labeling of Distant Entities (BP by Tree Reparameterization)

– Joint Co-reference Resolution (Graph Partitioning)

– Joint Segmentation and Co-ref (Sparse BP)

• Joint Topic Discovery and Social Network Analysis

– Roles and Topics (Author-Recipient-Topic Model)

– Groups and Topics (Group-Topic Model)

• Demo: Rexa, a Web portal for researchers

End of Talk

Summary• Traditionally, SNA examines links,

but not the language content on those links.

• Presented ART, an Bayesian network for messages sent in a social network: captures topics and role-similarity.

• RART explicitly represents roles.

• Additional work– Group-Topic model discovers groups

and clusters attributes of relations.[Wang, Mohanty, McCallum, LinkKDD 2005]