Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

94
epartment of Computer Science niversity of California, Irvine KDD Program Review November 18 th 2003 Entity-Based Data Mining from Spatio-Temporal Events and Text Sources Presentation at KD-D Program Review, Nov 18-19 2003 Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine {smyth, sharad}@ics.uci.edu www.datalab.uci.edu

description

Entity-Based Data Mining from Spatio-Temporal Events and Text Sources Presentation at KD-D Program Review, Nov 18-19 2003. Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine {smyth, sharad}@ics.uci.edu www.datalab.uci.edu. Project Participants. - PowerPoint PPT Presentation

Transcript of Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Page 1: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Entity-Based Data Mining fromSpatio-Temporal Events and Text Sources

Presentation at KD-D Program Review, Nov 18-19 2003

Padhraic Smyth, Sharad Mehrotra

Information and Computer ScienceUniversity of California, Irvine

{smyth, sharad}@ics.uci.eduwww.datalab.uci.edu

Page 2: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Project Participants• Principal Investigators:

– Padhraic Smyth: Data mining – Sharad Mehrotra: Databases

• Collaborators– Mark Steyvers: Text and Author Modeling

• Postdoctoral Researchers– Michal Rosen-Zvi, Dmitri Kalashnikov

• Staff Programmer– Amnon Meyers: Information Extraction

• Students– Phd: Joshua O Madadhain, Scott White, Yiming Ma, Dawit Seid– Undergraduates: Yan-Biao Boey, Momo Alhazzazi

• Acknowledgements– Steve Lawrence for CiteSeer data

Page 3: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Problem of Interest

• Intelligence Analysis today– Massive volumes/streams of data

• Text (newswire, reports, etc)• Web data• Transactions/events

• Central problems – Need flexible tools to support an analyst’s exploration of

the data– Automatically focus an analyst’s attention on interesting

parts of the data space– Need new theories/methods/tools….

Page 4: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Entities and Events

• Entities = Individuals, groups, communities, organizations, etc• Events = Contacts, collaborations, meetings, products, etc

• Working hypothesis– A large component of intelligence work is centered on

entities and events • Extracting entity-information from text streams and

transaction data• Predicting entity behavior• Detecting groups of related entities

• Our broad goal– Develop next-generation data management, exploration,

and analysis tools for entity-event data

Page 5: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Nodes = Entities = Biotech-Related OrganizationsEdges = Events = Collaborations

Page 6: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Red indicates nodes selected bythe data analyst as important

Page 7: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Algorithm determines blue nodes are important relative to red nodes (Oxford and Cambridge)

Page 8: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Research Issues- Information extraction- Data management tools- Visualization techniques- Interactive ad hoc querying and mining - Statistical modeling of graph data- Query languages for graphs- Scalability to large graphs- ……

Page 9: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Information Extraction

Entity-EventDatabases Statistical

Modeling andData Mining

Visualization

QueryLanguages

UserModeling

TextSources

Focus of Our Research

Page 10: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Major Themes in Our Work• Focus on data in the form of graphs

– Nodes = entities, edges = events– Nodes and edges have attributes (e.g., temporal)– Year 1: entities = computer science researchers– Year 1: limited spatio-temporal aspects

• Integration and coupling of– Statistical modeling and data mining– Visualization– Query languages and data management

• Scalability– Methods should scale to millions of nodes and edges

• User Interaction– Conditional “query-driven” analysis and mining – Contrast with offline global modeling

Page 11: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Accomplishments

• Infrastructure and Data Sets– Created testbed data sets, e.g., 100k entities, 400k events– Developed suite of text information extraction tools

Developed and released a general public-domain JAVA API for graph data analysis and visualization

• Statistical Modeling and Data Mining– Developed new statistical technique for modeling entities

based on authored text– Developed new class of scalable algorithms for interactive

graph-based data mining

Page 12: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Accomplishments

• Graph-based Querying– Developed framework for general graph-based query

language– New accurate and efficient algorithms for interactive

similarity queries and query refinement on graphs

• Software Tools– Netsight: JAVA-based graph visualization and analysis tool– Browser tool for exploring author-topic models– Interactive query refinement system – Prototype system for graph-based query language for

interacting with heterogenous graph data

Page 13: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Publications in Year 1

• Data Mining on Graphs– S. White and P. Smyth, Algorithms for Discovering Relative Importance In Graphs, Proceedings of the Ninth

International ACM SIGKDD Conference, August 2003. Extended version submitted to JICRD, June 2003.

– J. O'Madadhain, D. Fisher, S. White, and Y. Boey, The JUNG (Java Universal Network/Graph) Framework, UCI-ICS Tech Report 03-17, October 2003: invited presentation, Stanford Workshop on Statistical Inference, Computing and Visualization for Graphs, August 2003.

– Modeling the Internet and the Web: Probabilistic Methods and Algorithms, P. Baldi, P. Frasconi, and P. Smyth, Wiley, June 2003.

• Statistical Author-Topic Models– T. Griffiths and M. Steyvers (in press). Finding Scientific Topics. Proceedings of the National Academy of Sciences

– M. Steyvers, M. Rosen-Zvi, T. Griffiths, P. Smyth, Author Attribution with LDA, NIPS workshop on Syntax, Semantics, and Statistics, December 2003

• Data Management and Graph Querying– Y. Ma, S. Mehrotra, D. Seid, A Framework for Refining Similarity Queries Using Learning Techniques, UCI-ICS

Tech Report 03-19, Nov. 2003. Extended version submitted to EDBT 2004.

– Y. Ma, D. Seid, S. Mehrotra, Interactive Filtering of Data Streams by Refining Similarity Queries, UCI-ICS Tech Report 03-07, June. 2003.

– D. Seid, M. Ortega-Binderbergery, Z. Chen, and S. Mehrotra, Evaluating Top-k Selection and Preference Queries on Multiple Indexed Attributes. Submitted to EDBT'04.

– D. Seid, and S. Mehrotra, Complex Analytical Queries on Graphs and Hierarchies, (in preparation).

– L. Jin, C. Li, S. Mehrotra, Efficient Record Linkage in Large Data Sets, in the 8th International Conference on Database Systems for Advanced Applications (DASFAA 2003) 26 - 28 March, 2003, Kyoto, Japan.

Page 14: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Data Sets

Dataset Documents Entities Extracted Abstracts

Words

CiteSeer 363K 100K 163K 12M

NSF Abstracts

129K 199K 129K 10M +

US Comp Science Depts

294 web sites

14K faculty

67K extracted citations

-

Page 15: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Information Extraction

Extractor Field Dataset Upper Bound

NumberExtracted

EstimatedAccurac

y

CSNames Author CiteSeer 503K 467K 90-100%

CSWord Abstract CiteSeer 363K 163K 65-75%

Publication Crawler

Publication

US CS DeptWeb Sites

- 67K -

Page 16: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Author Database Schema

write

AID

Author Paper

P_S

Source

P_F

ISA

d

Conference Journal Newsletters

FieldS_F

Sponsor Fund

PublishPublisher

ReferFromTo

SID FID

PID

FstNm

Position

Description

Vol

ISSN Pub_name

PubID

SpID Desc.

Desc.

PageNoFrm

Institute

CoReadRead

Also Read

From

MidNm LstNm

has

WebInfo

ParentFromTo

Title Keyword Abs Full Text Index

URL Date

Date

IID Type Location

Name Date

IssLocation Vol Iss

PageNoTo

Text Index

Note: “individual-centric” not “document-centric”

Page 17: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Information Extraction

Entity-EventDatabases Statistical

Modeling andData Mining

Visualization

QueryLanguages

UserModeling

TextSources

Focus of Our Research

Page 18: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

“9/11 Network”

Page 19: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

From graphs to Markov chains

• Importance = recursive function of nodes pointing at you

B

C

D

A

3

4

2

2

Page 20: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

From graphs to Markov chains

• Importance = recursive function of nodes pointing at you

B

C

D

A B

C

D

A

3

4

2

2

1.0

0.33

0.6

0.33

0.77

0.50.4

0.5

Page 21: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

From graphs to Markov chains

• Importance = recursive function of nodes pointing at you

• Markov approach…– Notion of a “token” circulating around in Markov fashion– Important actors see the token more often – Importance = stationary probability of each node– PageRank: surfer randomly following links on the Web

B

C

D

A B

C

D

A

3

4

2

2

1.0

0.33

0.6

0.33

0.77

0.50.4

0.5

Page 22: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

A

B C

D

F

E

G

Page 23: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

A

B C

D

F

E

G

Page 24: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

A

B C

D

F

E

G

Relative importance of node V to A:Trade off [distance from A, structural importance of V]

Page 25: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

A

B C

D

F

E

G

Add backlinks to A with probability (e.g., 0.3)

Page 26: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Algorithms for Relative Importance(S. White and P. Smyth, ACM KDD 2003: also JICRD, submitted)

• PageRank with Priors (PRankP)– Random walks that start from A and return to A periodically– Relative importance = stationary probability– Iterative algorithm (e.g., Haveliwala, 2002)

• HITS with priors– Formulate HITS as Markov chain, same idea….

• K-Step Markov– Use the transient probability distribution starting from A– Faster than stationary probability methods

• Weighted Paths– Heuristic approximation to K-step Markov: even faster

• All algorithms scale linearly in number of edges– Different constant factors

Page 27: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Computation Times for Ranking Algorithms (in seconds)

Data Number of

Nodes

Number of

Edges

WeightedPaths

KStep Markov

K=6

PrankWithPriors

HITSWithPriors

Terrorist 63 308 0.01 0.28 1.17 0.57

Biotech 3k 13k 0.02 0.39 3.45 3.64

Author1 30k 88k 0.05 1.11 10.80 11.30

Author2 30k 88k 0.04 1.55 17.06 17.99

PRankP and HITS converged in 20-30 iterations

Page 28: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Computation Times for Ranking Algorithms (in seconds)

Data Number of

Nodes

Number of

Edges

WeightedPaths

KStep Markov

K=6

PrankWithPriors

HITSWithPriors

Terrorist 63 308 0.01 0.28 1.17 0.57

Biotech 3k 13k 0.02 0.39 3.45 3.64

Author1 30k 88k 0.05 1.11 10.80 11.30

Author2 30k 88k 0.04 1.55 17.06 17.99

PRankP and HITS converged in 20-30 iterations

Page 29: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Page 30: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Page 31: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Rank PRankP on Unweighted Graph

PRankP on Weighted Graph

1 Thrun Thrun

2 Fisher McCallum

3 Kononenko Nigam

4 Dzeroski Freitag

5 Freitag Blum

6 Bratko Slattery

7 Cheng Joachims

8 MCDermott Fox

Weighted versus Unweighted Graphs

Page 32: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Visualization and Analysis Software

Page 33: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

JUNG Java Universal Network/Graph Framework

http://jung.sourceforge.net

16,000 page visits800 downloadssince August

Page 34: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Demo of Netsight software

Page 35: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Entity Models from Text Data

Page 36: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

w 1

A 1 A 2

w 2

A k

w 3 w N

Authors

Words

Can we model authors, given documents?

(more generally, build statistical profiles of entitiesgiven sparse observed data)

Page 37: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

w 1

A 1 A 2

w 2

T1

A k

w 3 w N

T2

Authors

Words

HiddenTopics

Model = Author-Topic distributions + Topic-Word distributions

Parameters learned via Bayesian learning

Page 38: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

w 1

A 1 A 2

w 2

T1

A k

w 3 w N

T2

Authors

Words

HiddenTopics

Page 39: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

w 1

A 1 A 2

w 2

T1

A k

w 3 w N

T2

Authors

Words

HiddenTopics

Page 40: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

w 1

A 1 A 2

w 2

T1

A k

w 3 w N

T2

Authors

Words

HiddenTopics

Page 41: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

w 1

A 1 A 2

w 2

T1

A k

w 3 w N

T2

Authors

Words

HiddenTopics

Page 42: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

w 1

A 1 A 2

w 2

T1

A k

w 3 w N

T2

Authors

Words

HiddenTopics

Page 43: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

w 1

A 1 A 2

w 2

T1

A k

w 3 w N

T2

Authors

Words

HiddenTopics

Page 44: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

w 1 w 2

T1

w 3 w N

T2

“Topic Model”:- document can be generated from multiple topics- Hofmann (SIGIR ’99), Blei, Jordan, Ng (JMLR, 2003)

Words

HiddenTopics

Page 45: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

w 1

A 1 A 2

w 2

T1

A k

w 3 w N

T2

Authors

Words

HiddenTopics

Model = Author-Topic distributions + Topic-Word distributions

NOTE: documents can be composed of multiple topics

Page 46: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Author Modeling Data Sets

Source Documents UniqueAuthors

Unique Words

Total Word Count

CiteSeer 163,389 85,465 30,799 11.7 million

CORA 13,643 11,427 11,101 1.2 million

NIPS 1,740 2,037 13,649 2.3 million

Page 47: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Topic Models from CiteSeer

WORDS: probabilistic, Bayesian, carlo, monte, distribution, inference, conditional, prior, mixture, Markov, posterior, belief……

AUTHORS: N_Friedman, D_Heckerman, Z_Ghahramani, D_Koller, M_Jordan, R_Neal, A_Raftery, T_Lukasiewicz, J_Halpern….

WORDS: retrieval, text, document, information, content, indexing, relevance, collection, query, IR, feedback….

AUTHORS: D. Oard, W_Croft, K_Jones, P_Schauble, E_Voorhees, A_Singhal, D_Hawking, J_Allan, A_Smeaton, M_Hearst,….

Page 48: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Topic Models from CiteSeer

WORDS: Web, user, world, wide, pages, www, site, internet, hypertext, hypermedia, content, links, page, navigation..

AUTHORS: S. Lawrence, B. Mobasher, M. Levene, D. Florescu, O. Etzioni, R_Studer, W. Hall, R. Fielding, J. Pitkow, M. Crovella,….

WORDS: data, mining, attributes, discovery, association, large, knowledge, databases, dataset, interesting, frequent, discover, sets….

AUTHORS: J. Han, R. Rastogi, M. Zaki, R. Ng, B. Liu, H. Mannila, S. Brin, H Liu, L. Holder, H. Toivonen…

Page 49: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Author-Topic Models from CiteSeer

• Author = A McCallum:– Topic 1: classification, training, generalization, decision, data,…– Topic 2: learning, machine, examples, reinforcement, inductive,…..– Topic 3: retrieval, text, document, information, content,…

• Author = H Garcia-Molina:- Topic 1: query, index, data, join, processing, aggregate….

- Topic 2: transaction, concurrency, copy, permission,distributed….- Topic 3: source, separation, paper, heterogeneous, merging…..

• Author = P Cohen:- Topic 1: agent, multi, coordination, autonomous, intelligent….- Topic 2: planning, action, goal, world, execution, situation…- Topic 3: human, interaction, people, cognitive, social, natural….

Page 50: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Author-Topic Browser

• Interesting scalability issues– CiteSeer model exceeds 1 Gbyte – Real-time query answering demands Gibbs sampling

(not well suited to SQL!)

• Solution– Coupling of Gibbs sampling and relational DB (it works!)

Original Text+ Statistical

Model

JAVAQueryGUI

MySQLDB

SQLInterface

BayesianSampling

Page 51: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Demo of Author-Topic Browser

• Note– Real-time querying on CiteSeer authors/documents

• 85,000 authors• 163,000 documents• 30,000 unique words• 300 topics

– Can query on• Authors, topics, words, documents

– Topic distribution given documents/words requires sampling to estimate:

• Gibbs sampling is fast enough to answer queries in real-time

Page 52: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Applications of Author-Topic Models

• “Expert Finder”– “Find researchers who are knowledgeable in cryptography

and machine learning within 100 miles of Washington DC”– “Find reviewers for this set of NSF proposals who are active

in relevant topics and have no conflicts of interest”

• Prediction– Given a document and some subset of known authors for

the paper (k=0,1,2…), predict the other authors– Predict how many papers in different topics will appear

next year

• Change Detection/Monitoring– Which authors are on the leading edge of new topics?– Characterize the “topic trajectory” of this author over time

Page 53: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

1986 1988 1990 1992 1994 1996 1998 2000 20020

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2x 10

4

Year

Nu

mb

er o

f Do

cum

ents

Document and Word Distribution by Year in the UCI CiteSeer Data

Nu

mb

er o

f Wo

rds

0

2

4

6

8

10

12

14x 10

5

Page 54: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

1990 1992 1994 1996 1998 2000 20020

0.002

0.004

0.006

0.008

0.01

0.012Topic Proportions by Year in CiteSeer Data

Year

To

pic

Pro

ba

bili

ty

7::web:user:world:wide:users:80::mobile:wireless:devices:mobility:ad:76::java:remote:interface:platform:implementation:275::multicast:multimedia:media:delivery:applications:

Rise in Web, Mobile, JAVA

Web

Page 55: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

1990 1992 1994 1996 1998 2000 20021

2

3

4

5

6

7

8x 10

-3 Topic Proportions by Year in CiteSeer Data

Year

To

pic

Pro

ba

bili

ty

114::regression:variance:estimator:estimators:bias:153::classification:training:classifier:classifiers:generalization:205::data:mining:attributes:discovery:association:

Rise of Machine Learning

Page 56: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

1990 1992 1994 1996 1998 2000 20021.5

2

2.5

3

3.5

4

4.5

5

5.5x 10

-3 Topic Proportions by Year in CiteSeer Data

Year

To

pic

Pro

ba

bili

ty

189::statistical:prediction:correlation:predict:statistics:209::probabilistic:bayesian:probability:carlo:monte:276::random:distribution:probability:markov:distributions:

Bayes lives on….

Page 57: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

1990 1992 1994 1996 1998 2000 20022

3

4

5

6

7

8

9

10

11x 10

-3 Topic Proportions by Year in CiteSeer Data

Year

To

pic

Pro

ba

bili

ty

60::programming:language:concurrent:languages:implementation:139::system:operating:file:systems:kernel:283::collection:memory:persistent:garbage:stack:268::memory:cache:shared:access:performance:

Decline in Languages, OS, …

Page 58: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

1990 1992 1994 1996 1998 2000 20022

4

6

8

10

12

14x 10

-3 Topic Proportions by Year in CiteSeer Data

Year

To

pic

Pro

ba

bili

ty

111::proof:theorem:proofs:proving:prover:156::polynomial:complexity:np:complete:hard:226::language:languages:semantics:syntax:constructs:235::logic:semantics:reasoning:logical:logics:252::computation:computing:complexity:compute:computations:

Decline in CS Theory, …

Page 59: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

1990 1992 1994 1996 1998 2000 20021

2

3

4

5

6

7

8

9x 10

-3 Topic Proportions by Year in CiteSeer Data

Year

To

pic

Pro

ba

bili

ty

205::data:mining:attributes:discovery:association:261::transaction:transactions:concurrency:copy:copies:198::server:client:servers:clients:caching:82::library:access:digital:libraries:core:

Trends in Database Research

Page 60: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

1990 1992 1994 1996 1998 2000 20022

3

4

5

6

7

8x 10

-3 Topic Proportions by Year in CiteSeer Data

Year

To

pic

Pro

ba

bili

ty

280::language:semantic:natural:linguistic:grammar:289::retrieval:text:documents:information:document:

Trends in NLP and IR

IR

NLP

Page 61: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

1990 1992 1994 1996 1998 2000 20021

2

3

4

5

6

7

8

9x 10

-3 Topic Proportions by Year in CiteSeer Data

Year

To

pic

Pro

ba

bili

ty

120::security:secure:access:key:authentication:240::key:attack:encryption:hash:keys:

Security Research Reborn…

Page 62: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

1990 1992 1994 1996 1998 2000 20021

2

3

4

5

6

7x 10

-3 Topic Proportions by Year in CiteSeer Data

Year

To

pic

Pro

ba

bili

ty

23::neural:networks:network:training:learning:35::wavelet:operator:operators:basis:coefficients:242::genetic:evolutionary:evolution:population:ga:

(Not so) Hot Topics

NeuralNetworks

GAs

Wavelets

Page 63: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

1990 1992 1994 1996 1998 2000 20021.5

2

2.5

3

3.5

4

4.5

5x 10

-3 Topic Proportions by Year in CiteSeer Data

Year

To

pic

Pro

ba

bili

ty

157::gamma:delta:ff:omega:oe:

Decline in use of Greek Letters

Page 64: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Graph-based Query Refinement and Query Languages

Page 65: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Heterogeneous Event-Entity Querying

• Problem:– Most existing graph/link mining approaches assume single

node types (e.g. people, documents, etc.) and restricted link types (e.g. collaboration, html links, etc.)

• Solution– Single framework that enables analysts to mine

heterogeneous event-entity data

Page 66: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Supporting Exploratory Event-Entity Graph Analysis

• Influence/dependence analysis

• Prediction of links between entity type 1 and entity type 2, given their relation to entity 3.

• Compute strength of relationship between a given pair of individuals or groups with varying edge and node types.

Given the overall schema and graph data:

• Subschema selection• Subgraph selection (data

filtering)• Decoration of Data Graph

Nodes and Edges• Structural Grouping and

Aggregation– May also involve

aggregation of decoration values.

• Progressive/Interactive Refinement

Example tasks Our Approach

Page 67: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

The GrAQ System(built using JUNG library)

Page 68: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Status of Work

• Achievements– query language for interactive graph analysis– Aggregation operators for graph data analysis.– Similarity predicates and ranking for analysis involving imprecise

matching– Integration of concept hierarchies in graph data analysis– System development over a commercial ORDBMS

• Future Work– Model and language extensions to support spatio-temporal graph

analysis – Efficient support for graph analysis queries

• Graph indexing strategies • Query processing and optimization

– Integration of feedback based query refinement in graph analysis queries

Page 69: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Interactive Querying and Refinement

• Relevance-based retrieval– Queries approximately capture user’s information need– Ranked retrieval based on relevance of object to query

• Query Refinement – Customization based on user’s subjectivity, information

need, and preferences

• Existing Search Technologies – Database Systems: do not support relevance based

retrieval (only exact search)– IR systems: support (limited) aspect of similarity retrieval

but are limited to textual data.

Page 70: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Q: Start with 3 Universities and my research interests,retrieve important information about authors.

Entity (Author)

Stanford UCI UCLA

AI

Database

Stanford

UCI

UCLA

AI

Database

IBM IR

DataMining

Relation (Write) Event (Paper)

Jeff Ullman

Hector

R. Agrawal

MichaelPazzani

PadhraicSmyth

Feedback

Feedback

SELECT author FROM db WHERE (Inst=‘Stanford’ OR Inst=‘UCI’ OR Inst= ‘UCLA’) AND Sim_area(‘AI’, ‘Database)

Richard Korf

RefinementEngine

Setup Initial query

Feedback

Retrieve Result represented as Ranked ListRetrieve Result represented as Nodes in a

graph

Page 71: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Similarity Queries in SQL are Complex!

Page 72: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Evaluation of Query Refinement

• Tested on multiple real data sets• Average precision on 400 queries over 4 refinements • The new methods outperform existing methods

– substantially fewer iterations required

Page 73: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Other work in progress..

• Edge prediction in graphs– Given a graph with attributes on nodes and edges– Assume some edges are missing (or remove them)– Predict the probability of edge(i,j)

• E.g., what is likelihood that A and B have interacted given everything else we know, or that they will interact within the next 6 months

– Note: “runtime” querying, avoid O(N2) complexity

• Data cleaning– multiple names for a single entity– multiple entities mapped to the same name, e.g., J_Wang

• How many unique P_Smyths are there?

– Use heterogenous data sources and probabilistic models to iteratively produce “consistent” data

• E.g., combine CiteSeer, Web information, topic models, institution, etc

Page 74: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Conclusions

Page 75: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Summary of Accomplishments

• Infrastructure– Developed entity-event testbed data sets and IE tools– Released JUNG API for graph data analysis and visualization

• Graph Data Analysis/Querying Research– Novel author-topic models– New class of “relative importance” algorithms – Efficient similarity query refinement system– New general framework for graph schemas

• Software– Netsight– Topic-Author Browser– Interactive query refinement system– Prototype graph-based DB language system

Page 76: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

What’s ready for the KD-D TestBed?

• Netsight– Built on JUNG API– Can handle any standard network data set– Supports both visualization and analysis

• Relative importance algorithms• Relative betweenness algorithms• Graph layout and browsing• Graph filtering

– Easily extendible– Integrated database support is planned in Year 2

• Other software is also in principle available– Author-topic applications:

• e.g., find experts in South Florida in virus research– - GraQ tool for graph DB interface

Page 77: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Proposed Year 2 Work

• Basic research: extend theory and algorithms to – Extend to temporal and spatial semantics– Handle missing/noisy network data– Multi-edge types (multiple edges on same entities)– Scalability: graphs with millions of edges– Interaction: tools that support exploration and querying

• Integration and Coupling of– Statistical topic models, querying, graph visualization, and databases

• Software Tools and Applications for the KDD-testbed– Netsight as an analysis tool…– Application of Author-topic type model (e.g., “expert finder”)– Entity Monitoring application (monitor data sources over time with focused

Web crawling)

• Data Sets/Types (TBD)– KDD-provided testbed data sets– Digital libraries: more CiteSeer, possibly Patent DB, MEDLINE– Less structured text sources such as email streams

Page 78: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

BACKUP SLIDES

Page 79: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Page 80: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Page 81: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Page 82: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Page 83: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Page 84: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Page 85: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Page 86: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Page 87: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Page 88: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Number of Topics

5 10 20 50 100 200 400 8002000

2500

3000

3500

4000

4500

5000

5500

0th

1st

2nd

5th

10th

Perplexities for true author and any random author

PercentilesIn distribution

A = true author

A = any author

Page 89: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

• Accuracy of author prediction as a function of # topics

Number of Topics

5 10 20 50 100 200 400 800

0

10

20

30

40

% of documents for which correct author was picked

Page 90: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Heterogeneous Event-Entity Graph Analysis and Query Language

Analysis of link/graph data involves: • Subschema selection

– Selecting node and edge types of interest from the graph schema• Subgraph selection

– Identifying relevant members of a group based on (possibly imprecise) matching of edge/node attributes or involvement in a given pattern of relationship.

• Decoration– E.g. computation of pair-wise association measures between individual

entities (conditioned on a context or third entity type)

• Structural Grouping and Aggregation– Node/edge grouping– combination of decorations (or other attribute values) for groups of entities

at various levels.

• Progressive Refinement – carrying out the above operations in a progressive and interactive manner.

In particular, user should be able to ask queries based on results of previous queries.

Page 91: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

P(author and topic given a word)

P(Ai,Zi|{W},{Z}\Zi,{A}\Ai) (CWZ+ )(CAZ+)/(W’CW’Z+V)

CWZ counts the number of times the same word, W, (in the same or other documents) is assigned to topic Z

CAZ counts the number of times the same author, A, (in the same or other documents) is assigned to topic Z

Keeping these counts speeds up the algorithm!

Page 92: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Sampling over a query document

Preprocessing: Assign to each word in the query document an Author and a TopicK Iterations (typically K=10)

• For each word out of the N query words

• Derive the probability P(A, Z) conditioned on the current assignments of query words and the database words

• Assign a new author, A, and topic, Z, according to P(A,Z)The probability for a topic is the averaged ratio of words

assigned to the topic per total words

P(Z)=Kt=1Ct

Z/(KN)

CtZ is the number of words assigned in the t iteration to the z

topic

Page 93: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Page 94: Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003