Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Department of Computer ScienceUniversity of California, Irvine

KDD Program ReviewNovember 18th 2003

Entity-Based Data Mining fromSpatio-Temporal Events and Text Sources

Presentation at KD-D Program Review, Nov 18-19 2003

Padhraic Smyth, Sharad Mehrotra

Information and Computer ScienceUniversity of California, Irvine

{smyth, sharad}@ics.uci.eduwww.datalab.uci.edu



Project Participants• Principal Investigators:

– Padhraic Smyth: Data mining – Sharad Mehrotra: Databases

• Collaborators– Mark Steyvers: Text and Author Modeling

• Postdoctoral Researchers– Michal Rosen-Zvi, Dmitri Kalashnikov

• Staff Programmer– Amnon Meyers: Information Extraction

• Students– Phd: Joshua O Madadhain, Scott White, Yiming Ma, Dawit Seid– Undergraduates: Yan-Biao Boey, Momo Alhazzazi

• Acknowledgements– Steve Lawrence for CiteSeer data



Problem of Interest

• Intelligence Analysis today– Massive volumes/streams of data

• Text (newswire, reports, etc)• Web data• Transactions/events

• Central problems – Need flexible tools to support an analyst’s exploration of

the data– Automatically focus an analyst’s attention on interesting

parts of the data space– Need new theories/methods/tools….



Entities and Events

• Entities = Individuals, groups, communities, organizations, etc• Events = Contacts, collaborations, meetings, products, etc

• Working hypothesis– A large component of intelligence work is centered on

entities and events • Extracting entity-information from text streams and

transaction data• Predicting entity behavior• Detecting groups of related entities

• Our broad goal– Develop next-generation data management, exploration,

and analysis tools for entity-event data



Nodes = Entities = Biotech-Related OrganizationsEdges = Events = Collaborations



Red indicates nodes selected bythe data analyst as important



Algorithm determines blue nodes are important relative to red nodes (Oxford and Cambridge)



Research Issues- Information extraction- Data management tools- Visualization techniques- Interactive ad hoc querying and mining - Statistical modeling of graph data- Query languages for graphs- Scalability to large graphs- ……



Information Extraction

Entity-EventDatabases Statistical

Modeling andData Mining

Visualization

QueryLanguages

UserModeling

TextSources

Focus of Our Research



Major Themes in Our Work• Focus on data in the form of graphs

– Nodes = entities, edges = events– Nodes and edges have attributes (e.g., temporal)– Year 1: entities = computer science researchers– Year 1: limited spatio-temporal aspects

• Integration and coupling of– Statistical modeling and data mining– Visualization– Query languages and data management

• Scalability– Methods should scale to millions of nodes and edges

• User Interaction– Conditional “query-driven” analysis and mining – Contrast with offline global modeling



Accomplishments

• Infrastructure and Data Sets– Created testbed data sets, e.g., 100k entities, 400k events– Developed suite of text information extraction tools

Developed and released a general public-domain JAVA API for graph data analysis and visualization

• Statistical Modeling and Data Mining– Developed new statistical technique for modeling entities

based on authored text– Developed new class of scalable algorithms for interactive

graph-based data mining



Accomplishments

• Graph-based Querying– Developed framework for general graph-based query

language– New accurate and efficient algorithms for interactive

similarity queries and query refinement on graphs

• Software Tools– Netsight: JAVA-based graph visualization and analysis tool– Browser tool for exploring author-topic models– Interactive query refinement system – Prototype system for graph-based query language for

interacting with heterogenous graph data



Publications in Year 1

• Data Mining on Graphs– S. White and P. Smyth, Algorithms for Discovering Relative Importance In Graphs, Proceedings of the Ninth

International ACM SIGKDD Conference, August 2003. Extended version submitted to JICRD, June 2003.

– J. O'Madadhain, D. Fisher, S. White, and Y. Boey, The JUNG (Java Universal Network/Graph) Framework, UCI-ICS Tech Report 03-17, October 2003: invited presentation, Stanford Workshop on Statistical Inference, Computing and Visualization for Graphs, August 2003.

– Modeling the Internet and the Web: Probabilistic Methods and Algorithms, P. Baldi, P. Frasconi, and P. Smyth, Wiley, June 2003.

• Statistical Author-Topic Models– T. Griffiths and M. Steyvers (in press). Finding Scientific Topics. Proceedings of the National Academy of Sciences

– M. Steyvers, M. Rosen-Zvi, T. Griffiths, P. Smyth, Author Attribution with LDA, NIPS workshop on Syntax, Semantics, and Statistics, December 2003

• Data Management and Graph Querying– Y. Ma, S. Mehrotra, D. Seid, A Framework for Refining Similarity Queries Using Learning Techniques, UCI-ICS

Tech Report 03-19, Nov. 2003. Extended version submitted to EDBT 2004.

– Y. Ma, D. Seid, S. Mehrotra, Interactive Filtering of Data Streams by Refining Similarity Queries, UCI-ICS Tech Report 03-07, June. 2003.

– D. Seid, M. Ortega-Binderbergery, Z. Chen, and S. Mehrotra, Evaluating Top-k Selection and Preference Queries on Multiple Indexed Attributes. Submitted to EDBT'04.

– D. Seid, and S. Mehrotra, Complex Analytical Queries on Graphs and Hierarchies, (in preparation).

– L. Jin, C. Li, S. Mehrotra, Efficient Record Linkage in Large Data Sets, in the 8th International Conference on Database Systems for Advanced Applications (DASFAA 2003) 26 - 28 March, 2003, Kyoto, Japan.



Data Sets

Dataset Documents Entities Extracted Abstracts

Words

CiteSeer 363K 100K 163K 12M

NSF Abstracts

129K 199K 129K 10M +

US Comp Science Depts

294 web sites

14K faculty

67K extracted citations

-




Extractor Field Dataset Upper Bound

NumberExtracted

EstimatedAccurac

y

CSNames Author CiteSeer 503K 467K 90-100%

CSWord Abstract CiteSeer 363K 163K 65-75%

Publication Crawler

Publication

US CS DeptWeb Sites

- 67K -



Author Database Schema

write

AID

Author Paper

P_S

Source

P_F

ISA

d

Conference Journal Newsletters

FieldS_F

Sponsor Fund

PublishPublisher

ReferFromTo

SID FID

PID

FstNm

Position

Description

Vol

ISSN Pub_name

PubID

SpID Desc.

Desc.

PageNoFrm

Institute

CoReadRead

Also Read

From

MidNm LstNm

has

WebInfo

ParentFromTo

Title Keyword Abs Full Text Index

URL Date

Date

IID Type Location

Name Date

IssLocation Vol Iss

PageNoTo

Text Index

Note: “individual-centric” not “document-centric”




Entity-EventDatabases Statistical

Modeling andData Mining

Visualization

QueryLanguages

UserModeling

TextSources

Focus of Our Research



“9/11 Network”



From graphs to Markov chains

• Importance = recursive function of nodes pointing at you

B

C

D

A

3

4

2

2





B

C

D

A B

C

D

A

3

4

2

2

1.0

0.33

0.6

0.33

0.77

0.50.4

0.5





• Markov approach…– Notion of a “token” circulating around in Markov fashion– Important actors see the token more often – Importance = stationary probability of each node– PageRank: surfer randomly following links on the Web

B

C

D

A B

C

D

A

3

4

2

2

1.0

0.33

0.6

0.33

0.77

0.50.4

0.5



A

B C

D

F

E

G



A

B C

D

F

E

G

Relative importance of node V to A:Trade off [distance from A, structural importance of V]



A

B C

D

F

E

G

Add backlinks to A with probability (e.g., 0.3)



Algorithms for Relative Importance(S. White and P. Smyth, ACM KDD 2003: also JICRD, submitted)

• PageRank with Priors (PRankP)– Random walks that start from A and return to A periodically– Relative importance = stationary probability– Iterative algorithm (e.g., Haveliwala, 2002)

• HITS with priors– Formulate HITS as Markov chain, same idea….

• K-Step Markov– Use the transient probability distribution starting from A– Faster than stationary probability methods

• Weighted Paths– Heuristic approximation to K-step Markov: even faster

• All algorithms scale linearly in number of edges– Different constant factors



Computation Times for Ranking Algorithms (in seconds)

Data Number of

Nodes

Number of

Edges

WeightedPaths

KStep Markov

K=6

PrankWithPriors

HITSWithPriors

Terrorist 63 308 0.01 0.28 1.17 0.57

Biotech 3k 13k 0.02 0.39 3.45 3.64

Author1 30k 88k 0.05 1.11 10.80 11.30

Author2 30k 88k 0.04 1.55 17.06 17.99

PRankP and HITS converged in 20-30 iterations



Rank PRankP on Unweighted Graph

PRankP on Weighted Graph

1 Thrun Thrun

2 Fisher McCallum

3 Kononenko Nigam

4 Dzeroski Freitag

5 Freitag Blum

6 Bratko Slattery

7 Cheng Joachims

8 MCDermott Fox

Weighted versus Unweighted Graphs



Visualization and Analysis Software



JUNG Java Universal Network/Graph Framework

http://jung.sourceforge.net

16,000 page visits800 downloadssince August



Demo of Netsight software



Entity Models from Text Data



w 1

A 1 A 2

w 2

A k

w 3 w N

Authors

Words

Can we model authors, given documents?

(more generally, build statistical profiles of entitiesgiven sparse observed data)



w 1

A 1 A 2

w 2

T1

A k

w 3 w N

T2

Authors

Words

HiddenTopics

Model = Author-Topic distributions + Topic-Word distributions

Parameters learned via Bayesian learning



w 1

A 1 A 2

w 2

T1

A k

w 3 w N

T2

Authors

Words

HiddenTopics



w 1 w 2

T1

w 3 w N

T2

“Topic Model”:- document can be generated from multiple topics- Hofmann (SIGIR ’99), Blei, Jordan, Ng (JMLR, 2003)

Words

HiddenTopics



w 1

A 1 A 2

w 2

T1

A k

w 3 w N

T2

Authors

Words

HiddenTopics

Model = Author-Topic distributions + Topic-Word distributions

NOTE: documents can be composed of multiple topics



Author Modeling Data Sets

Source Documents UniqueAuthors

Unique Words

Total Word Count

CiteSeer 163,389 85,465 30,799 11.7 million

CORA 13,643 11,427 11,101 1.2 million

NIPS 1,740 2,037 13,649 2.3 million



Topic Models from CiteSeer

WORDS: probabilistic, Bayesian, carlo, monte, distribution, inference, conditional, prior, mixture, Markov, posterior, belief……

AUTHORS: N_Friedman, D_Heckerman, Z_Ghahramani, D_Koller, M_Jordan, R_Neal, A_Raftery, T_Lukasiewicz, J_Halpern….

WORDS: retrieval, text, document, information, content, indexing, relevance, collection, query, IR, feedback….

AUTHORS: D. Oard, W_Croft, K_Jones, P_Schauble, E_Voorhees, A_Singhal, D_Hawking, J_Allan, A_Smeaton, M_Hearst,….



Topic Models from CiteSeer

WORDS: Web, user, world, wide, pages, www, site, internet, hypertext, hypermedia, content, links, page, navigation..

AUTHORS: S. Lawrence, B. Mobasher, M. Levene, D. Florescu, O. Etzioni, R_Studer, W. Hall, R. Fielding, J. Pitkow, M. Crovella,….

WORDS: data, mining, attributes, discovery, association, large, knowledge, databases, dataset, interesting, frequent, discover, sets….

AUTHORS: J. Han, R. Rastogi, M. Zaki, R. Ng, B. Liu, H. Mannila, S. Brin, H Liu, L. Holder, H. Toivonen…



Author-Topic Models from CiteSeer

• Author = A McCallum:– Topic 1: classification, training, generalization, decision, data,…– Topic 2: learning, machine, examples, reinforcement, inductive,…..– Topic 3: retrieval, text, document, information, content,…

• Author = H Garcia-Molina:- Topic 1: query, index, data, join, processing, aggregate….

- Topic 2: transaction, concurrency, copy, permission,distributed….- Topic 3: source, separation, paper, heterogeneous, merging…..

• Author = P Cohen:- Topic 1: agent, multi, coordination, autonomous, intelligent….- Topic 2: planning, action, goal, world, execution, situation…- Topic 3: human, interaction, people, cognitive, social, natural….



Author-Topic Browser

• Interesting scalability issues– CiteSeer model exceeds 1 Gbyte – Real-time query answering demands Gibbs sampling

(not well suited to SQL!)

• Solution– Coupling of Gibbs sampling and relational DB (it works!)

Original Text+ Statistical

Model

JAVAQueryGUI

MySQLDB

SQLInterface

BayesianSampling



Demo of Author-Topic Browser

• Note– Real-time querying on CiteSeer authors/documents

• 85,000 authors• 163,000 documents• 30,000 unique words• 300 topics

– Can query on• Authors, topics, words, documents

– Topic distribution given documents/words requires sampling to estimate:

• Gibbs sampling is fast enough to answer queries in real-time



Applications of Author-Topic Models

• “Expert Finder”– “Find researchers who are knowledgeable in cryptography

and machine learning within 100 miles of Washington DC”– “Find reviewers for this set of NSF proposals who are active

in relevant topics and have no conflicts of interest”

• Prediction– Given a document and some subset of known authors for

the paper (k=0,1,2…), predict the other authors– Predict how many papers in different topics will appear

next year

• Change Detection/Monitoring– Which authors are on the leading edge of new topics?– Characterize the “topic trajectory” of this author over time



1986 1988 1990 1992 1994 1996 1998 2000 20020

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2x 10

4

Year

Nu

mb

er o

f Do

cum

ents

Document and Word Distribution by Year in the UCI CiteSeer Data

Nu

mb

er o

f Wo

rds

0

2

4

6

8

10

12

14x 10

5



1990 1992 1994 1996 1998 2000 20020

0.002

0.004

0.006

0.008

0.01

0.012Topic Proportions by Year in CiteSeer Data

Year

To

pic

Pro

ba

bili

ty

7::web:user:world:wide:users:80::mobile:wireless:devices:mobility:ad:76::java:remote:interface:platform:implementation:275::multicast:multimedia:media:delivery:applications:

Rise in Web, Mobile, JAVA

Web



1990 1992 1994 1996 1998 2000 20021

2

3

4

5

6

7

8x 10

-3 Topic Proportions by Year in CiteSeer Data

Year

To

pic

Pro

ba

bili

ty

114::regression:variance:estimator:estimators:bias:153::classification:training:classifier:classifiers:generalization:205::data:mining:attributes:discovery:association:

Rise of Machine Learning



1990 1992 1994 1996 1998 2000 20021.5

2

2.5

3

3.5

4

4.5

5

5.5x 10


Year

To

pic

Pro

ba

bili

ty

189::statistical:prediction:correlation:predict:statistics:209::probabilistic:bayesian:probability:carlo:monte:276::random:distribution:probability:markov:distributions:

Bayes lives on….



1990 1992 1994 1996 1998 2000 20022

3

4

5

6

7

8

9

10

11x 10


Year

To

pic

Pro

ba

bili

ty

60::programming:language:concurrent:languages:implementation:139::system:operating:file:systems:kernel:283::collection:memory:persistent:garbage:stack:268::memory:cache:shared:access:performance:

Decline in Languages, OS, …



1990 1992 1994 1996 1998 2000 20022

4

6

8

10

12

14x 10


Year

To

pic

Pro

ba

bili

ty

111::proof:theorem:proofs:proving:prover:156::polynomial:complexity:np:complete:hard:226::language:languages:semantics:syntax:constructs:235::logic:semantics:reasoning:logical:logics:252::computation:computing:complexity:compute:computations:

Decline in CS Theory, …



1990 1992 1994 1996 1998 2000 20021

2

3

4

5

6

7

8

9x 10


Year

To

pic

Pro

ba

bili

ty

205::data:mining:attributes:discovery:association:261::transaction:transactions:concurrency:copy:copies:198::server:client:servers:clients:caching:82::library:access:digital:libraries:core:

Trends in Database Research



1990 1992 1994 1996 1998 2000 20022

3

4

5

6

7

8x 10


Year

To

pic

Pro

ba

bili

ty

280::language:semantic:natural:linguistic:grammar:289::retrieval:text:documents:information:document:

Trends in NLP and IR

IR

NLP



1990 1992 1994 1996 1998 2000 20021

2

3

4

5

6

7

8

9x 10


Year

To

pic

Pro

ba

bili

ty

120::security:secure:access:key:authentication:240::key:attack:encryption:hash:keys:

Security Research Reborn…



1990 1992 1994 1996 1998 2000 20021

2

3

4

5

6

7x 10


Year

To

pic

Pro

ba

bili

ty

23::neural:networks:network:training:learning:35::wavelet:operator:operators:basis:coefficients:242::genetic:evolutionary:evolution:population:ga:

(Not so) Hot Topics

NeuralNetworks

GAs

Wavelets



1990 1992 1994 1996 1998 2000 20021.5

2

2.5

3

3.5

4

4.5

5x 10


Year

To

pic

Pro

ba

bili

ty

157::gamma:delta:ff:omega:oe:

Decline in use of Greek Letters



Graph-based Query Refinement and Query Languages



Heterogeneous Event-Entity Querying

• Problem:– Most existing graph/link mining approaches assume single

node types (e.g. people, documents, etc.) and restricted link types (e.g. collaboration, html links, etc.)

• Solution– Single framework that enables analysts to mine

heterogeneous event-entity data



Supporting Exploratory Event-Entity Graph Analysis

• Influence/dependence analysis

• Prediction of links between entity type 1 and entity type 2, given their relation to entity 3.

• Compute strength of relationship between a given pair of individuals or groups with varying edge and node types.

Given the overall schema and graph data:

• Subschema selection• Subgraph selection (data

filtering)• Decoration of Data Graph

Nodes and Edges• Structural Grouping and

Aggregation– May also involve

aggregation of decoration values.

• Progressive/Interactive Refinement

Example tasks Our Approach



The GrAQ System(built using JUNG library)



Status of Work

• Achievements– query language for interactive graph analysis– Aggregation operators for graph data analysis.– Similarity predicates and ranking for analysis involving imprecise

matching– Integration of concept hierarchies in graph data analysis– System development over a commercial ORDBMS

• Future Work– Model and language extensions to support spatio-temporal graph

analysis – Efficient support for graph analysis queries

• Graph indexing strategies • Query processing and optimization

– Integration of feedback based query refinement in graph analysis queries



Interactive Querying and Refinement

• Relevance-based retrieval– Queries approximately capture user’s information need– Ranked retrieval based on relevance of object to query

• Query Refinement – Customization based on user’s subjectivity, information

need, and preferences

• Existing Search Technologies – Database Systems: do not support relevance based

retrieval (only exact search)– IR systems: support (limited) aspect of similarity retrieval

but are limited to textual data.



Q: Start with 3 Universities and my research interests,retrieve important information about authors.

Entity (Author)

Stanford UCI UCLA

AI

Database

Stanford

UCI

UCLA

AI

Database

IBM IR

DataMining

Relation (Write) Event (Paper)

Jeff Ullman

Hector

R. Agrawal

MichaelPazzani

PadhraicSmyth

Feedback

Feedback

SELECT author FROM db WHERE (Inst=‘Stanford’ OR Inst=‘UCI’ OR Inst= ‘UCLA’) AND Sim_area(‘AI’, ‘Database)

Richard Korf

RefinementEngine

Setup Initial query

Feedback

Retrieve Result represented as Ranked ListRetrieve Result represented as Nodes in a

graph



Similarity Queries in SQL are Complex!



Evaluation of Query Refinement

• Tested on multiple real data sets• Average precision on 400 queries over 4 refinements • The new methods outperform existing methods

– substantially fewer iterations required



Other work in progress..

• Edge prediction in graphs– Given a graph with attributes on nodes and edges– Assume some edges are missing (or remove them)– Predict the probability of edge(i,j)

• E.g., what is likelihood that A and B have interacted given everything else we know, or that they will interact within the next 6 months

– Note: “runtime” querying, avoid O(N2) complexity

• Data cleaning– multiple names for a single entity– multiple entities mapped to the same name, e.g., J_Wang

• How many unique P_Smyths are there?

– Use heterogenous data sources and probabilistic models to iteratively produce “consistent” data

• E.g., combine CiteSeer, Web information, topic models, institution, etc



Conclusions



Summary of Accomplishments

• Infrastructure– Developed entity-event testbed data sets and IE tools– Released JUNG API for graph data analysis and visualization

• Graph Data Analysis/Querying Research– Novel author-topic models– New class of “relative importance” algorithms – Efficient similarity query refinement system– New general framework for graph schemas

• Software– Netsight– Topic-Author Browser– Interactive query refinement system– Prototype graph-based DB language system



What’s ready for the KD-D TestBed?

• Netsight– Built on JUNG API– Can handle any standard network data set– Supports both visualization and analysis

• Relative importance algorithms• Relative betweenness algorithms• Graph layout and browsing• Graph filtering

– Easily extendible– Integrated database support is planned in Year 2

• Other software is also in principle available– Author-topic applications:

• e.g., find experts in South Florida in virus research– - GraQ tool for graph DB interface



Proposed Year 2 Work

• Basic research: extend theory and algorithms to – Extend to temporal and spatial semantics– Handle missing/noisy network data– Multi-edge types (multiple edges on same entities)– Scalability: graphs with millions of edges– Interaction: tools that support exploration and querying

• Integration and Coupling of– Statistical topic models, querying, graph visualization, and databases

• Software Tools and Applications for the KDD-testbed– Netsight as an analysis tool…– Application of Author-topic type model (e.g., “expert finder”)– Entity Monitoring application (monitor data sources over time with focused

Web crawling)

• Data Sets/Types (TBD)– KDD-provided testbed data sets– Digital libraries: more CiteSeer, possibly Patent DB, MEDLINE– Less structured text sources such as email streams



BACKUP SLIDES



Number of Topics

5 10 20 50 100 200 400 8002000

2500

3000

3500

4000

4500

5000

5500

0th

1st

2nd

5th

10th

Perplexities for true author and any random author

PercentilesIn distribution

A = true author

A = any author



• Accuracy of author prediction as a function of # topics

Number of Topics

5 10 20 50 100 200 400 800

0

10

20

30

40

% of documents for which correct author was picked



Heterogeneous Event-Entity Graph Analysis and Query Language

Analysis of link/graph data involves: • Subschema selection

– Selecting node and edge types of interest from the graph schema• Subgraph selection

– Identifying relevant members of a group based on (possibly imprecise) matching of edge/node attributes or involvement in a given pattern of relationship.

• Decoration– E.g. computation of pair-wise association measures between individual

entities (conditioned on a context or third entity type)

• Structural Grouping and Aggregation– Node/edge grouping– combination of decorations (or other attribute values) for groups of entities

at various levels.

• Progressive Refinement – carrying out the above operations in a progressive and interactive manner.

In particular, user should be able to ask queries based on results of previous queries.



P(author and topic given a word)

P(Ai,Zi|{W},{Z}\Zi,{A}\Ai) (CWZ+ )(CAZ+)/(W’CW’Z+V)

CWZ counts the number of times the same word, W, (in the same or other documents) is assigned to topic Z

CAZ counts the number of times the same author, A, (in the same or other documents) is assigned to topic Z

Keeping these counts speeds up the algorithm!



Sampling over a query document

Preprocessing: Assign to each word in the query document an Author and a TopicK Iterations (typically K=10)

• For each word out of the N query words

• Derive the probability P(A, Z) conditioned on the current assignments of query words and the database words

• Assign a new author, A, and topic, Z, according to P(A,Z)The probability for a topic is the averaged ratio of words

assigned to the topic per total words

P(Z)=Kt=1Ct

Z/(KN)

CtZ is the number of words assigned in the t iteration to the z

topic

Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine

Documents

Transcript of Padhraic Smyth, Sharad Mehrotra Information and Computer Science University of California, Irvine