Current Research in Data Mining Research Group

Post on 01-Jan-2016

47 views 13 download

Tags:

description

Current Research in Data Mining Research Group. Jiawei Han Data Mining Research Group Department of Computer Science University of Illinois at Urbana-Champaign Acknowledgements: NSF, ARL, NASA, AFOSR (MURI), DHS, Microsoft, IBM, Yahoo!, HP Lab & Boeing November 8, 2014. Outline. - PowerPoint PPT Presentation

Transcript of Current Research in Data Mining Research Group

1

Current Research in Data Current Research in Data Mining Research GroupMining Research Group

Jiawei HanData Mining Research Group

Department of Computer Science

University of Illinois at Urbana-ChampaignAcknowledgements: NSF, ARL, NASA, AFOSR (MURI), DHS, Microsoft, IBM, Yahoo!, HP Lab & Boeing

April 20, 2023

2

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Mining and OLAPing Information NetworksMining and OLAPing Information Networks

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Mining Text-Rich Information NetworksMining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

Data Mining and Data WarehousingData Mining and Data WarehousingJiawei Han’s Group at CS, Jiawei Han’s Group at CS, UIUCUIUC

Mining patterns and knowledge discovery from massive data Data mining in heterogeneous information networks Exploring broad applications of data mining

Developed many effective data mining algorithms, e.g., FPgrowth, PrefixSpan, gSpan, StarCubing, CrossMine, RankingCube, CrossClus , RankClus, and NetClus

600+ research papers in conferences and journals Fellow of ACM, Fellow of IEEE, ACM SIGKDD Innovation Award, W.

McDowell Award, Daniel Drucker Eminent Faculty Award Textbook, “Data mining: Concepts and Techniques,” adopted

worldwide Project lead for NASA EventCube for Aviation Safety [2008-2012] Director of Information Network Academic Research Center funded

from Army Research Lab (ARL) [2009-2014]3

Data Mining Research Group at CS, UIUC

4

New Books on Data Mining & Link MiningNew Books on Data Mining & Link Mining

5

Han, Kamber and Pei,Data Mining, 3rd ed. 2011

Yu, Han and Faloutsos (eds.), Link Mining, 2010

Sun and Han, Mining Heterogeneous

Information Networks, 2012

6

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Mining and OLAPing Information NetworksMining and OLAPing Information Networks

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Mining Text-Rich Information NetworksMining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

RankClus/NetClus

SIGMOD

SDM

ICDM

KDD

EDBT

VLDB

ICML

AAAI

Tom

Jim

Lucy

Mike

Jack

Tracy

Cindy

Bob

Mary

Alice

SIGMOD

VLDB

EDBT

KDDICDM

SDM

AAAI

ICML

Objects

Ra

nki

ng

RankCompete: A Competing Random Walk Model for Rank-Based Clustering

Database Data Mining AI IR

Top-5 ranked

conferences

VLDB KDD IJCAI SIGIR

SIGMOD SDM AAAI ECIR

ICDE ICDM ICML CIKM

PODS PKDD CVPR WWW

EDBT PAKDD ECML WSDM

Top-5 ranked terms

data mining learning retrieval

database data knowledge information

query clustering reasoning web

system classification logic search

xml frequent cognition text

RankClass [KDD11]

Knowledge Propagation in Heterogeneous Network

8

Similarity Search and Role Discovery in Similarity Search and Role Discovery in Information NetworksInformation Networks

Path: ITI Path: ITIGITI

Which images are most similar to me in Flickr?PathSim [VLDB11]

Meta Path-Guided Similarity Search in

Networks

A “dirty” Information Network (imaginary)

Cleaned/InferredAdversarial Network

Cleaned/InferredAdversarial Network

Chief

Insurgent

Cell Lead

Automatically infer

Role Discovery in Information Networks [KDD’10]

Advisee Top Ranked Advisor

Time Note

David M. Blei

1. Michael I. Jordan

01-03 PhD advisor, 2004

2. John D. Lafferty

05-06 Postdoc, 2006

Hong Cheng

1. Qiang Yang 02-03 MS advisor, 2003

2. Jiawei Han 04-08 PhD advisor, 2008

Sergey Brin

1. Rajeev Motawani 97-98 Unofficial advisor

Meta-Paths & Their Prediction PowerMeta-Paths & Their Prediction Power List all the meta-paths in bibliographic network up to length 4

Investigate their respective power for coauthor relationship prediction Which meta-path has more prediction power? How to combine them to achieve the best quality of prediction

9

Relationship Prediction in Heterogeneous Info NetworksRelationship Prediction in Heterogeneous Info Networks

Why Prediction of Co-Author Relationship in DBLP? Prediction of relationships between different types of nodes

in heterogeneous networks E.g., what papers should Faloutsos writes?

Traditional link prediction: homogeneous networks Co-author networks in DBLP, friendship networks in Facebook

Relationship prediction Study the roles of topological features in heterogeneous

networks in predicting the co-author relationship building Meta-path guided prediction!

Y. Sun, et al., "Co-Author Relationship Prediction in Heterog. Bibliographic Networks", ASONAM'11, July 2011

10

Guidance: Meta Path in Bibliographic NetworkGuidance: Meta Path in Bibliographic Network

Relationship prediction: meta path-guided prediction Meta path relationships among similar typed links share similar

semantics and are comparable and inferable

11

papertopic

venue

author

publish publish-1

mention-1

mention writewrite-1

contain/contain-1 cite/cite-1

Co-author prediction (A—P—A) using topological features also encoded by meta paths, e.g., citation relations between authors (A—P→P—A)

Case Study in CS Bibliographic NetworkCase Study in CS Bibliographic Network The learned significance for each meta path under measure “normalized

path count” for HP-3hop dataset

12

Case Study: Predicting Concrete Co-AuthorsCase Study: Predicting Concrete Co-Authors High quality predictive power for such a difficult task

13

Using data in T0 =[1989; 1995] and T1 = [1996; 2002]

Predict new coauthor relationship in T2 = [2003; 2009]

14

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Mining and OLAPing Information NetworksMining and OLAPing Information Networks

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Mining Text-Rich Information NetworksMining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

Structural Layer: follow the sametopology as the document network

iTopicModel: Model Set-Up & Objective FunctioniTopicModel: Model Set-Up & Objective Function

Graphical model: ϴi=(ϴi1, ϴi2,…, ϴiT): Topic distribution for document xi

Text Layer: follow PLSA, i.e., for each word, pick a topic z~multi(ϴi), then

pick a word w~multi(βz)

Objective function: joint probability X: observed text informationG: document networkParameters

ϴ: topic distributionβ: word distribution

ϴ is the most critical, need to be consistent with the text as well as the network structure

Structure part Text partCan model them separately!

Case Study: Topic Hierarchy Building for DBLPCase Study: Topic Hierarchy Building for DBLP

Probabilistic Topic Models with Network-Based Probabilistic Topic Models with Network-Based Biased PropagationBiased Propagation

Text-rich heterogeneous information network Ubiquitous textual documents (news, papers) Connect with users and other objects: Topic propagation

Deng, Han et al, “Probabilistic Topic Models with Biased Propagation on Heterogeneous Information Networks”, KDD’11

17

How to discover latent topics and identify clusters of multi-typed objects simultaneously?

How can text data and heterogeneous information network mutually enhance each other in topic modeling and other text mining tasks?

Biased Topic PropagationBiased Topic PropagationIntuition: InfoNet provides valuable informationDifferent objects have their own inherent information (e.g., D with rich text and U without explicit text) To treat documents with rich text and other objects without explicit text in a different way Topic(D) inherent text + connected U Topic(U) connected D

18

Basic Criterion: (Biased Topic Propagation) The topic of an object without explicit text depends on the topic of the

documents it connects The topic of a document is correlated with its objects to some extend, and

should be principally determined by its inherent content of the text A simple and unbiased topic propagation does not make much sense

Incorporating Heterogeneous Info. NetworkIncorporating Heterogeneous Info. Network

19

L(C): Topic modelR(G): Biased propagation

Experiments: DBLP & NSF AwardsExperiments: DBLP & NSF Awards Data Collection

DBLP NSF-Awards

Metrics Accuracy (AC) Normalized mutual information (NMI)

Results

20

21

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Mining and OLAPing Information NetworksMining and OLAPing Information Networks

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Mining Text-Rich Information NetworksMining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

Event Cube:Event Cube: An Overview An Overview

MultidimensionalText Database

98.0199.0299.01

98.02

LAX SJC MIA AUS

overshoot

undershootbirds

turbulence

Time

Location

Topic

CA FL TXLocatio

n

1998

1999

Time

Deviation

Encounter

Topic

drill-down

roll-up

Event CubeRepresentation

Analyst…Multidimensional OLAP, Ranking, Cause Analysis,

Topic Summarization/Comparison …… Analysis Support

22 Event Cube: An Organized Approach for Mining and Understanding Anomalous Aviation EventsEvent Cube: An Organized Approach for Mining and Understanding Anomalous Aviation Events

Funded by NASA (2008-2010)

Text/Topic Cube: General Idea

Heterogeneous: categorical attributes + unstructured text

How to combine? Our solution:

Time Location Place Environment … … Event ReportACN

Text data

Cube: Categorical Attributes

Term/Topic Weight

T1 W1

T2 W2

T3 W3

… …

Text/Topic Model: Unstructured TextMeasure

24

Effective Keyword Search TopCells (ICDE’ 10): Ranking aggregated cells (objects) in

TextCube.

HealthcareReform

Effective OLAP Exploration TEXplorer (submitted): Integrating keyword-based ranking

and OLAP exploration

25

HealthcareReform

Effective Event Tracking PET (KDD’ 10): tracking popularity and textual representation

of events in social communities (twitter)

26

debate,cost,senate,…

pass,success,law,…

HealthcareReform

benefit,profit,effective,…

27

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Mining and OLAPing Information NetworksMining and OLAPing Information Networks

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Mining Text-Rich Information NetworksMining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

Growing Parallel Paths Growing Parallel Paths (WWW 2011)(WWW 2011)

DIV UL

AB

AC

HTML DIV UL

LI

LI

AX

AY

HTML DIV UL

LI

LI

AZ

AW

TABLE TR

TD

TD AU

AV

HTML

HTML

LI

LI

DIV

DIV ...

...

Page A

Page D

Page E

Page F

DIV P AFHTML

Page C

DIV

P

AE

Page B

HTML

P

AD

1

2

3

4

5

6

X

Y

Z

W

U

V

Path

Result:

28

Mapping Pages to Records Mapping Pages to Records (CIKM’10)(CIKM’10)

/people

/people/faculty

/jiawei-han

/people/faculty

/dan-roth

/people/faculty/vikram-

adve

/research/research

/areas/data

Faculty

DataMining

Jiawei Han

Dan Roth

Vikram Adve

Jiawei Han

Dan Roth

People

/people/faculty

www.cs.illinois.edu/homes/hanj/

llvm.cs.uiuc.edu/~vadve/Home.html

l2r.cs.uiuc.edu/~danr/

Research

PersonalSite

PersonalSite

PersonalSite

/ (root) [cs.illinois.edu]

llvm.cs.uiuc.edu/~vadve/Home.html

rsim.cs.illinois.edu/~sadve/

www.cs.illinois.edu/homes/hanj/

l2r.cs.uiuc.edu/~danr/

Tarek AbdelzaherSarita AdveVikram Adve

Gul AghaEyal AmirDan Roth

Jiawei Han

--------------

Name URL

Structured Data Web PagesMappings

--------------

Zipcode

Database records can be found on link paths!

29

WinaCS: Web Information Network Analysis WinaCS: Web Information Network Analysis for Computer Sciencefor Computer Science

Integration of Web structure mining and information network analysis

Tim Weninger, Marina Danilevsky, et al., “WinaCS: Construction and Analysis of Web-Based Computer Science Information Networks", ACM SIGMOD'11 (system demo), Athens, Greece, June 2011.

30

31

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Mining and OLAPing Information NetworksMining and OLAPing Information Networks

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Mining Text-Rich Information NetworksMining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

32

Discovery of Swarms and Periodic Patterns in Moving Discovery of Swarms and Periodic Patterns in Moving Object DataObject Data

A system that mines moving object patterns: Z. Li, et al., “MoveMine: Mining Moving Object Databases", SIGMOD’10 (system demo)

Z. Li, B. Ding, J. Han, and R. Kays, “Mining Hidden Periodic Behaviors for Moving Objects”, KDD’10 (sub)

Z. Li, B. Ding, J. Han, and R. Kays, “Swarm: Mining Relaxed Temporal Moving Object Clusters”, VLDB’10 (sub)

← Bird flying paths shown on Google Earth

Mined periodic patterns by our new method →

← Convoy discovers only restricted patterns

Swarm discovers more patterns →

GeoTopic Discovery: Mining Spatial TextGeoTopic Discovery: Mining Spatial Text

LDM

TDM

GeoFolk

LGTA

Geo-tagged photos w. landscape (coast vs. desert vs. mountain)

33

Z. Yin, et a., GeoTopic Discovery and Comparison, WWW'11

34

OutlineOutline An Introduction to Data Mining Research GroupAn Introduction to Data Mining Research Group

Mining and OLAPing Information NetworksMining and OLAPing Information Networks

Mining Heterogeneous Information NetworksMining Heterogeneous Information Networks

Mining Text-Rich Information NetworksMining Text-Rich Information Networks

OLAPing (Multi-dimensional analysis) of information OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networksnetworks: TextCube, OLAP heterogeneous networks

Taming the Web: WINACS (Integrated mining of Web structures Taming the Web: WINACS (Integrated mining of Web structures and contents)and contents)

Mining Cyber-Physical Systems and NetworksMining Cyber-Physical Systems and Networks

ConclusionsConclusions

35

Conclusions: Conclusions: Towards Mining Data Semantics in Integrated Towards Mining Data Semantics in Integrated Heterog. NetworksHeterog. Networks

Most data objects are linked, forming heterogeneous information networks Most datasets can be “organized” or “transformed” into

“structured” multi-typed heterogeneous info. networks Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, … Structures can be progressively mined from less organized

data sets by info. network analysis Surprisingly rich knowledge can be mine from such structured

heterogeneous info. networks Clustering, ranking, classification, data cleaning, trust analysis,

role discovery, similarity search, relationship prediction, …… It is promising to mine data semantics from rich info. networks !

References for the TalkReferences for the Talk J. Han, Y. Sun, X. Yan, and . S. Yu, “Mining Heterogeneous Information Networks"

(tutorial), KDD'10. Ming Ji, Jiawei Han, and Marina Danilevsky, "Ranking-Based Classification of

Heterogeneous Information Networks", KDD'11. Y. Sun, J. Han, et al., "RankClus: Integrating Clustering with Ranking for Heterogeneous

Information Network Analysis", EDBT’09 Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information

Networks with Star Network Schema", KDD’09 Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path-Based Top-K Similarity

Search in Heterogeneous Information Networks”, VLDB'11 Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction

in Heterogeneous Bibliographic Networks", ASONAM'11 C. Wang, J. Han, et al.,, , “Mining Advisor-Advisee Relationships from Research

Publication Networks", KDD'10. Tim Weninger, Marina Danilevsky, et al., “WinaCS: Construction and Analysis of Web-

Based Computer Science Information Networks", ACM SIGMOD'11 (system demo) X. Yin, J. Han, and P. S. Yu, “Truth Discovery with Multiple Conflicting Information

Providers on the Web”, IEEE TKDE, 20(6), 200836