March, 2011 I3.1 Noise-Aware Data Mining in Information Networks Xifeng Yan University of California...

March, 2011

I3.1 Noise-Aware Data Mining in

Information Networks

Xifeng YanUniversity of California at Santa

BarbaraINARC

INARC PI Report Involvement

– I2.1: In-Network Storage – I2.2: Large-Scale Information Network Processing

– I3.1: QoI Mining of Noisy, Volatile, Uncertain, and Incomplete Heterogeneous Information Networks

– I3.2: Modeling and Mining of Text-Rich Information Networks– E1.2: Composite Network Modeling with Composite Graphs

Objective– Concepts, Models, Theories, Methods, and Systems for measuring and

operating Information Networks and Others– Concepts and Models in Noise-aware data mining of information

networks Collaborators:

– Z. Wen IBM/SCNARC, J. Bao RPI/IRC, J. Han UIUC/INARC, M. Srivatsa IBM/INARC, V. Kawadia BBN/IRC, S. Desai Army

2

I3.1: Noise-Aware Mining: Graph Iceberg

R1 has high concentration of black vertices, but low connectivity

R2 contrarily has few black vertices, but well-connected;

R3, is an anomaly region with high density of black vertices and high connectivity

Nan Li, et al., Towards Iceberg Analysis in Graph OLAP in preparation for VLDB Journal

Find abnormal high density of intrusions in a network (targeted attack)

Find online communities where sensitive topics appear abnormally high (extremist groups)

Help us to study why it happens

3


Huge search space– If we confine the size of the regions to be s, the total

number of regions in a graph with n vertices is O(ns); Our method

– Find promising vertices first– Cluster these vertices to find the communities.

Promising vertices– Aggregate the personalized

page rank score of neighbors

where the event takes place– High Value => Good vertices

Nan Li, et al., Towards Iceberg Analysis in Graph OLAP in preparation for VLDB Journal4

gIceberg: PPV-Based Aggregation

Personalized PageRank vector (PPV) aggregation– Use PPV to measure the local closeness of two vertices

Local clustering algorithms– Query-aggregated personalized PageRank score (PPS)

Personalized PageRank approximation– Random-walk based Sampling– Pair-wise PPV formula– Active Boundary

5

Sq (v) qV pv pv (x)x|xV ,qL(x )


Our model: aggregated personalized page rank + sampling– 10-50 times faster

A novel graph mining framework– find anomaly regions in large heterogeneous information

networks – Noise-aware mining: It is an aggregate measure, which

can easily overcome noise

The first-of-its kind in network science

Nan Li, et al., Towards Iceberg Analysis in Graph OLAP in preparation for VLDB Journal6

I3.1: Noise-Aware Mining: Structural Correlation

A novel metric, Decayed Hitting Time, is proposed to assess and rank structural correlations

SIGMOD reviewer: “Interesting problem that I haven’t seen before”

The first-of-its kind defined for networks Sampling algorithm: 10-20 times faster An aggregate measure: noise-resistant

Question: Is the distribution of events (blue nodes) influenced

by the network links or not? If it is, to which degree?

(UCSB) Z. Guan, J. Wu, Z. Yun, A. Singh, X. Yan, Assessing and Ranking Structural Correlations in Graphs, Proc. 2011 Int. Conf. on Management of Data (SIGMOD'11), 2011 7

What is structural correlation?

Real world networks not only contain nodes and edges, but also have events (attributes)– Information network: events, documents, etc.– Social network: blog posts, rumors, opinions, online

shopping, etc.– Virus/Malware infections

Virus propagation through computer networks, email network, or facebook. Which one is the main channel for a specific virus/malware?

Some events are correlated to network links, while others just occur randomly

8

Correlation Metric in Information Networks

Question: Is the distribution of events (blue nodes) influenced by the network links or not? If it is, to which degree?

Help understand the distribution of events in networks Help detect viral influence in the underlying network

– Correlation has to do with link type, event type and time

Why measuring such correlation?

9

How to measure?

10

If correlated, black nodes tend to stick together.

A naïve approach: only look at neighborhood

General idea: compute the aggregated proximity among black nodes, which will be noise-resistant

Measure definition

The measure

– Vq: the set of nodes having event q; s(*) can be any graph proximity measure

We choose hitting time since it treats as a whole (compared to personalized PageRank and shortest distance, etc.)

11

Hitting time

The expected number of steps to reach a target node via random walk:

– B: target node set; Pr(TB=t|x0=vi): the probability that we start from vi and reach B after t steps

Bvi

12

Decayed Hitting Time (DHT)

Hitting time can be infinite To better and faster calculate proximity, we propose

using Decayed Hitting Time

– Mapping [1,∞) to [0,1], high value means high proximity

– Emphasizing the importance of short paths and reducing the impact of long paths

– Facilitating approximation of DHT

13

DHT sampling approximation

Perform c simulated random walks from vi

Two strategies:– In each random walk, stop when we hit a target node.

Get an estimate May never stop In large graph, can be time consuming

– In each random walk, stop when we hit a target node, or the maximum number of steps (denoted by s) is reached. Get an approximation to

14

Bounds for Sampling Approximation

Suppose we have random walks hitting a target, and which reach s steps (not hit)

For each random walk in those , its contribution to is upper bounded by and lower bounded by 0

Bounds for are

15

From measure to significance

Consider a randomly select set of m nodes: , where – As m increases, randomly selected m nodes tend to close to

one another (actually, monotonic increase of can be proved)

– Just relying on is not enough, we should assess the deviation of to random cases

An approximation method for significance

16

Estimating

• Sampling: Randomly sample c node sets of size m and estimate their ρ values (also by sampling). Then take the sample mean as an estimate of

An approximation method by geometric distribution– When generating , each node has probability m/n to be chosen– Relaxing: each node is chosen independently– Start from a node , the probability that the random walk

hits a target node after t steps is , where . By definition of DHT:

17

Estimating Also use Sampling

– Sample node sets of size m and estimate their ρ values. Then compute the sample variance

– Since we assume each in the definition of is independent, we have

– Thus, we sample pairs and estimate their DHTs and compute sample variance

18

Experiments - Datasets

DBLP– Co-author network– Events: keywords in paper titles– 815,940 nodes, 2,857,960 edges and 171,614 events

TaoBao– Online shopping data, friend network– Events: products– 794,001 nodes, 1,370,284 edges, 100 typical products

Twitter– 40 million nodes and 1.4 billion edges

19

Experiments - Efficiency

20

Experiments - Effectiveness (TaoBao)

21

Experiments – Correlation Evolution (TaoBao)

22

Collaborations

Collaborations with researchers in other networks– (E1.2)Work with Prithwish Basu (BBN) on Network Design– (I2.1) Work with Arun Iyengar and Mudahakar Srivatsa (IBM), who

has done much work on DTN and Storage, on building connection between informaiton network processing on DTN and Clusters. Shengqi Yang will work on it in IBM this summer.

– (T2.3) Work with Vikas Kawadia (BBN), on using graph query processing for distributed trust computing. Ziyu Guan is collaborating with Vikas

– (S1.1) Work with Zhen Wen (IBM), on the social network application of graph density indexing.

– (E1.1) Work with Jie Bao (RPI), on RDF queries using neighborhood-based graph search.

– (I3.1, I3.2) Work with Jiawei Han (UIUC) on graph mining– Work with Sachi Desai (Army) on graph query language/system

2323

Research Papers A. Khan, N. Li, Z. Guan, X. Yan, S. Chakraborty, and S. Tao, Neighborhood Based

Fast Graph Search in Large Networks, Proc. 2011 Int. Conf. on Management of Data (SIGMOD'11), 2011.

Z. Guan, J. Wu, Z. Yun, A. Singh, X. Yan, Assessing and Ranking Structural Correlations in Graphs, Proc. 2011 Int. Conf. on Management of Data (SIGMOD'11), 2011.

Qiang Qu, Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu, and Hongyan Li, “Efficient Topological OLAP on Information Networks", DASFAA'11.

Yizhou Sun et al., PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks, submitted to VLDB’2011

Nan Li, et al., Towards Iceberg Analysis in Graph OLAP, to be submitted to VLDB Journal

24

Research Papers Gengxin Miao, Ziyu Guan, Louise Moser, Xifeng Yan, Shu Tao, Nikos Anerousis,

Latent Association Analysis of Document Pairs, submitted to SIGKDD’2011 Ziyu Guan et al., Diffusion through Co-occurrence Relationships for Expert Search

on the Web, submitted to SIGIR’2011 Shengqi Yang, Bo Zong, Arijit Khan, Ben Zhao, Xifeng Yan, Managing Large-

Scale Graphs for Efficient Distributed Processing, submitted to VLDB 2011 Nan Li, Arijit Khan, Xifeng Yan, and Zhen Wen, Density Index and Proximity

Search on Large Graphs, to be submitted to VLDB Journal C. C. Aggarwal, A. Khan, X. Yan. A Probabilistic Index for Massive and Dynamic

Graph Streams, Research Report, to be submitted to VLDB Journal

25

Next Six Months and Path Ahead to 2012

Continue research – Composite Network Modeling and Design – Large-scale Information Network Processing– Information Network in DTN– Information Network Mining and Measuring with Noise

and Dynamics Structural Correlation in Dynamic Situations Mining Graph Patterns in a Noise Environment Node Mining and Inference for Multiple Information Networks

(QoI)

– Information Network Modeling with Text– Graph Query Language– Information Network Query Engine

2626

Brief Summary of My Team’s Work

in Other Tasks.

27

I2.2: Graph Partition for Distributed Graph Computing

28Shengqi Yang, et al., Managing Large-Scale Graphs for Efficient Distributed Processingsubmitted to VLDB 2011

Are typical techniques efficient for graph queries?

Graph partitioning and distribution techniques (e.g., Pregel) Limitations:– Unbalanced workload due to skewed

uniformly distributed graph queries.– Communication overhead due to inter‐

machine (cross partition) communication.

Goal– Model-based Graph Partitioning Techniques– First-of-Its Kind Distributed Graph Computing Platform

in public for Information, Social, and Communication Networks

I2.1: Adapt Sedge to DTN

Master– Vertex->Partition Map

+ Network Contact Graph

+ Route Table (opportunistic path)

Worker– Message Queue (Cache)

Superstep = Time slot– e.g. 1 min, 1 hour, 1 day,

etc.

1

2

3

4

5

6

P1

P2

P3

P4

P5

P6

Contact Graph

Cluster Connection

29

I2.2 gDensity: Model-Based Indexing

Problem definition (labeled proximity search)– Label-based graph proximity search, seeks to find the top-k vertex subsets with the smallest diameters, for a given query of distinct labels. Each subset must cover all the labels specified in the query.

30

Nan Li et al., Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal

10 – 300 times faster

Using probabilistic model to build index

31

Align two networks

Linked In Facebook

I2.2: Graph Search: the Model-Based Approach

Ideas Use information propagation

model to propagate labels in information networks

Convert vertices to vectors Align sets of vectors

Query Speed: 0.1 sec for WebGraph:10M vertices, 213M edges

Information Propagation Model

A. Khan et al., Neighborhood Based Fast Graph Search in Large Networks, SIGMOD’11

I3.2: Progressive Network Analysis for Expert Search

Goal: find and rank people who have expertise described by user query Web pages are more noisy, contain spam compared to corpus in an

enterprise. Both relevance and reputation should be considered Use a heterogeneous hypergraph to model the co-occurrence

relationships among people and words and devise a heat diffusion model on the hyerpgraph

Applied to 0.5B web pages Accuracy: 50%-200% improvement than the leading language model

methods. Significantly overcome noises in the Web.

Ziyu Guan, et al., “Diffusion through Co-occurrence Relationships for Expert Search on the Web”, SIGIR’11 (sub) 32

I3.2: Latent Association Analysis of Document Pairs

Latent Association Analysis (LAA) mines the topics of two document sets simultaneously, taking the bipartite network between two document sets into consideration

One of the first attempts to analyze the topic structures of two connected document sets, aiming to infer their mapping network model

LAA significantly outperforms existing algorithms with 70% accuracy improvement

Topic Simplex for Corpus 1

?

Topic Simplex for Corpus 2

0 1

1

?

Correlation Factor

… …

Document Pairs

Gengxin Miao, et al., “Latent Association Analysis of Document Pairs”, KDD’11 (sub) 33

E1.2: Collaborative Network Modeling And Inference

34

Questions:1. How to model it?2. How information flows among

different agents? 3. How agents interact with each

other?4. How to measure the quality of the

flow?5. Is there any mis-interaction

among these agents? 6. Can we identify the role of the

agents?7. Can we identify the relationship? 8. Can we identify the weak

components?

March, 2011 I3.1 Noise-Aware Data Mining in Information Networks Xifeng Yan University of California...

Documents

Transcript of March, 2011 I3.1 Noise-Aware Data Mining in Information Networks Xifeng Yan University of California...