September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques —...

91
November 2, 2022 Data Mining: Concepts and Techniq ues 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng Department of Computer Science University of Missouri, Columbia ©2006 Jiawei Han and Micheline Kamber. All rights reserved. Acknowledgements: Based on the slides by Sangkyum Kim and Chen Chen

Transcript of September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques —...

Page 1: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques

— Chapter 6 —Network Mining (Social Networks)

Jianlin Cheng

Department of Computer Science

University of Missouri, Columbia

©2006 Jiawei Han and Micheline Kamber. All rights reserved.

Acknowledgements: Based on the slides by Sangkyum Kim and Chen Chen

Page 2: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 2

Network Mining

Social Network Introduction

Statistics and Probability Theory

Models of Social Network Generation

Networks in Biological System

Mining on Social Network

Summary

Page 3: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 3

Society

Nodes: individuals

Links: social relationship (family/work/friendship/etc.)

S. Milgram (1967)

Social networks: Many individuals with diverse social interactions between them.

John Guare

Six Degrees of Separation

Page 4: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 4

Communication networks

The Earth is developing an electronic nervous system, a network with diverse nodes and links are

-computers

-routers

-satellites

-phone lines

-TV cables

-EM waves

Communication networks: Many non-identical components with diverse connections between them.

Page 5: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 5

Complex systemsMade of

many non-identical elements connected by diverse interactions.

NETWORK

Page 6: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 6

“Natural” Networks and Universality

Consider many kinds of networks: social, technological, business, economic, content,…

These networks tend to share certain informal properties: large scale; continual growth distributed, democratic growth: vertices “decide” who to link

to mixture of local and long-distance connections abstract notions of distance: geographical, content, social,…

Do natural networks share more quantitative universals? What would these “universals” be? How can we make them precise and measure them? How can we explain their universality? This is the domain of social network theory Sometimes also referred to as link analysis

Page 7: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 7

Some Interesting Quantities

Connected components: how many, and how large?

Network diameter: maximum (worst-case) or average? exclude infinite distances? (disconnected components) the small-world phenomenon

Clustering: to what extent that links tend to cluster “locally”? what is the balance between local and long-distance

connections? what roles do the two types of links play?

Degree distribution: what is the typical degree in the network? what is the overall distribution?

Page 8: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 8

A “Canonical” Natural Network has…

Few connected components: often only 1 or a small number, indep. of network size

Small diameter: often a constant independent of network size (like 6) or perhaps growing only logarithmically with network

size or even shrink? typically exclude infinite distances

A high degree of clustering: considerably more so than for a random network in tension with small diameter

A heavy-tailed degree distribution: a small but reliable number of high-degree vertices often of power law form

Page 9: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 9

Probabilistic Models of Networks

All of the network generation models we will study are probabilistic or statistical in nature

They can generate networks of any size They often have various parameters that can be set:

size of network generated average degree of a vertex fraction of long-distance connections

The models generate a distribution over networks Statements are always statistical in nature:

with high probability, diameter is small on average, degree distribution has heavy tail

Thus, we’re going to need some basic statistics and probability theory

Page 10: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 10

Social Network Analysis

Social Network Introduction

Statistics and Probability Theory

Models of Social Network Generation

Networks in Biological System

Mining on Social Network

Summary

Page 11: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 11

The Normal Distribution

The normal or Gaussian density: applies to continuous, real-valued random variables characterized by mean (average) m and standard

deviation s density at x is defined as

peaks at x = , then dies off exponentially rapidly the classic “bell-shaped curve”

exam scores, human body temperature, remarks:

can control mean and standard deviation independently can make as “broad” as we like, but always have finite

variance

Page 12: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 12

The Normal Distribution

Page 13: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 13

The Binomial Distribution

coin with Pr[heads] = p, flip n times probability of getting exactly k heads:

choose(n,k) pk(1-p)n-k

for large n and p fixed: approximated well by a normal with

= np, = sqrt(np(1-p)) / 0 as n grows leads to strong large deviation bounds

Page 14: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 14

The Binomial Distribution

www.professionalgambler.com/ binomial.html

How manywins in 21 bets?

Page 15: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 15

The Poisson Distribution

like binomial, applies to variables taken on integer values > 0

often used to model counts of events number of phone calls placed in a given time period number of times a neuron fires in a given time period

single free parameter probability of exactly x events:

mean and variance are both binomial distribution with n large, p = /n ( fixed)

converges to Poisson with mean

Page 16: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 16

The Poisson Distribution

single photoelectron distribution

Page 17: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 17

Heavy-tailed Distributions

Page 18: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 18

Heavy-Tailed Distributions

Page 19: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 19

Distributions vs. Data

All these distributions are idealized models In practice, we do not see distributions, but data Thus, there will be some largest value we observe Also, can be difficult to “eyeball” data and choose model So how do we distinguish between Poisson, power law, etc? Typical procedure:

might restrict our attention to a range of values of interest accumulate counts of observed data into equal-sized bins look at counts on a log-log plot note that

power law: log(Pr[X = x]) = log(1/x) = - log(x) linear, slope –

Normal: log(Pr[X = x]) = log(a exp(-x/b)) = log(a) – x/b non-linear, concave near mean

Poisson: log(Pr[X = x]) = log(exp(-) x/x!) also non-linear

Page 20: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 20

Zipf’s Law Look at the frequency of English words:

“the” is the most common, followed by “of”, “to”, etc.

claim: frequency of the n-th most common ~ 1/n (power law, α = 1)

General theme: rank events by their frequency of occurrence resulting distribution often is a power law!

Other examples: North America city sizes personal income file sizes

Page 21: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 21

Linear scales on both axes

Logarithmic scales on both axes

The same data plotted on linear and logarithmic scales. Both plots show a Zipf distribution with 300 datapoints

Zipf’s Law

Page 22: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 22

Social Network Analysis

Social Network Introduction

Statistics and Probability Theory

Models of Social Network Generation

Networks in Biological System

Mining on Social Network

Summary

Page 23: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 23

Some Models of Network Generation

Random graphs (Erdös-Rényi models): gives few components and small diameter does not give high clustering and heavy-tailed degree

distributions is the mathematically most well-studied and understood model

Watts-Strogatz models: give few components, small diameter and high clustering does not give heavy-tailed degree distributions

Scale-free Networks: gives few components, small diameter and heavy-tailed

distribution does not give high clustering

Hierarchical networks: few components, small diameter, high clustering, heavy-tailed

Affiliation networks: models group-actor formation

Page 24: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 24

Models of Social Network Generation

Random Graphs (Erdös-Rényi models)

Watts-Strogatz models

Scale-free Networks

Page 25: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 25

The Erdös-Rényi (ER) Model(Random Graphs)

All edges are equally probable and appear independently NW size N > 1 and probability p: distribution G(N,p)

each edge (u,v) chosen to appear with probability p N(N-1)/2 trials of a biased coin flip

The usual regime of interest is when p ~ 1/N, N is large e.g. p = 1/2N, p = 1/N, p = 2/N, p=10/N, p = log(N)/N, etc. in expectation, each vertex will have a “small” number of

neighbors will then examine what happens when N infinity can thus study properties of large networks with bounded

degree Degree distribution of a typical G drawn from G(N,p):

draw G according to G(N,p); look at a random vertex u in G what is Pr[deg(u) = k] for any fixed k? Poisson distribution with mean degree: λ = p(N-1) ~ pN Sharply concentrated; not heavy-tailed

Especially easy to generate NWs from G(N,p)

Page 26: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 26

Erdös-Rényi Model (1960)

- Democratic

- Random

Pál ErdösPál Erdös (1913-1996)

Connect with

probability pp=1/6 N=10

k~1.5 Poisson distribution

Page 27: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 27

A Closely Related Model

For any fixed m <= N(N-1)/2, define distribution G(N,m): choose uniformly at random from all

graphs with exactly m edges G(N,m) is “like” G(N,p) with p = m/(N(N-

1)/2) ~ 2m/N2

this intuition can be made precise, and is correct

if m = cN then p = 2c/(N-1) ~ 2c/N mathematically trickier than G(N,p)

Page 28: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 28

Another Closely Related Model

Graph process model: start with N vertices and no edges at each time step, add a new edge choose new edge randomly from among all

missing edges Allows study of the evolution or emergence of

properties: as the number of edges m grows in relation to N equivalently, as p is increased

Page 29: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 29

Evolution of a Random Network

We have a large number n of vertices

We start randomly adding edges one at a time

At what time t will the network:

have at least one “large” connected component?

have a single connected component?

have “small” diameter?

have a “large” clique?

How gradually or suddenly do these properties

appear?

Page 30: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 30

Combining and Formalizing Familiar Ideas

Explaining universal behavior through statistical models our models will always generate many networks almost all of them will share certain properties (universals)

Explaining tipping through incremental growth we gradually add edges, or gradually increase edge probability p many properties will emerge very suddenly during this process

size of police force

crim

e r

ate

number of edges

pro

b. N

W c

onnect

ed

Page 31: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 31

So Which Properties Tip?

Just about all of them! The following properties all have threshold

functions: having a “giant component” being connected having a perfect matching (N even) having “small” diameter

With remarkable consistency (N = 50): giant component (~ 40 edges), connected (~

100 edges), small diameter (~ 180 edges)

Page 32: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 32

Ever More Precise…

Page 33: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 33

Erdos-Renyi Summary

A model in which all connections are equally likely each of the N(N-1)/2 edges chosen randomly &

independently As we add edges, a precise sequence of events

unfolds: graph acquires a giant component graph becomes connected graph acquires small diameter

Many properties appear very suddenly (tipping, thresholds)

All statements are mathematically precise But is this how natural networks form? If not, which aspects are unrealistic?

may all edges are not equally likely!

Page 34: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 34

The Clustering Coefficient of a Network

Let nbr(u) denote the set of neighbors of u in a graph all vertices v such that the edge (u,v) is in the graph

The clustering coefficient of u: let k = |nbr(u)| (i.e., number of neighbors of u) choose(k,2) = k*(k-1) / 2: max possible # of edges between vertices

in nbr(u) c(u) = (actual # of edges between vertices in nbr(u))/choose(k,2) 0 <= c(u) <= 1; measure of cliquishness of u’s neighborhood

Clustering coefficient of a graph: average of c(u) over all vertices u

k = 4choose(k,2) = 6c(u) = 4/6 = 0.666…

u

Page 35: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 35

Clustering: My friends will likely know each other!

Probability to be connected C » p

C =# of links between 1,2,…n neighbors

n(n-1)/2

Networks are clustered [large C(p)]

but have a small characteristic path length

[small L(p)].

Network C Crand L N

WWW 0.1078 0.00023 3.1 153127

Internet 0.18-0.3 0.001 3.7-3.763015-6209

Actor 0.79 0.00027 3.65 225226

Coauthorship 0.43 0.00018 5.9 52909

Metabolic 0.32 0.026 2.9 282

Foodweb 0.22 0.06 2.43 134

C. elegance 0.28 0.05 2.65 282

The Clustering Coefficient of a Network

L: diameter?

Page 36: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 36

Erdos-Renyi: Clustering Coefficient

Generate a network G according to G(N,p) Examine a “typical” vertex u in G

choose u at random among all vertices in G what do we expect c(u) to be?

Answer: exactly p. In G(N,m), expect c(u) to be 2m/N(N-1) Both cases: c(u) entirely determined by overall

density Baseline for comparison with “more clustered”

models Erdos-Renyi has no bias towards clustered or

local edges

Page 37: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 37

Models of Social Network Generation

Random Graphs (Erdös-Rényi models)

Watts-Strogatz models

Scale-free Networks

Page 38: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 38

Caveman and Solaria Erdos-Renyi:

sharing a common neighbor makes two vertices no more likely to be directly connected than two very “distant” vertices

every edge appears entirely independently of existing structure

But in many settings, the opposite is true: you tend to meet new friends through your old friends two web pages pointing to a third might share a topic two companies selling goods to a third are in related

industries Watts’ Caveman world:

overall density of edges is low but two vertices with a common neighbor are likely connected

Page 39: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 39

Making it (Somewhat) Precise: the -model

The -model has the following parameters or “knobs”: N: size of the network to be generated k: the average degree of a vertex in the network to be generated p: the default probability two vertices are connected : adjustable parameter dictating bias towards local connections

For any vertices u and v: define m(u,v) to be the number of common neighbors (so far)

Key quantity: the propensity R(u,v) of u to connect to v if m(u,v) >= k, R(u,v) = 0 (share too many friends not to connect) if m(u,v) = 0, R(u,v) = p (no mutual friends no bias to connect) else, R(u,v) = p + (m(u,v)/k) (1-p)

Generate NW incrementally using R(u,v) as the edge probability; details omitted

Note: = infinity is “like” Erdos-Renyi (but not exactly)

Page 40: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 40

Small Worlds and Occam’s Razor

For small , should generate large clustering coefficients we “programmed” the model to do so Watts claims that proving precise statements is hard…

But we do not want a new model for every little property Erdos-Renyi small diameter -model high clustering coefficient

In the interests of Occam’s Razor, we would like to find a single, simple model of network generation… … that simultaneously captures many properties

Watt’s small world: small diameter and high clustering

Page 41: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 41

Meanwhile, Back in the Real World…

Watts examines three real networks as case studies: the Kevin Bacon graph the Western states power grid the C. elegans nervous system

For each of these networks, he: computes its size, diameter, and clustering coefficient compares diameter and clustering to best Erdos-

Renyi approx. shows that the best -model approximation is better important to be “fair” to each model by finding best

fit Overall moral:

if we care only about diameter and clustering, is better than p

Page 42: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 42

Case 1: Kevin Bacon Graph

Vertices: actors and actresses Edge between u and v if they appeared in a film together

Is Kevin Bacon the most

connected actor?

NO!

Rank NameAveragedistance

# ofmovies

# oflinks

1 Rod Steiger 2.537527 112 25622 Donald Pleasence 2.542376 180 28743 Martin Sheen 2.551210 136 35014 Christopher Lee 2.552497 201 29935 Robert Mitchum 2.557181 136 29056 Charlton Heston 2.566284 104 25527 Eddie Albert 2.567036 112 33338 Robert Vaughn 2.570193 126 27619 Donald Sutherland 2.577880 107 2865

10 John Gielgud 2.578980 122 294211 Anthony Quinn 2.579750 146 297812 James Earl Jones 2.584440 112 3787…

876 Kevin Bacon 2.786981 46 1811…

876 Kevin Bacon 2.786981 46 1811

Kevin Bacon

No. of movies : 46 No. of actors : 1811 Average separation: 2.79

Page 43: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 43

Rod Steiger

Martin Sheen

Donald Pleasence

#1

#2

#3

#876Kevin Bacon

Page 44: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 44

Case 2: New York State Power Grid

Vertices: generators and substations Edges: high-voltage power transmission lines and transformers Line thickness and color indicate the voltage level

Red 765 kV, 500 kV; brown 345 kV; green 230 kV; grey 138 kV

Page 45: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 45

Case 3: C. Elegans Nervous System

Vertices: neurons in the C. elegans worm Edges: axons/synapses between neurons

Page 46: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 46

Two More Examples

M. Newman on scientific collaboration networks coauthorship networks in several distinct

communities differences in degrees (papers per author) empirical verification of

giant components small diameter (mean distance) high clustering coefficient

Alberich et al. on the Marvel Universe purely fictional social network two characters linked if they appeared together in an

issue “empirical” verification of

heavy-tailed distribution of degrees (issues and characters) giant component rather small clustering coefficient

Page 47: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 47

One More (Structural) Property… A properly tuned -model can simultaneously explain

small diameter high clustering coefficient

But what about heavy-tailed degree distributions? -model and simple variants will not explain this intuitively, no “bias” towards large degree all vertices are created equal

As always, we want a “natural” model

Page 48: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 48

Models of Social Network Generation

Random Graphs (Erdös-Rényi models)

Watts-Strogatz models

Scale-free Networks

Page 49: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 49

World Wide Web

800 million documents (S. Lawrence, 1999)

ROBOT: collects all URL’s found in a document and follows them recursively

Nodes: WWW documents Links: URL links

R. Albert, H. Jeong, A-L Barabasi, Nature, 401 130 (1999)

Page 50: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 50

k ~ 6

P(k=500) ~ 10-99

NWWW ~ 109

N(k=500)~10-90

Expected Result Real Result

Pout(k) ~ k-out

P(k=500) ~ 10-6

out= 2.45 in = 2.1

Pin(k) ~ k- in

NWWW ~ 109 N(k=500) ~ 103

J. Kleinberg, et. al, Proceedings of the ICCC (1999)

World Wide Web

Page 51: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 51

< l

>

Finite size scaling: create a network with N nodes with Pin(k) and Pout(k)

< l > = 0.35 + 2.06 log(N)

l15=2 [125]

l17=4 [1346 7]

… < l > = ??

1

2

3

4

5

6

7

nd.edu

19 degrees of separation R. Albert et al Nature (99)

based on 800 million webpages [S. Lawrence et al Nature (99)]

A. Broder et al WWW9 (00)IBM

World Wide Web

L is the length of shortest paths

Page 52: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 52

What does that mean?Poisson distribution

Exponential Network

Power-law distribution

Scale-free Network

Page 53: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 53

Scale-free Networks

The number of nodes (N) is not fixed Networks continuously expand by additional new

nodes WWW: addition of new nodes Citation: publication of new papers

The attachment is not uniform A node is linked with higher probability to a node

that already has a large number of links WWW: new documents link to well known sites

(CNN, Yahoo, Google) Citation: Well cited papers are more likely to be

cited again

Page 54: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 54

Scale-Free Networks Start with (say) two vertices connected by an edge For i = 3 to N:

for each 1 <= j < i, d(j) = degree of vertex j so far let Z = S d(j) (sum of all degrees so far) add new vertex i with k edges back to {1, …, i-1}:

i is connected back to j with probability d(j)/Z Vertices j with high degree are likely to get more links! “Rich get richer” Natural model for many processes:

hyperlinks on the web new business and social contacts transportation networks

Generates a power law distribution of degrees exponent depends on value of k

Page 55: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 55

Preferential attachment explains heavy-tailed degree distributions small diameter (~log(N), via “hubs”)

Will not generate high clustering coefficient no bias towards local connectivity, but towards

hubs

Scale-Free Networks

Page 56: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 56

Case1: Internet Backbone

(Faloutsos, Faloutsos and Faloutsos, 1999)

Nodes: computers, routers Links: physical lines

Page 57: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 57

Page 58: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 58

Case2: Actor Connectivity

Nodes: actors Links: cast jointly

N = 212,250 actors k = 28.78

P(k) ~k-

Days of Thunder (1990) Far and Away

(1992) Eyes Wide Shut (1999)

=2.3

Page 59: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 59

Case 3: Science Citation Index

( = 3)

Nodes: papers Links: citations

(S. Redner, 1998)

P(k) ~k-

2212

25

1736 PRL papers (1988)

Witten-SanderPRL 1981

Page 60: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 60

Nodes: scientist (authors) Links: write paper together

(Newman, 2000, H. Jeong et al 2001)

Case 4: Science Coauthorship

Page 61: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 61

Case 5: Food Web

Nodes: trophic species Links: trophic interactions

R.J. Williams, N.D. Martinez Nature (2000)R. Sole (cond-mat/0011195)

Page 62: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 62

Case 6: Sex-Web

Nodes: people (Females; Males)Links: sexual relationships

Liljeros et al. Nature 2001

4781 Swedes; 18-74; 59% response rate.

Page 63: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 63

Robustness of Random vs. Scale-Free Networks

The accidental failure of a number of nodes in a random network can fracture the system into non-communicating islands.

Scale-free networks are more robust in the face of such failures.

Scale-free networks are highly vulnerable to a coordinated attack against their hubs.

Page 64: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 64

Social Network Analysis

Social Network Introduction

Statistics and Probability Theory

Models of Social Network Generation

Networks in Biological System

Mining on Social Network

Summary

Page 65: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 65

protein-gene interactions

protein-protein interactions

PROTEOME

GENOME

Citrate Cycle

METABOLISM

Bio-chemical reactions

Bio-Map

Page 66: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 66

Citrate Cycle METABOLISM Bio-chemical reactions

Metabolic Network

Page 67: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 67

Page 68: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 68

Nodes: chemicals (substrates)

Links: bio-chemical reactions

Metabolic Network

Page 69: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 69

Organisms from all three domains of life are scale-free networks!

H. Jeong, B. Tombor, R. Albert, Z.N. Oltvai, and A.L. Barabasi, Nature, 407 651 (2000)

Archaea Bacteria Eukaryotes

Metabolic Network

Page 70: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 70

protein-gene interactions

protein-protein interactions

PROTEOME

GENOME

Citrate Cycle

METABOLISM

Bio-chemical reactions

Bio-Map

Page 71: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 71

protein-protein interactions

PROTEOME

Protein Network

Page 72: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 72

Nodes: proteins

Links: physical interactions (binding)

P. Uetz, et al. Nature 403, 623-7 (2000).

Yeast Protein Network

Page 73: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 73

)exp()(~)( 00

k

kkkkkP

H. Jeong, S.P. Mason, A.-L. Barabasi, Z.N. Oltvai, Nature 411, 41-42 (2001)

Topology of the Protein Network

Page 74: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 74

Nature 408 307 (2000)

“One way to understand the p53 network is to compare it to the Internet. The cell, like the Internet, appears to be a ‘scale-free network’.”

p53 Network

Page 75: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 75

p53 Network (mammals)

Page 76: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 76

Social Network Analysis

Social Network Introduction

Statistics and Probability Theory

Models of Social Network Generation

Networks in Biological System

Mining on Social Network

Summary

Page 77: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 77

Information on the Social Network

Heterogeneous, multi-relational data represented as a graph or network Nodes are objects

May have different kinds of objects Objects have attributes Objects may have labels or classes

Edges are links May have different kinds of links Links may have attributes Links may be directed, are not required to be

binary Links represent relationships and interactions

between objects - rich content for mining

Page 78: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 78

What is New for Link Mining Here

Traditional machine learning and data mining approaches assume: A random sample of homogeneous objects from

single relation Real world data sets:

Multi-relational, heterogeneous and semi-structured

Link Mining Newly emerging research area at the intersection

of research in social network and link analysis, hypertext and web mining, graph mining, relational learning and inductive logic programming

Page 79: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 79

A Taxonomy of Common Link Mining Tasks

Object-Related Tasks Link-based object ranking Link-based object classification Object clustering (group detection) Object identification (entity resolution)

Link-Related Tasks Link prediction

Page 80: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 80

What Is a Link in Link Mining?

Link: relationship among data Two kinds of linked networks

homogeneous vs. heterogeneous Homogeneous networks

Single object type and single link type Single model social networks (e.g., friends) WWW: a collection of linked Web pages

Heterogeneous networks Multiple object and link types Medical network: patients, doctors, disease,

contacts, treatments Bibliographic network: publications, authors, venues

Page 81: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 81

Link-Based Object Ranking (LBR)

LBR: Exploit the link structure of a graph to order or prioritize the set of objects within the graph Focused on graphs with single object type and single

link type This is a primary focus of link analysis community Web information analysis

PageRank and Hits are typical LBR approaches In social network analysis (SNA), LBR is a core analysis

task Objective: rank individuals in terms of “centrality” Degree centrality vs. eigen vector/power centrality Rank objects relative to one or more relevant objects in

the graph vs. ranks object over time in dynamic graphs

Page 82: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 82

PageRank: Capturing Page Popularity (Brin & Page’98)

Intuitions Links are like citations in literature A page that is cited often can be expected to be

more useful in general PageRank is essentially “citation counting”, but

improves over simple counting Consider “indirect citations” (being cited by a

highly cited paper counts a lot…) Smoothing of citations (every page is assumed

to have a non-zero citation count) PageRank can also be interpreted as random

surfing (thus capturing popularity)

Page 83: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 83

The PageRank Algorithm (Brin & Page’98)

1( )

0 0 1/ 2 1/ 2

1 0 0 0

0 1 0 0

1/ 2 1/ 2 0 0

1( ) (1 ) ( ) ( )

1( ) [ (1 ) ] ( )

( (1 ) )

j i

t i ji t j t kd IN d k

i ki kk

T

M

p d m p d p dN

p d m p dN

p I M p

d1

d2

d4

“Transition matrix”

d3

Iterate until converge

Same as/N (why?)

Stationary (“stable”) distribution, so we

ignore time

Random surfing model: At any page,

With prob. , randomly jumping to a pageWith prob. (1 – ), randomly picking a link to follow

Iij = 1/N

Initial value p(d)=1/N

Page 84: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 84

Link-Based Object Classification (LBC)

Predicting the category of an object based on its attributes, its links and the attributes of linked objects

Web: Predict the category of a web page, based on words that occur on the page, links between pages, anchor text, html tags, etc.

Citation: Predict the topic of a paper, based on word occurrence, citations, co-citations

Epidemics: Predict disease type based on characteristics of the patients infected by the disease

Communication: Predict whether a communication contact is by email, phone call or mail

Page 85: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 85

Group Detection

Cluster the nodes in the graph into groups that share common characteristics Web: identifying communities Citation: identifying research

communities

Page 86: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 86

Entity Resolution

Predicting when two objects are the same, based on their attributes and their links

Also known as: deduplication, reference reconciliation, co-reference resolution, object consolidation

Applications Web: predict when two sites are mirrors of each

other Citation: predicting when two citations are

referring to the same paper Epidemics: predicting when two disease strains

are the same Biology: learning when two names refer to the

same protein

Page 87: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 87

Entity Resolution Methods

Earlier viewed as pair-wise resolution problem: resolved based on the similarity of their attributes

Importance at considering links Coauthor links in bib data, hierarchical links

between spatial references, co-occurrence links between name references in documents

Use of links in resolution Collective entity resolution: one resolution

decision affects another if they are linked Propagating evidence over links in a depen.

graph Probabilistic models interact with different

entity recognition decisions

Page 88: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 88

Link Prediction

Predict whether a link exists between two entities, based on attributes and other observed links

Applications Web: predict if there will be a link between two

pages Citation: predicting if a paper will cite another

paper Epidemics: predicting who a patient’s contacts are

Methods Often viewed as a binary classification problem Local conditional probability model, based on

structural and attribute features Difficulty: sparseness of existing links Collective prediction, e.g., Markov random field

model

Page 89: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 89

Link Cardinality Estimation

Predicting the number of links to an object Web: predict the authority of a page based on the

number of in-links; identifying hubs based on the number of out-links

Citation: predicting the impact of a paper based on the number of citations

Epidemics: predicting the number of people that will be infected based on the infectiousness of a disease

Predicting the number of objects reached along a path from an object Web: predicting number of pages retrieved by

crawling a site Citation: predicting the number of citations of a

particular author in a specific journal

Page 90: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 90

Social Network Analysis

Social Network Introduction

Statistics and Probability Theory

Models of Social Network Generation

Networks in Biological System

Mining on Social Network

Summary

Page 91: September 6, 2015Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 6 — Network Mining (Social Networks) Jianlin Cheng.

April 19, 2023 Data Mining: Concepts and Techniques 91

Ref: Mining on Social Networks

D. Liben-Nowell and J. Kleinberg. The Link Prediction Problem for Social Networks. CIKM’03

P. Domingos and M. Richardson, Mining the Network Value of Customers. KDD’01

M. Richardson and P. Domingos, Mining Knowledge-Sharing Sites for Viral Marketing. KDD’02

D. Kempe, J. Kleinberg, and E. Tardos, Maximizing the Spread of Influence through a Social Network. KDD’03.

P. Domingos, Mining Social Networks for Viral Marketing. IEEE Intelligent Systems, 20(1), 80-82, 2005.

S. Brin and L. Page, The anatomy of a large scale hypertextual Web search engine. WWW7.

S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, Mining the link structure of the World Wide Web. IEEE Computer’99

D. Cai, X. He, J. Wen, and W. Ma, Block-level Link Analysis. SIGIR'2004.