Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois,...
-
date post
19-Dec-2015 -
Category
Documents
-
view
218 -
download
1
Transcript of Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois,...
![Page 1: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/1.jpg)
On Community Outliers and their Efficient Detection in Information
NetworksJing Gao1, Feng Liang1, Wei Fan2,
Chi Wang1, Yizhou Sun1, Jiawei Han1
University of Illinois, IBM TJ Watson
Debapriya Basu
![Page 2: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/2.jpg)
2
OBJECTIVES
Determine outliers in information networks Compare various algorithms which does the
same
![Page 3: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/3.jpg)
3
CHARACTERISTICS OF INFORMATION NETWORK
Eg Internet, Social Networking Sites Nodes – characterized by feature values Links - representative of relation between
nodes
![Page 4: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/4.jpg)
4
Outliers – anomalies, novelties Different kinds of outliers
◦ Global◦ Contextual
OUTLIERS
![Page 5: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/5.jpg)
5
OUTLIERS IN INFORMATION NETWORKS
V7
10
V9
V8
30
V10
40 70 100 110 140 160
V6 V1 V4 V5 V3 V2
Global Outlier
Salary (in $1000)
![Page 6: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/6.jpg)
6
COMMUNITY OUTLIER Unified model considering both nodes and
links Community discovery and outlier detection
are related processes
![Page 7: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/7.jpg)
7
SUMMARY OF THE APPROACH Treat each object as a multivariate data point Use K components to describe normal
community behavior and one component to denote outliers
Induce a hidden variable zi at each object indicating community
Treat network information as a graph Model the graph as a Hidden Markov Random
Field on zi
Find the local minimum of the posterior probability potential energy of the model.
![Page 8: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/8.jpg)
8
UNIFIED PROBABILISTIC MODEL
community label Z
outlier
node feature
X
link structure W
high-income:mean: 116k
std: 35k
low-income:mean: 20k
std: 12k
model parameter
sK: number of communities
![Page 9: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/9.jpg)
9
SYMBOLS IN THE MODEL
Symbol Definition
I = {1,2,3….i,..M} Indices of the objects
V = {v1,v2….vm} Set of objects
S = {s1,s2,….sm} Given attributes of objects
WM*M = {wij} Adjacency matrix containing the weights of the links
Z = {z1,…..,zm} RVs for hidden labels of objects
X = {x1,…..,xm} RVs for observed data
Ni (i ∈ I) Neighborhood of object vi
1,….,k,….K Indices of normal communities
Θ = {Θ1, Θ2,……, Θk} R.Vs for model parameters
![Page 10: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/10.jpg)
10
◦ Set of R.Vs X are conditionally independent given their labelsP(X=S|Z) = ΠP(xi=si|zi)
◦ Kth normal community is characterized by a set of parametersP(xi=si|zi =k) = P(xi=si|Θk)
◦ Outliers are characterized by uniform distribution◦ P(xi=si|zi =0) = ρ0
◦ Markov random field is defined over hidden variable Z ◦ P(zi|zI-{i}) = P(zi|zNi)
◦ The equivalent Gibbs distribution is P(Z) = exp(-U(Z))*1/H1
H1 = normalizing constant, U(Z) = sum of clique potentials.
◦ Goal is to find the configuration of z that maximizes P(X=S|Z)P(Z) for a given Θ
THE MODEL
![Page 11: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/11.jpg)
11
MODELING DATA
Continuous Data◦ Is modeled as Gaussian distribution◦ Model parameters: mean, standard deviation
Text Data◦ Is modeled as Multinomial distribution◦ Model parameters: probability of a word
appearing in a community
![Page 12: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/12.jpg)
12
COMMUNITY OUTLIER DETECTION ALGORITHM
Given Θ, find Z that maximizes P(Z|X)
Given Z, find Θthat maximizes P(X|Z)
Initialize Z
INFERENCE
PARAMETER ESTIMATION
Θ : model parametersZ: community labels
![Page 13: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/13.jpg)
13
PARAMETER ESTIMATION
Calculate model parameters◦ maximum likelihood estimation
Continuous◦ mean: sample mean of the community◦ standard deviation: square root of the sample
variance of the community Text
◦ probability of a word appearing in the community: empirical probability
![Page 14: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/14.jpg)
14
INFERENCES
Calculate Zi values◦ Given Model parameters,◦ Iteratively update the community labels of nodes at
each timestep◦ Select the label that maximizes P(Z|X,ZN)
Calculate P(Z|X,ZN) values◦ Both the node features and community labels of
neighbors if Z indicates a normal community◦ If the probability of a node belonging to any community
is low enough, label it as an outlier
![Page 15: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/15.jpg)
15
DISCUSSIONS
Setting Hyper parameters◦ a0 = threshold ◦ Λ = confidence in the network◦ K = number of communities
Initialization◦ Group outliers in clusters.◦ It will eventually get corrected.
![Page 16: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/16.jpg)
16
SIMULATED EXPERIMENTS
Data Generation◦ Generate continuous data based on Gaussian
distributions and generate labels according to the model
◦ Define r: percentage of outliers, K: number of communities
Baseline models◦ GLODA: global outlier detection (based on node
features only)◦ DNODA: local outlier detection (check the feature
values of direct neighbors)◦ CNA: partition data into communities based on
links and then conduct outlier detection in each community
![Page 17: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/17.jpg)
17
TRUE POSITIVE RATE(Top r% as outliers)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
r=1% K=5 r=5% K=5 r=1% K=8 r=5% K=8
GLODA
DNODA
CNA
CODA
![Page 18: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/18.jpg)
18
EXPERIMENTS ON DBLP Communities
◦ data mining, artificial intelligence, database, information analysis
Sub network of Conferences Links: percentage of common authors among two
conferences Node features: publication titles in the conference
Sub network of Authors Links: co-authorship relationship Node features: titles of publications by an author
![Page 19: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/19.jpg)
19
RESULTS
Community outliers: CVPR CIKM
![Page 20: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/20.jpg)
20
CONCLUSIONS
Community Outliers
Community Outlier Detection
QUESTIONS
![Page 21: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu.](https://reader033.fdocuments.in/reader033/viewer/2022042608/56649d2f5503460f94a077e6/html5/thumbnails/21.jpg)
21
REFERENCES
On Community Outliers and their Efficient Detection in Information Networks – Gao, Liang, Fan, Wang, Sun, Han
Outlier detection – Irad Ben-Gal Automated detection of outliers in real-world data –
Last, Kandel Outlier Detection for High Dimensional Data –
Aggarwal, Yu