Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html...
Transcript of Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html...
![Page 1: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/1.jpg)
U Kang 1
Advanced Data Mining
Introduction
U KangSeoul National University
![Page 2: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/2.jpg)
U Kang 2
In This Lecture
Motivation Overview of Topics
![Page 3: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/3.jpg)
U Kang 3
Outline
MotivationOverview of TopicsConclusion
![Page 4: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/4.jpg)
U Kang 4
Motivation
There are many “big data” Graph Time series Text Image …
![Page 5: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/5.jpg)
U Kang 5
Main Questions
How can we find patterns and models from big data?
How can we do it in a scalable way?
![Page 6: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/6.jpg)
U Kang 6
What is this course about?
This course covers advanced theories, algorithms and systems for mining big data.
Topics Graph Spectral Analysis Large scale distributed system (e.g. MapReduce) Singular Value Decomposition, Tensor Time series, approximation, graph compression,
community detection, anomaly detection
![Page 7: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/7.jpg)
U Kang 7
Outline
MotivationOverview of Topics
GraphSpectral AnalysisMapReduceSVD, TensorRecommendationOther tools
Conclusion
![Page 8: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/8.jpg)
U Kang 8
What does the Internet look like? What does FaceBook look like? What is ‘normal’/‘abnormal’? Which patterns/laws hold?
Graph Mining
MRFerocius, Social Network, 2011, https://stackoverflow.com/questions/4594962/social-network-directed-graph-library-for-net
![Page 9: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/9.jpg)
U Kang 9
What does the Internet look like? What does FaceBook look like? What is ‘normal’/‘abnormal’? Which patterns/laws hold?
Large datasets reveal patterns and anomalies that may be invisible otherwise
Graph Mining
MRFerocius, Social Network, 2011, https://stackoverflow.com/questions/4594962/social-network-directed-graph-library-for-net
![Page 10: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/10.jpg)
U Kang 10
Are real graphs random?
Power Law
![Page 11: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/11.jpg)
U Kang 11
Are real graphs random? No!
Power Law
PowerLaw
![Page 12: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/12.jpg)
U Kang 12
Node (closeness) centrality
B
C
A
Q: If you have to pick 1 person to advertise,who do you want to choose?
[Kang et al. SDM’10]
![Page 13: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/13.jpg)
U Kang 13
Outline
MotivationOverview of Topics
GraphSpectral AnalysisMapReduceSVD, TensorRecommendationOther tools
Conclusion
![Page 14: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/14.jpg)
U Kang 14
Spectral Graph Analysis
Solve graph problems using theory of linear algebra
Adjacency matrix
Eigenvector
Apply the solution
Random walkson the graph(e.g. protein
interaction)
Wikipedia, Schizophrenia PPI, 2016,
https://en.wikipedia.org/wiki/Protein%E2%80%93protein_interaction
![Page 15: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/15.jpg)
U Kang 15
Triangle Counting Real social networks have a lot of triangles
Friends of friends are friends
But, triangles are expensive to compute (3-way join; several approx. algos)
Q: Can we do that quickly? A: Yes!
#triangles = 16σ𝑖 𝜆𝑖
3
(and, because of skewness in eigenvalues, we only need the top few eigenvalues!)
Triangle Counting[Kang et al. PAKDD’11]
![Page 16: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/16.jpg)
U Kang 16
Outline
MotivationOverview of Topics
GraphSpectral AnalysisMapReduceSVD, TensorRecommendationOther tools
Conclusion
![Page 17: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/17.jpg)
U Kang 17
Motivation
Many big data Crawled document Web request log …
Many ‘large scale computations’ Inverted index Graph operation Summaries of the number of pages crawled per host Most frequent queries in a given day …
![Page 18: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/18.jpg)
U Kang 18
Motivation
But, developing the code is very complex : How to parallelize the computation? How to distribute the data? How to handle failures?
![Page 19: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/19.jpg)
U Kang 19
Motivation
Failures Assume a machine works for 3 years without failure What is the expected number of failed machines when
operating 1 million machines?
![Page 20: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/20.jpg)
U Kang 20
MapReduce Example: histogram of fruit names
Map 0 Map 1 Map 2
Reduce 0 Reduce 1
Shuffle
(apple, 1)(apple, 1) (strawberry,1)
(apple, 2) (orange, 1)(strawberry, 1)
(orange, 1)
HDFS
HDFS
map( fruit ) {output(fruit, 1);
}
reduce( fruit, v[1..n] ) {for(i=1; i <=n; i++)sum = sum + v[i];
output(fruit, sum);}
![Page 21: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/21.jpg)
U Kang 21
Outline
MotivationOverview of Topics
GraphSpectral AnalysisMapReduceSVD, TensorTime SeriesOther tools
Conclusion
![Page 22: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/22.jpg)
U Kang 22
Singular Value Decomposition (SVD)
Essential tool for Concept discovery Dimensionality reduction Finding fixed points Solving linear systems …
![Page 23: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/23.jpg)
U Kang 23
SVD - Example
A = U Λ VT
datainfo
retrievalbrainlung
=CS
MD
x x
![Page 24: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/24.jpg)
U Kang 24
SVD - Example
A = U Λ VT
datainfo
retrievalbrainlung
=CS
MD
x x
CS Medical
![Page 25: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/25.jpg)
U Kang 25
SVD - Example
A = U Λ VT
datainfo
retrievalbrainlung
=CS
MD
x x
CS Medical‘strength’ of CS-concept
![Page 26: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/26.jpg)
U Kang 26
SVD - Example
A = U Λ VT
datainfo
retrievalbrainlung
=CS
MD
x x
CS Medical‘strength’ of CS-concept
doc-concept similarity
![Page 27: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/27.jpg)
U Kang 27
SVD - Example
A = U Λ VT
datainfo
retrievalbrainlung
=CS
MD
x x
CS Medical‘strength’ of CS-concept
doc-concept similarity term-concept similarity
![Page 28: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/28.jpg)
U Kang 28
What is a Tensor?
N-D generalization of matrix:
13 11 22 55 ...
5 4 6 7 ...
... ... ... ... ...
... ... ... ... ...
... ... ... ... ...
data mining classif. tree ...JohnPeterMaryNick
...
KDD’20
![Page 29: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/29.jpg)
U Kang 29
What is a Tensor?
N-D generalization of matrix:
13 11 22 55 ...
5 4 6 7 ...
... ... ... ... ...
... ... ... ... ...
... ... ... ... ...
data mining classif. tree ...JohnPeterMaryNick
...
KDD’21
KDD’22
KDD’20
![Page 30: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/30.jpg)
U Kang 30
Motivating Applications
Why tensors are useful? Multi-way semantic indexing Sensor data analysis
![Page 31: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/31.jpg)
U Kang 31
Multi-way Semantic Indexing
Data: author, keyword, year
DBDBDM
DBDBDM
Keywords
Auth
ors
Sun, Jimeng, Dacheng Tao, and Christos Faloutsos. "Beyond streams and graphs: dynamic tensor analysis." KDD. 2006.
![Page 32: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/32.jpg)
U Kang 32
Sensor Data Analysis
Data: location, type, time
1st factor (Main trend)
(a1) daily pattern (b1) main pattern (c1) Main correlation
2nd factor (Major abnormal trend)
(a2) abnormal residual (b2) three abnormal sensors (c2) Voltage anomaly
Core TensorTensor Streams
Sun, Jimeng, Spiros Papadimitriou, and S. Yu Philip. "Window-based tensor analysis on high-dimensional and multi-aspect streams." ICDM. 2006.
![Page 33: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/33.jpg)
U Kang 33
Outline
MotivationOverview of Topics
GraphSpectral AnalysisMapReduceSVD, TensorRecommendationOther tools
Conclusion
![Page 34: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/34.jpg)
U Kang 34
Recommender System
Search vs. recommender system Search: a user actively looks for what the user want (e.g.,
by entering a keyword in a search engine) Recommender system: the system automatically
provides recommended items to users
![Page 35: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/35.jpg)
U Kang 35
Real World Applications
Amazon.com
35 percent of what consumers purchase on Amazon come from recommendations
https://c1.staticflickr.com/5/4067/4551424756_3e176d6939_z.jpg
![Page 36: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/36.jpg)
U Kang 36
Real World Applications
Netflix
Personalization and recommendations saves ≥ $1B per year
https://www.flickr.com/photos/wfryer/2661730729
![Page 37: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/37.jpg)
U Kang 37
Matrix Factorization for CF
Map each user and each item to a low-dimensional space
Serious
Escapist
Geared toward mal
es
Geared toward fem
ales
Koren et al., Matrix Factorization Techniques for Recommender Systems, IEEE Computer, 2009
![Page 38: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/38.jpg)
U Kang 38
Outline
MotivationOverview of Topics
GraphSpectral AnalysisMapReduceSVD, TensorTime SeriesOther tools
Conclusion
![Page 39: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/39.jpg)
U Kang 39
Tool 1: Time Series Analysis
Given: one or more sequences x1 , x2 , … , xt , …(y1, y2, … , yt, …)
Task Find similar sequences Forecast future values Classify sequences (e.g., fault or normal)
![Page 40: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/40.jpg)
U Kang 40
Matrix Profile
Repeated earthquakes
Yeh, Chin-Chia Michael, et al. "Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets." 2016 IEEE 16th international conference on data mining (ICDM). Ieee, 2016. (https://www.cs.ucr.edu/~eamonn/matrix_profile_i.pptx)
![Page 41: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/41.jpg)
U Kang 41
Matrix Profile
Abnormal heartbeat detection from ECG
Yeh, Chin-Chia Michael, et al. "Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets." 2016 IEEE 16th international conference on data mining (ICDM). Ieee, 2016. (https://www.cs.ucr.edu/~eamonn/matrix_profile_i.pptx)
Maximum value in matrixprofile indicates discord
![Page 42: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/42.jpg)
U Kang 42
Tool 2 : Approximation
Flajolet-Martin sketch: Let’s say there are n unique numbers in a set S We save this information in memory Task : given a new number i, check whether i is in S
Question: how much memory do we need to answer such question?
![Page 43: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/43.jpg)
U Kang 43
Tool 2 : Approximation
Flajolet-Martin sketch: Let’s say there are n unique numbers in a set S We save this information in memory Task : given a new number i, check whether i is in S
Question: how much memory do we need to answer such question?
Answer: O(n) bytes, usually. But, Flajolet-Martin sketch uses only O(log(n)) bits to do it almost accurately
![Page 44: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/44.jpg)
U Kang 44
Tool 2 : Approximation
Application : speed-up the graph computation
For 2 Billon Edges, - standard closeness takes 30,000 years- effective closeness takes ~ 1 day !1,000,000 times faster!
![Page 45: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/45.jpg)
U Kang 45
Tool 3 : Graph Compression
Original SlashBurn
![Page 46: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/46.jpg)
U Kang 46
Tool 4 : Community Detection
How to find good communities in a graph?
http://en.wikipedia.org/wiki/Community_structure
![Page 47: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/47.jpg)
U Kang 47
Tool 5 : Anomaly Detection
How to find outliers, or anomalies?
![Page 48: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/48.jpg)
U Kang 48
OddBall at work (Posts)
# citations
# cr
oss-
citat
ions
223K posts217K citations
http://instapundit.com/archives/025235.phphttp://www.sizemore.co.
uk/2005/08/i-feel-some-movies-coming-on.html
POSTS
Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy
![Page 49: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/49.jpg)
U Kang 49
Outline
MotivationOverview of TopicsConclusion
![Page 50: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/50.jpg)
U Kang 50
Conclusion
Advanced theories, algorithms and systems for mining big data.
Topics Graph Spectral Analysis Large scale distributed system (e.g. MapReduce) Singular Value Decomposition, Tensor Time series, approximation, graph compression,
community detection, anomaly detection
![Page 51: Advanced Data Mining - datalab.snu.ac.krukang/courses/20F-ADM/L1-intro.pdf · movies-coming-on.html POSTS Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C.](https://reader034.fdocuments.in/reader034/viewer/2022051918/600ab09474c7493d33753760/html5/thumbnails/51.jpg)
U Kang 51
Questions?