DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W4/DataMining-3.pdf · 2014-02-13 · DATA...
Transcript of DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W4/DataMining-3.pdf · 2014-02-13 · DATA...
![Page 1: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W4/DataMining-3.pdf · 2014-02-13 · DATA MINING 1 . What To Cover ... Association Rule Mining • Clustering • Classification](https://reader034.fdocuments.in/reader034/viewer/2022050313/5f751b5def22462a180de124/html5/thumbnails/1.jpg)
DATA MINING
1
![Page 2: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W4/DataMining-3.pdf · 2014-02-13 · DATA MINING 1 . What To Cover ... Association Rule Mining • Clustering • Classification](https://reader034.fdocuments.in/reader034/viewer/2022050313/5f751b5def22462a180de124/html5/thumbnails/2.jpg)
What To Cover
• Frequent Itemset Mining
• Association Rule Mining
• Clustering
• Classification
• Deviation (Outlier) Detection
2
![Page 3: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W4/DataMining-3.pdf · 2014-02-13 · DATA MINING 1 . What To Cover ... Association Rule Mining • Clustering • Classification](https://reader034.fdocuments.in/reader034/viewer/2022050313/5f751b5def22462a180de124/html5/thumbnails/3.jpg)
Motivation for Outlier Analysis • Fraud Detection
• (Credit card, telecommunications, criminal activity in e-Commerce)
• Customized Marketing • (high/low income buying habits)
• Medical Treatments • (unusual responses to various drugs)
• Financial Applications • (stock tracking)
![Page 4: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W4/DataMining-3.pdf · 2014-02-13 · DATA MINING 1 . What To Cover ... Association Rule Mining • Clustering • Classification](https://reader034.fdocuments.in/reader034/viewer/2022050313/5f751b5def22462a180de124/html5/thumbnails/4.jpg)
What is an outlier? • Observations inconsistent
with rest of the dataset – Global Outlier
• Special outliers – Local
Outlier • Observations inconsistent
with their neighborhoods • A local instability or
discontinuity O1 and O2 seem outliers from the rest
![Page 5: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W4/DataMining-3.pdf · 2014-02-13 · DATA MINING 1 . What To Cover ... Association Rule Mining • Clustering • Classification](https://reader034.fdocuments.in/reader034/viewer/2022050313/5f751b5def22462a180de124/html5/thumbnails/5.jpg)
Outlier Detection Approaches • Objective:
• Define what data can be considered as inconsistent in a given data set • Statistical-Based Outlier Detection • Deviation-Based Outlier Detection • Distance-Based Outlier Detection
• Find an efficient method to mine the outliers
![Page 6: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W4/DataMining-3.pdf · 2014-02-13 · DATA MINING 1 . What To Cover ... Association Rule Mining • Clustering • Classification](https://reader034.fdocuments.in/reader034/viewer/2022050313/5f751b5def22462a180de124/html5/thumbnails/6.jpg)
Outlier Analysis - Outline • Introduction / Motivation / Definition • Statistical-based Detection
• Distribution-based, depth-based • Deviation-based Method
• Sequential exception, OLAP data cube • Distance-based Detection
• Index-based, nested-loop, cell-based, local-outliers
![Page 7: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W4/DataMining-3.pdf · 2014-02-13 · DATA MINING 1 . What To Cover ... Association Rule Mining • Clustering • Classification](https://reader034.fdocuments.in/reader034/viewer/2022050313/5f751b5def22462a180de124/html5/thumbnails/7.jpg)
Statistical-Based Outlier Detection (Distribution-based)
• Assumptions: • Knowledge of data
(distribution, mean, variance)
.,...,2,1 where ,)1(:
.,...,2,1 where ,)1(:
.,...,2,1 where ,:
15deviation standard within in is
.,...,2,1 where ,:
niFFoH
niGFoH
niGoH
Fo
niFoH
i
i
i
i
i
=ʹ′+−∈
=+−∈−
=∈
=
=∈
λλ
λλ
:nDistibutio Slippage-
:onDistributi Mixture
:onDistributi Inherent-
:Hypothesis eAlternativ
:Test yDiscordanc
:Hypothesis Working
![Page 8: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W4/DataMining-3.pdf · 2014-02-13 · DATA MINING 1 . What To Cover ... Association Rule Mining • Clustering • Classification](https://reader034.fdocuments.in/reader034/viewer/2022050313/5f751b5def22462a180de124/html5/thumbnails/8.jpg)
Statistical-Based Outlier Detection (Distribution-based)
.,...,2,1 where ,)1(:
.,...,2,1 where ,)1(:
.,...,2,1 where ,:
15deviation standard within in is
.,...,2,1 where ,:
niFFoH
niGFoH
niGoH
Fo
niFoH
i
i
i
i
i
=ʹ′+−∈
=+−∈−
=∈
=
=∈
λλ
λλ
:nDistibutio Slippage-
:onDistributi Mixture
:onDistributi Inherent-
:Hypothesis eAlternativ
:Test yDiscordanc
:Hypothesis Working
• Assumptions: • Knowledge of data
(distribution, mean, variance)
![Page 9: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W4/DataMining-3.pdf · 2014-02-13 · DATA MINING 1 . What To Cover ... Association Rule Mining • Clustering • Classification](https://reader034.fdocuments.in/reader034/viewer/2022050313/5f751b5def22462a180de124/html5/thumbnails/9.jpg)
Statistical-Based Outlier Detection • Strengths
• Most outlier research has been done in this area, many data distributions are known
• Weakness • Not good for multi-dimensional datasets
• Assumes the distribution is known –this is not always the case
![Page 10: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W4/DataMining-3.pdf · 2014-02-13 · DATA MINING 1 . What To Cover ... Association Rule Mining • Clustering • Classification](https://reader034.fdocuments.in/reader034/viewer/2022050313/5f751b5def22462a180de124/html5/thumbnails/10.jpg)
Outlier Analysis - Outline • Introduction / Motivation / Definition • Statistical-based Detection
• Distribution-based, depth-based • Deviation-based Method
• Sequential exception, OLAP data cube • Distance-based Detection
• Index-based, nested-loop, cell-based, local-outliers
![Page 11: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W4/DataMining-3.pdf · 2014-02-13 · DATA MINING 1 . What To Cover ... Association Rule Mining • Clustering • Classification](https://reader034.fdocuments.in/reader034/viewer/2022050313/5f751b5def22462a180de124/html5/thumbnails/11.jpg)
Distance-Based Outlier Detection • Given two parameters
• Radius r • Number of neighbors k
• Outlier: • Any point where within its radius r, there are less than k
neighbors
• Inlier: • Any point where within its radius r, there are k or more
neighbors
11
r
r
![Page 12: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W4/DataMining-3.pdf · 2014-02-13 · DATA MINING 1 . What To Cover ... Association Rule Mining • Clustering • Classification](https://reader034.fdocuments.in/reader034/viewer/2022050313/5f751b5def22462a180de124/html5/thumbnails/12.jpg)
Algorithm: Nested Loop
• Steps • For each data point p
• Scan all other points and count how many neighbors within distance r • Enhancements:
• Stop when you find k or more è p is inlier • Use an index (given p’s location à find all points within distance r)
(+) Easy to implement (-) Not efficient for large datasets
12
![Page 13: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W4/DataMining-3.pdf · 2014-02-13 · DATA MINING 1 . What To Cover ... Association Rule Mining • Clustering • Classification](https://reader034.fdocuments.in/reader034/viewer/2022050313/5f751b5def22462a180de124/html5/thumbnails/13.jpg)
Algorithm: Cell-Based • Divide the space into grid (cells)
• What if the cell size is smaller than r • If a cell has more than k è then all points are inliers without
checking them
13
• What if the cell size is larger than r • Check each point in cell C with the
points in C • For boundary points, check neighbor cell
as well
![Page 14: DATA MINING - WPIweb.cs.wpi.edu/~cs561/s14/Lectures/W4/DataMining-3.pdf · 2014-02-13 · DATA MINING 1 . What To Cover ... Association Rule Mining • Clustering • Classification](https://reader034.fdocuments.in/reader034/viewer/2022050313/5f751b5def22462a180de124/html5/thumbnails/14.jpg)
What To Cover
• Frequent Itemset Mining
• Association Rule Mining
• Clustering
• Classification
• Deviation (Outlier) Detection
14