Statistical Approaches to Mining Multivariate Data Streams

56
Statistical Approaches to Mining Multivariate Data Streams Eric Vance, Duke University Department of Statistical Science David Banks, Duke University Tamraparni Dasu, AT&T Labs - Research JSM July 31, 2007 Salt Lake City, Utah

description

Statistical Approaches to Mining Multivariate Data Streams. Eric Vance, Duke University Department of Statistical Science David Banks, Duke University Tamraparni Dasu, AT&T Labs - Research. JSM July 31, 2007 Salt Lake City, Utah. Data Streams. Huge amounts of complex data - PowerPoint PPT Presentation

Transcript of Statistical Approaches to Mining Multivariate Data Streams

Page 1: Statistical Approaches to Mining Multivariate Data Streams

Statistical Approaches to Mining Multivariate Data Streams

Eric Vance, Duke University Department of Statistical Science

David Banks, Duke University

Tamraparni Dasu, AT&T Labs - Research

JSM July 31, 2007Salt Lake City, Utah

Page 2: Statistical Approaches to Mining Multivariate Data Streams

Data Streams

• Huge amounts of complex data

• Rapid rate of accumulation

• One-time access to raw data

Page 3: Statistical Approaches to Mining Multivariate Data Streams

Change Detection• Problem: Change detection in complicated data

streams

• Three criteria– Nonparametric– Fast– Statistical guarantees

Page 4: Statistical Approaches to Mining Multivariate Data Streams

E-Commerce Server Data• Data description

– One server in a network of servers– Data polled every 5 minutes– 27 variables– 5 week time period

• Quality issues– Many variables unimportant or unchanging– Missing data

Page 5: Statistical Approaches to Mining Multivariate Data Streams

E-Commerce Server Data• Variable Elimination

– Several variables remain constant (Total Swap) – Predictable and non-informative (Used SysDisk Space)– Correlated (CPU Used%, CPU User%)

• 6 Variables Selected– CPU Used% – Number of Procs – Number of Threads – Ping Latency – Used Swap – Log Packets = log(In Packets + Out Packets)

Page 6: Statistical Approaches to Mining Multivariate Data Streams

Daily BoxplotsCPU

Procs

Threads

Page 7: Statistical Approaches to Mining Multivariate Data Streams

Daily BoxplotsPing

Swap

Packets

Page 8: Statistical Approaches to Mining Multivariate Data Streams

Our Approach

• Partitioning scheme (Multivariate Histogram)

• Data depth: Rank each point in relation to Mahalanobis distance from center of data

• Data Pyramid: Determine which direction in the data is most extreme

• Profile based comparison– Identify changes in profiles over time (week to week)

Page 9: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Example in 2D• 5 “center-outward” depth layers

• 4 pyramidsA: depth 4 pyramid +y

B: depth 3 pyramid -x C: depth 1 pyramid +x

A

BC

Page 10: Statistical Approaches to Mining Multivariate Data Streams

Identify Depth and Direction1. Compute center of comparison Data Sphere

2. Calculate Mahalanobis distance for each point

3. In which of the 5 quantiles of depth is

4. Determine direction of greatest variation for

X

M i = (x i − X ′ ) S−1(x i − X )

x i

M i

x i

Page 11: Statistical Approaches to Mining Multivariate Data Streams

Data Partition in 6 Dimensions

Page 12: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into Bins

Page 13: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into Bins

CPU

Page 14: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into Bins

CPU

CP

U

Page 15: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Procs

Page 16: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Procs

Page 17: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Threads

Page 18: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Threads

Th

read

s

Page 19: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Threads

Th

read

s

Threads

Page 20: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Threads

Th

read

s

Threads

Th

read

s

Page 21: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Th

read

s

Ping

Page 22: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Th

read

s

Ping

Pin

g

Page 23: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Th

read

s

Pin

g

Swap

Page 24: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Th

read

s

Pin

g

Swap

Sw

ap

Page 25: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Th

read

s

Pin

g

Sw

ap

Packets

Page 26: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into BinsC

PU

Pro

cs

Th

read

s

Pin

g

Sw

ap

Packets

Pac

ket

s

Page 27: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into Bins: Weeks 1 and 2

Page 28: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into Bins: Weeks 1 and 2

Swap

Page 29: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into Bins: Weeks 1 and 2

Swap

Sw

ap

Page 30: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into Bins: Weeks 1 and 2

Swap

Sw

ap

Swap

Page 31: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into Bins: Weeks 1 and 2

Swap

Sw

ap

Page 32: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into Bins: Weeks 1 and 2

Threads

Page 33: Statistical Approaches to Mining Multivariate Data Streams

Partitioning Into Bins: Weeks 1 and 2

Threads

Th

read

s

Page 34: Statistical Approaches to Mining Multivariate Data Streams

6 Variables Over 5 Weeks

Page 35: Statistical Approaches to Mining Multivariate Data Streams

Week 1 Results

Page 36: Statistical Approaches to Mining Multivariate Data Streams

Week 2 Results

Page 37: Statistical Approaches to Mining Multivariate Data Streams

6 Variables Over 5 Weeks

Page 38: Statistical Approaches to Mining Multivariate Data Streams

6 Variables Over 5 Weeks

Threads

Page 39: Statistical Approaches to Mining Multivariate Data Streams

Week 3 Results

Page 40: Statistical Approaches to Mining Multivariate Data Streams

Week 3 Results

Th

read

s

Page 41: Statistical Approaches to Mining Multivariate Data Streams

Week 4 Results

Page 42: Statistical Approaches to Mining Multivariate Data Streams

Week 4 Results

Pro

cs

Page 43: Statistical Approaches to Mining Multivariate Data Streams

Week 4 Results

Th

rea d

s

Page 44: Statistical Approaches to Mining Multivariate Data Streams

6 Variables Over 5 Weeks

Page 45: Statistical Approaches to Mining Multivariate Data Streams

6 Variables Over 5 Weeks

Packets

Page 46: Statistical Approaches to Mining Multivariate Data Streams

Week 5 Results

Page 47: Statistical Approaches to Mining Multivariate Data Streams

Week 5 Results

Pac

ket

s

Page 48: Statistical Approaches to Mining Multivariate Data Streams

Trimmed Mean and Covariance• Applying partitioning method using a trimmed

mean and trimmed covariance matrix – Use 90% most central data points– Recompute center– Recompute Mahalanobis distances using new center

and new covariance matrix

• Bins are more uniformly filled– More points appear closer to trimmed center– More variables become “most extreme”

Page 49: Statistical Approaches to Mining Multivariate Data Streams

Trimmed Comparison: Week 1

Page 50: Statistical Approaches to Mining Multivariate Data Streams

Trimmed Results: Week 1

Page 51: Statistical Approaches to Mining Multivariate Data Streams

Trimmed Results: Week 2

Page 52: Statistical Approaches to Mining Multivariate Data Streams

Trimmed Results: Week 3

Page 53: Statistical Approaches to Mining Multivariate Data Streams

Trimmed Results: Week 4

Page 54: Statistical Approaches to Mining Multivariate Data Streams

Trimmed Results: Week 5

Page 55: Statistical Approaches to Mining Multivariate Data Streams

Multivariate Tests of Statistical Significance

• Multinomial test – Chi-squared test, G-test

• Bootstrap and Resampling

• Bayesian methods

Page 56: Statistical Approaches to Mining Multivariate Data Streams

Conclusion

• Quickly categorize data points non-parametrically

• Compare bins

• Identify which variables change most

• Questions or comments to

ervance @ stat.duke.edu