Post on 13-Apr-2018
< > - +
Scan Statistics on Enron Graphs
Carey E. Priebecep@jhu.edu
Johns Hopkins University
Department of Applied Mathematics & StatisticsDepartment of Computer Science
Center for Imaging Science
“The wealth of your practical experience with sane and interesting problemswill give to mathematics a new direction and a new impetus.”
— Leopold Kronecker to Hermann von Helmholtz
Scan Statistics on Enron Graphs – p.1/39
< > - +
Citation
C.E. Priebe, J.M. Conroy, D.J. Marchette, and Y. Park,“Scan Statistics on Enron Graphs,”Computational and Mathematical Organization Theory,to appear.
SIAM International Conference on Data MiningWorkshop on Link Analysis, Counterterrorism and SecurityNewport Beach, CaliforniaApril 23, 2005
http://www.ams.jhu.edu/∼priebe/sseg.html
Scan Statistics on Enron Graphs – p.2/39
< > - +
Outline
Scan Statistics on Graphs, and on Time Series thereofScan Statistics on Enron Graphs
Scan Statistics on Enron Graphs – p.3/39
< > - +
Introduction
Objective: to develop and apply a theory of scan statistics onrandom graphs to perform change point / anomaly detectionin graphs and in time series thereof.Hypotheses:
H0: homogeneityHA: local subregion of excessive activity
Scan Statistics on Enron Graphs – p.4/39
< > - +
Scan Statistics
“moving window analysis”:to scan a small “window” (scan region) over data,calculating some locality statistic for each window;e.g.,• number of events for a point pattern,• average pixel value for an image,• ...scan statistic ≡ maximum of locality statistic:If maximum of observed locality statistics is large,then the inference can be made thatthere exists a subregion of excessive activity.
Scan Statistics on Enron Graphs – p.5/39
< > - +
Scan Statistics on Graphs
directed graph (digraph): D = (V, A)
order: |V (D)|
size: |A(D)|
k-th order neighborhood of v:Nk[v; D] = w ∈ V (D) : d(v, w) ≤ k
scan region (example: induced subdigraph): Ω(Nk[v; D])
locality statistic (example: size): Ψk(v) = |A(Ω(Nk[v; D]))|
“scale-specific” scan statistic: Mk(D) = maxv∈V (D) Ψk(v)
Scan Statistics on Enron Graphs – p.6/39
< > - +
HA1
p 00
pAA> p00
p0A = pA0 = p00
Figure 1: Alternative random graph hypothesis HA1.
Scan Statistics on Enron Graphs – p.7/39
< > - +
Monte Carlo SimulationH0
unrandomized size = 0.05T1
Dens
ity19200 19400 19600 19800 20000 20200
0.00
000.
0015
0.00
30
19795
HA
randomized power = 0.16T1
Dens
ity
19200 19400 19600 19800 20000 20200
0.00
000.
0015
0.00
30
19795
unrandomized size = 0.045T2
Dens
ity
50 55 60 65 70 75 80
0.00
0.05
0.10
0.15
65
randomized power = 0.154T2
Dens
ity
50 55 60 65 70 75 80
0.00
0.05
0.10
0.15
65
unrandomized size = 0.046T3
Dens
ity
150 200 250
0.00
0.02
0.04
153
randomized power = 1T3
Dens
ity
150 200 250
0.00
00.
010
0.02
00.
030
153
n = 1000, n′ = 13, α = 0.05, MC = 1000
T = size
T = scan0
T = scan1
β = 0.160
β = 0.154
β = 1.000
Scan Statistics on Enron Graphs – p.8/39
< > - +
Gumbel Conjecture
T1
Dens
ity19200 19400 19600 19800 20000
0.00
00.
001
0.00
20.
003
T2
Dens
ity
50 55 60 65 70 75 80
0.00
0.05
0.10
0.15
T3
Dens
ity
120 140 160 180
0.00
0.02
0.04
n = 1000, MC = 1000
T = size
T = scan0
T = scan1
Scan Statistics on Enron Graphs – p.9/39
< > - +
Example: H0
size = 26, scan0 = 5, scan1 = 5, scan2 = 11.
Scan Statistics on Enron Graphs – p.10/39
< > - +
Example: HA1
size = 32, scan0 = 5, scan1 = 7, scan2 = 11.
Scan Statistics on Enron Graphs – p.11/39
< > - +
Example: Monte Carlo Simulation H0 vs HA1H0
unrandomized size = 0.044T1
Dens
ity
10 20 30 40 50 60
0.00
0.03
0.06
35
HA
randomized power = 0.273T1
Dens
ity
10 20 30 40 50 60
0.00
0.04
35
unrandomized size = 0.032T2
Dens
ity
0 2 4 6 8 10
0.0
0.2
0.4
5
randomized power = 0.35T2
Dens
ity
0 2 4 6 8 10
0.0
0.2
0.4
5
unrandomized size = 0.018T3
Dens
ity
0 5 10 15 20
0.0
0.2
0.4
6
randomized power = 0.976T3
Dens
ity
0 5 10 15 20
0.00
0.20
6
unrandomized size = 0.037T4
Dens
ity
0 5 10 15 20 25 30
0.00
0.10
13
randomized power = 0.451T4
Dens
ity
0 5 10 15 20 25 30
0.00
0.06
0.12
13
n = 50, n′ = 4, α = 0.05, MC = 1000
T = size
T = scan0
T = scan1
T = scan2
β = 0.273
β = 0.350
β = 0.976
β = 0.451
Scan Statistics on Enron Graphs – p.12/39
< > - +
HA2
consider now a structured alternative . . .
Scan Statistics on Enron Graphs – p.13/39
< > - +
Example: HA2
size = 34, scan0 = 5, scan1 = 5, scan2 = 17.
Scan Statistics on Enron Graphs – p.14/39
< > - +
Example: HA2
size = 34, scan0 = 5, scan1 = 5, scan2 = 17.
Scan Statistics on Enron Graphs – p.15/39
< > - +
Example: Monte Carlo Simulation H0 vs HA2H0
unrandomized size = 0.044T1
Dens
ity
10 20 30 40 50 60
0.00
0.03
0.06
35
HA
randomized power = 0.633T1
Dens
ity
10 20 30 40 50 60
0.00
0.04
35
unrandomized size = 0.032T2
Dens
ity
0 2 4 6 8 10
0.0
0.2
0.4
5
randomized power = 0.545T2
Dens
ity
0 2 4 6 8 10
0.0
0.2
0.4
5
unrandomized size = 0.018T3
Dens
ity
0 5 10 15 20
0.0
0.2
0.4
6
randomized power = 0.501T3
Dens
ity
0 5 10 15 20
0.0
0.2
6
unrandomized size = 0.037T4
Dens
ity
0 10 20 30 40
0.00
0.10
13
randomized power = 0.915T4
Dens
ity
0 10 20 30 40
0.00
0.10
13
n = 50, n′ = 13, α = 0.05, MC = 1000
T = size
T = scan0
T = scan1
T = scan2
β = 0.633
β = 0.545
β = 0.501
β = 0.915
Scan Statistics on Enron Graphs – p.16/39
< > - +
Time Series
0 20 40 60 80 100
160
180
200
220
240
timeT 1
0 20 40 60 80 100
810
1214
16
time
T 2
0 20 40 60 80 100
1012
1416
1820
time
T 3
n = 100, n′ = 4, α = 0.05
T = size
T = scan0
T = scan1
Scan Statistics on Enron Graphs – p.17/39
< > - +
Time Series
0 20 40 60 80 100
160
180
200
220
240
timeT 1
0 20 40 60 80 100
810
1214
1618
time
T 2
0 20 40 60 80 100
1015
2025
30
time
T 3
n = 100, n′ = 6, α = 0.05
T = size
T = scan0
T = scan1
Scan Statistics on Enron Graphs – p.18/39
< > - +
Time Series
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
n’= 4z
F Z(z
)
0 20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
n’= 6z
F Z(z
)
n = 100, n′ = 4&6, α = 0.05
n = 4
n = 6
Scan Statistics on Enron Graphs – p.19/39
< > - +
Enron
Scan Statistics on Enron Graphs – p.20/39
< > - +
Enron Data
125,409 distinct messages from 184 unique “From” field,mostly Enron executives.189 weeks, from 1988 through 2002.directed edges (arcs) At = (v, w): vertex v sends at least oneemail to vertex w during the t-th week (“To”, “CC”, or “BCC”)
Scan Statistics on Enron Graphs – p.21/39
< > - +
Statistics and Time Series
scale-k locality statistics: Ψk,t(v) = |A(Ω(Nk[v; Dt]))|
k = 0 : Ψ0,t(v) = outdegree(v; Dt).scan statistic: Mk,t = maxv Ψk,t(v); k = 0, 1, 2
vertex-dependent standardized locality statistic:Ψk,t(v) = (Ψk,t(v) − µk,t,τ (v)) / max(σk,t,τ (v), 1)
µk,t,τ (v) = 1τ
∑t−1t′=t−τ Ψk,t′(v)
σ2k,t,τ (v) = 1
τ−1
∑t−1t′=t−τ (Ψk,t′(v) − µk,t,τ (v))2
standardized scan statistic: Mk,t = maxv Ψk,t(v)
Scan Statistics on Enron Graphs – p.22/39
< > - +
Statistics and Time Series
050
100
150
200
250
300
time (mm/yy)
raw
scan
sta
tistic
s
11/98 05/99 10/99 04/00 10/00 04/01 09/01 03/02
sizescan0scan1scan2
050
100
150
time (mm/yy)
stan
dard
ized
scan
sta
tistic
s
11/98 05/99 10/99 04/00 10/00 04/01 09/01 03/02
scan0scan1scan2
Scan Statistics on Enron Graphs – p.23/39
< > - +
Anomaly Detection
temporally-normalized scanstatistics: Sk,t =
(Mk,t − µk,t,`)/ max(σk,t,`, 1)
detection: time t such thatSk,t > 5
t∗ = 132 (May, 2001)
−20
24
68
time (mm/yy)
scan
0
02/01 03/01 04/01 05/01 06/01
−20
24
68
time (mm/yy)
scan
1
02/01 03/01 04/01 05/01 06/01
−20
24
68
time (mm/yy)
scan
2
02/01 03/01 04/01 05/01 06/01
Scan Statistics on Enron Graphs – p.24/39
< > - +
Detection Graph D132
arg maxv
Ψ0,132(v) = john.lavorato
arg maxv
Ψ1,132(v) = john.lavorato
arg maxv
Ψ2,132(v) = richard.shapiro
arg maxv
eΨ0,132(v) = richard.shapiro
arg maxv
eΨ1,132(v) = joannie.williamson
arg maxv
eΨ2,132(v) = k..allen
a..martinandrea.ring
andrew.lewis
andy.zippera..shankman
barry.tycholiz
benjamin.rogersbill.rapp
bill.williamsbrad.mckay
cara.sempergerc..gironcharles.weldon
chris.dorlandchris.germanyclint.dean
cooper.richeycraig.dean
dana.davis
dan.hyvldanny.mccarty
darrell.schoolcraftdarron.giron
david.delaineydebra.perlingiere
diana.scholtes
d..martin
don.baughman
drew.fossumd..thomas
dutch.quigleye..haedicke
elizabeth.sager
eric.basseric.saibi
errol.mclaughlin
f..brawner
f..campbell
f..keaveyfletcher.sturmfrank.ermis
geir.solberg
geoff.storey
gerald.nemecgreg.whalley
harry.arorah..lewis
holden.salisbury
hunter.shively
james.derrick
james.steffes
jane.tholt
jason.williamsjason.wolfe
jay.reitmeyer
jeff.dasovich
jeff.king
jeffrey.hodgejeffrey.shankman
jeff.skillingj..farmer
j.harris
jim.schwieger
j.kaminskijoannie.williamson
joe.parksjoe.quenet
joe.stepenovitch
john.arnoldjohn.forney
john.griffithjohn.hodge
john.lavorato
john.zufferlijonathan.mckay
j..sturm
juan.hernandezjudy.townsend
k..allen
kam.keiserkate.symeskay.mann
keith.holst
kenneth.laykevin.hyatt
kevin.presto
kevin.ruscitti
kimberly.watsonkim.wardlarry.campbell
larry.mayl..gay
lindy.donoholisa.gangliz.taylor
l..mims
louise.kitchen
lynn.blairmarie.heardmark.e.haedicke
mark.haedickemark.taylor
mark.whitt
martin.cuillamatthew.lenhartmatt.motley matt.smithm..forneymichelle.cash
michelle.lokay
mike.carson
mike.grigsby
mike.maggi
mike.mcconnell
mike.swerzbin
m..lovemonika.causholli
monique.sanchez
m..prestom..scott
m..tholtpatrice.mims
paul.thomas
peter.keaveyphillip.allen
phillip.lovephillip.platter
randall.gay
richard.ringrichard.sandersrichard.shapiro
rick.buy
robert.badeerrobert.benson
rod.hayslettryan.slinger
sally.beck
sandra.brawner
sara.shackleton
scott.hendrickson
scott.neal
shelley.corman
s..shively
stacy.dicksonstanley.horton
stephanie.panus
steven.kean
steven.south
susan.bailey
susan.pereira
susan.scotts..wardtana.jonesteb.lokeytheresa.staab
thomas.martintom.donohoetori.kuykendall
tracy.geacconevince.kaminski
vladi.pimenov
v.weldon
w..pereira
Scan Statistics on Enron Graphs – p.25/39
< > - +
Detection Graph D132
Details for the ‘detection’ graph D132
time t∗ 132 (week of May 17, 2001)size(D132) 267
scale k Mk,132 Mk,132 Sk,132
0 66 8.3 0.32
1 93 7.8 −0.35
2 172 116.0 7.30
3 219 174.0 5.20
number of isolates 50
Scan Statistics on Enron Graphs – p.26/39
< > - +
Anomaly Detection (Aliasing)
v∗ = arg maxv Ψ2,132(v) = k..allen
k..allen == phillip.allen?k..allen had no activity before t∗ = 132.At t∗ = 132, phillip.allen switched to k..allen.
Matched Filter:For each vertex v ∈ V \ v∗,
st∗,κ(v; v∗) =
t∗−1∑
t′=t∗−κ
|N1(v; Dt′) ∩ N1(v∗; Dt∗)|
Is this a detection we want?
Scan Statistics on Enron Graphs – p.27/39
< > - +
New York Times (May 22, 2005)
Scan Statistics on Enron Graphs – p.28/39
< > - +
New York Times (May 22, 2005)
Scan Statistics on Enron Graphs – p.29/39
< > - +
Enron Timeline
Scan Statistics on Enron Graphs – p.30/39
< > - +
Anomaly Detection (another )
Non-zero activity: Ψk,t(v) · Iµ0,t,τ (v) > c
For c = 1, v∗ = roy.hayslett at t∗ = 152.
scale k Ψk,t∗−5:t∗(v∗)
0 [ 1 , 2 , 1 , 3 , 1 , 2]1 [ 1 , 2 , 2 , 9 , 2 , 4]2 [ 1 , 2 , 2 , 19 , 4 , 175]3 [ 1 , 2 , 2 , 58 , 6 , 268]
Scan Statistics on Enron Graphs – p.31/39
< > - +
Anomaly Detection (another )
roy.hayslett communicates with sally.beck , who is a k = 0
detection!
scale k Ψk,t∗−5:t∗(v)
0 [ 3 , 2 , 0 , 2 , 3 , 62]1 [ 3 , 3 , 0 , 3 , 6 , 154]2 [ 4 , 3 , 0 , 37 , 11 , 229]3 [ 4 , 3 , 0 , 98 , 16 , 267]
Scan Statistics on Enron Graphs – p.32/39
< > - +
Anomaly Detection (chatter )
Seek a detection in which the excess activity is due to chatteramongst the 2-neighbors!
Ψ′
t(v) =(Ψ2,t(v) · It,τ (v)
)/ max(γt(v), 1)
It,τ (v) =I1 × I2 × I3
I1 =Iµ0,t,τ > c1,
I2 =IΨ0(v) < σ0,t,τ (v)c2 + µ0,t,τ (v),
I3 =IΨ1(v) < σ1,t,τ (v)c3 + µ1,t,τ (v).
Scan Statistics on Enron Graphs – p.33/39
< > - +
Anomaly Detection (chatter )
0.0
0.5
1.0
1.5
2.0
time (mm/yy)
M~ ′ 2
11/98 05/99 10/99 04/00 10/00 04/01 09/01 03/02
−0.5
0.0
0.5
1.0
1.5
time (mm/yy)
S′2
11/98 05/99 10/99 04/00 10/00 04/01 09/01 03/02
Scan Statistics on Enron Graphs – p.34/39
< > - +
Anomaly Detection (chatter )
(v∗, t∗) = (steven.kean, 109)
scale k Ψk,t∗−5:t∗(v∗)
0 [ 3 , 5 , 4 , 5 , 4 , 5]1 [11 , 13 , 10 , 10 , 11 , 18]2 [14 , 35 , 21 , 38 , 13 , 65]
Scan Statistics on Enron Graphs – p.35/39
< > - +
Anomaly Detection (chatter )
dan.hyvl
danny.mccarty
david.delainey
drew.fossum
james.steffes
jane.tholt
jeff.dasovich
jeffrey.hodge
john.lavorato
kevin.presto
lindy.donoho
mark.haedicke
mark.taylor
phillip.allen
richard.sanders
richard.shapiro
robert.badeerscott.neal
shelley.corman
stanley.horton
steven.kean
susan.scott
Ω109
Scan Statistics on Enron Graphs – p.36/39
< > - +
Anomaly Detection (chatter )
dan.hyvl
danny.mccarty
david.delainey
drew.fossum
james.steffes
jane.tholt
jeff.dasovich
jeffrey.hodge
john.lavorato
kevin.presto
lindy.donoho
mark.haedicke
mark.taylor
phillip.allen
richard.sanders
richard.shapiro
robert.badeerscott.neal
shelley.corman
stanley.horton
steven.kean
susan.scott
kenneth.lay
Ω108
Scan Statistics on Enron Graphs – p.37/39
< > - +
Discussion
scan statistics offers promise fordetecting anomalies in time series of graphs.extensions:• weighted graphs (# of messages),• coloured hypergraphs (“To”, “CC”, or “BCC”),• ...• sliding window (online analysis),• ...• exponential smoothing, detrending, variance stabilization, ...“Content and Scan Statistics for Enron” — John Conroy, Thursday
Scan Statistics on Enron Graphs – p.38/39
< > - +
Kronecker Quote
“The wealth of your practical experiencewith sane and interesting problems
will give to mathematicsa new direction and a new impetus.”
– Leopold Kronecker to Hermann von Helmholtz –
Scan Statistics on Enron Graphs – p.39/39