Outside the Closed World: On Using Machine Learning for...

52
IEEE Symposium on Security and Privacy May 2010 Robin Sommer International Computer Science Institute, & Lawrence Berkeley National Laboratory Vern Paxson International Computer Science Institute, & University of California, Berkeley Outside the Closed World: On Using Machine Learning for Network Intrusion Detection

Transcript of Outside the Closed World: On Using Machine Learning for...

Page 1: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

May 2010

Robin SommerInternational Computer Science Institute, &

Lawrence Berkeley National Laboratory

Vern PaxsonInternational Computer Science Institute, &

University of California, Berkeley

Outside the Closed World: On Using Machine Learning for Network Intrusion Detection

Page 2: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Network Intrusion Detection

2

Page 3: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Network Intrusion Detection

2

NIDS

Page 4: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Network Intrusion Detection

2

NIDS

Detection Approaches: Misuse vs. Anomaly

Page 5: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Anomaly Detection

3

Session Volume

SessionDuration

Page 6: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Anomaly Detection

3

Session Volume

SessionDuration

Training Phase: Building a profile of normal activity.

Page 7: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Anomaly Detection

3

Session Volume

SessionDuration

Training Phase: Building a profile of normal activity.Detection Phase: Matching observations against profile.

Page 8: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Anomaly Detection

3

Session Volume

SessionDuration

Training Phase: Building a profile of normal activity.Detection Phase: Matching observations against profile.

Page 9: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Anomaly Detection

3

Session Volume

SessionDuration

Training Phase: Building a profile of normal activity.Detection Phase: Matching observations against profile.

Page 10: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

• Assumption: Attacks exhibit characteristics that are different than those of normal traffic.

• Originally introduced by Dorothy Denning in1987.• IDES: Host-level system building per-user profiles of activity.• Login frequency, password failures, session duration, resource consumption.

Anomaly Detection (2)

4

Page 11: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Anomaly Detection (2)

4

14 · Chandola, Banerjee and Kumar

Technique Used Section ReferencesStatistical Profilingusing Histograms

Section 7.2.1 NIDES [Anderson et al. 1994; Anderson et al. 1995;Javitz and Valdes 1991], EMERALD [Porras andNeumann 1997], Yamanishi et al [2001; 2004], Hoet al. [1999], Kruegel at al [2002; 2003], Mahoneyet al [2002; 2003; 2003; 2007], Sargor [1998]

Parametric Statisti-cal Modeling

Section 7.1 Gwadera et al [2005b; 2004], Ye and Chen [2001]

Non-parametric Sta-tistical Modeling

Section 7.2.2 Chow and Yeung [2002]

Bayesian Networks Section 4.2 Siaterlis and Maglaris [2004], Sebyala et al. [2002],Valdes and Skinner [2000], Bronstein et al. [2001]

Neural Networks Section 4.1 HIDE [Zhang et al. 2001], NSOM [Labib and Ve-muri 2002], Smith et al. [2002], Hawkins et al.[2002], Kruegel et al. [2003], Manikopoulos and Pa-pavassiliou [2002], Ramadas et al. [2003]

Support Vector Ma-chines

Section 4.3 Eskin et al. [2002]

Rule-based Systems Section 4.4 ADAM [Barbara et al. 2001a; Barbara et al. 2003;Barbara et al. 2001b], Fan et al. [2001], Helmeret al. [1998], Qin and Hwang [2004], Salvador andChan [2003], Otey et al. [2003]

Clustering Based Section 6 ADMIT [Sequeira and Zaki 2002], Eskin et al.[2002], Wu and Zhang [2003], Otey et al. [2003]

Nearest Neighborbased

Section 5 MINDS [Ertoz et al. 2004; Chandola et al. 2006],Eskin et al. [2002]

Spectral Section 9 Shyu et al. [2003], Lakhina et al. [2005], Thottanand Ji [2003],Sun et al. [2007]

Information Theo-retic

Section 8 Lee and Xiang [2001],Noble and Cook [2003]

Table III. Examples of anomaly detection techniques used for network intrusion detection.

Technique Used Section ReferencesNeural Networks Section 4.1 CARDWATCH [Aleskerov et al. 1997], Ghosh and

Reilly [1994],Brause et al. [1999],Dorronsoro et al.[1997]

Rule-based Systems Section 4.4 Brause et al. [1999]Clustering Section 6 Bolton and Hand [1999]

Table IV. Examples of anomaly detection techniques used for credit card fraud detection.

detection techniques is to maintain a usage profile for each customer and monitorthe profiles to detect any deviations. Some of the specific applications of frauddetection are discussed below.

3.2.1 Credit Card Fraud Detection. In this domain, anomaly detection tech-niques are applied to detect fraudulent credit card applications or fraudulent creditcard usage (associated with credit card thefts). Detecting fraudulent credit cardapplications is similar to detecting insurance fraud [Ghosh and Reilly 1994].To Appear in ACM Computing Surveys, 09 2009.

Source: Chandola et al. 2009

Page 12: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Anomaly Detection (2)

4

14 · Chandola, Banerjee and Kumar

Technique Used Section ReferencesStatistical Profilingusing Histograms

Section 7.2.1 NIDES [Anderson et al. 1994; Anderson et al. 1995;Javitz and Valdes 1991], EMERALD [Porras andNeumann 1997], Yamanishi et al [2001; 2004], Hoet al. [1999], Kruegel at al [2002; 2003], Mahoneyet al [2002; 2003; 2003; 2007], Sargor [1998]

Parametric Statisti-cal Modeling

Section 7.1 Gwadera et al [2005b; 2004], Ye and Chen [2001]

Non-parametric Sta-tistical Modeling

Section 7.2.2 Chow and Yeung [2002]

Bayesian Networks Section 4.2 Siaterlis and Maglaris [2004], Sebyala et al. [2002],Valdes and Skinner [2000], Bronstein et al. [2001]

Neural Networks Section 4.1 HIDE [Zhang et al. 2001], NSOM [Labib and Ve-muri 2002], Smith et al. [2002], Hawkins et al.[2002], Kruegel et al. [2003], Manikopoulos and Pa-pavassiliou [2002], Ramadas et al. [2003]

Support Vector Ma-chines

Section 4.3 Eskin et al. [2002]

Rule-based Systems Section 4.4 ADAM [Barbara et al. 2001a; Barbara et al. 2003;Barbara et al. 2001b], Fan et al. [2001], Helmeret al. [1998], Qin and Hwang [2004], Salvador andChan [2003], Otey et al. [2003]

Clustering Based Section 6 ADMIT [Sequeira and Zaki 2002], Eskin et al.[2002], Wu and Zhang [2003], Otey et al. [2003]

Nearest Neighborbased

Section 5 MINDS [Ertoz et al. 2004; Chandola et al. 2006],Eskin et al. [2002]

Spectral Section 9 Shyu et al. [2003], Lakhina et al. [2005], Thottanand Ji [2003],Sun et al. [2007]

Information Theo-retic

Section 8 Lee and Xiang [2001],Noble and Cook [2003]

Table III. Examples of anomaly detection techniques used for network intrusion detection.

Technique Used Section ReferencesNeural Networks Section 4.1 CARDWATCH [Aleskerov et al. 1997], Ghosh and

Reilly [1994],Brause et al. [1999],Dorronsoro et al.[1997]

Rule-based Systems Section 4.4 Brause et al. [1999]Clustering Section 6 Bolton and Hand [1999]

Table IV. Examples of anomaly detection techniques used for credit card fraud detection.

detection techniques is to maintain a usage profile for each customer and monitorthe profiles to detect any deviations. Some of the specific applications of frauddetection are discussed below.

3.2.1 Credit Card Fraud Detection. In this domain, anomaly detection tech-niques are applied to detect fraudulent credit card applications or fraudulent creditcard usage (associated with credit card thefts). Detecting fraudulent credit cardapplications is similar to detecting insurance fraud [Ghosh and Reilly 1994].To Appear in ACM Computing Surveys, 09 2009.

Source: Chandola et al. 2009

Features usedpacket sizesIP addresses portsheader fieldstimestampsinter-arrival timessession sizesession durationsession volumepayload frequenciespayload tokenspayload pattern...

Page 13: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

The Holy Grail ...

5

Page 14: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

The Holy Grail ...

5

• Anomaly detection is extremely appealing.• Promises to find novel attacks without anticipating specifics. • It’s plausible: machine learning works so well in other domains.

Page 15: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

The Holy Grail ...

5

• Anomaly detection is extremely appealing.• Promises to find novel attacks without anticipating specifics. • It’s plausible: machine learning works so well in other domains.

• But guess what’s used in operation? Snort.• We find hardly any machine learning NIDS in real-world deployments.

Page 16: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

The Holy Grail ...

5

• Anomaly detection is extremely appealing.• Promises to find novel attacks without anticipating specifics. • It’s plausible: machine learning works so well in other domains.

• But guess what’s used in operation? Snort.• We find hardly any machine learning NIDS in real-world deployments.

• Could using machine learning be harder than it appears?

Page 17: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Why is Anomaly Detection Hard?

6

The intrusion detection domain faces challenges that make it fundamentally different from other fields.

Page 18: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Outlier detection and the high costs of errors! How do we find the opposite of normal?Interpretation of results! What does that anomaly mean?Evaluation!! How do we make sure it actually works? Training data! What do we train our system with?Evasion risk! Can the attacker mislead our system?

Why is Anomaly Detection Hard?

6

The intrusion detection domain faces challenges that make it fundamentally different from other fields.

Page 19: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Outlier detection and the high costs of errors! How do we find the opposite of normal?Interpretation of results! What does that anomaly mean?Evaluation!! How do we make sure it actually works? Training data! What do we train our system with?Evasion risk! Can the attacker mislead our system?

Why is Anomaly Detection Hard?

6

The intrusion detection domain faces challenges that make it fundamentally different from other fields.

Page 20: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Machine Learning for Classification

7

Feature X

Feature Y

Page 21: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Machine Learning for Classification

7

A

B

C

Feature X

Feature Y

Page 22: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Machine Learning for Classification

7

A

B

C

Feature X

Feature Y

Page 23: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Machine Learning for Classification

7

A

B

C

Feature X

Feature Y

Page 24: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Machine Learning for Classification

7

A

B

C

Feature X

Feature Y

Page 25: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Machine Learning for Classification

7

A

B

C

Feature X

Feature Y

Page 26: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Machine Learning for Classification

7

A

B

C

Feature X

Feature Y

Page 27: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Machine Learning for Classification

7

A

B

C

Feature X

Feature Y

Classification Problems

Optical Character RecognitionGoogle’s Machine TranslationAmazon’s Recommendations

Spam Detection

Page 28: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Outlier Detection

8

Feature X

Feature Y

Page 29: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Outlier Detection

8

Feature X

Feature Y

Page 30: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Outlier Detection

8

Feature X

Feature Y

Page 31: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Outlier Detection

8

Feature X

Feature Y

Closed World AssumptionSpecify only positive examples.

Adopt standing assumption that the rest is negative.

Can work well if the model is very precise, or mistakes are cheap.

Page 32: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

What is Normal?

• Finding a stable notion of normal is hard for networks.

• Network traffic is composed of many individual sessions.• Leads to enormous variety and unpredictable behavior. • Observable on all layers of the protocol stack.

9

Page 33: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Self-Similarity of Ethernet Traffic

10

0 100 200 300 400 500 600 700 800 900 1000

0

20000

40000

60000

Time Units, Unit = 100 Seconds (a)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

2000

4000

6000

Time Units, Unit = 10 Seconds (b)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

200

400

600

800

Time Units, Unit = 1 Second (c)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

020406080

100

Time Units, Unit = 0.1 Second (d)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

5

10

15

Time Units, Unit = 0.01 Second (e)

Pac

kets

/Tim

e U

nit

Figure 1 (a)—(e). Pictorial "proof" of self-similarity:Ethernet traffic (packets per time unit for the August’89 trace) on 5 different time scales. (Different graylevels are used to identify the same segments of trafficon the different time scales.)

3.2 THE MATHEMATICS OF SELF-SIMILARITYThe notion of self-similarity is not merely an intuitivedescription, but a precise concept captured by the following

rigorous mathematical definition. Let X = (Xt: t = 0, 1, 2, ...) bea covariance stationary (sometimes called wide-sensestationary) stochastic process; that is, a process with constantmean µ = E [Xt], finite variance !2 = E [(Xt " µ)2], and anautocorrelation function r (k) = E [(Xt " µ)(Xt + k " µ)]/E [(Xt " µ)2] (k = 0, 1, 2, ...) that depends only on k. Inparticular, we assume that X has an autocorrelation function ofthe form

r (k) # a 1k "$ , as k %&, (3.2.1)

where 0 < $ < 1 (here and below, a 1, a 2, . . . denote finitepositive constants). For each m = 1, 2, 3, . . . , letX (m) = (Xk

(m) : k = 1, 2, 3, ...) denote a new time series obtainedby averaging the original series X over non-overlapping blocksof size m. That is, for each m = 1, 2, 3, . . . , X (m) is given byXk(m) = 1/m (Xkm " m + 1 + . . . + Xkm), (k ' 1). Note that for

each m, the aggregated time series X (m) defines a covariancestationary process; let r (m) denote the correspondingautocorrelation function.

The process X is called exactly (second-order) self-similar withself-similarity parameter H = 1 " $/2 if the correspondingaggregated processes X (m) have the same correlation structure asX, i.e., r (m) (k) = r (k), for all m = 1, 2, . . . ( k = 1, 2, 3, . . . ).In other words, X is exactly self-similar if the aggregatedprocesses X (m) are indistinguishable from X—at least withrespect to their second order statistical properties. An exampleof an exactly self-similar process with self-similarity parameterH is fractional Gaussian noise (FGN) with parameter1/2 < H < 1, introduced by Mandelbrot and Van Ness (1968).

A covariance stationary process X is called asymptotically(second-order) self-similar with self-similarity parameterH = 1 " $/2 if r (m) (k) agrees asymptotically (i.e., for large m andlarge k) with the correlation structure r (k) of X given by (3.2.1).The fractional autoregressive integrated moving-averageprocesses (fractional ARIMA(p,d,q) processes) with0 < d < 1/2 are examples of asymptotically second-order self-similar processes with self-similarity parameter d + 1/2. (Formore details, see Granger and Joyeux (1980), and Hosking(1981).)

Intuitively, the most striking feature of (exactly orasymptotically) self-similar processes is that their aggregatedprocesses X (m) possess a nondegenerate correlation structure asm %&. This behavior is precisely the intuition illustrated withthe sequence of plots in Figure 1; if the original time series Xrepresents the number of Ethernet packets per 10 milliseconds(plot (e)), then plots (a) to (d) depict segments of the aggregatedtime series X (10000) , X (1000) , X (100) , and X (10) , respectively. All ofthe plots look "similar", suggesting a nearly identicalautocorrelation function for all of the aggregated processes.

Mathematically, self-similarity manifests itself in a number ofequivalent ways (see Cox (1984)): (i) the variance of the samplemean decreases more slowly than the reciprocal of the samplesize (slowly decaying variances), i.e., var(X (m) ) # a 2m "$ ,as m %&, with 0 < $ < 1; (ii) the autocorrelations decayhyperbolically rather than exponentially fast, implying a non-summable autocorrelation function (k

r (k) = & (long-rangedependence), i.e., r (k) satisfies relation (3.2.1); and (iii) thespectral density f ( . ) obeys a power-law behavior near theorigin (1/f-noise), i.e., f ()) # a 3)"* , as ) % 0 , with 0 < * < 1and * = 1 " $.

The existence of a nondegenerate correlation structure for theaggregated processes X (m) as m %& is in stark contrast totypical packet traffic models currently considered in theliterature, all of which have the property that their aggregated

ACM SIGCOMM –206– Computer Communication Review

Source: LeLand et al. 1995

Page 34: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Self-Similarity of Ethernet Traffic

10

0 100 200 300 400 500 600 700 800 900 1000

0

20000

40000

60000

Time Units, Unit = 100 Seconds (a)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

2000

4000

6000

Time Units, Unit = 10 Seconds (b)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

200

400

600

800

Time Units, Unit = 1 Second (c)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

020406080

100

Time Units, Unit = 0.1 Second (d)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

5

10

15

Time Units, Unit = 0.01 Second (e)

Pac

kets

/Tim

e U

nit

Figure 1 (a)—(e). Pictorial "proof" of self-similarity:Ethernet traffic (packets per time unit for the August’89 trace) on 5 different time scales. (Different graylevels are used to identify the same segments of trafficon the different time scales.)

3.2 THE MATHEMATICS OF SELF-SIMILARITYThe notion of self-similarity is not merely an intuitivedescription, but a precise concept captured by the following

rigorous mathematical definition. Let X = (Xt: t = 0, 1, 2, ...) bea covariance stationary (sometimes called wide-sensestationary) stochastic process; that is, a process with constantmean µ = E [Xt], finite variance !2 = E [(Xt " µ)2], and anautocorrelation function r (k) = E [(Xt " µ)(Xt + k " µ)]/E [(Xt " µ)2] (k = 0, 1, 2, ...) that depends only on k. Inparticular, we assume that X has an autocorrelation function ofthe form

r (k) # a 1k "$ , as k %&, (3.2.1)

where 0 < $ < 1 (here and below, a 1, a 2, . . . denote finitepositive constants). For each m = 1, 2, 3, . . . , letX (m) = (Xk

(m) : k = 1, 2, 3, ...) denote a new time series obtainedby averaging the original series X over non-overlapping blocksof size m. That is, for each m = 1, 2, 3, . . . , X (m) is given byXk(m) = 1/m (Xkm " m + 1 + . . . + Xkm), (k ' 1). Note that for

each m, the aggregated time series X (m) defines a covariancestationary process; let r (m) denote the correspondingautocorrelation function.

The process X is called exactly (second-order) self-similar withself-similarity parameter H = 1 " $/2 if the correspondingaggregated processes X (m) have the same correlation structure asX, i.e., r (m) (k) = r (k), for all m = 1, 2, . . . ( k = 1, 2, 3, . . . ).In other words, X is exactly self-similar if the aggregatedprocesses X (m) are indistinguishable from X—at least withrespect to their second order statistical properties. An exampleof an exactly self-similar process with self-similarity parameterH is fractional Gaussian noise (FGN) with parameter1/2 < H < 1, introduced by Mandelbrot and Van Ness (1968).

A covariance stationary process X is called asymptotically(second-order) self-similar with self-similarity parameterH = 1 " $/2 if r (m) (k) agrees asymptotically (i.e., for large m andlarge k) with the correlation structure r (k) of X given by (3.2.1).The fractional autoregressive integrated moving-averageprocesses (fractional ARIMA(p,d,q) processes) with0 < d < 1/2 are examples of asymptotically second-order self-similar processes with self-similarity parameter d + 1/2. (Formore details, see Granger and Joyeux (1980), and Hosking(1981).)

Intuitively, the most striking feature of (exactly orasymptotically) self-similar processes is that their aggregatedprocesses X (m) possess a nondegenerate correlation structure asm %&. This behavior is precisely the intuition illustrated withthe sequence of plots in Figure 1; if the original time series Xrepresents the number of Ethernet packets per 10 milliseconds(plot (e)), then plots (a) to (d) depict segments of the aggregatedtime series X (10000) , X (1000) , X (100) , and X (10) , respectively. All ofthe plots look "similar", suggesting a nearly identicalautocorrelation function for all of the aggregated processes.

Mathematically, self-similarity manifests itself in a number ofequivalent ways (see Cox (1984)): (i) the variance of the samplemean decreases more slowly than the reciprocal of the samplesize (slowly decaying variances), i.e., var(X (m) ) # a 2m "$ ,as m %&, with 0 < $ < 1; (ii) the autocorrelations decayhyperbolically rather than exponentially fast, implying a non-summable autocorrelation function (k

r (k) = & (long-rangedependence), i.e., r (k) satisfies relation (3.2.1); and (iii) thespectral density f ( . ) obeys a power-law behavior near theorigin (1/f-noise), i.e., f ()) # a 3)"* , as ) % 0 , with 0 < * < 1and * = 1 " $.

The existence of a nondegenerate correlation structure for theaggregated processes X (m) as m %& is in stark contrast totypical packet traffic models currently considered in theliterature, all of which have the property that their aggregated

ACM SIGCOMM –206– Computer Communication Review

0 100 200 300 400 500 600 700 800 900 1000

0

20000

40000

60000

Time Units, Unit = 100 Seconds (a)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

2000

4000

6000

Time Units, Unit = 10 Seconds (b)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

200

400

600

800

Time Units, Unit = 1 Second (c)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

020406080

100

Time Units, Unit = 0.1 Second (d)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

5

10

15

Time Units, Unit = 0.01 Second (e)

Pac

kets

/Tim

e U

nit

Figure 1 (a)—(e). Pictorial "proof" of self-similarity:Ethernet traffic (packets per time unit for the August’89 trace) on 5 different time scales. (Different graylevels are used to identify the same segments of trafficon the different time scales.)

3.2 THE MATHEMATICS OF SELF-SIMILARITYThe notion of self-similarity is not merely an intuitivedescription, but a precise concept captured by the following

rigorous mathematical definition. Let X = (Xt: t = 0, 1, 2, ...) bea covariance stationary (sometimes called wide-sensestationary) stochastic process; that is, a process with constantmean µ = E [Xt], finite variance !2 = E [(Xt " µ)2], and anautocorrelation function r (k) = E [(Xt " µ)(Xt + k " µ)]/E [(Xt " µ)2] (k = 0, 1, 2, ...) that depends only on k. Inparticular, we assume that X has an autocorrelation function ofthe form

r (k) # a 1k "$ , as k %&, (3.2.1)

where 0 < $ < 1 (here and below, a 1, a 2, . . . denote finitepositive constants). For each m = 1, 2, 3, . . . , letX (m) = (Xk

(m) : k = 1, 2, 3, ...) denote a new time series obtainedby averaging the original series X over non-overlapping blocksof size m. That is, for each m = 1, 2, 3, . . . , X (m) is given byXk(m) = 1/m (Xkm " m + 1 + . . . + Xkm), (k ' 1). Note that for

each m, the aggregated time series X (m) defines a covariancestationary process; let r (m) denote the correspondingautocorrelation function.

The process X is called exactly (second-order) self-similar withself-similarity parameter H = 1 " $/2 if the correspondingaggregated processes X (m) have the same correlation structure asX, i.e., r (m) (k) = r (k), for all m = 1, 2, . . . ( k = 1, 2, 3, . . . ).In other words, X is exactly self-similar if the aggregatedprocesses X (m) are indistinguishable from X—at least withrespect to their second order statistical properties. An exampleof an exactly self-similar process with self-similarity parameterH is fractional Gaussian noise (FGN) with parameter1/2 < H < 1, introduced by Mandelbrot and Van Ness (1968).

A covariance stationary process X is called asymptotically(second-order) self-similar with self-similarity parameterH = 1 " $/2 if r (m) (k) agrees asymptotically (i.e., for large m andlarge k) with the correlation structure r (k) of X given by (3.2.1).The fractional autoregressive integrated moving-averageprocesses (fractional ARIMA(p,d,q) processes) with0 < d < 1/2 are examples of asymptotically second-order self-similar processes with self-similarity parameter d + 1/2. (Formore details, see Granger and Joyeux (1980), and Hosking(1981).)

Intuitively, the most striking feature of (exactly orasymptotically) self-similar processes is that their aggregatedprocesses X (m) possess a nondegenerate correlation structure asm %&. This behavior is precisely the intuition illustrated withthe sequence of plots in Figure 1; if the original time series Xrepresents the number of Ethernet packets per 10 milliseconds(plot (e)), then plots (a) to (d) depict segments of the aggregatedtime series X (10000) , X (1000) , X (100) , and X (10) , respectively. All ofthe plots look "similar", suggesting a nearly identicalautocorrelation function for all of the aggregated processes.

Mathematically, self-similarity manifests itself in a number ofequivalent ways (see Cox (1984)): (i) the variance of the samplemean decreases more slowly than the reciprocal of the samplesize (slowly decaying variances), i.e., var(X (m) ) # a 2m "$ ,as m %&, with 0 < $ < 1; (ii) the autocorrelations decayhyperbolically rather than exponentially fast, implying a non-summable autocorrelation function (k

r (k) = & (long-rangedependence), i.e., r (k) satisfies relation (3.2.1); and (iii) thespectral density f ( . ) obeys a power-law behavior near theorigin (1/f-noise), i.e., f ()) # a 3)"* , as ) % 0 , with 0 < * < 1and * = 1 " $.

The existence of a nondegenerate correlation structure for theaggregated processes X (m) as m %& is in stark contrast totypical packet traffic models currently considered in theliterature, all of which have the property that their aggregated

ACM SIGCOMM –206– Computer Communication Review

Source: LeLand et al. 1995

Page 35: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Self-Similarity of Ethernet Traffic

10

0 100 200 300 400 500 600 700 800 900 1000

0

20000

40000

60000

Time Units, Unit = 100 Seconds (a)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

2000

4000

6000

Time Units, Unit = 10 Seconds (b)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

200

400

600

800

Time Units, Unit = 1 Second (c)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

020406080

100

Time Units, Unit = 0.1 Second (d)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

5

10

15

Time Units, Unit = 0.01 Second (e)

Pac

kets

/Tim

e U

nit

Figure 1 (a)—(e). Pictorial "proof" of self-similarity:Ethernet traffic (packets per time unit for the August’89 trace) on 5 different time scales. (Different graylevels are used to identify the same segments of trafficon the different time scales.)

3.2 THE MATHEMATICS OF SELF-SIMILARITYThe notion of self-similarity is not merely an intuitivedescription, but a precise concept captured by the following

rigorous mathematical definition. Let X = (Xt: t = 0, 1, 2, ...) bea covariance stationary (sometimes called wide-sensestationary) stochastic process; that is, a process with constantmean µ = E [Xt], finite variance !2 = E [(Xt " µ)2], and anautocorrelation function r (k) = E [(Xt " µ)(Xt + k " µ)]/E [(Xt " µ)2] (k = 0, 1, 2, ...) that depends only on k. Inparticular, we assume that X has an autocorrelation function ofthe form

r (k) # a 1k "$ , as k %&, (3.2.1)

where 0 < $ < 1 (here and below, a 1, a 2, . . . denote finitepositive constants). For each m = 1, 2, 3, . . . , letX (m) = (Xk

(m) : k = 1, 2, 3, ...) denote a new time series obtainedby averaging the original series X over non-overlapping blocksof size m. That is, for each m = 1, 2, 3, . . . , X (m) is given byXk(m) = 1/m (Xkm " m + 1 + . . . + Xkm), (k ' 1). Note that for

each m, the aggregated time series X (m) defines a covariancestationary process; let r (m) denote the correspondingautocorrelation function.

The process X is called exactly (second-order) self-similar withself-similarity parameter H = 1 " $/2 if the correspondingaggregated processes X (m) have the same correlation structure asX, i.e., r (m) (k) = r (k), for all m = 1, 2, . . . ( k = 1, 2, 3, . . . ).In other words, X is exactly self-similar if the aggregatedprocesses X (m) are indistinguishable from X—at least withrespect to their second order statistical properties. An exampleof an exactly self-similar process with self-similarity parameterH is fractional Gaussian noise (FGN) with parameter1/2 < H < 1, introduced by Mandelbrot and Van Ness (1968).

A covariance stationary process X is called asymptotically(second-order) self-similar with self-similarity parameterH = 1 " $/2 if r (m) (k) agrees asymptotically (i.e., for large m andlarge k) with the correlation structure r (k) of X given by (3.2.1).The fractional autoregressive integrated moving-averageprocesses (fractional ARIMA(p,d,q) processes) with0 < d < 1/2 are examples of asymptotically second-order self-similar processes with self-similarity parameter d + 1/2. (Formore details, see Granger and Joyeux (1980), and Hosking(1981).)

Intuitively, the most striking feature of (exactly orasymptotically) self-similar processes is that their aggregatedprocesses X (m) possess a nondegenerate correlation structure asm %&. This behavior is precisely the intuition illustrated withthe sequence of plots in Figure 1; if the original time series Xrepresents the number of Ethernet packets per 10 milliseconds(plot (e)), then plots (a) to (d) depict segments of the aggregatedtime series X (10000) , X (1000) , X (100) , and X (10) , respectively. All ofthe plots look "similar", suggesting a nearly identicalautocorrelation function for all of the aggregated processes.

Mathematically, self-similarity manifests itself in a number ofequivalent ways (see Cox (1984)): (i) the variance of the samplemean decreases more slowly than the reciprocal of the samplesize (slowly decaying variances), i.e., var(X (m) ) # a 2m "$ ,as m %&, with 0 < $ < 1; (ii) the autocorrelations decayhyperbolically rather than exponentially fast, implying a non-summable autocorrelation function (k

r (k) = & (long-rangedependence), i.e., r (k) satisfies relation (3.2.1); and (iii) thespectral density f ( . ) obeys a power-law behavior near theorigin (1/f-noise), i.e., f ()) # a 3)"* , as ) % 0 , with 0 < * < 1and * = 1 " $.

The existence of a nondegenerate correlation structure for theaggregated processes X (m) as m %& is in stark contrast totypical packet traffic models currently considered in theliterature, all of which have the property that their aggregated

ACM SIGCOMM –206– Computer Communication Review

0 100 200 300 400 500 600 700 800 900 1000

0

20000

40000

60000

Time Units, Unit = 100 Seconds (a)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

2000

4000

6000

Time Units, Unit = 10 Seconds (b)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

200

400

600

800

Time Units, Unit = 1 Second (c)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

020406080

100

Time Units, Unit = 0.1 Second (d)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

5

10

15

Time Units, Unit = 0.01 Second (e)

Pac

kets

/Tim

e U

nit

Figure 1 (a)—(e). Pictorial "proof" of self-similarity:Ethernet traffic (packets per time unit for the August’89 trace) on 5 different time scales. (Different graylevels are used to identify the same segments of trafficon the different time scales.)

3.2 THE MATHEMATICS OF SELF-SIMILARITYThe notion of self-similarity is not merely an intuitivedescription, but a precise concept captured by the following

rigorous mathematical definition. Let X = (Xt: t = 0, 1, 2, ...) bea covariance stationary (sometimes called wide-sensestationary) stochastic process; that is, a process with constantmean µ = E [Xt], finite variance !2 = E [(Xt " µ)2], and anautocorrelation function r (k) = E [(Xt " µ)(Xt + k " µ)]/E [(Xt " µ)2] (k = 0, 1, 2, ...) that depends only on k. Inparticular, we assume that X has an autocorrelation function ofthe form

r (k) # a 1k "$ , as k %&, (3.2.1)

where 0 < $ < 1 (here and below, a 1, a 2, . . . denote finitepositive constants). For each m = 1, 2, 3, . . . , letX (m) = (Xk

(m) : k = 1, 2, 3, ...) denote a new time series obtainedby averaging the original series X over non-overlapping blocksof size m. That is, for each m = 1, 2, 3, . . . , X (m) is given byXk(m) = 1/m (Xkm " m + 1 + . . . + Xkm), (k ' 1). Note that for

each m, the aggregated time series X (m) defines a covariancestationary process; let r (m) denote the correspondingautocorrelation function.

The process X is called exactly (second-order) self-similar withself-similarity parameter H = 1 " $/2 if the correspondingaggregated processes X (m) have the same correlation structure asX, i.e., r (m) (k) = r (k), for all m = 1, 2, . . . ( k = 1, 2, 3, . . . ).In other words, X is exactly self-similar if the aggregatedprocesses X (m) are indistinguishable from X—at least withrespect to their second order statistical properties. An exampleof an exactly self-similar process with self-similarity parameterH is fractional Gaussian noise (FGN) with parameter1/2 < H < 1, introduced by Mandelbrot and Van Ness (1968).

A covariance stationary process X is called asymptotically(second-order) self-similar with self-similarity parameterH = 1 " $/2 if r (m) (k) agrees asymptotically (i.e., for large m andlarge k) with the correlation structure r (k) of X given by (3.2.1).The fractional autoregressive integrated moving-averageprocesses (fractional ARIMA(p,d,q) processes) with0 < d < 1/2 are examples of asymptotically second-order self-similar processes with self-similarity parameter d + 1/2. (Formore details, see Granger and Joyeux (1980), and Hosking(1981).)

Intuitively, the most striking feature of (exactly orasymptotically) self-similar processes is that their aggregatedprocesses X (m) possess a nondegenerate correlation structure asm %&. This behavior is precisely the intuition illustrated withthe sequence of plots in Figure 1; if the original time series Xrepresents the number of Ethernet packets per 10 milliseconds(plot (e)), then plots (a) to (d) depict segments of the aggregatedtime series X (10000) , X (1000) , X (100) , and X (10) , respectively. All ofthe plots look "similar", suggesting a nearly identicalautocorrelation function for all of the aggregated processes.

Mathematically, self-similarity manifests itself in a number ofequivalent ways (see Cox (1984)): (i) the variance of the samplemean decreases more slowly than the reciprocal of the samplesize (slowly decaying variances), i.e., var(X (m) ) # a 2m "$ ,as m %&, with 0 < $ < 1; (ii) the autocorrelations decayhyperbolically rather than exponentially fast, implying a non-summable autocorrelation function (k

r (k) = & (long-rangedependence), i.e., r (k) satisfies relation (3.2.1); and (iii) thespectral density f ( . ) obeys a power-law behavior near theorigin (1/f-noise), i.e., f ()) # a 3)"* , as ) % 0 , with 0 < * < 1and * = 1 " $.

The existence of a nondegenerate correlation structure for theaggregated processes X (m) as m %& is in stark contrast totypical packet traffic models currently considered in theliterature, all of which have the property that their aggregated

ACM SIGCOMM –206– Computer Communication Review

0 100 200 300 400 500 600 700 800 900 1000

0

20000

40000

60000

Time Units, Unit = 100 Seconds (a)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

2000

4000

6000

Time Units, Unit = 10 Seconds (b)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

200

400

600

800

Time Units, Unit = 1 Second (c)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

020406080

100

Time Units, Unit = 0.1 Second (d)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

5

10

15

Time Units, Unit = 0.01 Second (e)

Pac

kets

/Tim

e U

nit

Figure 1 (a)—(e). Pictorial "proof" of self-similarity:Ethernet traffic (packets per time unit for the August’89 trace) on 5 different time scales. (Different graylevels are used to identify the same segments of trafficon the different time scales.)

3.2 THE MATHEMATICS OF SELF-SIMILARITYThe notion of self-similarity is not merely an intuitivedescription, but a precise concept captured by the following

rigorous mathematical definition. Let X = (Xt: t = 0, 1, 2, ...) bea covariance stationary (sometimes called wide-sensestationary) stochastic process; that is, a process with constantmean µ = E [Xt], finite variance !2 = E [(Xt " µ)2], and anautocorrelation function r (k) = E [(Xt " µ)(Xt + k " µ)]/E [(Xt " µ)2] (k = 0, 1, 2, ...) that depends only on k. Inparticular, we assume that X has an autocorrelation function ofthe form

r (k) # a 1k "$ , as k %&, (3.2.1)

where 0 < $ < 1 (here and below, a 1, a 2, . . . denote finitepositive constants). For each m = 1, 2, 3, . . . , letX (m) = (Xk

(m) : k = 1, 2, 3, ...) denote a new time series obtainedby averaging the original series X over non-overlapping blocksof size m. That is, for each m = 1, 2, 3, . . . , X (m) is given byXk(m) = 1/m (Xkm " m + 1 + . . . + Xkm), (k ' 1). Note that for

each m, the aggregated time series X (m) defines a covariancestationary process; let r (m) denote the correspondingautocorrelation function.

The process X is called exactly (second-order) self-similar withself-similarity parameter H = 1 " $/2 if the correspondingaggregated processes X (m) have the same correlation structure asX, i.e., r (m) (k) = r (k), for all m = 1, 2, . . . ( k = 1, 2, 3, . . . ).In other words, X is exactly self-similar if the aggregatedprocesses X (m) are indistinguishable from X—at least withrespect to their second order statistical properties. An exampleof an exactly self-similar process with self-similarity parameterH is fractional Gaussian noise (FGN) with parameter1/2 < H < 1, introduced by Mandelbrot and Van Ness (1968).

A covariance stationary process X is called asymptotically(second-order) self-similar with self-similarity parameterH = 1 " $/2 if r (m) (k) agrees asymptotically (i.e., for large m andlarge k) with the correlation structure r (k) of X given by (3.2.1).The fractional autoregressive integrated moving-averageprocesses (fractional ARIMA(p,d,q) processes) with0 < d < 1/2 are examples of asymptotically second-order self-similar processes with self-similarity parameter d + 1/2. (Formore details, see Granger and Joyeux (1980), and Hosking(1981).)

Intuitively, the most striking feature of (exactly orasymptotically) self-similar processes is that their aggregatedprocesses X (m) possess a nondegenerate correlation structure asm %&. This behavior is precisely the intuition illustrated withthe sequence of plots in Figure 1; if the original time series Xrepresents the number of Ethernet packets per 10 milliseconds(plot (e)), then plots (a) to (d) depict segments of the aggregatedtime series X (10000) , X (1000) , X (100) , and X (10) , respectively. All ofthe plots look "similar", suggesting a nearly identicalautocorrelation function for all of the aggregated processes.

Mathematically, self-similarity manifests itself in a number ofequivalent ways (see Cox (1984)): (i) the variance of the samplemean decreases more slowly than the reciprocal of the samplesize (slowly decaying variances), i.e., var(X (m) ) # a 2m "$ ,as m %&, with 0 < $ < 1; (ii) the autocorrelations decayhyperbolically rather than exponentially fast, implying a non-summable autocorrelation function (k

r (k) = & (long-rangedependence), i.e., r (k) satisfies relation (3.2.1); and (iii) thespectral density f ( . ) obeys a power-law behavior near theorigin (1/f-noise), i.e., f ()) # a 3)"* , as ) % 0 , with 0 < * < 1and * = 1 " $.

The existence of a nondegenerate correlation structure for theaggregated processes X (m) as m %& is in stark contrast totypical packet traffic models currently considered in theliterature, all of which have the property that their aggregated

ACM SIGCOMM –206– Computer Communication Review

Source: LeLand et al. 1995

Page 36: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Self-Similarity of Ethernet Traffic

10

0 100 200 300 400 500 600 700 800 900 1000

0

20000

40000

60000

Time Units, Unit = 100 Seconds (a)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

2000

4000

6000

Time Units, Unit = 10 Seconds (b)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

200

400

600

800

Time Units, Unit = 1 Second (c)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

020406080

100

Time Units, Unit = 0.1 Second (d)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

5

10

15

Time Units, Unit = 0.01 Second (e)

Pac

kets

/Tim

e U

nit

Figure 1 (a)—(e). Pictorial "proof" of self-similarity:Ethernet traffic (packets per time unit for the August’89 trace) on 5 different time scales. (Different graylevels are used to identify the same segments of trafficon the different time scales.)

3.2 THE MATHEMATICS OF SELF-SIMILARITYThe notion of self-similarity is not merely an intuitivedescription, but a precise concept captured by the following

rigorous mathematical definition. Let X = (Xt: t = 0, 1, 2, ...) bea covariance stationary (sometimes called wide-sensestationary) stochastic process; that is, a process with constantmean µ = E [Xt], finite variance !2 = E [(Xt " µ)2], and anautocorrelation function r (k) = E [(Xt " µ)(Xt + k " µ)]/E [(Xt " µ)2] (k = 0, 1, 2, ...) that depends only on k. Inparticular, we assume that X has an autocorrelation function ofthe form

r (k) # a 1k "$ , as k %&, (3.2.1)

where 0 < $ < 1 (here and below, a 1, a 2, . . . denote finitepositive constants). For each m = 1, 2, 3, . . . , letX (m) = (Xk

(m) : k = 1, 2, 3, ...) denote a new time series obtainedby averaging the original series X over non-overlapping blocksof size m. That is, for each m = 1, 2, 3, . . . , X (m) is given byXk(m) = 1/m (Xkm " m + 1 + . . . + Xkm), (k ' 1). Note that for

each m, the aggregated time series X (m) defines a covariancestationary process; let r (m) denote the correspondingautocorrelation function.

The process X is called exactly (second-order) self-similar withself-similarity parameter H = 1 " $/2 if the correspondingaggregated processes X (m) have the same correlation structure asX, i.e., r (m) (k) = r (k), for all m = 1, 2, . . . ( k = 1, 2, 3, . . . ).In other words, X is exactly self-similar if the aggregatedprocesses X (m) are indistinguishable from X—at least withrespect to their second order statistical properties. An exampleof an exactly self-similar process with self-similarity parameterH is fractional Gaussian noise (FGN) with parameter1/2 < H < 1, introduced by Mandelbrot and Van Ness (1968).

A covariance stationary process X is called asymptotically(second-order) self-similar with self-similarity parameterH = 1 " $/2 if r (m) (k) agrees asymptotically (i.e., for large m andlarge k) with the correlation structure r (k) of X given by (3.2.1).The fractional autoregressive integrated moving-averageprocesses (fractional ARIMA(p,d,q) processes) with0 < d < 1/2 are examples of asymptotically second-order self-similar processes with self-similarity parameter d + 1/2. (Formore details, see Granger and Joyeux (1980), and Hosking(1981).)

Intuitively, the most striking feature of (exactly orasymptotically) self-similar processes is that their aggregatedprocesses X (m) possess a nondegenerate correlation structure asm %&. This behavior is precisely the intuition illustrated withthe sequence of plots in Figure 1; if the original time series Xrepresents the number of Ethernet packets per 10 milliseconds(plot (e)), then plots (a) to (d) depict segments of the aggregatedtime series X (10000) , X (1000) , X (100) , and X (10) , respectively. All ofthe plots look "similar", suggesting a nearly identicalautocorrelation function for all of the aggregated processes.

Mathematically, self-similarity manifests itself in a number ofequivalent ways (see Cox (1984)): (i) the variance of the samplemean decreases more slowly than the reciprocal of the samplesize (slowly decaying variances), i.e., var(X (m) ) # a 2m "$ ,as m %&, with 0 < $ < 1; (ii) the autocorrelations decayhyperbolically rather than exponentially fast, implying a non-summable autocorrelation function (k

r (k) = & (long-rangedependence), i.e., r (k) satisfies relation (3.2.1); and (iii) thespectral density f ( . ) obeys a power-law behavior near theorigin (1/f-noise), i.e., f ()) # a 3)"* , as ) % 0 , with 0 < * < 1and * = 1 " $.

The existence of a nondegenerate correlation structure for theaggregated processes X (m) as m %& is in stark contrast totypical packet traffic models currently considered in theliterature, all of which have the property that their aggregated

ACM SIGCOMM –206– Computer Communication Review

0 100 200 300 400 500 600 700 800 900 1000

0

20000

40000

60000

Time Units, Unit = 100 Seconds (a)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

2000

4000

6000

Time Units, Unit = 10 Seconds (b)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

200

400

600

800

Time Units, Unit = 1 Second (c)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

020406080

100

Time Units, Unit = 0.1 Second (d)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

5

10

15

Time Units, Unit = 0.01 Second (e)

Pac

kets

/Tim

e U

nit

Figure 1 (a)—(e). Pictorial "proof" of self-similarity:Ethernet traffic (packets per time unit for the August’89 trace) on 5 different time scales. (Different graylevels are used to identify the same segments of trafficon the different time scales.)

3.2 THE MATHEMATICS OF SELF-SIMILARITYThe notion of self-similarity is not merely an intuitivedescription, but a precise concept captured by the following

rigorous mathematical definition. Let X = (Xt: t = 0, 1, 2, ...) bea covariance stationary (sometimes called wide-sensestationary) stochastic process; that is, a process with constantmean µ = E [Xt], finite variance !2 = E [(Xt " µ)2], and anautocorrelation function r (k) = E [(Xt " µ)(Xt + k " µ)]/E [(Xt " µ)2] (k = 0, 1, 2, ...) that depends only on k. Inparticular, we assume that X has an autocorrelation function ofthe form

r (k) # a 1k "$ , as k %&, (3.2.1)

where 0 < $ < 1 (here and below, a 1, a 2, . . . denote finitepositive constants). For each m = 1, 2, 3, . . . , letX (m) = (Xk

(m) : k = 1, 2, 3, ...) denote a new time series obtainedby averaging the original series X over non-overlapping blocksof size m. That is, for each m = 1, 2, 3, . . . , X (m) is given byXk(m) = 1/m (Xkm " m + 1 + . . . + Xkm), (k ' 1). Note that for

each m, the aggregated time series X (m) defines a covariancestationary process; let r (m) denote the correspondingautocorrelation function.

The process X is called exactly (second-order) self-similar withself-similarity parameter H = 1 " $/2 if the correspondingaggregated processes X (m) have the same correlation structure asX, i.e., r (m) (k) = r (k), for all m = 1, 2, . . . ( k = 1, 2, 3, . . . ).In other words, X is exactly self-similar if the aggregatedprocesses X (m) are indistinguishable from X—at least withrespect to their second order statistical properties. An exampleof an exactly self-similar process with self-similarity parameterH is fractional Gaussian noise (FGN) with parameter1/2 < H < 1, introduced by Mandelbrot and Van Ness (1968).

A covariance stationary process X is called asymptotically(second-order) self-similar with self-similarity parameterH = 1 " $/2 if r (m) (k) agrees asymptotically (i.e., for large m andlarge k) with the correlation structure r (k) of X given by (3.2.1).The fractional autoregressive integrated moving-averageprocesses (fractional ARIMA(p,d,q) processes) with0 < d < 1/2 are examples of asymptotically second-order self-similar processes with self-similarity parameter d + 1/2. (Formore details, see Granger and Joyeux (1980), and Hosking(1981).)

Intuitively, the most striking feature of (exactly orasymptotically) self-similar processes is that their aggregatedprocesses X (m) possess a nondegenerate correlation structure asm %&. This behavior is precisely the intuition illustrated withthe sequence of plots in Figure 1; if the original time series Xrepresents the number of Ethernet packets per 10 milliseconds(plot (e)), then plots (a) to (d) depict segments of the aggregatedtime series X (10000) , X (1000) , X (100) , and X (10) , respectively. All ofthe plots look "similar", suggesting a nearly identicalautocorrelation function for all of the aggregated processes.

Mathematically, self-similarity manifests itself in a number ofequivalent ways (see Cox (1984)): (i) the variance of the samplemean decreases more slowly than the reciprocal of the samplesize (slowly decaying variances), i.e., var(X (m) ) # a 2m "$ ,as m %&, with 0 < $ < 1; (ii) the autocorrelations decayhyperbolically rather than exponentially fast, implying a non-summable autocorrelation function (k

r (k) = & (long-rangedependence), i.e., r (k) satisfies relation (3.2.1); and (iii) thespectral density f ( . ) obeys a power-law behavior near theorigin (1/f-noise), i.e., f ()) # a 3)"* , as ) % 0 , with 0 < * < 1and * = 1 " $.

The existence of a nondegenerate correlation structure for theaggregated processes X (m) as m %& is in stark contrast totypical packet traffic models currently considered in theliterature, all of which have the property that their aggregated

ACM SIGCOMM –206– Computer Communication Review

0 100 200 300 400 500 600 700 800 900 1000

0

20000

40000

60000

Time Units, Unit = 100 Seconds (a)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

2000

4000

6000

Time Units, Unit = 10 Seconds (b)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

200

400

600

800

Time Units, Unit = 1 Second (c)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

020406080

100

Time Units, Unit = 0.1 Second (d)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

5

10

15

Time Units, Unit = 0.01 Second (e)

Pac

kets

/Tim

e U

nit

Figure 1 (a)—(e). Pictorial "proof" of self-similarity:Ethernet traffic (packets per time unit for the August’89 trace) on 5 different time scales. (Different graylevels are used to identify the same segments of trafficon the different time scales.)

3.2 THE MATHEMATICS OF SELF-SIMILARITYThe notion of self-similarity is not merely an intuitivedescription, but a precise concept captured by the following

rigorous mathematical definition. Let X = (Xt: t = 0, 1, 2, ...) bea covariance stationary (sometimes called wide-sensestationary) stochastic process; that is, a process with constantmean µ = E [Xt], finite variance !2 = E [(Xt " µ)2], and anautocorrelation function r (k) = E [(Xt " µ)(Xt + k " µ)]/E [(Xt " µ)2] (k = 0, 1, 2, ...) that depends only on k. Inparticular, we assume that X has an autocorrelation function ofthe form

r (k) # a 1k "$ , as k %&, (3.2.1)

where 0 < $ < 1 (here and below, a 1, a 2, . . . denote finitepositive constants). For each m = 1, 2, 3, . . . , letX (m) = (Xk

(m) : k = 1, 2, 3, ...) denote a new time series obtainedby averaging the original series X over non-overlapping blocksof size m. That is, for each m = 1, 2, 3, . . . , X (m) is given byXk(m) = 1/m (Xkm " m + 1 + . . . + Xkm), (k ' 1). Note that for

each m, the aggregated time series X (m) defines a covariancestationary process; let r (m) denote the correspondingautocorrelation function.

The process X is called exactly (second-order) self-similar withself-similarity parameter H = 1 " $/2 if the correspondingaggregated processes X (m) have the same correlation structure asX, i.e., r (m) (k) = r (k), for all m = 1, 2, . . . ( k = 1, 2, 3, . . . ).In other words, X is exactly self-similar if the aggregatedprocesses X (m) are indistinguishable from X—at least withrespect to their second order statistical properties. An exampleof an exactly self-similar process with self-similarity parameterH is fractional Gaussian noise (FGN) with parameter1/2 < H < 1, introduced by Mandelbrot and Van Ness (1968).

A covariance stationary process X is called asymptotically(second-order) self-similar with self-similarity parameterH = 1 " $/2 if r (m) (k) agrees asymptotically (i.e., for large m andlarge k) with the correlation structure r (k) of X given by (3.2.1).The fractional autoregressive integrated moving-averageprocesses (fractional ARIMA(p,d,q) processes) with0 < d < 1/2 are examples of asymptotically second-order self-similar processes with self-similarity parameter d + 1/2. (Formore details, see Granger and Joyeux (1980), and Hosking(1981).)

Intuitively, the most striking feature of (exactly orasymptotically) self-similar processes is that their aggregatedprocesses X (m) possess a nondegenerate correlation structure asm %&. This behavior is precisely the intuition illustrated withthe sequence of plots in Figure 1; if the original time series Xrepresents the number of Ethernet packets per 10 milliseconds(plot (e)), then plots (a) to (d) depict segments of the aggregatedtime series X (10000) , X (1000) , X (100) , and X (10) , respectively. All ofthe plots look "similar", suggesting a nearly identicalautocorrelation function for all of the aggregated processes.

Mathematically, self-similarity manifests itself in a number ofequivalent ways (see Cox (1984)): (i) the variance of the samplemean decreases more slowly than the reciprocal of the samplesize (slowly decaying variances), i.e., var(X (m) ) # a 2m "$ ,as m %&, with 0 < $ < 1; (ii) the autocorrelations decayhyperbolically rather than exponentially fast, implying a non-summable autocorrelation function (k

r (k) = & (long-rangedependence), i.e., r (k) satisfies relation (3.2.1); and (iii) thespectral density f ( . ) obeys a power-law behavior near theorigin (1/f-noise), i.e., f ()) # a 3)"* , as ) % 0 , with 0 < * < 1and * = 1 " $.

The existence of a nondegenerate correlation structure for theaggregated processes X (m) as m %& is in stark contrast totypical packet traffic models currently considered in theliterature, all of which have the property that their aggregated

ACM SIGCOMM –206– Computer Communication Review

0 100 200 300 400 500 600 700 800 900 1000

0

20000

40000

60000

Time Units, Unit = 100 Seconds (a)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

2000

4000

6000

Time Units, Unit = 10 Seconds (b)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

200

400

600

800

Time Units, Unit = 1 Second (c)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

020406080

100

Time Units, Unit = 0.1 Second (d)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

5

10

15

Time Units, Unit = 0.01 Second (e)

Pac

kets

/Tim

e U

nit

Figure 1 (a)—(e). Pictorial "proof" of self-similarity:Ethernet traffic (packets per time unit for the August’89 trace) on 5 different time scales. (Different graylevels are used to identify the same segments of trafficon the different time scales.)

3.2 THE MATHEMATICS OF SELF-SIMILARITYThe notion of self-similarity is not merely an intuitivedescription, but a precise concept captured by the following

rigorous mathematical definition. Let X = (Xt: t = 0, 1, 2, ...) bea covariance stationary (sometimes called wide-sensestationary) stochastic process; that is, a process with constantmean µ = E [Xt], finite variance !2 = E [(Xt " µ)2], and anautocorrelation function r (k) = E [(Xt " µ)(Xt + k " µ)]/E [(Xt " µ)2] (k = 0, 1, 2, ...) that depends only on k. Inparticular, we assume that X has an autocorrelation function ofthe form

r (k) # a 1k "$ , as k %&, (3.2.1)

where 0 < $ < 1 (here and below, a 1, a 2, . . . denote finitepositive constants). For each m = 1, 2, 3, . . . , letX (m) = (Xk

(m) : k = 1, 2, 3, ...) denote a new time series obtainedby averaging the original series X over non-overlapping blocksof size m. That is, for each m = 1, 2, 3, . . . , X (m) is given byXk(m) = 1/m (Xkm " m + 1 + . . . + Xkm), (k ' 1). Note that for

each m, the aggregated time series X (m) defines a covariancestationary process; let r (m) denote the correspondingautocorrelation function.

The process X is called exactly (second-order) self-similar withself-similarity parameter H = 1 " $/2 if the correspondingaggregated processes X (m) have the same correlation structure asX, i.e., r (m) (k) = r (k), for all m = 1, 2, . . . ( k = 1, 2, 3, . . . ).In other words, X is exactly self-similar if the aggregatedprocesses X (m) are indistinguishable from X—at least withrespect to their second order statistical properties. An exampleof an exactly self-similar process with self-similarity parameterH is fractional Gaussian noise (FGN) with parameter1/2 < H < 1, introduced by Mandelbrot and Van Ness (1968).

A covariance stationary process X is called asymptotically(second-order) self-similar with self-similarity parameterH = 1 " $/2 if r (m) (k) agrees asymptotically (i.e., for large m andlarge k) with the correlation structure r (k) of X given by (3.2.1).The fractional autoregressive integrated moving-averageprocesses (fractional ARIMA(p,d,q) processes) with0 < d < 1/2 are examples of asymptotically second-order self-similar processes with self-similarity parameter d + 1/2. (Formore details, see Granger and Joyeux (1980), and Hosking(1981).)

Intuitively, the most striking feature of (exactly orasymptotically) self-similar processes is that their aggregatedprocesses X (m) possess a nondegenerate correlation structure asm %&. This behavior is precisely the intuition illustrated withthe sequence of plots in Figure 1; if the original time series Xrepresents the number of Ethernet packets per 10 milliseconds(plot (e)), then plots (a) to (d) depict segments of the aggregatedtime series X (10000) , X (1000) , X (100) , and X (10) , respectively. All ofthe plots look "similar", suggesting a nearly identicalautocorrelation function for all of the aggregated processes.

Mathematically, self-similarity manifests itself in a number ofequivalent ways (see Cox (1984)): (i) the variance of the samplemean decreases more slowly than the reciprocal of the samplesize (slowly decaying variances), i.e., var(X (m) ) # a 2m "$ ,as m %&, with 0 < $ < 1; (ii) the autocorrelations decayhyperbolically rather than exponentially fast, implying a non-summable autocorrelation function (k

r (k) = & (long-rangedependence), i.e., r (k) satisfies relation (3.2.1); and (iii) thespectral density f ( . ) obeys a power-law behavior near theorigin (1/f-noise), i.e., f ()) # a 3)"* , as ) % 0 , with 0 < * < 1and * = 1 " $.

The existence of a nondegenerate correlation structure for theaggregated processes X (m) as m %& is in stark contrast totypical packet traffic models currently considered in theliterature, all of which have the property that their aggregated

ACM SIGCOMM –206– Computer Communication Review

Source: LeLand et al. 1995

Page 37: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Self-Similarity of Ethernet Traffic

10

0 100 200 300 400 500 600 700 800 900 1000

0

20000

40000

60000

Time Units, Unit = 100 Seconds (a)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

2000

4000

6000

Time Units, Unit = 10 Seconds (b)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

200

400

600

800

Time Units, Unit = 1 Second (c)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

020406080

100

Time Units, Unit = 0.1 Second (d)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

5

10

15

Time Units, Unit = 0.01 Second (e)

Pac

kets

/Tim

e U

nit

Figure 1 (a)—(e). Pictorial "proof" of self-similarity:Ethernet traffic (packets per time unit for the August’89 trace) on 5 different time scales. (Different graylevels are used to identify the same segments of trafficon the different time scales.)

3.2 THE MATHEMATICS OF SELF-SIMILARITYThe notion of self-similarity is not merely an intuitivedescription, but a precise concept captured by the following

rigorous mathematical definition. Let X = (Xt: t = 0, 1, 2, ...) bea covariance stationary (sometimes called wide-sensestationary) stochastic process; that is, a process with constantmean µ = E [Xt], finite variance !2 = E [(Xt " µ)2], and anautocorrelation function r (k) = E [(Xt " µ)(Xt + k " µ)]/E [(Xt " µ)2] (k = 0, 1, 2, ...) that depends only on k. Inparticular, we assume that X has an autocorrelation function ofthe form

r (k) # a 1k "$ , as k %&, (3.2.1)

where 0 < $ < 1 (here and below, a 1, a 2, . . . denote finitepositive constants). For each m = 1, 2, 3, . . . , letX (m) = (Xk

(m) : k = 1, 2, 3, ...) denote a new time series obtainedby averaging the original series X over non-overlapping blocksof size m. That is, for each m = 1, 2, 3, . . . , X (m) is given byXk(m) = 1/m (Xkm " m + 1 + . . . + Xkm), (k ' 1). Note that for

each m, the aggregated time series X (m) defines a covariancestationary process; let r (m) denote the correspondingautocorrelation function.

The process X is called exactly (second-order) self-similar withself-similarity parameter H = 1 " $/2 if the correspondingaggregated processes X (m) have the same correlation structure asX, i.e., r (m) (k) = r (k), for all m = 1, 2, . . . ( k = 1, 2, 3, . . . ).In other words, X is exactly self-similar if the aggregatedprocesses X (m) are indistinguishable from X—at least withrespect to their second order statistical properties. An exampleof an exactly self-similar process with self-similarity parameterH is fractional Gaussian noise (FGN) with parameter1/2 < H < 1, introduced by Mandelbrot and Van Ness (1968).

A covariance stationary process X is called asymptotically(second-order) self-similar with self-similarity parameterH = 1 " $/2 if r (m) (k) agrees asymptotically (i.e., for large m andlarge k) with the correlation structure r (k) of X given by (3.2.1).The fractional autoregressive integrated moving-averageprocesses (fractional ARIMA(p,d,q) processes) with0 < d < 1/2 are examples of asymptotically second-order self-similar processes with self-similarity parameter d + 1/2. (Formore details, see Granger and Joyeux (1980), and Hosking(1981).)

Intuitively, the most striking feature of (exactly orasymptotically) self-similar processes is that their aggregatedprocesses X (m) possess a nondegenerate correlation structure asm %&. This behavior is precisely the intuition illustrated withthe sequence of plots in Figure 1; if the original time series Xrepresents the number of Ethernet packets per 10 milliseconds(plot (e)), then plots (a) to (d) depict segments of the aggregatedtime series X (10000) , X (1000) , X (100) , and X (10) , respectively. All ofthe plots look "similar", suggesting a nearly identicalautocorrelation function for all of the aggregated processes.

Mathematically, self-similarity manifests itself in a number ofequivalent ways (see Cox (1984)): (i) the variance of the samplemean decreases more slowly than the reciprocal of the samplesize (slowly decaying variances), i.e., var(X (m) ) # a 2m "$ ,as m %&, with 0 < $ < 1; (ii) the autocorrelations decayhyperbolically rather than exponentially fast, implying a non-summable autocorrelation function (k

r (k) = & (long-rangedependence), i.e., r (k) satisfies relation (3.2.1); and (iii) thespectral density f ( . ) obeys a power-law behavior near theorigin (1/f-noise), i.e., f ()) # a 3)"* , as ) % 0 , with 0 < * < 1and * = 1 " $.

The existence of a nondegenerate correlation structure for theaggregated processes X (m) as m %& is in stark contrast totypical packet traffic models currently considered in theliterature, all of which have the property that their aggregated

ACM SIGCOMM –206– Computer Communication Review

0 100 200 300 400 500 600 700 800 900 1000

0

20000

40000

60000

Time Units, Unit = 100 Seconds (a)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

2000

4000

6000

Time Units, Unit = 10 Seconds (b)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

200

400

600

800

Time Units, Unit = 1 Second (c)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

020406080

100

Time Units, Unit = 0.1 Second (d)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

5

10

15

Time Units, Unit = 0.01 Second (e)

Pac

kets

/Tim

e U

nit

Figure 1 (a)—(e). Pictorial "proof" of self-similarity:Ethernet traffic (packets per time unit for the August’89 trace) on 5 different time scales. (Different graylevels are used to identify the same segments of trafficon the different time scales.)

3.2 THE MATHEMATICS OF SELF-SIMILARITYThe notion of self-similarity is not merely an intuitivedescription, but a precise concept captured by the following

rigorous mathematical definition. Let X = (Xt: t = 0, 1, 2, ...) bea covariance stationary (sometimes called wide-sensestationary) stochastic process; that is, a process with constantmean µ = E [Xt], finite variance !2 = E [(Xt " µ)2], and anautocorrelation function r (k) = E [(Xt " µ)(Xt + k " µ)]/E [(Xt " µ)2] (k = 0, 1, 2, ...) that depends only on k. Inparticular, we assume that X has an autocorrelation function ofthe form

r (k) # a 1k "$ , as k %&, (3.2.1)

where 0 < $ < 1 (here and below, a 1, a 2, . . . denote finitepositive constants). For each m = 1, 2, 3, . . . , letX (m) = (Xk

(m) : k = 1, 2, 3, ...) denote a new time series obtainedby averaging the original series X over non-overlapping blocksof size m. That is, for each m = 1, 2, 3, . . . , X (m) is given byXk(m) = 1/m (Xkm " m + 1 + . . . + Xkm), (k ' 1). Note that for

each m, the aggregated time series X (m) defines a covariancestationary process; let r (m) denote the correspondingautocorrelation function.

The process X is called exactly (second-order) self-similar withself-similarity parameter H = 1 " $/2 if the correspondingaggregated processes X (m) have the same correlation structure asX, i.e., r (m) (k) = r (k), for all m = 1, 2, . . . ( k = 1, 2, 3, . . . ).In other words, X is exactly self-similar if the aggregatedprocesses X (m) are indistinguishable from X—at least withrespect to their second order statistical properties. An exampleof an exactly self-similar process with self-similarity parameterH is fractional Gaussian noise (FGN) with parameter1/2 < H < 1, introduced by Mandelbrot and Van Ness (1968).

A covariance stationary process X is called asymptotically(second-order) self-similar with self-similarity parameterH = 1 " $/2 if r (m) (k) agrees asymptotically (i.e., for large m andlarge k) with the correlation structure r (k) of X given by (3.2.1).The fractional autoregressive integrated moving-averageprocesses (fractional ARIMA(p,d,q) processes) with0 < d < 1/2 are examples of asymptotically second-order self-similar processes with self-similarity parameter d + 1/2. (Formore details, see Granger and Joyeux (1980), and Hosking(1981).)

Intuitively, the most striking feature of (exactly orasymptotically) self-similar processes is that their aggregatedprocesses X (m) possess a nondegenerate correlation structure asm %&. This behavior is precisely the intuition illustrated withthe sequence of plots in Figure 1; if the original time series Xrepresents the number of Ethernet packets per 10 milliseconds(plot (e)), then plots (a) to (d) depict segments of the aggregatedtime series X (10000) , X (1000) , X (100) , and X (10) , respectively. All ofthe plots look "similar", suggesting a nearly identicalautocorrelation function for all of the aggregated processes.

Mathematically, self-similarity manifests itself in a number ofequivalent ways (see Cox (1984)): (i) the variance of the samplemean decreases more slowly than the reciprocal of the samplesize (slowly decaying variances), i.e., var(X (m) ) # a 2m "$ ,as m %&, with 0 < $ < 1; (ii) the autocorrelations decayhyperbolically rather than exponentially fast, implying a non-summable autocorrelation function (k

r (k) = & (long-rangedependence), i.e., r (k) satisfies relation (3.2.1); and (iii) thespectral density f ( . ) obeys a power-law behavior near theorigin (1/f-noise), i.e., f ()) # a 3)"* , as ) % 0 , with 0 < * < 1and * = 1 " $.

The existence of a nondegenerate correlation structure for theaggregated processes X (m) as m %& is in stark contrast totypical packet traffic models currently considered in theliterature, all of which have the property that their aggregated

ACM SIGCOMM –206– Computer Communication Review

0 100 200 300 400 500 600 700 800 900 1000

0

20000

40000

60000

Time Units, Unit = 100 Seconds (a)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

2000

4000

6000

Time Units, Unit = 10 Seconds (b)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

200

400

600

800

Time Units, Unit = 1 Second (c)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

020406080

100

Time Units, Unit = 0.1 Second (d)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

5

10

15

Time Units, Unit = 0.01 Second (e)

Pac

kets

/Tim

e U

nit

Figure 1 (a)—(e). Pictorial "proof" of self-similarity:Ethernet traffic (packets per time unit for the August’89 trace) on 5 different time scales. (Different graylevels are used to identify the same segments of trafficon the different time scales.)

3.2 THE MATHEMATICS OF SELF-SIMILARITYThe notion of self-similarity is not merely an intuitivedescription, but a precise concept captured by the following

rigorous mathematical definition. Let X = (Xt: t = 0, 1, 2, ...) bea covariance stationary (sometimes called wide-sensestationary) stochastic process; that is, a process with constantmean µ = E [Xt], finite variance !2 = E [(Xt " µ)2], and anautocorrelation function r (k) = E [(Xt " µ)(Xt + k " µ)]/E [(Xt " µ)2] (k = 0, 1, 2, ...) that depends only on k. Inparticular, we assume that X has an autocorrelation function ofthe form

r (k) # a 1k "$ , as k %&, (3.2.1)

where 0 < $ < 1 (here and below, a 1, a 2, . . . denote finitepositive constants). For each m = 1, 2, 3, . . . , letX (m) = (Xk

(m) : k = 1, 2, 3, ...) denote a new time series obtainedby averaging the original series X over non-overlapping blocksof size m. That is, for each m = 1, 2, 3, . . . , X (m) is given byXk(m) = 1/m (Xkm " m + 1 + . . . + Xkm), (k ' 1). Note that for

each m, the aggregated time series X (m) defines a covariancestationary process; let r (m) denote the correspondingautocorrelation function.

The process X is called exactly (second-order) self-similar withself-similarity parameter H = 1 " $/2 if the correspondingaggregated processes X (m) have the same correlation structure asX, i.e., r (m) (k) = r (k), for all m = 1, 2, . . . ( k = 1, 2, 3, . . . ).In other words, X is exactly self-similar if the aggregatedprocesses X (m) are indistinguishable from X—at least withrespect to their second order statistical properties. An exampleof an exactly self-similar process with self-similarity parameterH is fractional Gaussian noise (FGN) with parameter1/2 < H < 1, introduced by Mandelbrot and Van Ness (1968).

A covariance stationary process X is called asymptotically(second-order) self-similar with self-similarity parameterH = 1 " $/2 if r (m) (k) agrees asymptotically (i.e., for large m andlarge k) with the correlation structure r (k) of X given by (3.2.1).The fractional autoregressive integrated moving-averageprocesses (fractional ARIMA(p,d,q) processes) with0 < d < 1/2 are examples of asymptotically second-order self-similar processes with self-similarity parameter d + 1/2. (Formore details, see Granger and Joyeux (1980), and Hosking(1981).)

Intuitively, the most striking feature of (exactly orasymptotically) self-similar processes is that their aggregatedprocesses X (m) possess a nondegenerate correlation structure asm %&. This behavior is precisely the intuition illustrated withthe sequence of plots in Figure 1; if the original time series Xrepresents the number of Ethernet packets per 10 milliseconds(plot (e)), then plots (a) to (d) depict segments of the aggregatedtime series X (10000) , X (1000) , X (100) , and X (10) , respectively. All ofthe plots look "similar", suggesting a nearly identicalautocorrelation function for all of the aggregated processes.

Mathematically, self-similarity manifests itself in a number ofequivalent ways (see Cox (1984)): (i) the variance of the samplemean decreases more slowly than the reciprocal of the samplesize (slowly decaying variances), i.e., var(X (m) ) # a 2m "$ ,as m %&, with 0 < $ < 1; (ii) the autocorrelations decayhyperbolically rather than exponentially fast, implying a non-summable autocorrelation function (k

r (k) = & (long-rangedependence), i.e., r (k) satisfies relation (3.2.1); and (iii) thespectral density f ( . ) obeys a power-law behavior near theorigin (1/f-noise), i.e., f ()) # a 3)"* , as ) % 0 , with 0 < * < 1and * = 1 " $.

The existence of a nondegenerate correlation structure for theaggregated processes X (m) as m %& is in stark contrast totypical packet traffic models currently considered in theliterature, all of which have the property that their aggregated

ACM SIGCOMM –206– Computer Communication Review

0 100 200 300 400 500 600 700 800 900 1000

0

20000

40000

60000

Time Units, Unit = 100 Seconds (a)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

2000

4000

6000

Time Units, Unit = 10 Seconds (b)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

200

400

600

800

Time Units, Unit = 1 Second (c)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

020406080

100

Time Units, Unit = 0.1 Second (d)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

5

10

15

Time Units, Unit = 0.01 Second (e)

Pac

kets

/Tim

e U

nit

Figure 1 (a)—(e). Pictorial "proof" of self-similarity:Ethernet traffic (packets per time unit for the August’89 trace) on 5 different time scales. (Different graylevels are used to identify the same segments of trafficon the different time scales.)

3.2 THE MATHEMATICS OF SELF-SIMILARITYThe notion of self-similarity is not merely an intuitivedescription, but a precise concept captured by the following

rigorous mathematical definition. Let X = (Xt: t = 0, 1, 2, ...) bea covariance stationary (sometimes called wide-sensestationary) stochastic process; that is, a process with constantmean µ = E [Xt], finite variance !2 = E [(Xt " µ)2], and anautocorrelation function r (k) = E [(Xt " µ)(Xt + k " µ)]/E [(Xt " µ)2] (k = 0, 1, 2, ...) that depends only on k. Inparticular, we assume that X has an autocorrelation function ofthe form

r (k) # a 1k "$ , as k %&, (3.2.1)

where 0 < $ < 1 (here and below, a 1, a 2, . . . denote finitepositive constants). For each m = 1, 2, 3, . . . , letX (m) = (Xk

(m) : k = 1, 2, 3, ...) denote a new time series obtainedby averaging the original series X over non-overlapping blocksof size m. That is, for each m = 1, 2, 3, . . . , X (m) is given byXk(m) = 1/m (Xkm " m + 1 + . . . + Xkm), (k ' 1). Note that for

each m, the aggregated time series X (m) defines a covariancestationary process; let r (m) denote the correspondingautocorrelation function.

The process X is called exactly (second-order) self-similar withself-similarity parameter H = 1 " $/2 if the correspondingaggregated processes X (m) have the same correlation structure asX, i.e., r (m) (k) = r (k), for all m = 1, 2, . . . ( k = 1, 2, 3, . . . ).In other words, X is exactly self-similar if the aggregatedprocesses X (m) are indistinguishable from X—at least withrespect to their second order statistical properties. An exampleof an exactly self-similar process with self-similarity parameterH is fractional Gaussian noise (FGN) with parameter1/2 < H < 1, introduced by Mandelbrot and Van Ness (1968).

A covariance stationary process X is called asymptotically(second-order) self-similar with self-similarity parameterH = 1 " $/2 if r (m) (k) agrees asymptotically (i.e., for large m andlarge k) with the correlation structure r (k) of X given by (3.2.1).The fractional autoregressive integrated moving-averageprocesses (fractional ARIMA(p,d,q) processes) with0 < d < 1/2 are examples of asymptotically second-order self-similar processes with self-similarity parameter d + 1/2. (Formore details, see Granger and Joyeux (1980), and Hosking(1981).)

Intuitively, the most striking feature of (exactly orasymptotically) self-similar processes is that their aggregatedprocesses X (m) possess a nondegenerate correlation structure asm %&. This behavior is precisely the intuition illustrated withthe sequence of plots in Figure 1; if the original time series Xrepresents the number of Ethernet packets per 10 milliseconds(plot (e)), then plots (a) to (d) depict segments of the aggregatedtime series X (10000) , X (1000) , X (100) , and X (10) , respectively. All ofthe plots look "similar", suggesting a nearly identicalautocorrelation function for all of the aggregated processes.

Mathematically, self-similarity manifests itself in a number ofequivalent ways (see Cox (1984)): (i) the variance of the samplemean decreases more slowly than the reciprocal of the samplesize (slowly decaying variances), i.e., var(X (m) ) # a 2m "$ ,as m %&, with 0 < $ < 1; (ii) the autocorrelations decayhyperbolically rather than exponentially fast, implying a non-summable autocorrelation function (k

r (k) = & (long-rangedependence), i.e., r (k) satisfies relation (3.2.1); and (iii) thespectral density f ( . ) obeys a power-law behavior near theorigin (1/f-noise), i.e., f ()) # a 3)"* , as ) % 0 , with 0 < * < 1and * = 1 " $.

The existence of a nondegenerate correlation structure for theaggregated processes X (m) as m %& is in stark contrast totypical packet traffic models currently considered in theliterature, all of which have the property that their aggregated

ACM SIGCOMM –206– Computer Communication Review

0 100 200 300 400 500 600 700 800 900 1000

0

20000

40000

60000

Time Units, Unit = 100 Seconds (a)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

2000

4000

6000

Time Units, Unit = 10 Seconds (b)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

200

400

600

800

Time Units, Unit = 1 Second (c)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

020406080

100

Time Units, Unit = 0.1 Second (d)

Pac

kets

/Tim

e U

nit

0 100 200 300 400 500 600 700 800 900 1000

0

5

10

15

Time Units, Unit = 0.01 Second (e)

Pac

kets

/Tim

e U

nit

Figure 1 (a)—(e). Pictorial "proof" of self-similarity:Ethernet traffic (packets per time unit for the August’89 trace) on 5 different time scales. (Different graylevels are used to identify the same segments of trafficon the different time scales.)

3.2 THE MATHEMATICS OF SELF-SIMILARITYThe notion of self-similarity is not merely an intuitivedescription, but a precise concept captured by the following

rigorous mathematical definition. Let X = (Xt: t = 0, 1, 2, ...) bea covariance stationary (sometimes called wide-sensestationary) stochastic process; that is, a process with constantmean µ = E [Xt], finite variance !2 = E [(Xt " µ)2], and anautocorrelation function r (k) = E [(Xt " µ)(Xt + k " µ)]/E [(Xt " µ)2] (k = 0, 1, 2, ...) that depends only on k. Inparticular, we assume that X has an autocorrelation function ofthe form

r (k) # a 1k "$ , as k %&, (3.2.1)

where 0 < $ < 1 (here and below, a 1, a 2, . . . denote finitepositive constants). For each m = 1, 2, 3, . . . , letX (m) = (Xk

(m) : k = 1, 2, 3, ...) denote a new time series obtainedby averaging the original series X over non-overlapping blocksof size m. That is, for each m = 1, 2, 3, . . . , X (m) is given byXk(m) = 1/m (Xkm " m + 1 + . . . + Xkm), (k ' 1). Note that for

each m, the aggregated time series X (m) defines a covariancestationary process; let r (m) denote the correspondingautocorrelation function.

The process X is called exactly (second-order) self-similar withself-similarity parameter H = 1 " $/2 if the correspondingaggregated processes X (m) have the same correlation structure asX, i.e., r (m) (k) = r (k), for all m = 1, 2, . . . ( k = 1, 2, 3, . . . ).In other words, X is exactly self-similar if the aggregatedprocesses X (m) are indistinguishable from X—at least withrespect to their second order statistical properties. An exampleof an exactly self-similar process with self-similarity parameterH is fractional Gaussian noise (FGN) with parameter1/2 < H < 1, introduced by Mandelbrot and Van Ness (1968).

A covariance stationary process X is called asymptotically(second-order) self-similar with self-similarity parameterH = 1 " $/2 if r (m) (k) agrees asymptotically (i.e., for large m andlarge k) with the correlation structure r (k) of X given by (3.2.1).The fractional autoregressive integrated moving-averageprocesses (fractional ARIMA(p,d,q) processes) with0 < d < 1/2 are examples of asymptotically second-order self-similar processes with self-similarity parameter d + 1/2. (Formore details, see Granger and Joyeux (1980), and Hosking(1981).)

Intuitively, the most striking feature of (exactly orasymptotically) self-similar processes is that their aggregatedprocesses X (m) possess a nondegenerate correlation structure asm %&. This behavior is precisely the intuition illustrated withthe sequence of plots in Figure 1; if the original time series Xrepresents the number of Ethernet packets per 10 milliseconds(plot (e)), then plots (a) to (d) depict segments of the aggregatedtime series X (10000) , X (1000) , X (100) , and X (10) , respectively. All ofthe plots look "similar", suggesting a nearly identicalautocorrelation function for all of the aggregated processes.

Mathematically, self-similarity manifests itself in a number ofequivalent ways (see Cox (1984)): (i) the variance of the samplemean decreases more slowly than the reciprocal of the samplesize (slowly decaying variances), i.e., var(X (m) ) # a 2m "$ ,as m %&, with 0 < $ < 1; (ii) the autocorrelations decayhyperbolically rather than exponentially fast, implying a non-summable autocorrelation function (k

r (k) = & (long-rangedependence), i.e., r (k) satisfies relation (3.2.1); and (iii) thespectral density f ( . ) obeys a power-law behavior near theorigin (1/f-noise), i.e., f ()) # a 3)"* , as ) % 0 , with 0 < * < 1and * = 1 " $.

The existence of a nondegenerate correlation structure for theaggregated processes X (m) as m %& is in stark contrast totypical packet traffic models currently considered in theliterature, all of which have the property that their aggregated

ACM SIGCOMM –206– Computer Communication Review

Source: LeLand et al. 1995

Page 38: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

One Day of Crud at ICSI

11

Postel’s Law: Be strict in what you send and liberal in what you accept ...

Page 39: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

One Day of Crud at ICSI

11

Postel’s Law: Be strict in what you send and liberal in what you accept ...

active-connection-reuse

DNS-label-len-gt-pkt

HTTP-chunked-multipart

possible-split-routing

bad-Ident-reply DNS-label-too-long

HTTP-version-mismatch SYN-after-close

bad-RPC DNS-RR-length-mismatch

illegal-%-at-end-of-URI SYN-after-reset

bad-SYN-ack DNS-RR-unknown-type inappropriate-FIN SYN-inside-

connection

bad-TCP-header-len

DNS-truncated-answer IRC-invalid-line SYN-seq-jump

base64-illegal-encoding

DNS-len-lt-hdr-len

line-terminated-with-single-CR truncated-NTP

connection-originator-SYN-ack

DNS-truncated-RR-rdlength

malformed-SSH-identification

unescaped-%-in-URI

data-after-reset double-%-in-URI no-login-prompt unescaped-special-URI-char

data-before-established excess-RPC NUL-in-line unmatched-HTTP-

reply

too-many-DNS-queries

FIN-advanced-last-seq

POP3-server-sending-client-commands window-recision

DNS-label-forward-compress-

offset

fragment-with-DF 155K in total!

Page 40: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

What is Normal?

• Finding a stable notion of normal is hard for networks.

• Network traffic is composed of many individual sessions.• Leads to enormous variety and unpredictable behavior. • Observable on all layers of the protocol stack.

• Violates an implicit assumption: Outliers are attacks!

• Ignoring this leads to a semantic gap• Disconnect between what the system reports and what the operator wants.• Root cause for the common complaint of “too many false positives”.

• Each mistake costs scarce analyst time.

12

Page 41: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Mistakes in Other Domains

13

OCR Spell Checker

Image Analysis Human Eye

Translation Low Expectation

Collaborative Filtering Not much impact.

Page 42: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Mistakes in Other Domains

13

OCR Spell Checker

Image Analysis Human Eye

Translation Low Expectation

Collaborative Filtering Not much impact.

“ [Recommendations are] guess work. Our error rate will always be high.”

- Greg Linden (Amazon)

Page 43: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Building a Good Anomaly Detector

• Limit the detector’s scope.• What concrete attack is the system to find?• Define a problem for which machine learning makes less mistakes.

• Gain insight into capabilities and limitations.• What exactly does it detect and why? What not and why not?

• What are the features conceptually able to capture?

• When exactly does it break?

• Acknowledge shortcomings.

• Examine false and true positives/negatives.

14

Page 44: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Image Analysis with Neural Networks

15

Tank

Page 45: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Image Analysis with Neural Networks

15

Tank No Tank

Page 46: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

What Can we Do?

• Limit the detector’s scope.• What concrete attack is the system to find?• Define a problem for which machine learning makes less mistakes.

• Gain insight into capabilities and limitations.• What exactly does it detect and why? What not and why not?• What are the features conceptually able to capture?

• When exactly does it break?• Acknowledge shortcomings.• Examine false and true positives/negatives.

• Assume the perspective of a network operator.• How does the detector help with operations? • Gold standard: work with operators. If they deem it useful, you got it right.

16

Page 47: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

What Can we Do?

• Limit the detector’s scope.• What concrete attack is the system to find?• Define a problem for which machine learning makes less mistakes.

• Gain insight into capabilities and limitations.• What exactly does it detect and why? What not and why not?• What are the features conceptually able to capture?

• When exactly does it break?• Acknowledge shortcomings.• Examine false and true positives/negatives.

• Assume the perspective of a network operator.• How does the detector help with operations? • Gold standard: work with operators. If they deem it useful, you got it right.

16

Once you have done all this ...... you might notice that you now know enough about the activity you’re looking for that you don’t need any

machine learning.

Page 48: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Why is Anomaly Detection Hard?

17

The intrusion detection domain faces challenges that make it fundamentally different from other fields.

• Outlier detection and the high costs of errors

• Interpretation of results

• Evaluation!

• Training data

• Evasion risk

Can we still make it work?

Page 49: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Conclusion

• Machine learning for intrusion detection is challenging.• Reasonable and possible, but needs care.• Consider fundamental differences to other domains.

• There is some good anomaly detection work out there.

• If you do anomaly detection, understand and explain.

• If you are given an anomaly detector, ask questions.

18

Page 50: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

IEEE Symposium on Security and Privacy

Conclusion

• Machine learning for intrusion detection is challenging.• Reasonable and possible, but needs care.• Consider fundamental differences to other domains.

• There is some good anomaly detection work out there.

• If you do anomaly detection, understand and explain.

• If you are given an anomaly detector, ask questions.

18

“Open questions: [...] Soundness of Approach: Does the approach actually detect intrusions? Is it possible to distinguish anomalies related to intrusions from those related to other factors?”-Denning, 1987

Page 51: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

Robin SommerInternational Computer Science Institute, &

Lawrence Berkeley National Laboratory

[email protected]://www.icir.org

Thanks for your attention.

Page 52: Outside the Closed World: On Using Machine Learning for ...oakland10.cs.virginia.edu/slides/anomaly-oakland.pdf · • It’s plausible: machine learning works so well in other domains.

Robin SommerInternational Computer Science Institute, &

Lawrence Berkeley National Laboratory

[email protected]://www.icir.org

Thanks for your attention.