Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette,...

47
Computational Statistics and Mathematics for Cyber Security David J. Marchette Sept 25–27, 2017 Acknowledgment: This work funded in part by the NSWC In-House Laboratory Independent Research (ILIR) program. NSWCDD-PN-17-00345 Distribution A: Approved for Public Release

Transcript of Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette,...

Page 1: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

Computational Statistics and Mathematics forCyber Security

David J. Marchette

Sept 25–27, 2017

Acknowledgment: This work funded in part by the NSWC

In-House Laboratory Independent Research (ILIR) program.

NSWCDD-PN-17-00345 Distribution A: Approved for Public Release

Page 2: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Topics

1 Introduction

2 Computational Statistics

3 Machine Learning

4 Manifold Learning

5 Topological Data Analysis

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 3: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Topics

1 Introduction

2 Computational Statistics

3 Machine Learning

4 Manifold Learning

5 Topological Data Analysis

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 4: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Take-Away Points

Mathematics and statistics provide many tools for cybersecurity.

Simple can be powerful.

Complicated models or algorithms are not always necessary.Sometimes they are.

“Complicated” things become “simple” with familiarity.

High dimensional data is complicated, messy, and can foolyou.

Know your data!

If your results appear too good to be true, triple check them!

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 5: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Two Cultures

“There are two cultures in the use of statistical modeling to reach

conclusions from data. One assumes that the data are generated by a

given stochastic data model. The other uses algorithmic models and

treats the data mechanism as unknown.”*

There are many aspects of this dichotomy:

Modeling – algorithms.

Parametric – non-parametric.

Statistics – machine learning.

Inference – prediction.**

“Small” data – “big data”.

“Traditional” statistics – computational statistics.

*Leo Breiman, Statistical Science 2001, Vol. 16, No. 3, 199231

**Donoho, D. (2015, September). 50 years of Data Science. In Princeton NJ, Tukey Centennial Workshop.

http://www.economicsguy.com/wp-content/uploads/2016/06/50YearsDataScience.pdf accessed 8/8/2017

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 6: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

The Illusion of Progress

“. . . [comparative studies] often fail to take into account important

aspects of real problems, so that the apparent superiority of more

sophisticated methods may be something of an illusion.”*

Simple models often produce essentially the same accuracy asmore complicated models. These can be easier to understand,fit, and may have fewer parameters to choose – possiblyresulting in lower variance.

The data you get is rarely (if ever) a true random draw fromthe distribution you will be running your trained/implementedalgorithm on.

This is particularly important in cyber security.By its nature, cyber security data is non-stationary, andtoday’s data may look very different from tomorrow’s.

*David Hand, Statistical Science 2006, Vol. 21, No. 1, 114

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 7: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

The Illusion of Progress

When building a model, one makes assumptions, which areoften not testable, and which can impact the ultimateperformance.

Simpler models (may) have fewer assumptions.Non-parametric (may) be superior to parametric in that they(tend to) make fewer assumptions.

However, if the assumptions are true, parametric may besuperior.“Good” non-parametric algorithms would be “nearly as good”as the parametric, while allowing a hedge on the assumptions.

Hand suggests we spend less time developing the “next greatclassifier” and more time on methods that mitigate the aboveissues.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 8: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Outline

Probability density estimation.

Kernel estimators.Streaming data.

Machine learning.

Nearest neighbors.Random forests.

Manifold learning.

Graphs.Spectral embedding.

Topological Data Analysis.

We’ll see how much of this we can cover today – see the paper.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 9: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Topics

1 Introduction

2 Computational Statistics

3 Machine Learning

4 Manifold Learning

5 Topological Data Analysis

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 10: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

The HistogramD

ensi

ty

−1 1 3

0.00

0.05

0.10

0.15

0.20

0.25

0.30

● ●● ● ●● ●●● ● ● ●● ●●● ●●● ●

−1 1 3

0.00

0.05

0.10

0.15

0.20

0.25

0.30

● ●● ●●● ●●● ● ● ●● ●●● ●●● ●

−1 1 30.

000.

050.

100.

150.

200.

250.

30

● ●● ●●● ●●● ●

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 11: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

The Histogram – The Kernel EstimatorD

ensi

ty

−1 1 3

0.00

0.05

0.10

0.15

0.20

0.25

0.30

● ●● ● ●● ●●● ● ● ●● ●●● ●●● ●

−1 1 3

0.00

0.05

0.10

0.15

0.20

0.25

0.30

● ●● ●●● ●●● ● ● ●● ●●● ●●● ●

−1 1 30.

000.

050.

100.

150.

200.

250.

30

● ●● ●●● ●●● ●

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 12: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

The Kernel Estimator

f (x) =1

n

n∑i=1

Kh(x − xi ) =1

nh

n∑i=1

φ

(x − xi

h

).

Easily extended to multivariate versions.

Note that this is an “average”.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 13: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Network Flows

http://csr.lanl.gov/data/cyber1/

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 14: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Network Flows

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 15: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Streaming Data

Averages can be computed in a streaming fashion:

Xn =n − 1

nXn−1 +

1

nXn.

We can implement an exponential window:

Xn =N − 1

NXn−1 +

1

NXn = θXn−1 + (1− θ)Xn,

and apply this idea to the kernel estimator:

fn(x) = θfn−1(x) + (1− θ)φ

(x − Xn

h

).

θ controls how much of the past we “remember”. Note that wehave to set a grid of x points at which we want to compute f .

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 16: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Streaming Network Flows: log(#bytes) in a Flow

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 17: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Topics

1 Introduction

2 Computational Statistics

3 Machine Learning

4 Manifold Learning

5 Topological Data Analysis

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 18: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Machine Learning: Classification

Given {(xi , yi )}i=1,...,n ⊂ X × Y with xi corresponding toobservations (flows, programs, email, system calls, log files:“features”), and yi corresponding to class labels (e.g.“malware”, “benign”).

A classifier is a mapping g : X → Y .

Machine learning (pattern recognition, classification) isdesigning a function g from “training data” {(xi , yi )}i=1,...,n

for which “truth” is known.

We are given training data {(xi , yi )}i=1,...,n ⊂ X × Y , and will bepresented with a new x ∈ X for which the label is unknown. Wewish to infer the y associated with x .

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 19: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Nearest Neighbors

We are given training data {(xi , yi )}i=1,...,n ⊂ X × Y , and a newx ∈ X for which the label is unknown.

1 Find the closest xi to x:

y = yargmin d(x ,xi ).

2 We must select an appropriate distance (dissimilarity) d .

3 Alternative: We can compute the k closest, and vote: takethe majority class.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 20: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Kaggle Malware

10, 868 examples of malware grouped into 9 malware“families”.*

Each file has been byte-dumped and tabulated:

We are using the frequency of times each value 0, . . . , 255occurs in the file.This seems really dumb (computer scientists laugh when I tellthis story).

We’ll look at the nearest neighbor classifier on these data.100 observations of each family are used for training (21observations from the family containing only 42 observations).Test on the remaining.

Remember: sometimes simple is good.

*https://www.kaggle.com/c/malware-classification

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 21: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Kaggle Malware: 1NN Performance

True Class

1291 116 6 2 53 12 75 1285 1525 1 4 1 6 5

33 2807 3 1 2 21 63 29 352 12 2 7 10

27 89 19 13 8 23 2155 166 2 10 554 2 32 2830 6 4 3 4 244 4528 243 3 15 920 124 137 7 12 18 709

Error: 16.2%. That is, 84% of the observations are correctlyclassified.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 22: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Kaggle Malware: 1NN Performance – Why???

Text analogy: byte-count histogram is analogous to theword-count histogram used in text analysis.

Maybe this is more like a morpheme-count histogram.

Intuitively, a “family” shares a core of code (they aremodifications of the “mother” malware).

The bytes correspond to machine instructions – or at leastthey would if we were counting words instead of bytes.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 23: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Kaggle Malware: Smoothed 1NN Performance

Using the kernel estimator instead of the histogram, oneobtains an error of 12.4%.

This is another place for computer scientists to laugh: bytesare not continuous, machine instruction codes are discrete.

. . . and yet it works.

Remember Hand’s paper.

Here is the point at which we need to better understand ourdata.

Unfortunately, we won’t be doing this today.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 24: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Random Forests

We are given training data {(xi , yi )}i=1,...,n ⊂ X × Y , and a newx ∈ X for which the label is unknown.

The random forest is an ensemble of decision trees:1 Sample (with replacement) from the training data. Sample a

subset of the variables.2 Build a decision tree using the two samples – don’t bother

with any optimization or pruning.3 Repeat.

With a new observation, vote the trees.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 25: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Benign vs Malicious

7738 observations of windows binaries: 2054 benign, 5684malicious.

Random forest performance: 0.65% error.

1.6% of benign misclassified.0.3% of malicious misclassified.

Nearest neighbor classifier is a little worse: overall error of1.1%.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 26: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Know Your Data

The results demonstrate that there is something going on withthis byte-count approach.

Logically, the performance seems too good to be true, and yetit does seem to work.

The data are high dimensional (256), so maybe there is a“curse of dimensionality” thing going on here.Perhaps we are finding OS-specific things:

The data collected for the benign files may be a differentversion of the operating system than the malicious.We don’t have version information about the data (beyondthese are Windows files).

Worrisome fact: there are several different sets of benign (ormalicious) data. A classifier can be built to tell which set –which of the benign collections a file belongs to.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 27: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Know Your Data?

●●●

● ●

●●

●●●

●●

●● ●

●●

●●

●●

● ●

● ●

●●

● ●●

●● ●

● ●●

●● ●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

● ●

●●

●● ●●

●●

●●

●●

● ●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●●

●●

● ●

●●●

●●

●●

●● ●●

●●

●● ●

●●

●●

●●

●●

● ●

●●●

● ●

●●●

●●

● ●

● ●

●●

● ●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●●● ●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ● ●●

●●

●●

●●

● ●

●●

●●●

● ●

●●

●●

●● ●

●●

●●

●●●●

●●

●●

●●

●●● ●

● ●

●●

●●●

●●

●●

●●●

●●

●● ●

●●

●●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●● ●

●●●

● ●●

● ●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●●

● ●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

● ●

● ●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

● ●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●●●

●●●

●●●

●●●

●●●

●●

●●●

●●

●●

●●

● ●●

●●

●●

●● ●●

●●

●●

●● ●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

● ●

●●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●● ●

●●

●●

●●

● ●●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●● ●

●●

●●

● ●

●●

●● ●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

● ●

●●

● ●●

●●

●●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●● ●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●●

●●●

●●

●●

● ●

●● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●●

●● ●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●●●● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

Maaten & Hinton (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579-2605.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 28: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Topics

1 Introduction

2 Computational Statistics

3 Machine Learning

4 Manifold Learning

5 Topological Data Analysis

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 29: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Manifold Learning

Hypothesis: high dimensional data“lives” on a lower dimensionalstructure.

Manifold learning is a set oftechniques to infer this structure,or to embed the data from the highdimensional space into a lowerdimensional space that respects thelocal structure.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 30: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Multidimensional Scaling

Problem: Given a distance matrix (or dissimilarity matrix) D,find a set of points X ∈ Rd whose distance d(X ) bestapproximates D.

This is the problem solved by multidimensional scaling (MDS).

Different definitions of “best approximates” lead to differentalgorithms.

Classical MDS utilizes the eigenvector decomposition of (amodified version of) the distance matrix.

Some manifold learning algorithms compute a local distanceand use MDS, others computer eigenvectors of relatedmatrices. These are the algorithms I use most often.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 31: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Basic Graph Theory

A graph is a set V of vertices, and E of pairs of vertices(edges).

The edges can be directed or undirected, and can haveweights.

In this talk they will be undirected.

The (graph) distance between two vertices is the length of theshortest path between them in the graph.

The adjacency matrix of a graph on n vertices is the n × nbinary matrix with a 1 in those positions corresponding to theedges of the graph.

The spectrum of a graph is the eigen decomposition of theadjacency matrix A, or more generally, of some function f (A).

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 32: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Graph Examples

ε-ball graph with ε = 0.25.

3-nearest neighbor graph.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 33: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Basic Steps of Manifold Learning

Given data {x1, . . . , xn} ∈ Rp:1 Construct a graph whose vertices are the xi with edges

between “near” points.

k-nearest neighbor graph.ε-ball graph.Variations.

2 Compute the eigenvectors of:

The adjacency matrix.The Laplacian of the adjacency matrix.Scaled or modified versions of the above.

3 Set Z to the matrix with columns corresponding to the maineigenvectors. That is, the rows {z1, . . . , zn} are the“embedded” data.

Perform inference on Z .

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 34: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Manifold Learning

Compute ε-ball graph on the Kaggle training data.

Layout thegraph.

Embed usingscaled Laplacian.

Embed usingadjacencymatrix.

Embed usingMDS on graphdistance.

9

8

6

7

19

2

3

7

3

7

33

2

2

2

6

33

2

7

1

8

9

8

6

8

43

7

8

1

97

77 2

2

6

8

3

8

4

2

7

8

7

8

5

7

2

8

6

12

1

7

64

6

7

5

5

4

1

2

1

9

8

4

9

4

9

1

7

2

6

4

9

6

4

11

6

6

11

4

9 9

1

3

6

4

1

2

3

1

3

5

96

1

4

6

7

2

7

3 44

6

9

6

6

6

9

6

2

1

8

3

1

8

7

1

4 43

6

8

2

7

8

3

8

3

2

3

8

9

5

433

9

4

8

8 6

3

9

6

2

6 46

7

1

2

9

6

9

6

3

6

2

8

3

99

9

4

28

9

8

9

1

4

5

8

9

44

1

9

3

8

8

2

3

3

2

3

9

4

8

9

3

6

4

7

2

2

6

8

86

7

1

9

63

8

9

1

8

7

2

72

6

6

4

4

1

2

9

9

3

7

2

1

1

4

9

2

2

44

1

7

88

6

2

88

7

4

2

8

9

1

333 6

8

61

2

1

9

7

63

8

6

8

1

3

1

7

7 7

9

6

3

9

7

2

7

2

9

6

8

7

5

6

9

6

6

24

8

2

8 1

8

76

2

1

3

4

2

32

4

9

1

2

9

7

9

7

1

1

7

9

6

8

6

1

6

2

4

1

2

9

2

3

4

7

2

4

1

4

6

2

1

9

2

9

7 7

6

1

1

44

9

3

4

5

2

7

181

1

1

6

99

1

7

9

6

42

7 87

2

96

4

1

7

3

9

4

1

7

2

9

9

4

17

9

8

1

6

4

8

77

4

9

4

7

6

6

6

7

2

3

8

4

3

1

2

4

12

34

9

7

3

8

5

3

2

7

2

8

4

6

7

3

9

1

43

9

9

9

1

2

11

4

1

4

5

4

7

1

2

7

1

7

1

1

1

7

6

6

1

6

4

78

5

6

3

9

3

8

8

3

25

9

7

3

1

7

66

3

9

36

6

7

2

8

66

43

4

52

1

4

1

88

2

66

2

8

3

7

9

4

8

7

63

4

7

9

2

6

6

7

4

99

1

3

7

2

8

3

8

1

2

9

9

4

8

1

4

6

89

2

2

5

6

3 4

9

7

5

2

91

3

9

11

81

1

6

5

3333

1

4

1

8

4

8

1

8

42 1

4

9

5

2

2

8

78

7

6

9

8

7

11

6

3

9

88

8

7

9

44

7

9

8

3

2

1

2

2

7

3

1

9

8

3 3

1

6

3

1

9

9

6

2

2

22

4

33

254

8

3

4

79

7

99

7

2

7

2

3

1

99

6

3

68

6333

7

1

8

33

8

8

3

9

6

6

8

2

9

7

5

8

7

4

7

1

4

3

7

3

53

6

2

9

4

8

5

83

1

8

2

3

2

7

33

9

8

9

2

1

8

8

9

6

4

4

4

3

6

6

1

3

7

2

9

8

9

333

2

2

4

8

1

9

7

8

8

3

2

3

8

7

1

9

4

44

8 6

33

9

4

7

3 6

8

4

7

7

8

8

6

8 1

9

6

6

7

7

2

96

3

6

6

7

2

43

6

6

7

1 1

8

24

1

7

2 8

7

2

8

6

6

1

9

4

7

7

4

42

7

2

4

7

2

4

4

4

44

4

44

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 35: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Manifold Learning

Compute ε-ball graph on the Kaggle training data.

Layout thegraph.

Embed usingscaledLaplacian.

Embed usingadjacencymatrix.

Embed usingMDS on graphdistance.

9

8

6

7

1923

7

37

33

2 22

6

33

2 71

89 8

6

8

43

7

81

9777

2

26 8

3

8

4 2

787

85

7

2

8

61 21

764 6

7

5541

219 8

4

9

49 1

7

2

6

4

9

64

11

66

114

99

1

3

64

1 23

1

35

9

6

14

6

7

2

7

34 469

66

6

962

18

3

18 7

1

4 43

6

8

2

7

83

8

32

3

8

9

54339

4

88

6

3

9

626 46

7

12

96

9

6

36 2 8

39

9

94 28

9

8

9

14 5

8

9

44

193

8 8233

2

39

48

9

36

4

7

22688 6

7

1

9

6389 18

7

2

7

26

6

44 12

9 9

3

7

211

49

2244

1

7

8

8

62

8 8

7

42

8

91

333 6

8

612

19

7

63

8

6

81

3

1

7 77

963

97

27

29

68

7

56

9

66 24

82

8

18

7

6

21

34

23 24

91

29

797

11

7

96

86 1

62

41

29

23 4

7

24

146 2

1

92

97

7

61

14 4

9

3 452

7

1811 1

6

99

17

9

6

42

787

2 9641

7

3

9

4

1

7

29

9

41

7

98

164

8

7749

4

7

6

6

6

7

2

3

843

1 24 12

3 4

9 7

3

853 2

7

2

8

4

6

73

9

1

4 3

9 9

912

114

145

4

7

12

7

1

7

11 1

7

661

6 4

78

563

9

38

8

3 259

7

31

7

6

6

39

3 66

7

28

6

6

43 452 141

8 8

26 62

8

37

94

87

634

7

9266

7

4991

3

7

2

83

81

29

9

4

8

1

4

6

89

225 6

3 4

97

52

9 1

39

1181

1

653333

1

4

1

8

4

8

18

4214

9

5 22

8

7

8

7

69

8

7

11

6

39

8

8

8

7

9

44

7

98

3

2 12273

1

9

8

33

163

1996 2

222433 254

83 4

7 97

99

7

2

7

2

3

1

99

6

3

68

63337

18

33

88

3 9

6

6 829

7

5

87

47

1

4 37

3 53 6 2 94

8583

1823

2

7

339

8

9

21

88

9

644 4366 1

3

7

2

98 9

333 22

48

1 9

7

8 8

32

3

8

7

19

44 48633

9

4

73 6

8

4

77

88

68

1

9

66

77

2963

66

7

243

6

6

7

11

8

241

7

28

7

2 86

6

1

9

4

7

7442

7

24

7

244

444 444

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 36: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Manifold Learning

Compute ε-ball graph on the Kaggle training data.

Layout thegraph.

Embed usingscaled Laplacian.

Embed usingadjacencymatrix.

Embed usingMDS on graphdistance.

9867192

3

7

3

7

33

2226

3

3

271898

6

8

4

3

78197772268

3

8

4

27878572861217

6

4

6755412198

4

9

4

91726

4

96

4

1166114991

3

6412

3

1

3

5961

4

6727

3

4

46966

6

96218

3

1871

4

4

3

68278

3

8

3

2

3

8954

3

3

94886

3

962

6

4

6

7129

6

96

3

628

3

999

4

289891 45894

4

19

3

882

3

3

2

3

9

4

89

3

6

4

72268867196

3

8918727266441299

3

72114922 4

4

17886288742891

33

3

6

86121976

3

8681

3

177796

3

972729

6

875696

6

2

4

828187621

3

42

3

2491297971179

6

861624129

2

3

472

4

14

6

2192977611

44

9

3

4527181116991

7

9642787

2

96

4

17

3

9

4

17

2

99 41798164877494

766672

3

8

43

12

4

12

3

497

3

85

3

2728467

3

91

4

3

9991211

4

145

4

71271711176

6

1

6

4785

6

3

93

883

2597

3

1

7

663

9

3

6

672866

4

3

45214

1882

6

628

3

79487

6

3

479

26674991

3

728

3

81299

4

8146892256

3

4

975291

3

91181165

33

3

31

4

1848184214952287876987116

3

98887944798

3

21227

3

198

3

3

1

6

3

199

6

22224

3

3

25

4

8

3

4797997272

3

19963

68

6

33

3

718

3

3

88

3

96

6

8297587

4

71

4

3

7

3

5

3

6

29

4

8583

182

3

27

3

39892

1

88964

4

43

6

61

3

72989

33

3

22

4

819788

3

2

3

87194

4

4

8

6

3

3

9

4

7

3

6847788

6

8196677296

3

6

6

724

3

6671182417287286619

4

7744

272472

4

44

4

4

444

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 37: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Manifold Learning

Compute ε-ball graph on the Kaggle training data.

Layout thegraph.

Embed usingscaled Laplacian.

Embed usingadjacencymatrix.

Embed usingMDS on graphdistance.

9

8

6

7

19

2

3 7

373 3

2

22

6

33

2

71

8

9

8

6

8

43

7

8

1

9

777

2

26

8

3

8

4

2

78

7

8

5

72

8

6 1 2

1

7

64 6

7

55

4

1

2 1

9 8

4

9

4

91

7

2

6

4

9

6

4

1

1

6 6

11

4

9

9

1

3

64

12

3

1

3 5

9

61

4

6

7

2

73 4 4

6

9

66

6

9

62

1

8

3

1

8

7

1

44

3

6

8

2

7

8

3

8

32

3

8

9

54

33

9

4

8 8

6

3

9

62

646

7

1

2

9

69

6

3

62

8

3

99

94

2

8

9

8

9

1

45

8

9

44

1

93

8

8

23

3

2

3

9

48

93

6

4

72 2

6 886

7

1

9

63

89 1

8

72 7

2

6

6

4

41

2

99

3

7

2

1

14

9

2 244

1

7

8

8

62

8

8

7

4 2

8

9

1

33 36

8

61

2

19

7

63

8

6

8

1

3

1

7

77963

97

27

2

9

6

8

7

5

6

9

6

62

4

8

2

8

1

8

7

6

2

1

3

4

2

3 2

4

91

2

9

7

97

11

7

9

6

8

61

6

2

4

1

2

9

2

3

4

7

2

4

1

46 2

1

9

2

9

7

76

1

1

44

9

3

45 2

7

1

8

1

1

1

6

991

7

9

6

42

7872 9

6

4

1

7

3

9

4

1

7

2

9

94

1

7

9

8

1

6 4

8

774

9

4

7

6

6

6

7

2

3

8

4

3

124 12

3

4

973

8

53

2

7

2

84

6

7

3

9

1

4 3

9

9

9

1

2

1

14

1

4 54

7

1

2

7

1

7

1

11

7

66

1

64

785

63

938

8

3

25

9

7

3

1

7

66

3

9

36

6

7

2

8 6

6

434

52 14

1

88

2

662

8

3 7

94 8

7

63

4

7

926

6

7

4

99

1

3

7

283

81

29

9

4

8

1

4

6

8

9

2

2 5 6

34

9

7

5

2

9

1

39

11

8

1

1

65

33 3

3

1

4

1

8

4

8

1

8

4

214

9

52 2

8

7

8

7

6

9

8

7

11

6

3

9

8

8

8

7

9

44

7

9

8

3

21

2

2 73

1

9

8

3 3

1

63

19

9

6 222

2

4

3 3 254

8

3

4

79

7

9

9

7

2

7

2

3

1

99

6

3 6

8

63337

1

8

33

8

8

3 9

6

6 82 9

7

5

8

7

4

7

1

437

3

5362 94

8

58

3

1

8 2

3

2

7

33 9

8

9

2

1

8

8

9

64

4

4

36

61

3

7

2

9

89

33 3 2

2

4 8

1

9

7

8

8

32

3

8

7

1

944

48

633

9

4 7

36

8

4

7

7

8

8

68

1

9

66

77

2 963

6

6

7

2

43

6

6

7

1

1

82

41

7

2

8

7

2

86

6

1

9

4

7

7

44 2

72 4

7

24

4

4444 44

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 38: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Manifold Learning Discussion

Different embedding methods extract different informationabout the data.

These two dimensional plots are misleading in that there is noreason to assume the intrinsic dimensionality is 2.

Some care must be taken to ensure that the embeddingmethod can be applied to new data.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 39: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Joint Embedding

DW =

(D1 WW D2

)Jointly embed D1 and D2

using DW , whereW = λD1 + (1− λ)D2.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 40: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Topics

1 Introduction

2 Computational Statistics

3 Machine Learning

4 Manifold Learning

5 Topological Data Analysis

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 41: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Topological Data Analysis (TDA)

The basic idea is to use topological features – measures that areinvariant to smooth deformations – to learn about the structure ofthe data.

We will only be able to touch briefly on this subject.

See:

Carlsson, “Topology and Data”, Bulletin of the AmericanMathematical Society, 46, 2009, 255–308.Ghrist, “Elementary Applied Topology”, CreatespaceIndependent Publishing Platform, 2014.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 42: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Simplices

● ● ●

● ●

● ●

A (geometric) simplex of dimension d is a set of d + 1 points inrelative position.

A 0 simplex is a point, a 1-simplex a line segment, a 2 simplex atriangle, and so on.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 43: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Simplicial Complexes

● ● ●

● ●

● ●

● ●

A simplicial complex is a collection S of simplices that satisfies thefollowing conditions:

1 If σ ∈ S then so are the faces of σ.

2 If σ1, σ2 ∈ S are k simplices, then either they are disjoint or theyintersect in a lower dimensional simplex which is a face of both.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 44: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Persistent Homology

We construct an ε-ball graph on the data, and from this weget a simplicial complex.

We compute a measure of the topology (the rank of theHomology, or the Betti number) – how many “d-dimensionalholes” are there?

Those structures that persist across ranges of ε are“interesting” and more likely to be real structure rather thannoise.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 45: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Euler Characteristic

One defines the Euler characteristic as:

χ(X ) =n∑

j=0

(−1)jBettij(X ).

This is equivalent to the standard Euler characteristic onelearns in grade school, extended to general topological spacesand higher dimensions.

The “persistent” version is to compute this on the persistenthomologies from the ε-ball graphs.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 46: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Persistent Euler Characteristics of Malware

Distribution A: Approved for Public Release NSWCDD-PN-17-00345

Page 47: Computational Statistics and Mathematics for Cyber Securitystatisticalcyber.com/talks/Marchette, David.pdf · Computational Statistics Machine Learning Manifold Learning Topological

IntroductionComputational Statistics

Machine LearningManifold Learning

Topological Data Analysis

Discussion

Mathematics has many tools for the data analyst, in particularfor the analysis of cyber data.

These tools include:

Computational statistics.Machine learning.Graph theory.Manifold learning.Topological data analysis.

New applications of “pure” mathematics to data analysis aredeveloped every day, and these areas are all huge growth areasfor applied mathematicians.

Distribution A: Approved for Public Release NSWCDD-PN-17-00345