BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA...

37
BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park [email protected] , @vssubrah Joint work with Chanhyun Kang, Noseong Park, B. Aditya Prakash, Edoardo Serra

Transcript of BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA...

Page 1: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

BIG DATA ANALYTICSFOR CYBER SECURITY

V.S. SubrahmanianUniversity of Maryland, College Park

[email protected], @vssubrah

Joint work with Chanhyun Kang, Noseong Park, B. Aditya Prakash, Edoardo Serra

Page 2: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

Malware Pandemic

2

Symantec Report 2015

Page 3: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

Malware is hard to detect!

3

1st

DiscoveryInspect

Release signature

Releasepatch

Detection

DetectionDetection

Page 4: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

Key Challenge

• Statistics from Symantec WINE Dataset

– # of Detections <<< # of Infections

4

# of infected hosts

# of detected hosts

Page 5: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

Problem Statement

5

Predict this value

Using these detections

Page 6: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

Our Approaches

• Feature based prediction method

–Proposed a set of novel features

• Epidemic model inspired by SIR model

• Ensemble method that merges the previous two methods with other state-of-the-art techniques.

6

Page 7: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

FEATURE BASED PREDICTION MODEL

1st Method

7

Page 8: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

8

Symantec Telemetry data

Each record= (Host, Malware, File name, Infection time, Detection time)

‘Detection and Patch ability’ of each host

‘Detection and Patch hardness’ of each malware

2. Compute host-level features

3. Compute country-level features

Aggregate host level features, e.g. Average

4. Train a prediction model with the features

Prediction model

Features / Infection Ratio

The expected number of infections in future

day Feature #1 Feature #2 …… Infected Host Ratio

80% Training

d

Ground Truthd+1

20% Testd+n

Ground Truth vs. Predictions⁞

~2 years data

‘Detection and Patch incompetence’ of each host

Feature Based Method

Page 9: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

Detection/Patch Incompetence• Each record= (Host h, Malware m, File name f,

Infection time i, Detection time t)

• Detection time – Infection time– How good/bad is a user h at detecting malware m?

– How easy/hard is it to detect malware m?

• Patch time – Infection time– How good/bad is a user at patching a vulnerability/malware?

– How easy/hard is it to patch a vulnerability/malware?

• Average these values for each host host-level detection/patch incompetence

• Some other similar features, e.g., Detection time – Malware signature release time

91: These two are the most simplest features.

(Detection Incompetence1)

(Patch Incompetence1)

Page 10: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

Detection Ability/Hardness

• Each record= (Host h, Malware m, File name f, Infection time i, Detection time t)

• Detection Ability (ADA) of host h is the weighted sum of Detection Hardness (ADH) of malware detected by h.

• Detection Hardness of malware m is the weighted sum of Detection Ability of hosts that detected m.

10

A subset of WINE records, where Host = h

Page 11: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

BiFixpoint Algorithm

11

Uniform initialization

Recursive calculation

We prove that convergence is always guaranteed!

Page 12: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

Collaborative Features

12

• Given two similar1 hosts h1 and h2

– Suppose h1 was infected by m.

– h2 is likely to be infected soon with prob ~ sim(h1, h2).

• cf(h,m) is the estimated prob. of host h being infected by m (considering similarity).

• cf(C,m) is the sum of cf(h,m), where h is a host in country C.

1: We defined various similarity measures based on calculated features.

Similarity 0.5(0.9+0.5)/3

Page 13: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

Time Lag Features

13

• Today’s infection ratio depends on not only today’s features but also past features.

• Very high dimensional feature space

day Feature #1 Feature #1 (-1 day) Feature #1 (-7 day) …… Infected Host Ratio

80% Training

d

Ground Truthd+1

20% Testd+n Ground Truth vs.

Predictions⁞

Page 14: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

Recap of Features

14

• Features from raw values– Detection time – Infection time (Detection Incompetence)

– Patch time – Infection time (Patch Incompetence)

– Some features calculated from raw data

• Features from BiFixpoint Algorithm– Detection ability, Patch ability for hosts

– Detection hardness, Patch hardness for malware

• Collaborative Features– Infection numbers based on host similarity

• Country Human Development Index, …

• Time lag features

• Country level aggregation Regression Problem

Page 15: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

EPIDEMIC PREDICTION MODEL

2nd Method

15

Page 16: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

16

• SIR Model models the the dynamics of infectious disease.

• Sometimes used for social rumor diffusion.

• Does not fit the spread of malware.– Recovered doesn’t precisely capture the dynamics of malware spread.

– Transition rate is not designed for malware.

– Network data may not always be available.

Epidemic Model

Page 17: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

DIPS Epidemic Model

• “Recovered” “Detected” and “Patched”

• Carefully designed transition rates

• S(t), I(t), D(t) and P(t) are the number of susceptible, infected, detected and patched hosts at time t

• S(t), I(t), D(t) and P(t) are recursively defined.

17

S

IP

β(t): infection rates

γ(t) : # detections in modelδ(t): patching rate

θ(t): patching rate

1- δ(t)

𝜸(𝒕) = 𝜸𝟎 ∙ 𝑫𝑬𝑻(𝒕)

𝜷(𝒕) = 𝜷𝟎 ∙ (𝟏 + 𝑷𝒂 ∙ 𝐜𝐨𝐬𝟐𝝅

𝑷𝒑∙ (𝒕 + 𝑷𝑺) )

𝜽(𝒕) =𝟎 (𝒕 < 𝒕𝒑)

𝜽𝟎 (𝒕 ≥ 𝒕𝒑)

𝜹(𝒕) =𝟎 (𝒕 < 𝒕𝒑, 𝒕𝒅)

𝜹𝟎 (𝒕 ≥ 𝒕𝒑, 𝒕𝒅)

- S(t+1) = S(t) − β(t)·S(t)·I(t) + (1- δ(t))·D(t) − θ(t)·S(t)

- I(t+1) = I(t) + β(t) ·S(t)·I(t) − γ0 ·DET(t)

- D(t+1) = γ0·DET(t)

- P(t+1) = P(t)+ δ(t)·D(t)+ θ(t)·S(t)

D

Page 18: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

How to predict with DIPS

• Find the optimal set of parameters with Least Square Method to minimize the sum of (true-prediction)

2

• Train with the target country-malware pair.

– Initialization local optimal not stable learning

• Learning algorithm (two phases)

– First, train the parameters with all countries and malware

– Second, train again only for the target country-malware

Host state

I

S

D

P

Train with the first 80% infection/detection history

Detections, DET(t), for the last 20%

26Host state

I

S

D

P

Infections I(t)

Page 19: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

DIPS - Susceptible

19

S

I

D

P

β(t): infection rate

δ(t): patching rate

θ(t): patching rate

1- δ(t)

𝜷(𝒕) = 𝜷𝟎 ∙ (𝟏 + 𝑷𝒂 ∙ 𝐜𝐨𝐬𝟐𝝅

𝑷𝒑∙ (𝒕 + 𝑷𝑺) )

- SI in between t and t+1: β(t)·S(t)·I(t)1

- DS: (1- δ(t))·D(t)

- SP: θ(t)·S(t)

- S(t+1) = S(t) − β(t)·S(t)·I(t) − θ(t)·S(t) + (1- δ(t))·D(t)

1: This is from SIR model.

𝜷𝟎: base infection rate𝑷𝒂: amplitude of cycle𝑷𝒑: period of cycle

𝑷𝑺: phase shift of cycle

Page 20: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

20

S

I

D

P

γ(t) : # detections in modelδ(t): patching rate

1- δ(t)

𝜸(𝒕) = 𝜸𝟎 ∙ 𝑫𝑬𝑻(𝒕)𝜹(𝒕) =𝟎 (𝒕 < 𝒕𝒑, 𝒕𝒅)

𝜹𝟎 (𝒕 ≥ 𝒕𝒑, 𝒕𝒅)

- ID: γ0 ·DET(t), where DET(t) is the true detection numbers at time t

- DS: (1- δ(t))·D(t)

- DP: δ(t)·D(t)

- D(t) = γ0·DET(t)

DIPS - Detected

Page 21: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

• Modeling of “Birth” of the SIR model

• 𝝈 𝒕 is added.

21

S

I

D

P

β(t): infection rates

γ(t) : # detections in modelδ(t): patching rate

θ(t): patching rate

1- δ(t)

𝜸(𝒕) = 𝜸𝟎 ∙ 𝑫𝑬𝑻(𝒕)

𝝈 𝒕 : inflow of additional hosts

𝜷(𝒕) = 𝜷𝟎 ∙ (𝟏 + 𝑷𝒂 ∙ 𝐜𝐨𝐬𝟐𝝅

𝑷𝒑∙ (𝒕 + 𝑷𝑺) )

𝜽(𝒕) =𝟎 (𝒕 < 𝒕𝒑)

𝜽𝟎 (𝒕 ≥ 𝒕𝒑)

𝜹(𝒕) =𝟎 (𝒕 < 𝒕𝒑, 𝒕𝒅)

𝜹𝟎 (𝒕 ≥ 𝒕𝒑, 𝒕𝒅)

- S(t+1) = S(t) − β(t)·S(t)·I(t) + (1- δ(t))·D(t) − θ(t)·S(t)+𝝈 𝒕- I(t+1) = I(t) + β(t) ·S(t)·I(t) − γ0 ·DET(t)

- D(t+1) = γ0·DET(t)

- P(t+1) = P(t)+ δ(t)·D(t)+ θ(t)·S(t)

DIPS-exp Epidemic Model

Page 22: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

ENSEMBLE PREDICTION MODEL

3rd Method

22

Page 23: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

Combine Prediction Models

• Combine Feature Method and DIPS.

• Use DIPS prediction results as additional features.

23

ESM

Symantec WINE dataset

Host state

Infected

Susceptible

Detected

PatchedMalware downloaded

Signature released

Page 24: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

Not Enough Training Data

• To predict number of hosts infected by malware m, train jointly with similar malware

• Discover similar malware with Dynamic Time Warping to calculate time-series similarity

• Lots of noise

24

ESM

Symantec WINE dataset

Host state

Infected

Susceptible

Detected

PatchedMalware downloaded

Signature released

Cluster of Similar Malware

m

Page 25: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

Robust Regression

• Need a robust regression

• Gaussian Process Regression

– Very strong Bayesian regression method

– Less parametric (Parameters are calculated from data with maximum likelihood.)

25

Linear Regression

Ridge Regression

Lasso Regression

Linear combination of weighted features + regularization term

Page 26: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

26

ESM

The expected number of infections in future

Symantec WINE dataset

Host state

Infected

Susceptible

Detected

PatchedMalware downloaded

Signature released

Cluster of Similar Malware

m

Feature #1 Feature #2 DIPS output …… Infected Host Ratio

80% Training (m)

80% Training (m1)

20% Test (m)

ESM Model

GPR

Page 27: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

Experiment Environment

• Top 50 Most Infectious Malware, Top 40 Country in GDP per capita 2000 Predictions

• 1.45M unique hosts, 2.99M records

• FBP

• DIPS, DIPS-exp

• FUNNEL: state-of-the-art epidemic model

• ESM0 (FBP + DIPS + DIPS-exp +Similar Malware)

• ESM1 (ESM0 + FUNNEL)

27

Page 28: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

Measurements

28

• MAE*=|true infections - predicted infections|

• MSE = (true infection ratio - predicted infection ratio)2

• RMSE = sqrt(MSE)

• NRMSE

• Pearson Correlation Coefficient

Page 29: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

FUNNEL (prior art)

• State of the art epidemic model for human disease

• C.C. between truths and predictions are very bad.

29

Prediction for country-malware pair

Page 30: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

Feature Based Prediction

30

Page 31: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

DIPS

31

Not enough training data

Page 32: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

ESM0

32

Page 33: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

Error Values

33

Page 34: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

34

Page 35: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

35

Page 36: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

References

36

C. Kang, N. Park, B.A. Prakash, E. Serra, V.S. Subrahmanian. Ensemble Models for Data-Driven Prediction of Malware Infections, Proc. 2016 ACM Web Science & Data Mining Conference (WSDM), Feb 2016.

Page 37: BIG DATA ANALYTICS FOR CYBER SECURITYin.bgu.ac.il/en/engn/ise/dmbi2016/Documents/...BIG DATA ANALYTICS FOR CYBER SECURITY V.S. Subrahmanian University of Maryland, College Park vs@cs.umd.edu,

Contact Information

37

V.S. SubrahmanianDept. of Computer Science & UMIACSUniversity of MarylandCollege Park, MD [email protected]

@vssubrah

www.cs.umd.edu/~vs/