Download - Time Series Analysis for Network Secruity

Transcript
Page 1: Time Series Analysis for Network Secruity

1Endgame Proprietary

Page 2: Time Series Analysis for Network Secruity

2

Time Series Analysis for Network Security

Phil RothData Scientist @ Endgame

mrphilroth.com

Page 3: Time Series Analysis for Network Secruity

33

First, an introduction. My history of Python scientific computing, in function calls:

Page 4: Time Series Analysis for Network Secruity

44

os.path.walk

Physics Undergraduate @ PSU

AMANDA Neutrino Telescope

Page 5: Time Series Analysis for Network Secruity

55

pylab.plot

Physics Graduate Student @ UMD

IceCube Neutrino Telescope

Page 6: Time Series Analysis for Network Secruity

66

numpy.fft.fft

Radar Scientist @ User Systems, Inc.

Various Radar Simulations

Page 7: Time Series Analysis for Network Secruity

77

pandas.io.parsers.read_csv

Side Projects

Scraping data from the web

Page 8: Time Series Analysis for Network Secruity

88

sklearn.linear_model.LogisticRegression

Side Projects

Machine learning competitions

Page 9: Time Series Analysis for Network Secruity

99

(the rest of this talk…)

Data Scientist @ Endgame

Time Series Anomaly Detection

Page 10: Time Series Analysis for Network Secruity

1010

Problem:Highlight when recorded metrics deviate from

normal patterns.

for example: a high number of connections might be anindication of a brute force attack

for example: a large volume of outgoing data might be anindication of an exfiltration event

Page 11: Time Series Analysis for Network Secruity

1111

Solution:Build a system that can track and store

historical records of any metric. Develop an algorithm that will detect irregular behavior

with minimal false positives.

Page 12: Time Series Analysis for Network Secruity

1212

Gathering Datakairos

kafka-pythonpyspark

Building Modelsclassification

ewmaarima

Page 13: Time Series Analysis for Network Secruity

1313

real timestream

batchhistorical

RedisIn memory

key-value data store

HDFSLarge scale distributed data store

Kafka TopicsDistributed message passing

Data Sources

data flow

Page 14: Time Series Analysis for Network Secruity

1414

kairos

A Python interface to backend storage databases (redis in my case, others available) tailored for time series storage.

Takes care of expiring data and different types of time series (series, histogram, count, gauge, set).

Open sourced by Agora Games.

https://github.com/agoragames/kairos

Page 15: Time Series Analysis for Network Secruity

1515

kairos

Example code:

from redis import Redisfrom kairos import Timeseries

intervals = {"days" : {"step" : 60, "steps" : 2880}, "months" : {"step" : 1800, "steps" : 4032}}

rclient = Redis(“localhost”, 6379)ktseries = Timeseries(rclient, type="histogram”, intervals=intervals)

ktseries.insert(metric_name, metric_value, timestamp)

Page 16: Time Series Analysis for Network Secruity

1616

kafka-python

A Python interface to Apache Kafka, where Kafka is publish-subscribe messaging rethought as a distributed commit log.

Allows me to subscribe to the events as they come in real time.

https://github.com/mumrah/kafka-python

Page 17: Time Series Analysis for Network Secruity

1717

kafka-python

from kafka.client import KafkaClientfrom kafka.consumer import SimpleConsumer

kclient = KafkaClient(“localhost:9092”)kconsumer = SimpleConsumer(kclient, “timevault, “rawmsgs”)

for message in kconsumer : insert_to_kairos(message)

Example code:

Page 18: Time Series Analysis for Network Secruity

1818

pyspark

A Python interface to Apache Spark, where Spark is a fast and general engine for large scale data processing.

Allows me to fill in historical data to the time series when I add or modify metrics.

http://spark.apache.org/

Page 19: Time Series Analysis for Network Secruity

1919

pyspark

from pyspark import SparkContext, SparkConf

spark_conf = (SparkConf() .setMaster(“localhost”) .setAppName(“timevault-update”))sc = SparkContext(conf=spark_conf)

rdd = (sc.textFile(hdfs_files) .map(insert_to_kairos) .count())

Example code:

Page 20: Time Series Analysis for Network Secruity

2020

pyspark

from json import loadsimport timevault as tvfrom functools import partialfrom pyspark import SparkContext, SparkConf

spark_conf = (SparkConf() .setMaster(“localhost”) .setAppName(“timevault-update”))sc = SparkContext(conf=spark_conf)

rdd = (sc.textFile(tv.conf.hdfs_files) .map(loads) .flatMap(tv.flatten_message) .flatMap(partial(tv.emit_metrics, metrics=tv.metrics_to_emit)) .filter(lambda tup : tup[2] < float(tv.conf.limit_time)) .mapPartitions(partial(tv.insert_to_kairos, conf=tv.conf) .count())

Example code:

Page 21: Time Series Analysis for Network Secruity

2121

the end resultfrom pandas import DataFrame, to_datetime

series = ktseries.series(metric_name, “months”, transform=transform)ts, fields = zip(*series.items())df = DataFrame({"data” : fields}, index=to_datetime(ts, unit="s"))

Page 22: Time Series Analysis for Network Secruity

2222

building models

First naïve model is simply the mean and standard deviation across all time.

blue: actual number of connectionsgreen: prediction windowred: actual value exceeded standard deviation limit

Page 23: Time Series Analysis for Network Secruity

2323

building models

Second slightly less naïve model is fitting a sine curve to the whole series.

blue: actual number of connectionsgreen: prediction windowred: actual value exceeded standard deviation limit

Page 24: Time Series Analysis for Network Secruity

2424

classification

Both naïve models left a lot to be desired. Two simple classifications would help us treat different types of time series appropriately:

Does this metric show a weekly pattern (ie. different behavior on weekends versus weekdays)?

Does this metric show a daily pattern?

Page 25: Time Series Analysis for Network Secruity

2525

classificationFit a sine curve to the weekday and weekend periods.

Ratio of the level of those fits to determine if weekdays will be divided from weekends.

weekly

Page 26: Time Series Analysis for Network Secruity

2626

classification weekly

from scipy.optimize import leastsq

def fitfunc(p, x) : return (p[0] * (1 - p[1] * np.sin(2 * np.pi / (24 * 3600) * (x + p[2]))))

def residuals(p, y, x) : return y - fitfunc(p, x)

def fit(tsdf) : tsgb = tsdf.groupby(tsdf.timeofday).mean() p0 = np.array([tsgb[“conns”].mean(), 1.0, 0.0]) plsq, suc = leastsq(residuals, p0, args=(tsgb[“conns”], np.array(tsgb.index))) return plsq

Page 27: Time Series Analysis for Network Secruity

2727

classification weekly

def weekend_ratio(tsdf) :

tsdf['weekday'] = pd.Series(tsdf.index.weekday < 5, index=tsdf.index) tsdf['timeofday'] = (tsdf.index.second + tsdf.index.minute * 60 + tsdf.index.hour * 3600)

wdayplsq = fit(tsdf[tsdf.weekday == 1]) wendplsq = fit(tsdf[tsdf.weekdy == 0])

return wendplsq[0] / wdayplsq[0]

0 1cutoff 1 / cutoff

No weekly variation.

Page 28: Time Series Analysis for Network Secruity

2828

classification

Weekly pattern.

No weekly pattern.

weekly

Page 29: Time Series Analysis for Network Secruity

2929

classificationTake a Fourier transform of the time series, and inspect the bins associated with a frequency of a day.

Use the ratio of those bins to the first (constant or DC component) in order to classify the time series.

daily

Page 30: Time Series Analysis for Network Secruity

3030

classification

Time series on weekdays shown with a strong daily pattern.

Fourier transform with bins around the day frequency highlighted.

daily

Page 31: Time Series Analysis for Network Secruity

3131

classification

Time series on weekends shown with no daily pattern.

Fourier transform with bins around the day frequency highlighted.

daily

Page 32: Time Series Analysis for Network Secruity

3232

classificationdef daily_ratio(tsdf) :

nbins = len(tsdf) deltat = (tsdf.index[1] - tsdf.index[0]).seconds deltaf = 1.0 / (len(tsdf) * deltat)

daybin = int((1.0 / (24 * 3600)) / deltaf)

rfft = np.abs(np.fft.rfft(tsdf[“conns”])) daily_ratio = np.sum(rfft[daybin - 1:daybin + 2]) / rfft[0] return daily_ratio

daily

Find the bin associated with the frequency of a day using:

Page 33: Time Series Analysis for Network Secruity

3333

ewma

Exponentially weighted moving average:

The decay parameter is specified as a span, s, in pandas, related to α by:

α = 2 / (s + 1)

A normal EWMA analysis is done when the metric shows no daily pattern. A stacked EWMA analysis is done when there is a daily pattern.

Page 34: Time Series Analysis for Network Secruity

3434

ewmadef ewma_outlier(tsdf, stdlimit=5, span=15) :

tsdf[’conns_binpred’] = pd.ewma(tsdf[‘conns’], span=span).shift(1) tsdf[’conns_binstd’] = pd.ewmstd(tsdf[‘conns’], span=span).shift(1)

tsdf[‘conns_stds’] = ((tsdf[‘conns’] – tsdf[’conns_binpred’]) / tsdf[‘conns_binstd’]) tsdf[‘conns_outlier’] = (tsdf[‘conns_stds’].abs() > stdlimit)

return tsdf

normal

Page 35: Time Series Analysis for Network Secruity

3535

ewma normal

blue: actual number of connectionsgreen: prediction windowred: actual value exceeded standard deviation limit

Page 36: Time Series Analysis for Network Secruity

3636

ewma

blue: actual response sizegreen: prediction windowred: actual value exceeded standard deviation limit

normal

Page 37: Time Series Analysis for Network Secruity

3737

ewma stacked

Page 38: Time Series Analysis for Network Secruity

3838

ewma stacked

Page 39: Time Series Analysis for Network Secruity

3939

ewma stacked

Page 40: Time Series Analysis for Network Secruity

4040

ewmadef stacked_outlier(tsdf, stdlimit=4, span=10) :

gbdf = tsdf.groupby(‘timeofday’)[colname] gbdf = pd.DataFrame({‘conns_binpred’ : gbdf.apply(pd.ewma, span=span), ‘conns_binstd’ : gbdf.apply(pd.ewmstd, span=span)})

interval = tsdf.timeofday[1] - tsdf.timeofday[0] nshift = int(86400.0 / interval)

gbdf = gbdf.shift(nshift) tsdf = gbdf.combine_first(tsdf)

tsdf[‘conns_stds’] = ((tsdf[‘conns’] – tsdf[‘conns_binpred’]) / tsdf[‘conns_binstd’]) tsdf[‘conns_outlier’] = (tsdf[‘conns_stds’].abs() > stdlimit)

return tsdf

stacked

Shift the EWMA results by a day and overlay them on the original DataFrame.

Page 41: Time Series Analysis for Network Secruity

4141

ewma

blue: actual number of connectionsgreen: prediction windowred: actual value exceeded standard deviation limit

stacked

Page 42: Time Series Analysis for Network Secruity

4242

arima

I am currently investigating using ARIMA (autoregressive integrated moving average) models to make better predictions.

I’m not convinced that this level of detail is necessary for the analysis I’m doing, but I wanted to highlight another cool scientific computing library that’s available.

Page 43: Time Series Analysis for Network Secruity

4343

arimafrom statsmodels.tsa.arima_model import ARIMA

def arima_model_forecast(tsdf, p, d q) :

arima_model = ARIMA(tsdf[“conns”][:-1], (p, d, q)).fit() forecast, stderr, conf_int = arima_model.forecast(1)

tsdf[“conns_binpred"][-1] = forecast[0] tsdf[“conns_binstd"][-1] = stderr[0]

return tsdf

Page 44: Time Series Analysis for Network Secruity

4444

arima

blue: actual number of connectionsgreen: prediction windowred: actual value exceeded standard deviation limit

p = d = q = 1

Page 45: Time Series Analysis for Network Secruity

4545

takeaways

Python provides simple and usable interfaces to most data handling projects.

Combined, these interfaces can create a full data analysis pipeline from collection to analysis.

Page 46: Time Series Analysis for Network Secruity

46© 2014 Endgame