1Endgame Proprietary
2
Time Series Analysis for Network Security
Phil RothData Scientist @ Endgame
mrphilroth.com
33
First, an introduction. My history of Python scientific computing, in function calls:
44
os.path.walk
Physics Undergraduate @ PSU
AMANDA Neutrino Telescope
55
pylab.plot
Physics Graduate Student @ UMD
IceCube Neutrino Telescope
66
numpy.fft.fft
Radar Scientist @ User Systems, Inc.
Various Radar Simulations
77
pandas.io.parsers.read_csv
Side Projects
Scraping data from the web
88
sklearn.linear_model.LogisticRegression
Side Projects
Machine learning competitions
99
(the rest of this talk…)
Data Scientist @ Endgame
Time Series Anomaly Detection
1010
Problem:Highlight when recorded metrics deviate from
normal patterns.
for example: a high number of connections might be anindication of a brute force attack
for example: a large volume of outgoing data might be anindication of an exfiltration event
1111
Solution:Build a system that can track and store
historical records of any metric. Develop an algorithm that will detect irregular behavior
with minimal false positives.
1212
Gathering Datakairos
kafka-pythonpyspark
Building Modelsclassification
ewmaarima
1313
real timestream
batchhistorical
RedisIn memory
key-value data store
HDFSLarge scale distributed data store
Kafka TopicsDistributed message passing
Data Sources
data flow
1414
kairos
A Python interface to backend storage databases (redis in my case, others available) tailored for time series storage.
Takes care of expiring data and different types of time series (series, histogram, count, gauge, set).
Open sourced by Agora Games.
https://github.com/agoragames/kairos
1515
kairos
Example code:
from redis import Redisfrom kairos import Timeseries
intervals = {"days" : {"step" : 60, "steps" : 2880}, "months" : {"step" : 1800, "steps" : 4032}}
rclient = Redis(“localhost”, 6379)ktseries = Timeseries(rclient, type="histogram”, intervals=intervals)
ktseries.insert(metric_name, metric_value, timestamp)
1616
kafka-python
A Python interface to Apache Kafka, where Kafka is publish-subscribe messaging rethought as a distributed commit log.
Allows me to subscribe to the events as they come in real time.
https://github.com/mumrah/kafka-python
1717
kafka-python
from kafka.client import KafkaClientfrom kafka.consumer import SimpleConsumer
kclient = KafkaClient(“localhost:9092”)kconsumer = SimpleConsumer(kclient, “timevault, “rawmsgs”)
for message in kconsumer : insert_to_kairos(message)
Example code:
1818
pyspark
A Python interface to Apache Spark, where Spark is a fast and general engine for large scale data processing.
Allows me to fill in historical data to the time series when I add or modify metrics.
http://spark.apache.org/
1919
pyspark
from pyspark import SparkContext, SparkConf
spark_conf = (SparkConf() .setMaster(“localhost”) .setAppName(“timevault-update”))sc = SparkContext(conf=spark_conf)
rdd = (sc.textFile(hdfs_files) .map(insert_to_kairos) .count())
Example code:
2020
pyspark
from json import loadsimport timevault as tvfrom functools import partialfrom pyspark import SparkContext, SparkConf
spark_conf = (SparkConf() .setMaster(“localhost”) .setAppName(“timevault-update”))sc = SparkContext(conf=spark_conf)
rdd = (sc.textFile(tv.conf.hdfs_files) .map(loads) .flatMap(tv.flatten_message) .flatMap(partial(tv.emit_metrics, metrics=tv.metrics_to_emit)) .filter(lambda tup : tup[2] < float(tv.conf.limit_time)) .mapPartitions(partial(tv.insert_to_kairos, conf=tv.conf) .count())
Example code:
2121
the end resultfrom pandas import DataFrame, to_datetime
series = ktseries.series(metric_name, “months”, transform=transform)ts, fields = zip(*series.items())df = DataFrame({"data” : fields}, index=to_datetime(ts, unit="s"))
2222
building models
First naïve model is simply the mean and standard deviation across all time.
blue: actual number of connectionsgreen: prediction windowred: actual value exceeded standard deviation limit
2323
building models
Second slightly less naïve model is fitting a sine curve to the whole series.
blue: actual number of connectionsgreen: prediction windowred: actual value exceeded standard deviation limit
2424
classification
Both naïve models left a lot to be desired. Two simple classifications would help us treat different types of time series appropriately:
Does this metric show a weekly pattern (ie. different behavior on weekends versus weekdays)?
Does this metric show a daily pattern?
2525
classificationFit a sine curve to the weekday and weekend periods.
Ratio of the level of those fits to determine if weekdays will be divided from weekends.
weekly
2626
classification weekly
from scipy.optimize import leastsq
def fitfunc(p, x) : return (p[0] * (1 - p[1] * np.sin(2 * np.pi / (24 * 3600) * (x + p[2]))))
def residuals(p, y, x) : return y - fitfunc(p, x)
def fit(tsdf) : tsgb = tsdf.groupby(tsdf.timeofday).mean() p0 = np.array([tsgb[“conns”].mean(), 1.0, 0.0]) plsq, suc = leastsq(residuals, p0, args=(tsgb[“conns”], np.array(tsgb.index))) return plsq
2727
classification weekly
def weekend_ratio(tsdf) :
tsdf['weekday'] = pd.Series(tsdf.index.weekday < 5, index=tsdf.index) tsdf['timeofday'] = (tsdf.index.second + tsdf.index.minute * 60 + tsdf.index.hour * 3600)
wdayplsq = fit(tsdf[tsdf.weekday == 1]) wendplsq = fit(tsdf[tsdf.weekdy == 0])
return wendplsq[0] / wdayplsq[0]
0 1cutoff 1 / cutoff
No weekly variation.
2828
classification
Weekly pattern.
No weekly pattern.
weekly
2929
classificationTake a Fourier transform of the time series, and inspect the bins associated with a frequency of a day.
Use the ratio of those bins to the first (constant or DC component) in order to classify the time series.
daily
3030
classification
Time series on weekdays shown with a strong daily pattern.
Fourier transform with bins around the day frequency highlighted.
daily
3131
classification
Time series on weekends shown with no daily pattern.
Fourier transform with bins around the day frequency highlighted.
daily
3232
classificationdef daily_ratio(tsdf) :
nbins = len(tsdf) deltat = (tsdf.index[1] - tsdf.index[0]).seconds deltaf = 1.0 / (len(tsdf) * deltat)
daybin = int((1.0 / (24 * 3600)) / deltaf)
rfft = np.abs(np.fft.rfft(tsdf[“conns”])) daily_ratio = np.sum(rfft[daybin - 1:daybin + 2]) / rfft[0] return daily_ratio
daily
Find the bin associated with the frequency of a day using:
3333
ewma
Exponentially weighted moving average:
The decay parameter is specified as a span, s, in pandas, related to α by:
α = 2 / (s + 1)
A normal EWMA analysis is done when the metric shows no daily pattern. A stacked EWMA analysis is done when there is a daily pattern.
3434
ewmadef ewma_outlier(tsdf, stdlimit=5, span=15) :
tsdf[’conns_binpred’] = pd.ewma(tsdf[‘conns’], span=span).shift(1) tsdf[’conns_binstd’] = pd.ewmstd(tsdf[‘conns’], span=span).shift(1)
tsdf[‘conns_stds’] = ((tsdf[‘conns’] – tsdf[’conns_binpred’]) / tsdf[‘conns_binstd’]) tsdf[‘conns_outlier’] = (tsdf[‘conns_stds’].abs() > stdlimit)
return tsdf
normal
3535
ewma normal
blue: actual number of connectionsgreen: prediction windowred: actual value exceeded standard deviation limit
3636
ewma
blue: actual response sizegreen: prediction windowred: actual value exceeded standard deviation limit
normal
3737
ewma stacked
3838
ewma stacked
3939
ewma stacked
4040
ewmadef stacked_outlier(tsdf, stdlimit=4, span=10) :
gbdf = tsdf.groupby(‘timeofday’)[colname] gbdf = pd.DataFrame({‘conns_binpred’ : gbdf.apply(pd.ewma, span=span), ‘conns_binstd’ : gbdf.apply(pd.ewmstd, span=span)})
interval = tsdf.timeofday[1] - tsdf.timeofday[0] nshift = int(86400.0 / interval)
gbdf = gbdf.shift(nshift) tsdf = gbdf.combine_first(tsdf)
tsdf[‘conns_stds’] = ((tsdf[‘conns’] – tsdf[‘conns_binpred’]) / tsdf[‘conns_binstd’]) tsdf[‘conns_outlier’] = (tsdf[‘conns_stds’].abs() > stdlimit)
return tsdf
stacked
Shift the EWMA results by a day and overlay them on the original DataFrame.
4141
ewma
blue: actual number of connectionsgreen: prediction windowred: actual value exceeded standard deviation limit
stacked
4242
arima
I am currently investigating using ARIMA (autoregressive integrated moving average) models to make better predictions.
I’m not convinced that this level of detail is necessary for the analysis I’m doing, but I wanted to highlight another cool scientific computing library that’s available.
4343
arimafrom statsmodels.tsa.arima_model import ARIMA
def arima_model_forecast(tsdf, p, d q) :
arima_model = ARIMA(tsdf[“conns”][:-1], (p, d, q)).fit() forecast, stderr, conf_int = arima_model.forecast(1)
tsdf[“conns_binpred"][-1] = forecast[0] tsdf[“conns_binstd"][-1] = stderr[0]
return tsdf
4444
arima
blue: actual number of connectionsgreen: prediction windowred: actual value exceeded standard deviation limit
p = d = q = 1
4545
takeaways
Python provides simple and usable interfaces to most data handling projects.
Combined, these interfaces can create a full data analysis pipeline from collection to analysis.
46© 2014 Endgame
Top Related