Time Series Analysis for Network Secruity

download Time Series Analysis for Network Secruity

of 46

  • date post

    27-Aug-2014
  • Category

    Software

  • view

    986
  • download

    2

Embed Size (px)

description

How Endgame is using the scientific computing stack in Python to find anomalies in network flow data.

Transcript of Time Series Analysis for Network Secruity

  • 1 Endgame Proprietary
  • 2 Time Series Analysis for Network Security Phil Roth Data Scientist @ Endgame mrphilroth.com
  • 33 First, an introduction. My history of Python scientific computing, in function calls:
  • 44 os.path.walk Physics Undergraduate @ PSU AMANDA Neutrino Telescope
  • 55 pylab.plot Physics Graduate Student @ UMD IceCube Neutrino Telescope
  • 66 numpy.fft.fft Radar Scientist @ User Systems, Inc. Various Radar Simulations
  • 77 pandas.io.parsers.read_csv Side Projects Scraping data from the web
  • 88 sklearn.linear_model.LogisticRegression Side Projects Machine learning competitions
  • 99 (the rest of this talk) Data Scientist @ Endgame Time Series Anomaly Detection
  • 1010 Problem: Highlight when recorded metrics deviate from normal patterns. for example: a high number of connections might be an indication of a brute force attack for example: a large volume of outgoing data might be an indication of an exfiltration event
  • 1111 Solution: Build a system that can track and store historical records of any metric. Develop an algorithm that will detect irregular behavior with minimal false positives.
  • 1212 Gathering Data kairos kafka-python pyspark Building Models classification ewma arima
  • 1313 real time stream batch historical Redis In memory key-value data store HDFS Large scale distributed data store Kafka Topics Distributed message passing Data Sources data flow
  • 1414 kairos A Python interface to backend storage databases (redis in my case, others available) tailored for time series storage. Takes care of expiring data and different types of time series (series, histogram, count, gauge, set). Open sourced by Agora Games. https://github.com/agoragames/kairos
  • 1515 kairos Example code: from redis import Redis from kairos import Timeseries intervals = {"days" : {"step" : 60, "steps" : 2880}, "months" : {"step" : 1800, "steps" : 4032}} rclient = Redis(localhost, 6379) ktseries = Timeseries(rclient, type="histogram, intervals=intervals) ktseries.insert(metric_name, metric_value, timestamp)
  • 1616 kafka-python A Python interface to Apache Kafka, where Kafka is publish-subscribe messaging rethought as a distributed commit log. Allows me to subscribe to the events as they come in real time. https://github.com/mumrah/kafka-python
  • 1717 kafka-python from kafka.client import KafkaClient from kafka.consumer import SimpleConsumer kclient = KafkaClient(localhost:9092) kconsumer = SimpleConsumer(kclient, timevault, rawmsgs) for message in kconsumer : insert_to_kairos(message) Example code:
  • 1818 pyspark A Python interface to Apache Spark, where Spark is a fast and general engine for large scale data processing. Allows me to fill in historical data to the time series when I add or modify metrics. http://spark.apache.org/
  • 1919 pyspark from pyspark import SparkContext, SparkConf spark_conf = (SparkConf() .setMaster(localhost) .setAppName(timevault-update)) sc = SparkContext(conf=spark_conf) rdd = (sc.textFile(hdfs_files) .map(insert_to_kairos) .count()) Example code:
  • 2020 pyspark from json import loads import timevault as tv from functools import partial from pyspark import SparkContext, SparkConf spark_conf = (SparkConf() .setMaster(localhost) .setAppName(timevault-update)) sc = SparkContext(conf=spark_conf) rdd = (sc.textFile(tv.conf.hdfs_files) .map(loads) .flatMap(tv.flatten_message) .flatMap(partial(tv.emit_metrics, metrics=tv.metrics_to_emit)) .filter(lambda tup : tup[2] < float(tv.conf.limit_time)) .mapPartitions(partial(tv.insert_to_kairos, conf=tv.conf) .count()) Example code:
  • 2121 the end result from pandas import DataFrame, to_datetime series = ktseries.series(metric_name, months, transform=transform) ts, fields = zip(*series.items()) df = DataFrame({"data : fields}, index=to_datetime(ts, unit="s"))
  • 2222 building models First nave model is simply the mean and standard deviation across all time. blue: actual number of connections green: prediction window red: actual value exceeded standard deviation limit
  • 2323 building models Second slightly less nave model is fitting a sine curve to the whole series. blue: actual number of connections green: prediction window red: actual value exceeded standard deviation limit
  • 2424 classification Both nave models left a lot to be desired. Two simple classifications would help us treat different types of time series appropriately: Does this metric show a weekly pattern (ie. different behavior on weekends versus weekdays)? Does this metric show a daily pattern?
  • 2525 classification Fit a sine curve to the weekday and weekend periods. Ratio of the level of those fits to determine if weekdays will be divided from weekends. weekly
  • 2626 classification weekly from scipy.optimize import leastsq def fitfunc(p, x) : return (p[0] * (1 - p[1] * np.sin(2 * np.pi / (24 * 3600) * (x + p[2])))) def residuals(p, y, x) : return y - fitfunc(p, x) def fit(tsdf) : tsgb = tsdf.groupby(tsdf.timeofday).mean() p0 = np.array([tsgb[conns].mean(), 1.0, 0.0]) plsq, suc = leastsq(residuals, p0, args=(tsgb[conns], np.array(tsgb.index))) return plsq
  • 2727 classification weekly def weekend_ratio(tsdf) : tsdf['weekday'] = pd.Series(tsdf.index.weekday < 5, index=tsdf.index) tsdf['timeofday'] = (tsdf.index.second + tsdf.index.minute * 60 + tsdf.index.hour * 3600) wdayplsq = fit(tsdf[tsdf.weekday == 1]) wendplsq = fit(tsdf[tsdf.weekdy == 0]) return wendplsq[0] / wdayplsq[0] 0 1cutoff 1 / cutoff No weekly variation.
  • 2828 classification Weekly pattern. No weekly pattern. weekly
  • 2929 classification Take a Fourier transform of the time series, and inspect the bins associated with a frequency of a day. Use the ratio of those bins to the first (constant or DC component) in order to classify the time series. daily
  • 3030 classification Time series on weekdays shown with a strong daily pattern. Fourier transform with bins around the day frequency highlighted. daily
  • 3131 classification Time series on weekends shown with no daily pattern. Fourier transform with bins around the day frequency highlighted. daily
  • 3232 classification def daily_ratio(tsdf) : nbins = len(tsdf) deltat = (tsdf.index[1] - tsdf.index[0]).seconds deltaf = 1.0 / (len(tsdf) * deltat) daybin = int((1.0 / (24 * 3600)) / deltaf) rfft = np.abs(np.fft.rfft(tsdf[conns])) daily_ratio = np.sum(rfft[daybin - 1:daybin + 2]) / rfft[0] return daily_ratio daily Find the bin associated with the frequency of a day using:
  • 3333 ewma Exponentially weighted moving average: The decay parameter is specified as a span, s, in pandas, related to by: = 2 / (s + 1) A normal EWMA analysis is done when the metric shows no daily pattern. A stacked EWMA analysis is done when there is a daily pattern.
  • 3434 ewma def ewma_outlier(tsdf, stdlimit=5, span=15) : tsdf[conns_binpred] = pd.ewma(tsdf[conns], span=span).shift(1) tsdf[conns_binstd] = pd.ewmstd(tsdf[conns], span=span).shift(1) tsdf[conns_stds] = ((tsdf[conns] tsdf[conns_binpred]) / tsdf[conns_binstd]) tsdf[conns_outlier] = (tsdf[conns_stds].abs() > stdlimit) return tsdf normal
  • 3535 ewma normal blue: actual number of connections green: prediction window red: actual value exceeded standard deviation limit
  • 3636 ewma blue: actual response size green: prediction window red: actual value exceeded standard deviation limit normal
  • 3737 ewma stacked
  • 3838 ewma stacked
  • 3939 ewma stacked
  • 4040