Cloud Security Monitoring and Spark Analytics

Cloud Security Monitoring and Spark Analytics

Boston Spark MeetupThreat Stack

Andre Mesarovic10 December 2015

Threat Stack - Who We Are• Leadership team with deep security, SaaS, and big data

experience• Launched on stage at 2014 AWS re:Invent• Founded by principal engineers from Mandiant in 2012• Based in Boston's Innovation District• 27 employees and hiring• On Track for 100+ Customers and 10,000 Monitored

Servers by Year-End 2015• Funded by Accomplice (Atlas) and .406 Ventures

Threat Stack - Use Cases• Insider Threat Detection• External Threat Detection• Data Loss Detection• Regulatory Compliance Support - HIPAA, PCI

Threat Stack - Key Workload Questions• What processes are running on all my servers?• Did a process suddenly start making outbound

connections?• Who is logged into my servers and what are they

running?• Has anyone logged in from non-standard locations?• Are any critical system and data files being changed?• What happened on a transient server 7 weeks ago?• Who is changing our Cloud infrastructure?

Threat Stack - Features• Deep OS Auditing• Behavior-based Intrusion Detection• DVR Capabilities• Customizable Alerts• File Integrity Monitoring• DevOps Enabled Deployment

Threat Stack - Tech Stack• RabbitMQ• Nginx• Cassandra• Elasticsearch• MongoDB• Redis - ElastiCache • Postgres - RDS• Languages: Node.js, C, Scala and a bit of Lua • Chef• Librato, Grafana, Sensu, Sentry, PagerDuty• Slack

Spark Cluster • Spark 1.4.1 • Spark standalone cluster manager - no Mesos or Yarn • One long running Spark job - running over 2 months• Separate driver node

– Since driver has different workload it can be scaled independently of the workers

• We like our cluster to be a homogenous set of worker nodes– One executor per worker

• Monitored by Grafana • Custom Codahale metrics consumed by Grafana

– Only implemented for Driver - for Worker it’s a TODO

Spark Cluster Hardware

Threat Stack Overall Architecture

Spark Analytics Architecture

Spark Web UI - Master

Spark Web UI - Jobs

Event Pipeline Statistics Mean event is 700 bytes

Second 10 Min Interval Day Month

Mean events 75 K 4.5 M 6.48 B 194 B

Spike events 125 K 7.5 M 10.8 B 324 B

Mean bytes 52.5 MB 31.5 GB 4.5 TB 136 TB

Spike bytes 87.5 MB 52.5 GB 7.6 TB 227 TB

Problem that Spark Analytics Addresses• Overview

– Spark replaced home-grown rollups and Elasticsearch facets– Original solutions did not scale well

• Home-grown rollups of streaming data– Used eep.js - subset of CEP that adds aggregate functions and

windowed stream operations to Node.js.– Postgres stored procedures to upsert rolled up values– Problem: way too many Postgres transactions

• Elasticsearch facets – Great for initial moderate volume– Running into scaling issues as we grow

Why not Spark Streaming?• We first tried to use Spark Streaming• Ran OK in dev env but failed in prod env - 20x• Too many endurance and scaling problems• Ran out of file descriptors on workers very quickly

– Sure, we can write a cron job but do we want to?– Zillions of 24 byte files that were never cleaned up

• Too many out-of-memory errors on workers – Intermittent and random OOMs– Workers crashed in 3 days due to tiny memory leak

• No robust RabbitMQ receiver - everyone is focused on Kafka• Love the idea, but just wasn’t ready for prime time

Current Spark Solution • Decouple event consumption and Spark processing• Two processes: Event Writer and Spark Analytics• Event Writer consumes events from RabbitMQ firehose

– Writes batches to scratch store every 10 min interval• Spark job wakes up every 10 min to roll up events by

different criteria into Postgres – For example, at 10:20 Spark job processes the data

from 10:10 to 10:20• Spark then deletes the interval data of 10:10 to 10:20• Spark uptime: 64 days since Oct. 7, 2015

Basic Workflow• Event Writer consumes RMQ messages and writes them to S3• RMQ messages are in MessagePack format• Message is one doc per org/agent/type specified header and

array of events• Event Writer flattens this into a batch of events • Output is gzip JSON sequence file - one JSON object per line• Event Writer writes fixed sized output batches of events to S3• Current memory buffer for the batch is 100 MB• This compresses down to 3.5 MB - 28x compression

Advantages of Current Solution• Separation of concerns - each process is focused on doing one

thing best• Event Writer is concerned with non-trivial RMQ flow control• Spark simply reads event sequences from scratch storage• Thus Spark has more resources to compute rollups• Each app can scale independently • Spark Streaming was trying to do too much - both handle

RMQ ingestion and analytics processing• Current solution is more robust

Capacity and Scaling• Good news - Spark scales linearly for us• We ran tests with different numbers of workers and results

were linear• Elasticity: we can independently scale the Event Writers and

the Spark cluster• With Spark Streaming we could not dynamically add more

RMQ receivers without restarting the app

Event Writer Stats • One Event Writer per RabbitMQ exchange • We have 3 RMQ exchanges• 10 minute interval for buffering events• 100 MB in-memory event buffer compresses down to 3.5 MB• Compression factor of 28 x• 600 S3 objects per interval (compressed)• 2.1 GB per interval (uncompressed would be 58.8 GB)• Need 2 intervals present - current and previous - 4.1 GB (118

GB)

Event Types

• audit - accept, bind, connect, exit, etc.• login - login, logout• host• file• network

Event Example{ "organization_id" : "3d0c49e818bac99c72b7088665342daf30a3bcd7", "agent_id" : "835af48534bfd4bc60f8c5882dd565c5a84e4b94", "arguments" : "/usr/sbin/sshd -D -R", "_id" : "835af48534bfd4bc60f8c5882dd565c5a84e4b94", "_type" : "audit", "_insert_time" : 1429902593 "args" : [ "/usr/sbin/sshd", "-D", "-R" ], "user" : "root", "group" : "root", "path" : [ "/usr/sbin/sshd", null ], "exe" : "/usr/sbin/sshd", "timestamp" : 1429902590000, "type" : "start", "syscall" : "execve", "command" : "sshd", "uid" : 0, "euid" : 0, "gid" : 0, "egid" : 0, "exit" : 0, "session" : 4294967295, "pid" : 7829, "ppid" : 873, "success" :, "parent_process" : { "pid" : 873, "exe" : "/usr/sbin/sshd", "command" : "sshd", "args" : [ "/usr/sbin/sshd", "-D" ], "loginuid" : 4294967295, "timestamp" : 1427337850230, "uid" : 0, "gid" : 0, "ppid" : 1 },}

Spark Event Count Rollups• total counts - org and agent• user counts - org, agent, user and exe• IP counts that access Maxmind geo DB file on each worker

– IP source counts - org, exe, ip, country, city, lat, lon– IP destination counts - ibid

• host counts - org, comment• port source counts - org, exe and port• port destination counts• CloudTrail events of various (four) kinds

Sample Rollups Table

insert_time | event_time | org_id | agent_id | count---------------------+---------------------+--------------------------+--------------------------+-------- 2015-11-08 15:41:18 | 2015-11-08 15:00:00 | 5522d0276c15919d69000x01 | 563bd15419d2f85c2c9085c1 | 216652 2015-11-08 20:01:24 | 2015-11-08 19:00:00 | 5522d0276c15919d69000x01 | 563bd15419d2f85c2c9085c1 | 207962 2015-11-08 15:31:17 | 2015-11-08 15:00:00 | 5522d0276c15919d69000x01 | 5665c53b04d674f048e0892e | 160354 2015-11-08 15:01:34 | 2015-11-08 14:00:00 | 5522d0276c15919d69000y01 | 563bd15419d2f85c2c9085c1 | 160098 2015-11-07 21:51:31 | 2015-11-07 20:00:00 | 5522d0276c15919d69000x01 | 5665c53b04d674f048e0892e | 149813 2015-11-08 03:08:53 | 2015-11-08 00:00:00 | 533af57f41e9885820006771 | 5632c6431612b6096d195d02 | 144999 2015-11-08 03:08:53 | 2015-11-08 01:00:00 | 533af57f41e988582000a7b1 | 55fc8beb7f8ce68d5052b6c9 | 143072 2015-11-08 03:08:53 | 2015-11-08 01:00:00 | 533af57f41e9885820006771 | 55f989dacc155d6d5e2627cf | 141468 2015-11-08 03:08:53 | 2015-11-08 01:00:00 | 533af57f41e9885820006771 | 55f98b41cc155d6d5e262811 | 137778 2015-11-17 15:21:11 | 2015-11-17 15:00:00 | 5522d0276c15919d69000x01 | 566f217100229a8b2bdce000 | 128375

Scratch Event Data • S3

– Easy to get started with Spark S3 support (gzip support)– Mean write time is 350 ms - 99.9 percentile is 2.3 sec!– This clogs up our processing pipeline– S3 is “eventually consistent” - there are no SLAs

guaranteeing when a written object is available• Alternatives

– NoSQL store such as Redis - under active exploration now– AWS Elastic File System - when will it arrive (April blog)?– HDFS

S3 Write Percentiles

Percentile Millis

50.00 349

90.00 560

99.00 1413

99.50 2081

99.90 23,898

99.99 50,281

max 139,596

S3 vs Redis Write LatenciesAll write latencies are in milliseconds.The “10-minute intervals” column refers to the sample size.

Mean Max 10-min intervals

S3 349 139,596 15,172

Redis 43 168 7,313

Speedup factor 8 831

Data Expiration• The problem of big data is how to efficiently delete data• Every byte costs - AWS is not cheap• Big data at scale costs big bucks• In the real world, companies have to deal with data retention• Deleting objects

– Spark• After processing S3 objects, Spark deletes them• Backup with AWS life-cycle expiration (1 day)

– Redis• Use Redis TTLs

RabbitMQ Flow Control - Message Ack-ingFlow control is fun!• Fast publisher - slow consumer

Message Ack-ing• MultipleRmqAckManager - Acknowledge all messages up to

and including the supplied delivery tag• SingleRmqAckManager - Acknowledge just the supplied

delivery tag• When we have written an S3 object, we ack all the RMQ

messages in that batch

RabbitMQ Prefetch Count• Limit the number of unacknowledged messages on a channel• Important for Event Writer to handle so as not to OOM during

traffic surges• Sadly RMQ doesn’t implement AMQP prefetch for byte size• Only supports prefetch count for number of messages• This works if the messages are of relatively same size• Fortunately this the case for us

Fault Tolerance • Created generic fault tolerance manager• Used for retrying RabbitMQ consumer and S3 writes• Pluggable retry algorithm - linear backoff, exponential backoff,

whatever you wish• Looked at third party packages (e.g. Spring Retry) but didn’t

quite fit our particular needs• RMQ reads rarely fail• Do see the occasional S3 write failure

Spark and Metrics• Metrics and monitoring are vital to Threat Stack• Any production app must have a way of allowing for app-

specific metrics• Spark’s custom metrics are very rudimentary• Custom metrics capabilities - driver and/or worker?• Spark Codahale custom metrics - we apparently have to

extend Spark private class!• You need to extend org.apache.spark.metrics.source.Source

and include it in your jar!

Cloud Security Monitoring and Spark Analytics

Technology

Transcript of Cloud Security Monitoring and Spark Analytics