Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

40
Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Metron Kevin Mao Senior Data Engineer, Capital One [email protected] @KevinJokaiMao

Transcript of Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Page 1: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Achieving Real-time Ingestion and Analysis of Security

Events through Kafka and MetronKevin Mao

Senior Data Engineer, Capital [email protected]

@KevinJokaiMao

Page 2: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

About Me B.S., Computer Science, University of Maryland, Baltimore County M.S., Computer Science, George Mason University Enterprise Data Services, Data Intelligence Purple Rain Project Huge Zelda fan!

Page 3: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Agenda Part 1: Motivation and Background Part 2: Approach and Architecture Part 3: Challenges Part 4: Future Work Part 5: Wrapping Up

Page 4: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Part 1: Motivation and Background

Page 5: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Capital One 45,000 Employees 45 Million Customers 26,000 EC2 Instances

Credit Cards Traditional Banking Home/Auto Loans Brokerage Services

Page 6: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

The Problem The ways in which adversaries can attack your system are increasing

- DNC hacks involved convincing spear phishing emails posing as Google Password Reset

- Hollywood Presbyterian Medical Center pays $17,000 in Bitcoins to unlock medical records system held hostage by ransomware

Organizations have to keep up by employing a more numerous and more diverse set of tools

Finding a way to effectively use those tools is difficult

Page 7: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

The Data HTTP Proxy logs

Email Metadata

VPN logs

Firewall events

DNS

Syslogs (*nix, Windows)

Security Endpoints

Threat Intelligence

IDS Events

Wireless Access Points

Mobile Device Management

And more...

~ 40 distinct data feeds

~ 5 Billion events per day

~ 75,000 Peak events per second

~ 5 TB per day

Page 8: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

What We Started Out With Enterprise SIEM (Security Information and Event Management) platform

- Primary management tool for many years- Encountered stability issues while scaling out to 13 months of data retention

Splunk- Great UI experience- Scaling out to 13 months becomes prohibitively expensive

Page 9: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Where Does That Leave Us We need a solution for security event and telemetry data that is diverse,

voluminous, and fast-moving. Horizontally and linearly scalable Platform and interface built for:

- SOC Analysts to quickly respond to incidents- Forensic Investigators to analyze historical data and compile reports- Threat Hunters to efficiently find vulnerabilities and malicious behavior

Affordable!

Page 10: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Purple Rain

Page 11: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

PART 2: Approach and Architecture

Page 12: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm
Page 13: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

NiFi Data routing, transformation, and distribution platform Easy to use Web UI On-Prem Cluster – Collects data from all local devices

- Flows into AWS Cluster- 3 Nodes, 20 CPU cores, 375GB Memory, 6 x 2TB Disk

AWS Cluster – Collects, preprocesses, and tags incoming data- 6 Nodes, m4.4xlarge, 3 x 1TB EBS Volume (gp2)

Individual data flows defined for each feed

Page 14: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm
Page 15: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Kafka Distributed messaging platform

- Publish-Subscribe model- Producer/Consumer implementations across many languages- Support for stream processing and ingestion via Kafka Streams/Connect

Serves as communication backbone for infrastructure

20 brokers – m4.xlarge, 6 x 250GB EBS volumes (gp2) Replication factor of 2 Set partition count to multiple of aggregate disk count

Page 16: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Storm Distributed realtime stream computation system Scales up by adding more worker nodes Fault tolerant – When a node dies, jobs that were on that node are moved to

another Support for topology isolation, microbatching, and custom routing

Storm Nimbus/UI – m4.2xlarge 45 Storm Worker Nodes – m4.2xlarge 4 worker slots per node – 2 vCPU 8GB Mem

Page 17: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Metron Security analytics framework built on top of Storm Consists of two sets of Storm topologies:

- Parser topologies – Parse raw data into human readable JSON format- Enrichment topologies – Enrich parsed data with contextual information, then

send to storage tier. Enrichment of incoming data streams with additional information

- Domain Generation Algorithm (DGA) scoring via machine learning model- Active Directory user lookup- Geolocation/ASN data for external IP addresses- WHOIS lookup for unknown domain names

Page 18: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

ElasticSearch Distributed, RESTful search and analytics engine

- Each data feed is comprised of its own set of daily indices- Each index is further subdivided into shards

Linearly scalable Low latency full-text search

3 Master Nodes – m4.2xlarge 100 Data Nodes – m4.4xlarge, 3 x 1TB EBS volumes (gp2)

Page 19: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Kibana Data visualization frontend for ElasticSearch Alert management system Cyber Threat Intelligence (CTI) repository for storing, tagging, searching artifacts Multiple open source and custom plugins

• Timelion• fermiumlabs/mathlion• prelert/kibana-swimlane-vis• sirensolutions/kibi

• siresolutions/sentinl• snuids/heatmap• chenryn/kbn_sankey_vis• And more...

Page 20: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm
Page 21: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm
Page 22: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm
Page 23: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm
Page 24: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

S3 Simple Storage Service – Object storage service in the cloud Compatible with processing engines like Spark, EMR Data stored in two format s:

- Raw data – Used for replaying data through the pipeline and meeting our obligations as a system of record for some feeds

- Parsed data – Stored in columnar format (ORC) for batch processing Everything in S3 is encrypted

Page 25: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Monitoring Zabbix agent to collect system-level telemetry (CPU, Mem, IOPS, Disk %, etc.) Ingestion rate and message volume metrics collected from NiFi, Kafka, Storm,

ElasticSearch Most data stored in a separate ElasticSearch cluster Grafana for visualization ElastAlert for platform alerting

Page 26: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm
Page 27: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

PART 3: Challenges

Page 28: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Format Wars Ingested raw data comes in a variety of formats

- CSV, JSON, XML, CEF Sometimes the formats are poorly defined

- Windows Syslogs pretty indented using tabs, but no delimiters- Various subtypes come in different formats

Upstream changes to raw data format often propagate through our entire pipeline, eventually making the data in ElasticSearch unusable

Takeaway: Format and serialize data as far upstream as possible.

Page 29: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Monitoring and Alerting Platform-level telemetry should be stored with all the other data

- Instead of a separate Zabbix subsystem Collect more granular application-level data

- Most components expose metrics via JMX- Necessary to effectively troubleshoot performance bottlenecks- Useful for capacity planning

Logging data collection Common problem among many teams at Capital One Takeaway: Reduce duplication of work by offering common monitoring

infrastructure, or even Monitoring-as-a-Service

Page 30: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Rehydration EC2 Instances with AMIs older than 60 days must be terminated

- Internal Capital One policy Spent a lot of time developing automation and orchestration to spin up a full cluster

from scratch How do you rehydrate a newly provisioned platform with data? How do you avoid service interruption to the user?

Blue/Green cluster deployment Rolling rehydration every 30 days

Page 31: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Auditing Internal Audit

- 2 Internal Audits of NPI/PCI handling and storage processes OCC (Office of the Comptroller of the Currency)

- Audit of data sources, networking, and archival of data. FRB (Federal Reserve Board)

- IT Risk Management – Alerts considered as an authoritative source as part of first line of defense

- Resiliency – Provide evidence of ability to failover within an acceptable window of time

Page 32: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Handling Sensitive Data Social Security Numbers Credit card info Home/Auto Loans Checking/Savings Account Data Trading data

Automated process to scan for PII/PCI data and scrub it from the raw data stream- Secure raw data topics via encryption and access control- Streaming job to scrub raw feeds and produce into separate ‘clean’ topics

Backwards remediation process for data stored in HDFS/S3

Page 33: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

PART 4: Future Work

Page 34: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Schema Management Authoritative service for clients to retrieve schemas applied to datasets. Implementation is protocol dependent.

- Avro – Confluent Schema Registry- Protobuf – Central GH Repository

Streaming job to parse and schema-fy raw data prior to processing it.- Raw data that fails to fit schema diverted to alternate Kafka topic.

Page 35: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Monitoring Consolidate monitoring stack.

- Fully unified Elastic stack: *Beats, Logstash, ElasticSearch, Kibana and friends- Separate stacks for Time-series numeric and logging:

- TICK (Telegraf, InfluxDB, Chronograf, Kapacitor) stack- ELK stack

- Both have tradeoffs

Page 36: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Generalized Data Processing Metron is really good for working in the infosec space, but does not generalize well. Exploring options for building a data platform to address multiple use cases.

- Credit transactions- Credit fraud- Anti-Money Laundering- Legal

Focus on supporting machine learning.

Page 37: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

PART 5: Wrapping Up

Page 38: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Retrospective Users (SoC analysts, threat hunters, etc.) are generally happy with the platform. Low query latency Working to address concerns around data integrity (duplicates, loss, malformed) They want more data!

- Bro- Silvertail- Phantom

Page 39: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Q&A

Page 40: Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

[email protected]

@KevinJokaiMao

linkedin.com/in/kevinjmao

We’re hiring in SF, Chicago, and DC!Machine Learning Engineers

Software Engineers

Data Engineers

Data Scientists