Download - Achieving Real-time Ingestion and Analysis of Security Events through Kafka and Storm

Achieving Real-time Ingestion and Analysis of Security

Events through Kafka and MetronKevin Mao

Senior Data Engineer, Capital [email protected]

@KevinJokaiMao

mailto:[email protected]

About Me B.S., Computer Science, University of Maryland, Baltimore County M.S., Computer Science, George Mason University Enterprise Data Services, Data Intelligence Purple Rain Project Huge Zelda fan!

Agenda Part 1: Motivation and Background Part 2: Approach and Architecture Part 3: Challenges Part 4: Future Work Part 5: Wrapping Up

Part 1: Motivation and Background

Capital One 45,000 Employees 45 Million Customers 26,000 EC2 Instances

Credit Cards Traditional Banking Home/Auto Loans Brokerage Services

The Problem The ways in which adversaries can attack your system are increasing

- DNC hacks involved convincing spear phishing emails posing as Google Password Reset

- Hollywood Presbyterian Medical Center pays $17,000 in Bitcoins to unlock medical records system held hostage by ransomware

Organizations have to keep up by employing a more numerous and more diverse set of tools

Finding a way to effectively use those tools is difficult

The Data HTTP Proxy logs

Email Metadata

VPN logs

Firewall events

DNS

Syslogs (*nix, Windows)

Security Endpoints

Threat Intelligence

IDS Events

Wireless Access Points

Mobile Device Management

And more...

~ 40 distinct data feeds

~ 5 Billion events per day

~ 75,000 Peak events per second

~ 5 TB per day

What We Started Out With Enterprise SIEM (Security Information and Event Management) platform

- Primary management tool for many years- Encountered stability issues while scaling out to 13 months of data retention

Splunk- Great UI experience- Scaling out to 13 months becomes prohibitively expensive

Where Does That Leave Us We need a solution for security event and telemetry data that is diverse,

voluminous, and fast-moving. Horizontally and linearly scalable Platform and interface built for:

- SOC Analysts to quickly respond to incidents- Forensic Investigators to analyze historical data and compile reports- Threat Hunters to efficiently find vulnerabilities and malicious behavior

Affordable!

Purple Rain

PART 2: Approach and Architecture

NiFi Data routing, transformation, and distribution platform Easy to use Web UI On-Prem Cluster – Collects data from all local devices

- Flows into AWS Cluster- 3 Nodes, 20 CPU cores, 375GB Memory, 6 x 2TB Disk

AWS Cluster – Collects, preprocesses, and tags incoming data- 6 Nodes, m4.4xlarge, 3 x 1TB EBS Volume (gp2)

Individual data flows defined for each feed

Kafka Distributed messaging platform

- Publish-Subscribe model- Producer/Consumer implementations across many languages- Support for stream processing and ingestion via Kafka Streams/Connect

Serves as communication backbone for infrastructure

20 brokers – m4.xlarge, 6 x 250GB EBS volumes (gp2) Replication factor of 2 Set partition count to multiple of aggregate disk count

Storm Distributed realtime stream computation system Scales up by adding more worker nodes Fault tolerant – When a node dies, jobs that were on that node are moved to

another Support for topology isolation, microbatching, and custom routing

Storm Nimbus/UI – m4.2xlarge 45 Storm Worker Nodes – m4.2xlarge 4 worker slots per node – 2 vCPU 8GB Mem

Metron Security analytics framework built on top of Storm Consists of two sets of Storm topologies:

- Parser topologies – Parse raw data into human readable JSON format- Enrichment topologies – Enrich parsed data with contextual information, then

send to storage tier. Enrichment of incoming data streams with additional information

- Domain Generation Algorithm (DGA) scoring via machine learning model- Active Directory user lookup- Geolocation/ASN data for external IP addresses- WHOIS lookup for unknown domain names

ElasticSearch Distributed, RESTful search and analytics engine

- Each data feed is comprised of its own set of daily indices- Each index is further subdivided into shards

Linearly scalable Low latency full-text search

3 Master Nodes – m4.2xlarge 100 Data Nodes – m4.4xlarge, 3 x 1TB EBS volumes (gp2)

Kibana Data visualization frontend for ElasticSearch Alert management system Cyber Threat Intelligence (CTI) repository for storing, tagging, searching artifacts Multiple open source and custom plugins

• Timelion• fermiumlabs/mathlion• prelert/kibana-swimlane-vis• sirensolutions/kibi

• siresolutions/sentinl• snuids/heatmap• chenryn/kbn_sankey_vis• And more...

S3 Simple Storage Service – Object storage service in the cloud Compatible with processing engines like Spark, EMR Data stored in two format s:

- Raw data – Used for replaying data through the pipeline and meeting our obligations as a system of record for some feeds

- Parsed data – Stored in columnar format (ORC) for batch processing Everything in S3 is encrypted

Monitoring Zabbix agent to collect system-level telemetry (CPU, Mem, IOPS, Disk %, etc.) Ingestion rate and message volume metrics collected from NiFi, Kafka, Storm,

ElasticSearch Most data stored in a separate ElasticSearch cluster Grafana for visualization ElastAlert for platform alerting

PART 3: Challenges

Format Wars Ingested raw data comes in a variety of formats

- CSV, JSON, XML, CEF Sometimes the formats are poorly defined

- Windows Syslogs pretty indented using tabs, but no delimiters- Various subtypes come in different formats

Upstream changes to raw data format often propagate through our entire pipeline, eventually making the data in ElasticSearch unusable

Takeaway: Format and serialize data as far upstream as possible.

Monitoring and Alerting Platform-level telemetry should be stored with all the other data

- Instead of a separate Zabbix subsystem Collect more granular application-level data

- Most components expose metrics via JMX- Necessary to effectively troubleshoot performance bottlenecks- Useful for capacity planning

Logging data collection Common problem among many teams at Capital One Takeaway: Reduce duplication of work by offering common monitoring

infrastructure, or even Monitoring-as-a-Service

Rehydration EC2 Instances with AMIs older than 60 days must be terminated

- Internal Capital One policy Spent a lot of time developing automation and orchestration to spin up a full cluster

from scratch How do you rehydrate a newly provisioned platform with data? How do you avoid service interruption to the user?

Blue/Green cluster deployment Rolling rehydration every 30 days

Auditing Internal Audit

- 2 Internal Audits of NPI/PCI handling and storage processes OCC (Office of the Comptroller of the Currency)

- Audit of data sources, networking, and archival of data. FRB (Federal Reserve Board)

- IT Risk Management – Alerts considered as an authoritative source as part of first line of defense

- Resiliency – Provide evidence of ability to failover within an acceptable window of time

Handling Sensitive Data Social Security Numbers Credit card info Home/Auto Loans Checking/Savings Account Data Trading data

Automated process to scan for PII/PCI data and scrub it from the raw data stream- Secure raw data topics via encryption and access control- Streaming job to scrub raw feeds and produce into separate ‘clean’ topics

Backwards remediation process for data stored in HDFS/S3

PART 4: Future Work

Schema Management Authoritative service for clients to retrieve schemas applied to datasets. Implementation is protocol dependent.

- Avro – Confluent Schema Registry- Protobuf – Central GH Repository

Streaming job to parse and schema-fy raw data prior to processing it.- Raw data that fails to fit schema diverted to alternate Kafka topic.

Monitoring Consolidate monitoring stack.

- Fully unified Elastic stack: *Beats, Logstash, ElasticSearch, Kibana and friends- Separate stacks for Time-series numeric and logging:

- TICK (Telegraf, InfluxDB, Chronograf, Kapacitor) stack- ELK stack

- Both have tradeoffs

Generalized Data Processing Metron is really good for working in the infosec space, but does not generalize well. Exploring options for building a data platform to address multiple use cases.

- Credit transactions- Credit fraud- Anti-Money Laundering- Legal

Focus on supporting machine learning.

PART 5: Wrapping Up

Retrospective Users (SoC analysts, threat hunters, etc.) are generally happy with the platform. Low query latency Working to address concerns around data integrity (duplicates, loss, malformed) They want more data!

- Bro- Silvertail- Phantom

[email protected]

@KevinJokaiMao

linkedin.com/in/kevinjmao

We’re hiring in SF, Chicago, and DC!Machine Learning Engineers

Software Engineers

Data Engineers

Data Scientists