Achieving Real-time Ingestion and Analysis of Security
Events through Kafka and MetronKevin Mao
Senior Data Engineer, Capital [email protected]
@KevinJokaiMao
About Me B.S., Computer Science, University of Maryland, Baltimore County M.S., Computer Science, George Mason University Enterprise Data Services, Data Intelligence Purple Rain Project Huge Zelda fan!
Agenda Part 1: Motivation and Background Part 2: Approach and Architecture Part 3: Challenges Part 4: Future Work Part 5: Wrapping Up
Part 1: Motivation and Background
Capital One 45,000 Employees 45 Million Customers 26,000 EC2 Instances
Credit Cards Traditional Banking Home/Auto Loans Brokerage Services
The Problem The ways in which adversaries can attack your system are increasing
- DNC hacks involved convincing spear phishing emails posing as Google Password Reset
- Hollywood Presbyterian Medical Center pays $17,000 in Bitcoins to unlock medical records system held hostage by ransomware
Organizations have to keep up by employing a more numerous and more diverse set of tools
Finding a way to effectively use those tools is difficult
The Data HTTP Proxy logs
Email Metadata
VPN logs
Firewall events
DNS
Syslogs (*nix, Windows)
Security Endpoints
Threat Intelligence
IDS Events
Wireless Access Points
Mobile Device Management
And more...
~ 40 distinct data feeds
~ 5 Billion events per day
~ 75,000 Peak events per second
~ 5 TB per day
What We Started Out With Enterprise SIEM (Security Information and Event Management) platform
- Primary management tool for many years- Encountered stability issues while scaling out to 13 months of data retention
Splunk- Great UI experience- Scaling out to 13 months becomes prohibitively expensive
Where Does That Leave Us We need a solution for security event and telemetry data that is diverse,
voluminous, and fast-moving. Horizontally and linearly scalable Platform and interface built for:
- SOC Analysts to quickly respond to incidents- Forensic Investigators to analyze historical data and compile reports- Threat Hunters to efficiently find vulnerabilities and malicious behavior
Affordable!
Purple Rain
PART 2: Approach and Architecture
NiFi Data routing, transformation, and distribution platform Easy to use Web UI On-Prem Cluster – Collects data from all local devices
- Flows into AWS Cluster- 3 Nodes, 20 CPU cores, 375GB Memory, 6 x 2TB Disk
AWS Cluster – Collects, preprocesses, and tags incoming data- 6 Nodes, m4.4xlarge, 3 x 1TB EBS Volume (gp2)
Individual data flows defined for each feed
Kafka Distributed messaging platform
- Publish-Subscribe model- Producer/Consumer implementations across many languages- Support for stream processing and ingestion via Kafka Streams/Connect
Serves as communication backbone for infrastructure
20 brokers – m4.xlarge, 6 x 250GB EBS volumes (gp2) Replication factor of 2 Set partition count to multiple of aggregate disk count
Storm Distributed realtime stream computation system Scales up by adding more worker nodes Fault tolerant – When a node dies, jobs that were on that node are moved to
another Support for topology isolation, microbatching, and custom routing
Storm Nimbus/UI – m4.2xlarge 45 Storm Worker Nodes – m4.2xlarge 4 worker slots per node – 2 vCPU 8GB Mem
Metron Security analytics framework built on top of Storm Consists of two sets of Storm topologies:
- Parser topologies – Parse raw data into human readable JSON format- Enrichment topologies – Enrich parsed data with contextual information, then
send to storage tier. Enrichment of incoming data streams with additional information
- Domain Generation Algorithm (DGA) scoring via machine learning model- Active Directory user lookup- Geolocation/ASN data for external IP addresses- WHOIS lookup for unknown domain names
ElasticSearch Distributed, RESTful search and analytics engine
- Each data feed is comprised of its own set of daily indices- Each index is further subdivided into shards
Linearly scalable Low latency full-text search
3 Master Nodes – m4.2xlarge 100 Data Nodes – m4.4xlarge, 3 x 1TB EBS volumes (gp2)
Kibana Data visualization frontend for ElasticSearch Alert management system Cyber Threat Intelligence (CTI) repository for storing, tagging, searching artifacts Multiple open source and custom plugins
• Timelion• fermiumlabs/mathlion• prelert/kibana-swimlane-vis• sirensolutions/kibi
• siresolutions/sentinl• snuids/heatmap• chenryn/kbn_sankey_vis• And more...
S3 Simple Storage Service – Object storage service in the cloud Compatible with processing engines like Spark, EMR Data stored in two format s:
- Raw data – Used for replaying data through the pipeline and meeting our obligations as a system of record for some feeds
- Parsed data – Stored in columnar format (ORC) for batch processing Everything in S3 is encrypted
Monitoring Zabbix agent to collect system-level telemetry (CPU, Mem, IOPS, Disk %, etc.) Ingestion rate and message volume metrics collected from NiFi, Kafka, Storm,
ElasticSearch Most data stored in a separate ElasticSearch cluster Grafana for visualization ElastAlert for platform alerting
PART 3: Challenges
Format Wars Ingested raw data comes in a variety of formats
- CSV, JSON, XML, CEF Sometimes the formats are poorly defined
- Windows Syslogs pretty indented using tabs, but no delimiters- Various subtypes come in different formats
Upstream changes to raw data format often propagate through our entire pipeline, eventually making the data in ElasticSearch unusable
Takeaway: Format and serialize data as far upstream as possible.
Monitoring and Alerting Platform-level telemetry should be stored with all the other data
- Instead of a separate Zabbix subsystem Collect more granular application-level data
- Most components expose metrics via JMX- Necessary to effectively troubleshoot performance bottlenecks- Useful for capacity planning
Logging data collection Common problem among many teams at Capital One Takeaway: Reduce duplication of work by offering common monitoring
infrastructure, or even Monitoring-as-a-Service
Rehydration EC2 Instances with AMIs older than 60 days must be terminated
- Internal Capital One policy Spent a lot of time developing automation and orchestration to spin up a full cluster
from scratch How do you rehydrate a newly provisioned platform with data? How do you avoid service interruption to the user?
Blue/Green cluster deployment Rolling rehydration every 30 days
Auditing Internal Audit
- 2 Internal Audits of NPI/PCI handling and storage processes OCC (Office of the Comptroller of the Currency)
- Audit of data sources, networking, and archival of data. FRB (Federal Reserve Board)
- IT Risk Management – Alerts considered as an authoritative source as part of first line of defense
- Resiliency – Provide evidence of ability to failover within an acceptable window of time
Handling Sensitive Data Social Security Numbers Credit card info Home/Auto Loans Checking/Savings Account Data Trading data
Automated process to scan for PII/PCI data and scrub it from the raw data stream- Secure raw data topics via encryption and access control- Streaming job to scrub raw feeds and produce into separate ‘clean’ topics
Backwards remediation process for data stored in HDFS/S3
PART 4: Future Work
Schema Management Authoritative service for clients to retrieve schemas applied to datasets. Implementation is protocol dependent.
- Avro – Confluent Schema Registry- Protobuf – Central GH Repository
Streaming job to parse and schema-fy raw data prior to processing it.- Raw data that fails to fit schema diverted to alternate Kafka topic.
Monitoring Consolidate monitoring stack.
- Fully unified Elastic stack: *Beats, Logstash, ElasticSearch, Kibana and friends- Separate stacks for Time-series numeric and logging:
- TICK (Telegraf, InfluxDB, Chronograf, Kapacitor) stack- ELK stack
- Both have tradeoffs
Generalized Data Processing Metron is really good for working in the infosec space, but does not generalize well. Exploring options for building a data platform to address multiple use cases.
- Credit transactions- Credit fraud- Anti-Money Laundering- Legal
Focus on supporting machine learning.
PART 5: Wrapping Up
Retrospective Users (SoC analysts, threat hunters, etc.) are generally happy with the platform. Low query latency Working to address concerns around data integrity (duplicates, loss, malformed) They want more data!
- Bro- Silvertail- Phantom
Q&A
@KevinJokaiMao
linkedin.com/in/kevinjmao
We’re hiring in SF, Chicago, and DC!Machine Learning Engineers
Software Engineers
Data Engineers
Data Scientists
Top Related