Using the SDACK Architecture to Build a Big Data Product
Transcript of Using the SDACK Architecture to Build a Big Data Product
![Page 1: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/1.jpg)
Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver
Using the SDACK Architecture to Build a Big Data Product
![Page 2: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/2.jpg)
Outline• A Threat Analytic Big Data product
• The SDACK Architecture
• Akka Streams and data pipelines
• Cassandra data model for time series data
![Page 3: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/3.jpg)
Who am I• Apache Bigtop PMC member
• Software Engineer @ Trend Micro
• Develop big data apps & infra
![Page 4: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/4.jpg)
A Threat Analytic Big Data Product
![Page 5: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/5.jpg)
Target Scenario
InfoSec Team
Security information and event management(SIEM)
![Page 6: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/6.jpg)
Problems• Too many log to investigate
• Lack of actionable, prioritized recommendations
![Page 7: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/7.jpg)
Splunk
ArcSightWindows
event
Web Proxy
DNS
…
AD
Anomaly Detection
Expert Rule
Filtering
Prioritization
Threat Analytic System
Notification
![Page 8: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/8.jpg)
Deployment
![Page 9: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/9.jpg)
On-Premises
InfoSec Team
Security information and event
Anomaly Detection
Expert Rule
Filtering
Prioritization
![Page 10: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/10.jpg)
On-Premises
InfoSec Team
Anomaly Detection
Expert Rule
Filtering
Prioritization
AD
Windows Event Log
DNS
…
![Page 11: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/11.jpg)
On the cloud?
InfoSec Team
AD
Windows Event Log
DNS
…
![Page 12: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/12.jpg)
There’re too many PII data in the log
![Page 13: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/13.jpg)
System Requirement• Identify anomaly activities based on data
• Lookup extra info such as ldap, active directory, hostname, etc
• Correlate with other data
• EX: C&C connection followed by an anomaly login
• Support both streaming and batch analytic workloads
• Run streaming detections to identify anomaly in timely manner
• Run batch detections with arbitrary start time and end time
• Scalable
• Stable
![Page 14: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/14.jpg)
No problem. I have a solution.
How exactly are we going to build this product?
PM RD
![Page 15: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/15.jpg)
SMACK StackToolbox for wide variety of data processing scenarios
![Page 16: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/16.jpg)
SMACK Stack• Spark: fast and general purpose engine for large-scale data
processing scenarios
• Mesos: cluster resource management system
• Akka: toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications
• Cassandra: distributed, highly available database designed to handle large amounts of data across datacenters
• Kafka: high-throughput, low-latency distributed pub-sub messaging system for real-time data feeds
Source: http://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka
![Page 17: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/17.jpg)
Medium-sized Enterprises
InfoSec Team
AD
Windows Event Log
DNS
…
![Page 18: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/18.jpg)
No problem. I have a solution.Can’t we simplify it?
PM RD
![Page 19: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/19.jpg)
SDACK Stack
![Page 20: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/20.jpg)
SDACK Stack• Spark: fast and general purpose engine for large-scale data
processing scenarios
• Docker: deployment and resource management
• Akka: toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications
• Cassandra: distributed, highly available database designed to handle large amounts of data across datacenters
• Kafka: high-throughput, low-latency distributed pub-sub messaging system for real-time data feeds
Source: http://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka
![Page 21: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/21.jpg)
• Three key techniques in Docker
• Use Linux Kernel’s Namespaces to create isolated resources
• Use Linux Kernel’s cgroups to constrain resource usage
• Union file system that supports git-like image creation
Docker - How it works
![Page 22: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/22.jpg)
• Lightweight
• Fast creation
• Efficient to ship
• Easy to deploy
• Portability
• runs on *any* Linux environment
The nice things about it
![Page 23: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/23.jpg)
Medium-sized Enterprises
InfoSec Team
AD
Windows Event Log
DNS
…
![Page 24: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/24.jpg)
Large Enterprises
InfoSec Team
AD
Windows Event Log
DNS
…
![Page 25: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/25.jpg)
Fortune 500
InfoSec Team
AD
Windows Event Log
DNS
…
![Page 26: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/26.jpg)
Threat Analytic System Architecture
![Page 27: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/27.jpg)
DB
a d
d d i
w e b
Actor
Actor Actor
……Log
Web Portal Detection ResultAPI Server
![Page 28: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/28.jpg)
API Server
Web Portal
![Page 29: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/29.jpg)
Docker Compose
kafka: build: . ports: - “9092:9092” spark: image: spark port: - “8080:8080” ……
Web Portal
API Server
![Page 30: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/30.jpg)
Akka Streams and data pipelines
![Page 31: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/31.jpg)
receiver
![Page 32: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/32.jpg)
receiver
buffer
![Page 33: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/33.jpg)
transformation, lookup ext info
receiver
buffer
![Page 34: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/34.jpg)
batch
streaming
receiver
buffer
transformation, lookup ext info
![Page 35: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/35.jpg)
• Each streaming job occupies at least 1 core
• There’re so many type of log
• AD, Windows Event, DNS, Proxy, Web Application, etc
• Kafka offset loss when code changed
• No back-pressure (back then)
• Estimated back-pressure available in Spark 1.5
• Integration with Akka based info lookup services
Why not Spark Streaming?
![Page 36: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/36.jpg)
Akka• High performance concurrency framework for Java and
Scala
• Actor model for message-driven processing
• Asynchronous by design to achieve high throughput
• Each message is handled in a single threaded context(no lock, synchronous needed)
• Let-it-crash model for fault tolerance and auto-healing
• Clustering mechanism availableThe Road to Akka Cluster, and Beyond
![Page 37: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/37.jpg)
Akka Streams• Akka Streams is a DSL library for doing streaming computation
on Akka
• Materializer to transform each step into Actor
• Back-pressure enabled by default
Source Flow Sink
The Reactive Manifesto
![Page 38: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/38.jpg)
Reactive Kafka • Akka Streams wrapper for Kafka
• Commit processed offset back into Kafka
• Provide at-least-once message delivery guarantee
https://github.com/akka/reactive-kafka
![Page 39: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/39.jpg)
Message Delivery Sematics• Actor Model: at-most-once
• Akka Persistence: at-least-once
• Persist log to external storage (like WAL)
• Reactive Kafka: at-least-once + back-pressure
• Write offset back into Kafka
• At-least-once + Idempotent writes = exactly-once
![Page 40: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/40.jpg)
Implementing Data Pipelines
kafka source
manual commit
ParsingPhase
Storing Phase
Akka StreamReactive Kafka
![Page 41: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/41.jpg)
Scale Upkafka
sourcemanual commit
ParsingPhase
Storing Phase
mapAsync
router
parser parser parser …
scale up
![Page 42: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/42.jpg)
Scale Out• Scale out by creating more docker containers with
same Kafka consumer group
![Page 43: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/43.jpg)
Be careful using Reactive Kafka Manual Commit
![Page 44: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/44.jpg)
A bug that periodically reset the committed offset
kafka source
manual commit
#partition offset
partition 0 50
partition 1 90
partition 2 70
0
1
2
![Page 45: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/45.jpg)
A bug that periodically reset committed offset
kafka source
manual commit
kafka source
manual commit
#partition offset
partition 0 300
partition 1 250
partition 2 70
0
1
2
#partition offset
partition 2 350
![Page 46: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/46.jpg)
Normal:redline(lag) keep declining
![Page 47: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/47.jpg)
Abnormal:redline(lag) keep getting reset
![Page 48: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/48.jpg)
PR merged in 0.8 branch Apply patch if version <= 0.8.7
https://github.com/akka/reactive-kafka/pull/144/commits
![Page 49: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/49.jpg)
Akka Cluster• Connects Akka runtimes into a cluster
• No-SPOF
• Gossip Protocol
• ClusterSingleton library available
![Page 50: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/50.jpg)
Lookup Service Integration
Akka Streams Akka Streams Akka Streams
ClusterSingleton
Ldap Lookup Hostname Lookup
Akka Cluster
![Page 51: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/51.jpg)
Lookup Service Integration
Akka Streams Akka Streams Akka Streams
ClusterSingleton
Ldap Lookup Hostname Lookup
Akka Cluster
![Page 52: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/52.jpg)
Lookup Service Integration
Akka Streams Akka Streams Akka Streams
ClusterSingleton
Ldap LookupHostname Lookup
Akka Cluster
Service Discovery
![Page 53: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/53.jpg)
Async Lookup Process
Akka Streams
ClusterSingleton
Hostname Lookup
Cache
SynchronousAsynchronous
![Page 54: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/54.jpg)
Async Lookup Process
Akka Streams
ClusterSingleton
Hostname Lookup
Cache
SynchronousAsynchronous
![Page 55: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/55.jpg)
Akka Streams
Cache
Akka Streams
Cache
Async Lookup Process
Akka Streams
ClusterSingleton
Hostname Lookup
Cache
SynchronousAsynchronous
![Page 56: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/56.jpg)
Cassandra Data Model for Time Series Data
![Page 57: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/57.jpg)
Event Time V.S.
Processing Time
![Page 58: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/58.jpg)
Problem• We need to do analytics based on event time instead
of processing time
• Example: modelling user behaviour
• Log may arrive late
• network slow, machine busy, service restart, etc
• Log arrived out-of-order
• multiple sources, different network routing, etc
![Page 59: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/59.jpg)
Problem• Spark streaming can’t handle event time
ts 12:03ts 11:50ts 12:04
ts 12:01ts 12:02
![Page 60: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/60.jpg)
• Store time series data into Cassandra
• Query Cassandra based on event time
• Get most accurate result upon query executed
• Late tolerance time is flexible
• Simple
Solution
![Page 61: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/61.jpg)
Cassandra Data Model
• Retrieve a entire row is fast
• Retrieve a range of columns is fast
• Read continuous blocks on disk (not random seek)
![Page 62: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/62.jpg)
Data Model for AD log
use shardId to distribute the log into different partitions
![Page 63: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/63.jpg)
Data Model for AD log
partition key to support frequent queries for a specific hour
![Page 64: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/64.jpg)
Data Model for AD log
clustering key for range queries
![Page 65: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/65.jpg)
Wrap up
![Page 66: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/66.jpg)
• Spark: streaming and batch analytics
• Docker: ship, deploy and resource management
• Akka: resilient, scalable, flexible data pipelines
• Cassandra: batch queries based on event data
• Kafka: durable buffer, streaming processing
Recap: SDACK Stack
![Page 67: Using the SDACK Architecture to Build a Big Data Product](https://reader034.fdocuments.in/reader034/viewer/2022050614/587756601a28ab84388b75ab/html5/thumbnails/67.jpg)
Questions ?
Thank you !