Using the SDACK Architecture to Build a Big Data Product

67
Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Using the SDACK Architecture to Build a Big Data Product

Transcript of Using the SDACK Architecture to Build a Big Data Product

Page 1: Using the SDACK Architecture to Build a Big Data Product

Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Using the SDACK Architecture to Build a Big Data Product

Page 2: Using the SDACK Architecture to Build a Big Data Product

Outline• A Threat Analytic Big Data product

• The SDACK Architecture

• Akka Streams and data pipelines

• Cassandra data model for time series data

Page 3: Using the SDACK Architecture to Build a Big Data Product

Who am I• Apache Bigtop PMC member

• Software Engineer @ Trend Micro

• Develop big data apps & infra

Page 4: Using the SDACK Architecture to Build a Big Data Product

A Threat Analytic Big Data Product

Page 5: Using the SDACK Architecture to Build a Big Data Product

Target Scenario

InfoSec Team

Security information and event management(SIEM)

Page 6: Using the SDACK Architecture to Build a Big Data Product

Problems• Too many log to investigate

• Lack of actionable, prioritized recommendations

Page 7: Using the SDACK Architecture to Build a Big Data Product

Splunk

ArcSightWindows

event

Web Proxy

DNS

AD

Anomaly Detection

Expert Rule

Filtering

Prioritization

Threat Analytic System

Notification

Page 8: Using the SDACK Architecture to Build a Big Data Product

Deployment

Page 9: Using the SDACK Architecture to Build a Big Data Product

On-Premises

InfoSec Team

Security information and event

Anomaly Detection

Expert Rule

Filtering

Prioritization

Page 10: Using the SDACK Architecture to Build a Big Data Product

On-Premises

InfoSec Team

Anomaly Detection

Expert Rule

Filtering

Prioritization

AD

Windows Event Log

DNS

Page 11: Using the SDACK Architecture to Build a Big Data Product

On the cloud?

InfoSec Team

AD

Windows Event Log

DNS

Page 12: Using the SDACK Architecture to Build a Big Data Product

There’re too many PII data in the log

Page 13: Using the SDACK Architecture to Build a Big Data Product

System Requirement• Identify anomaly activities based on data

• Lookup extra info such as ldap, active directory, hostname, etc

• Correlate with other data

• EX: C&C connection followed by an anomaly login

• Support both streaming and batch analytic workloads

• Run streaming detections to identify anomaly in timely manner

• Run batch detections with arbitrary start time and end time

• Scalable

• Stable

Page 14: Using the SDACK Architecture to Build a Big Data Product

No problem. I have a solution.

How exactly are we going to build this product?

PM RD

Page 15: Using the SDACK Architecture to Build a Big Data Product

SMACK StackToolbox for wide variety of data processing scenarios

Page 16: Using the SDACK Architecture to Build a Big Data Product

SMACK Stack• Spark: fast and general purpose engine for large-scale data

processing scenarios

• Mesos: cluster resource management system

• Akka: toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications

• Cassandra: distributed, highly available database designed to handle large amounts of data across datacenters

• Kafka: high-throughput, low-latency distributed pub-sub messaging system for real-time data feeds

Source: http://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka

Page 17: Using the SDACK Architecture to Build a Big Data Product

Medium-sized Enterprises

InfoSec Team

AD

Windows Event Log

DNS

Page 18: Using the SDACK Architecture to Build a Big Data Product

No problem. I have a solution.Can’t we simplify it?

PM RD

Page 19: Using the SDACK Architecture to Build a Big Data Product

SDACK Stack

Page 20: Using the SDACK Architecture to Build a Big Data Product

SDACK Stack• Spark: fast and general purpose engine for large-scale data

processing scenarios

• Docker: deployment and resource management

• Akka: toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications

• Cassandra: distributed, highly available database designed to handle large amounts of data across datacenters

• Kafka: high-throughput, low-latency distributed pub-sub messaging system for real-time data feeds

Source: http://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka

Page 21: Using the SDACK Architecture to Build a Big Data Product

• Three key techniques in Docker

• Use Linux Kernel’s Namespaces to create isolated resources

• Use Linux Kernel’s cgroups to constrain resource usage

• Union file system that supports git-like image creation

Docker - How it works

Page 22: Using the SDACK Architecture to Build a Big Data Product

• Lightweight

• Fast creation

• Efficient to ship

• Easy to deploy

• Portability

• runs on *any* Linux environment

The nice things about it

Page 23: Using the SDACK Architecture to Build a Big Data Product

Medium-sized Enterprises

InfoSec Team

AD

Windows Event Log

DNS

Page 24: Using the SDACK Architecture to Build a Big Data Product

Large Enterprises

InfoSec Team

AD

Windows Event Log

DNS

Page 25: Using the SDACK Architecture to Build a Big Data Product

Fortune 500

InfoSec Team

AD

Windows Event Log

DNS

Page 26: Using the SDACK Architecture to Build a Big Data Product

Threat Analytic System Architecture

Page 27: Using the SDACK Architecture to Build a Big Data Product

DB

a d

d d i

w e b

Actor

Actor Actor

……Log

Web Portal Detection ResultAPI Server

Page 28: Using the SDACK Architecture to Build a Big Data Product

API Server

Web Portal

Page 29: Using the SDACK Architecture to Build a Big Data Product

Docker Compose

kafka: build: . ports: - “9092:9092” spark: image: spark port: - “8080:8080” ……

Web Portal

API Server

Page 30: Using the SDACK Architecture to Build a Big Data Product

Akka Streams and data pipelines

Page 31: Using the SDACK Architecture to Build a Big Data Product

receiver

Page 32: Using the SDACK Architecture to Build a Big Data Product

receiver

buffer

Page 33: Using the SDACK Architecture to Build a Big Data Product

transformation, lookup ext info

receiver

buffer

Page 34: Using the SDACK Architecture to Build a Big Data Product

batch

streaming

receiver

buffer

transformation, lookup ext info

Page 35: Using the SDACK Architecture to Build a Big Data Product

• Each streaming job occupies at least 1 core

• There’re so many type of log

• AD, Windows Event, DNS, Proxy, Web Application, etc

• Kafka offset loss when code changed

• No back-pressure (back then)

• Estimated back-pressure available in Spark 1.5

• Integration with Akka based info lookup services

Why not Spark Streaming?

Page 36: Using the SDACK Architecture to Build a Big Data Product

Akka• High performance concurrency framework for Java and

Scala

• Actor model for message-driven processing

• Asynchronous by design to achieve high throughput

• Each message is handled in a single threaded context(no lock, synchronous needed)

• Let-it-crash model for fault tolerance and auto-healing

• Clustering mechanism availableThe Road to Akka Cluster, and Beyond

Page 37: Using the SDACK Architecture to Build a Big Data Product

Akka Streams• Akka Streams is a DSL library for doing streaming computation

on Akka

• Materializer to transform each step into Actor

• Back-pressure enabled by default

Source Flow Sink

The Reactive Manifesto

Page 38: Using the SDACK Architecture to Build a Big Data Product

Reactive Kafka • Akka Streams wrapper for Kafka

• Commit processed offset back into Kafka

• Provide at-least-once message delivery guarantee

https://github.com/akka/reactive-kafka

Page 39: Using the SDACK Architecture to Build a Big Data Product

Message Delivery Sematics• Actor Model: at-most-once

• Akka Persistence: at-least-once

• Persist log to external storage (like WAL)

• Reactive Kafka: at-least-once + back-pressure

• Write offset back into Kafka

• At-least-once + Idempotent writes = exactly-once

Page 40: Using the SDACK Architecture to Build a Big Data Product

Implementing Data Pipelines

kafka source

manual commit

ParsingPhase

Storing Phase

Akka StreamReactive Kafka

Page 41: Using the SDACK Architecture to Build a Big Data Product

Scale Upkafka

sourcemanual commit

ParsingPhase

Storing Phase

mapAsync

router

parser parser parser …

scale up

Page 42: Using the SDACK Architecture to Build a Big Data Product

Scale Out• Scale out by creating more docker containers with

same Kafka consumer group

Page 43: Using the SDACK Architecture to Build a Big Data Product

Be careful using Reactive Kafka Manual Commit

Page 44: Using the SDACK Architecture to Build a Big Data Product

A bug that periodically reset the committed offset

kafka source

manual commit

#partition offset

partition 0 50

partition 1 90

partition 2 70

0

1

2

Page 45: Using the SDACK Architecture to Build a Big Data Product

A bug that periodically reset committed offset

kafka source

manual commit

kafka source

manual commit

#partition offset

partition 0 300

partition 1 250

partition 2 70

0

1

2

#partition offset

partition 2 350

Page 46: Using the SDACK Architecture to Build a Big Data Product

Normal:redline(lag) keep declining

Page 47: Using the SDACK Architecture to Build a Big Data Product

Abnormal:redline(lag) keep getting reset

Page 48: Using the SDACK Architecture to Build a Big Data Product

PR merged in 0.8 branch Apply patch if version <= 0.8.7

https://github.com/akka/reactive-kafka/pull/144/commits

Page 49: Using the SDACK Architecture to Build a Big Data Product

Akka Cluster• Connects Akka runtimes into a cluster

• No-SPOF

• Gossip Protocol

• ClusterSingleton library available

Page 50: Using the SDACK Architecture to Build a Big Data Product

Lookup Service Integration

Akka Streams Akka Streams Akka Streams

ClusterSingleton

Ldap Lookup Hostname Lookup

Akka Cluster

Page 51: Using the SDACK Architecture to Build a Big Data Product

Lookup Service Integration

Akka Streams Akka Streams Akka Streams

ClusterSingleton

Ldap Lookup Hostname Lookup

Akka Cluster

Page 52: Using the SDACK Architecture to Build a Big Data Product

Lookup Service Integration

Akka Streams Akka Streams Akka Streams

ClusterSingleton

Ldap LookupHostname Lookup

Akka Cluster

Service Discovery

Page 53: Using the SDACK Architecture to Build a Big Data Product

Async Lookup Process

Akka Streams

ClusterSingleton

Hostname Lookup

Cache

SynchronousAsynchronous

Page 54: Using the SDACK Architecture to Build a Big Data Product

Async Lookup Process

Akka Streams

ClusterSingleton

Hostname Lookup

Cache

SynchronousAsynchronous

Page 55: Using the SDACK Architecture to Build a Big Data Product

Akka Streams

Cache

Akka Streams

Cache

Async Lookup Process

Akka Streams

ClusterSingleton

Hostname Lookup

Cache

SynchronousAsynchronous

Page 56: Using the SDACK Architecture to Build a Big Data Product

Cassandra Data Model for Time Series Data

Page 57: Using the SDACK Architecture to Build a Big Data Product

Event Time V.S.

Processing Time

Page 58: Using the SDACK Architecture to Build a Big Data Product

Problem• We need to do analytics based on event time instead

of processing time

• Example: modelling user behaviour

• Log may arrive late

• network slow, machine busy, service restart, etc

• Log arrived out-of-order

• multiple sources, different network routing, etc

Page 59: Using the SDACK Architecture to Build a Big Data Product

Problem• Spark streaming can’t handle event time

ts 12:03ts 11:50ts 12:04

ts 12:01ts 12:02

Page 60: Using the SDACK Architecture to Build a Big Data Product

• Store time series data into Cassandra

• Query Cassandra based on event time

• Get most accurate result upon query executed

• Late tolerance time is flexible

• Simple

Solution

Page 61: Using the SDACK Architecture to Build a Big Data Product

Cassandra Data Model

• Retrieve a entire row is fast

• Retrieve a range of columns is fast

• Read continuous blocks on disk (not random seek)

Page 62: Using the SDACK Architecture to Build a Big Data Product

Data Model for AD log

use shardId to distribute the log into different partitions

Page 63: Using the SDACK Architecture to Build a Big Data Product

Data Model for AD log

partition key to support frequent queries for a specific hour

Page 64: Using the SDACK Architecture to Build a Big Data Product

Data Model for AD log

clustering key for range queries

Page 65: Using the SDACK Architecture to Build a Big Data Product

Wrap up

Page 66: Using the SDACK Architecture to Build a Big Data Product

• Spark: streaming and batch analytics

• Docker: ship, deploy and resource management

• Akka: resilient, scalable, flexible data pipelines

• Cassandra: batch queries based on event data

• Kafka: durable buffer, streaming processing

Recap: SDACK Stack

Page 67: Using the SDACK Architecture to Build a Big Data Product

Questions ?

Thank you !