Kentik Detect Engine - Network Field Day 2017

27
Kentik Data Engine Dan Ellis CTO

Transcript of Kentik Detect Engine - Network Field Day 2017

Kentik Data Engine

Dan EllisCTO

KDE Quick Stats(kentik detect engine)

NetFlow in the Cloud

• 125+ Billion Flows/Day stored• 1,000,000+ FPS• 50 “Large” Queries/s, thousands of sub-qps• 75+ TB flow data stored/day

(25+ compressed)

SNMP, BGP, network performance too!

KDE High-Level• KDE is a hybrid system:

○ Fusing / Ingest Layer○ Distributed column store db / query engine○ Realtime stream processing for anomaly detection

• We evaluated various existing engines: ES, Hadoop, Cassandra, Storm, Spark, SILK, Druid, Kafka....

• Couldn’t find performance, multi-tenancy, and network savvy

so we wrote our own...

Ingest & Fusion layer

Storage layer(flow specific)

Querylayer

Each layer has separate and different scaling characteristics

Query engine and UI

Query interfaces

SQL

WWW

REST

Datasources Clients

SELECT flowFROM routerWHERE …

>_

KDE architecture

Ingest architecture

KDE Architecture

BGP VIP

KDE ingest layer

enKryptor

Storage layer

Streaming layer

kFlow(HTTPS)

NetFlow(UDP)

NetFlow(UDP)

kFlow(HTTPS)

kFlow(HTTP)

kFlow(HTTP)

relay

relay

proxy

proxy

proxy

client

C

client

C

client

C

KDE ingest layer

enKryptor

Storage layer

Streaming layer

kFlow(HTTPS)

NetFlow(UDP)

kFlow(HTTPS)

kFlow(HTTPS)

kFlow(HTTPS)

proxy

proxy

proxy

client

C

client

C

client

C

BGP VIP NetFlow

(UDP) relay

VIP + Relay

• One IP bound to multiple servers

• Sharded by Source-IP• Validate Sender as Kentik

Customer• Pass flow on (raw UDP

socket) to correct proxy

• Relay handles load balancing (Kentik specific, UDP+TCP)

relay

Proxy

BGP VIP

KDE ingest layer

enKryptor

Storage layer

Streaming layer

kFlow(HTTPS)

NetFlow(UDP)

NetFlow(UDP)

kFlow(HTTPS)

relay

relay

kFlow(HTTP)

client

C

client

C

client

C

kFlow(HTTP)

• Inspect flow & determine type:V5, V9, IPFIX, SFlow, KFlow

• Need to resample?

• Configured Sample Rate

• Launch Client Process for each device

• Poll for device changes

• Monitor health

• Relaunch of client crash

proxy

proxy

proxy

BGP VIP

KDE ingest layer

enKryptor

Storage layer

Streaming layer

kFlow(HTTPS)

NetFlow(UDP)

NetFlow(UDP)

kFlow(HTTPS)

relay

relay

proxy

proxy

proxy

kFlow(HTTP)

kFlow(HTTP)

client

C

client

C

client

C

Client(where the magic happens)

• One per device configured to send flow

• * goes in, KFlow comes out

client

C

NetFlow

SFlow

IPFix

kFlow

Client Processingis a key enabler to useful data

Step 1: Normalization

• Separate code paths for each type expected• CGO callouts

Step 2: Enrichment

• BGP - Route data for xxx• GeoIP - Where does my traffic start and end• SNMP - Interface names and descriptions• Tagging - business classification: cost-centers,

user-info, peering info• App Specific Data - URL/DNS requests, MYSQL

query• Performance data (NPM) - Retransmits, network latency,

appl latency

• coming soon:• Timestamped event Data (syslog)• Threat feeds

DATA FUSION in CLIENT

DecoderModules

MemTables

NetFlow v5

NetFlow v9

IPFIX

BGP RIB

Custom Tags

SNMP Poller

BGP Daemon

Enrichment DB

DATA FUSION

Geo ←→ IP

ASN ←→ IP

SFlow

ROUTER

FLOW FRIENDLY DATASTORE

Single flowfused row

sent to storage

PCAP

PCAPagent

proxy

Step 3: Resampling & Unification

• Long term (>1 Month)• What a process (device) said over an hour

• Two tricks:• Flow Unification• Resampling

Query+Storage layersachieving ‘ā la carte’ data consumption

Storage Layer• Fused KFlow as input...Cap'n Proto (like

protobuffers)• Shard data into small chunks• HTTP to N distributed storage nodes• Metadata supervisor DB handles shard locations• Row Oriented to Column Oriented• Compressed using ZFS

DISK

Multi-Tenancy DBNeeded Multitenancy for a large-scale SaaS productCould not find other DB’s @scale with it

We succeeded by building in:● Fairness

queries are chopped into small chunks, users are rate limited and prioritized

● Security data is isolated between “users” down to the thread level

● Multiuser caching with fairness Built a cache that cannot be monopolized by any 1 user

Ingest & Fusion layer

Storage layer(flow specific)

Querylayer

Query engine and UI

Query interfaces

SQL

WWW

REST

Datasources Clients

SELECT flowFROM routerWHERE …

>_

● SQL interface PSQL FDW

● UI/UX feat. advanceddata-viz

● REST API based interfacebuild your own

Viz-richUI

SELECT flowFROM routerWHERE …SQL

API

Anomaly Detection and Streaming Databases

Anomaly Detection ● Network + NPM specific● Policy based, customizable● Granular itemization and metrics

○ look at top-100 Country, IP, Port, ASN, site, path,...○ Unique senders, bps, pps, rxmits, latency

● Over/under static thresholds● Over/under what’s “normal” (baselining)● Perform actions

○ E-mail, Slack, JSON, Pagerduty○ Mitigation (A10, Radware, BGP)

• DDoS is a simple use case of anomaly detection

• V1 anomaly detection relied on KDE queries. Abusive

• V2 needed stream processing and in-ram baseline storage

• Typically avoided streaming db’s due to aggregation

• Streaming db’s for anomaly detection+our long term flow storage is a powerful combination

• Evaluated Spark, Storm, Samza, PipelineDB. Fail

Detecting Anomalies

BGP VIP

KDE ingest layer

enKryptor

Storage layer

kFlow(HTTPS)

NetFlow(UDP)

NetFlow(UDP)

kFlow(HTTPS)

kFlow(HTTP)

kFlow(HTTP)

relay

relay

proxy

proxy

proxy

client

C

client

C

client

C

Streaming layer

Aggregation Layer #2

POLICIES

kFlow

Multiple kFPS

Policy #1

Policy #2

1s 1s 1s 1s 1s 1s

Aggregation Layer #1

1min

Σ

Σ Σ

Aggregation Layer #3

Policy #1

Policy AggregationFilter

Policy Thresholds& Actions

1hour

Σ ThresholdComparator Action

Triggers

kentik.com/nfd14