Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics...

37
Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee {yhlee06, lee}@cnu.ac.kr http://networks.cnu.ac.kr/~yhlee Chungnam National University, Korea January 8, 2013 FloCon 2013

Transcript of Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics...

Page 1: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

Scalable NetFlow Analysis with Hadoop

Yeonhee Lee and Youngseok Lee {yhlee06, lee}@cnu.ac.kr

http://networks.cnu.ac.kr/~yhlee Chungnam National University, Korea

January 8, 2013

FloCon 2013

Page 2: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

Contents

• Introduction • Overview • Hadoop-based traffic processing tool • Evaluation • Summary

Page 3: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

INTRODUCTION

Page 4: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

Internet Measurement

• Challenges • Scalability • Fault-tolerant system • Extensibility

• CAIDA data • Capture, Curation, Storage, Search, Sharing, Analysis,

and Visualization • Ark topology: 1.8 TB • Telescope: 102 TB • Packet headers: 18.8 TB

4

Josh Polterock, “CAIDA: A Data Sharing Case Study,” Security at the Cyber Border: Exploring Cybersecurity for International Research Network Connections workshop, 2012

Page 5: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

• 1 PB sorting by Google • 2008: 6 hours and 2

minutes on 4,000 computers

• 2011: 33 minutes on 8000 computers

• 2011: 10PB, 8000 computers, 6 hours and 27 minutes

Harness Distributed Computing and Storage ? Google MapReduce, 2004 Apache Hadoop project

5

Page 6: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

Our Proposal

6

NetFlow v5

Packet

Administrator

Slave

Master

Traffic Collector

Web Visualizer / Hive

Pcap I/O

NetFlow I/O

Traffic Analysis Mapper & Reducer

Traffic Analyzer

HDFS Hadoop

Bin I/O

Hadoop-based Traffic Measurement and Analysis Platform

1. Yeonhee Lee and Youngseok Lee, "Toward Scalable Internet Traffic Measurement and Analysis with Hadoop," ACM SIGCOMM Computer Communication Review (CCR), Jan. 2013

2. Yeonhee Lee and Youngseok Lee “A Hadoop-based Packet Trace Processing Tool” , TMA, April 2011 3. Yeonhee Lee and Youngseok Lee, "Detecting DDoS Attacks with Hadoop", ACM CoNEXT Student

Workshop, Dec, 2011

Page 7: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

Related Work

• Traffic analysis of DNS root server (RIPE, 2011.11) • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

2011 • Firewall/IDS logs, netflow/packet

• Performing Network and Security Analytics with Hadoop, (Travis Dawson, Narus), Hadoop Summit 2012

• Distributed Bro (IDS)

7

Page 8: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

OVERVIEW

Page 9: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

Hadoop-based NetFlow Analysis

9

Distributer

NetFlow NetFlow

Collect & Anlaysis

Page 10: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

10

Distributer

Packet N

etFlow

monitor

query

Traffic Analyzer

MapReduce for Traffic Analysis

IO formats

Traffic Collector

& Loader

Pcap InputFormat

Binary Input/OutputFormat

Data Source (Jpcap, HDFS)

Data Processing (HDFS, MapReduce, Hive)

User Interface (Hive, Web)

Text Input/OutputFormat

IP analysis MR

TCP analysis MR

HTTP analysis MR

User Interface

Web UI

CLI

DDoS analysis MR

Hive QL Query for Traffic Analysis

NetFlow analysis MR

Hadoop HDFS

Scan IP query

Spoofed IP query

Heavy User query

User-defined query

Page 11: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

HADOOP-BASED TRAFFIC ANALYSIS

Page 12: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

Challenges

1. Data handing issue in HDFS 2. Distributed traffic analysis MapReduce algorithms 3. Performance tuning in a large-scale Hadoop

Distributed computation

Fault tolerance

Scalability (~TB/PB)

12

Page 13: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

1. Data handing issue in Hadoop

2. Distributed traffic analysis MapReduce algorithms

3. Performance tuning in a large-scale Hadoop

testbed

13

Challenges

Page 14: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

file-level processing

Block-level Parallelism

14 14 00 00 68 2B AD 4C 38 A4 04 00 5C 00 00 00 5C 00 00 00 FF FF ‥‥ 00 21 B5 01 68 2B AD 4C 2B 1C 07 00 3C 00 00 00 3C 00 00 00 01 80 ‥

HDFS Block2 (64 MB)

HDFS Block3 (64 MB)

block-level processing

Page 15: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

15

Block-level IO vs. File-level IO

3.5 3.9

4.3

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

0

20

40

60

80

100

120

140

Spee

dUp

Com

plet

ion

Tim

e (m

in)

# of nodes

IP Analysis_blockIO

IP Analysis_fileIO

SpeedUp vs fileIO

Page 16: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

1. Data handing issue in Hadoop

2. Distributed traffic analysis MapReduce algorithms

3. Performance tuning in a large-scale Hadoop

testbed

16

Challenges

Page 17: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

Reduce Phase

Map Phase

Block IO HDFS HDFS

DistributedCache Aggregation

17

Filtering Rule cnu;srcip=168.188.0.0-168.188.255.255

Aggregation Rule as;ip;subnet;port;protocol;srcas;dstas;srcip;dstip;srcsubnet;dstsubnet;srcport;dstport;

IP/UDP packet NetFlow v5 header

v5 record

IP/UDP packet NetFlow v5 header

v5 record

v5 record

deco

ding

filte

ring

gro

up-k

ey

gene

ratio

n

aggr

egat

ion

deco

ding

filte

ring

gro

up-k

ey

gene

ratio

n

aggr

egat

ion

Port # of octets # of packets # of Flows

Protocol # of octets # of packets # of Flows

AS # of octets # of packets # of Flows

Subnet # of octets # of packets # of Flows

pack

et

iden

tific

atio

n pa

cket

id

entif

icat

ion

K: time|AS V: count

K: time|AS V: count

counts per AS

counts per AS

Page 18: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

Map Phase

Reduce Phase

Shuffle &Sort

Block IO HDFS

HDFS

Anomaly Detection

18

DistributedCache Detection Rules port_scan;ip,proto=6;srcip,dstport;srcip;pkts=20- syn_flood;ip,proto=6,syn-fin=1-;srcip,dstip;srcip,dstip;syn-fin=6-

IP/UDP packet NetFlow v5 header

v5 record

IP/UDP packet NetFlow v5 header

v5 record

v5 record

deco

ding

patte

rn

mat

chin

g g

roup

-key

ge

nera

tion

aggr

egat

ion

deco

ding

patte

rn

mat

chin

g g

roup

-key

ge

nera

tion

aggr

egat

ion

syn_flood 3.3.3.33.3.3.4 …

PortScan 1.1.1.11.1.1.2 …

dete

ctio

n de

tect

ion

parti

tioni

ng

&gro

up s

ort

parti

tioni

ng

&gro

up s

ort

pack

et

iden

tific

atio

n pa

cket

id

entif

icat

ion

Page 19: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

1. Data handing issue in Hadoop

2. Distributed traffic analysis MapReduce algorithms

3. Performance tuning in a large-scale Hadoop

19

Challenges

Page 20: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

• Configuration • Hadoop IO Buffer (128K 1 MB) • Java heap space (300 MB 1 024 MB) • # of MapReduce Slots (# of cores)

• MapReduce Algorithm • normal combiner vs inMapper combiner

• Job scheduling

20

Performance Tuning

Page 21: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

Job Scheduling

• Different job types • Periodic jobs (for monitoring)

• guaranteed service within time • e.g Aggregated Statistics for monitoring, Flow Parse job for

analytics

• Small ad-hoc query job (for analytics) • fast response time

21

Collect Collect Collect … Collector

Basic Statistics Flow Parse Basic Statistics Flow Parse … Basic Statistics Flow Parse Fair Scheduler ad-hoc query ad-hoc query ad-hoc query

Basic Statistics Flow Parse Basic Statistics Flow

Parse … Basic Statistics Flow

Parse FIFO Scheduling ad-hoc query

ad-hoc query

ad-hoc query

5 munites 5 munites 5 munites 5 munites

ad-hoc job periodic job

Page 22: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

PERFORMANCE EVALUATION

Page 23: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

Experiments

• Testbed

• Data and MapReduce jobs

23

Type Nodes Cores CPU Memory HardDisk Rack Small 3 24 3.4 GHz 8 core 16 GB 2 TB 1 Rack

Medium 30 240 2.93 GHz 8 core 16 GB 4 TB 1 Rack

Large 200 400 2.66 GHz 2 core 2 GB 500 GB 4 Racks

Type Dataset MapReduce Job Testbed

NetFlow 1 TB from KOREN flowStats, flowDetect, flowPrint Small

Packet 1 ~ 5 TB from CNU campus N/W IP, TCP, Web (webpop, User Behavior, DDoS) Medium, Large

Page 24: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

5.0

2.3

0.01.02.03.04.05.06.0

1 2 3

Spee

dUp

# of nodes

flowStats vs. Flowtools

flowPrint vs. FlowTools

24

NetFlow: SpeedUp (vs. Flowtools)

> FlowPrint flow-cat -p flowfile |flow-print –f14

> FlowStats flow-cat -p flowfile|flow-stat -f12 flow-cat -p flowfile|flow-stat –f5

Page 25: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

25

NetFlow: Scalability

103

17

0

20

40

60

80

100

120

5 10 15 20 25 30

Job

Com

plet

ion

Tim

e (m

in)

# of nodes

flowStats

flowDetect

6.1

01234567

5 10 15 20 25 30Sp

eedU

p (v

s 5

node

s)

# of nodes

flowStats

flowDetect

Page 26: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

1

10

100

1000

10000

100000

8/29/2012 9/8/2012 9/18/2012 9/28/2012 10/8/2012 10/18/2012 10/28/2012

coun

t

date

worm.sasser,w32.sasser

remote_administrator

vnc

w32.witty.worm

worm.opasoft,w32.opaserv.worm

code_red_worm

netfairy

kamun

emule

shockwave_killer

worm.killmsblast,w32.nachi.worm,w32.welchia.worm

26

NetFlow: Pattern Matching Result

0

2

4

6

8

10

# of

reco

rds (

M) NetFlows Record Distribution

Page 27: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

9

13

02468

101214

5 10 15 20 25 30

Thro

ughp

ut (G

bps)

# of nodes

IP Analysis (Gbps)

TCP Analysis (Gbps)

WebPop (Gbps)

UserBehavior (Gbps)

DDos (Gbps)

121

15 13

77

020406080

100120140

5 10 15 20 25 30

Com

plet

ion

Tim

e (m

in)

IP Analysis (min)

TCP Analysis (min)

WebPop (min)

UserBehavior (min)

DDoS (min)

Packet: ScaleOut

27

Page 28: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

28

Packet: SizeUp (30 nodes)

2934

7007

0

2

4

6

8

10

12

14

16

0

1000

2000

3000

4000

5000

6000

7000

8000

1TB 2TB 3TB 4TB 5TB

Thro

ughp

ut (G

bps)

Com

plet

ion

Tim

e (s

ec)

Data Size

IP Analysis

IP Analysis_ripe

TCPStats

Webpop

UserBehavior

DDoS

IP Analysis

TCPStats

Webpop

UserBehavior

DDoS

Page 29: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

2744

8

15

0

2

4

6

8

10

12

14

16

18

0500

100015002000250030003500400045005000

1TB 2TB 3TB 4TB 5TB

Thro

ughp

ut (G

bps)

Com

plet

ion

Tim

e (s

ec)

Data Size

IP AnalysisTCP AnalysisWebpopUserBehaviorDDoSIP AnalysisTCP AnalysisWebpopUserBehaviorDDoS

Packet: SizeUp (200 nodes)

29

Page 30: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

SUMMARY

Page 31: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

• NetFlow analysis with Hadoop • NetFlow v5 processing module • MapReduce algorithms: statistics

• Distributed computing and storage with Hadoop

• Fits Internet measurement application • Scalability

• Source codes are available at

• Packet, NetFlow • https://sites.google.com/a/networks.cnu.ac.kr/dnlab/researc

h/hadoop • https://github.com/ssallys/pcap-on-Hadoop

31

Summary

Page 32: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

Ongoing Work • Distributed real-time

monitoring • Rule matching for

Streamed NetFlow • Developing rule for

MapReduce • Rule classification for

dedicated rule matching

• Integration • Streaming packages • Enhanced analytics

• Data mining: Mahout • Machine learning

32

RHive

Hive Pig

MapReduce

HDFS

Mahout

RHadoop

Rhipe

Performance

Productivity

• Scalable collection • E.g.) 10GE 10 X 1 GE

HDFS

Page 33: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

• Papers 1. Y. Lee and Y. Lee, "Toward Scalable Internet Traffic

Measurement and Analysis with Hadoop," ACM SIGCOMM Computer Communication Review (CCR), Jan. 2013

2. Y. Lee, W. Kang, and Y. Lee, "A Hadoop-based Packet Trace Processing Tool," The Third TMA, April 2011

3. Y. Lee and Y. Lee, "Detecting DDoS Attacks with Hadoop", ACM CoNEXT Student Workshop, Dec, 2011

• Software 1. http://networks.cnu.ac.kr/~yhlee 2. https://sites.google.com/a/networks.cnu.ac.kr/dnlab/research/hadoop 3. https://github.com/ssallys/pcap-on-Hadoop

33

Reference

Page 34: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

THANK YOU !

Page 35: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

35

0%

20%

40%

60%

80%

100%

no 10 20 30 40 50 60 70 80 90 100

110

120

130

140

150

160

170

180

190

200

Usa

ge(

%)

Tics

IP Analysis

mem_usage

disk_read_usage

disk_write_usage

cpu_usage

net_in_usage

net_out_usage

0%

20%

40%

60%

80%

100%

1 19 37 55 73 91 109

127

145

163

181

199

217

235

253

271

289

307

325

343

361

379

397

Usa

ge(

%)

Tics

TCP Analysis

mem_usage

disk_read_usage

disk_write_usage

cpu_usage

net_in_usage

net_out_usage

Page 36: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

36

Collector TaskTracker Datanode

Collector TaskTracker Datanode

Collector TaskTracker Datanode

LoadBalancer JobTracker NameNode

PF_RING

PF_RING ,RAID

DBMS

Page 37: Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World

37

on - the - fly packets packets NetFlow record

each statistics

aggregated statistics

table records

visualized statistics

load balancer

network interface

PF_RING /RSS

Input Format

Mapper Reducer table records

visualized statistics

Internet

rule name; filter pattern; mapout key; patition&groupsort key;detection condition; action ex) port_scan;ip,proto=6;srcip,dstport;srcip;pkts=20- syn_flood;ip,proto=6,syn-fin=1-;srcip,dstip;srcip,dstip;syn-fin=6-