Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics...
Transcript of Scalable NetFlow Analysis with Hadoop€¦ · • PacketPig (2012.03) - Big Data Security Analytics...
Scalable NetFlow Analysis with Hadoop
Yeonhee Lee and Youngseok Lee {yhlee06, lee}@cnu.ac.kr
http://networks.cnu.ac.kr/~yhlee Chungnam National University, Korea
January 8, 2013
FloCon 2013
Contents
• Introduction • Overview • Hadoop-based traffic processing tool • Evaluation • Summary
INTRODUCTION
Internet Measurement
• Challenges • Scalability • Fault-tolerant system • Extensibility
• CAIDA data • Capture, Curation, Storage, Search, Sharing, Analysis,
and Visualization • Ark topology: 1.8 TB • Telescope: 102 TB • Packet headers: 18.8 TB
4
Josh Polterock, “CAIDA: A Data Sharing Case Study,” Security at the Cyber Border: Exploring Cybersecurity for International Research Network Connections workshop, 2012
• 1 PB sorting by Google • 2008: 6 hours and 2
minutes on 4,000 computers
• 2011: 33 minutes on 8000 computers
• 2011: 10PB, 8000 computers, 6 hours and 27 minutes
Harness Distributed Computing and Storage ? Google MapReduce, 2004 Apache Hadoop project
5
Our Proposal
6
NetFlow v5
Packet
Administrator
Slave
Master
Traffic Collector
Web Visualizer / Hive
Pcap I/O
NetFlow I/O
Traffic Analysis Mapper & Reducer
Traffic Analyzer
HDFS Hadoop
Bin I/O
Hadoop-based Traffic Measurement and Analysis Platform
1. Yeonhee Lee and Youngseok Lee, "Toward Scalable Internet Traffic Measurement and Analysis with Hadoop," ACM SIGCOMM Computer Communication Review (CCR), Jan. 2013
2. Yeonhee Lee and Youngseok Lee “A Hadoop-based Packet Trace Processing Tool” , TMA, April 2011 3. Yeonhee Lee and Youngseok Lee, "Detecting DDoS Attacks with Hadoop", ACM CoNEXT Student
Workshop, Dec, 2011
Related Work
• Traffic analysis of DNS root server (RIPE, 2011.11) • PacketPig (2012.03) - Big Data Security Analytics platform • Sherpasurfing – Open Source Cyber Security Solution, Hadoop World
2011 • Firewall/IDS logs, netflow/packet
• Performing Network and Security Analytics with Hadoop, (Travis Dawson, Narus), Hadoop Summit 2012
• Distributed Bro (IDS)
7
OVERVIEW
Hadoop-based NetFlow Analysis
9
Distributer
NetFlow NetFlow
Collect & Anlaysis
10
Distributer
Packet N
etFlow
monitor
query
Traffic Analyzer
MapReduce for Traffic Analysis
IO formats
Traffic Collector
& Loader
Pcap InputFormat
Binary Input/OutputFormat
Data Source (Jpcap, HDFS)
Data Processing (HDFS, MapReduce, Hive)
User Interface (Hive, Web)
Text Input/OutputFormat
IP analysis MR
TCP analysis MR
HTTP analysis MR
User Interface
Web UI
CLI
DDoS analysis MR
Hive QL Query for Traffic Analysis
NetFlow analysis MR
Hadoop HDFS
Scan IP query
Spoofed IP query
Heavy User query
User-defined query
HADOOP-BASED TRAFFIC ANALYSIS
Challenges
1. Data handing issue in HDFS 2. Distributed traffic analysis MapReduce algorithms 3. Performance tuning in a large-scale Hadoop
Distributed computation
Fault tolerance
Scalability (~TB/PB)
12
1. Data handing issue in Hadoop
2. Distributed traffic analysis MapReduce algorithms
3. Performance tuning in a large-scale Hadoop
testbed
13
Challenges
file-level processing
Block-level Parallelism
14 14 00 00 68 2B AD 4C 38 A4 04 00 5C 00 00 00 5C 00 00 00 FF FF ‥‥ 00 21 B5 01 68 2B AD 4C 2B 1C 07 00 3C 00 00 00 3C 00 00 00 01 80 ‥
HDFS Block2 (64 MB)
HDFS Block3 (64 MB)
block-level processing
15
Block-level IO vs. File-level IO
3.5 3.9
4.3
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
0
20
40
60
80
100
120
140
Spee
dUp
Com
plet
ion
Tim
e (m
in)
# of nodes
IP Analysis_blockIO
IP Analysis_fileIO
SpeedUp vs fileIO
1. Data handing issue in Hadoop
2. Distributed traffic analysis MapReduce algorithms
3. Performance tuning in a large-scale Hadoop
testbed
16
Challenges
Reduce Phase
Map Phase
Block IO HDFS HDFS
DistributedCache Aggregation
17
Filtering Rule cnu;srcip=168.188.0.0-168.188.255.255
Aggregation Rule as;ip;subnet;port;protocol;srcas;dstas;srcip;dstip;srcsubnet;dstsubnet;srcport;dstport;
IP/UDP packet NetFlow v5 header
v5 record
IP/UDP packet NetFlow v5 header
v5 record
v5 record
…
…
deco
ding
filte
ring
gro
up-k
ey
gene
ratio
n
aggr
egat
ion
deco
ding
filte
ring
gro
up-k
ey
gene
ratio
n
aggr
egat
ion
Port # of octets # of packets # of Flows
Protocol # of octets # of packets # of Flows
AS # of octets # of packets # of Flows
Subnet # of octets # of packets # of Flows
pack
et
iden
tific
atio
n pa
cket
id
entif
icat
ion
K: time|AS V: count
K: time|AS V: count
counts per AS
counts per AS
Map Phase
Reduce Phase
Shuffle &Sort
Block IO HDFS
HDFS
Anomaly Detection
18
DistributedCache Detection Rules port_scan;ip,proto=6;srcip,dstport;srcip;pkts=20- syn_flood;ip,proto=6,syn-fin=1-;srcip,dstip;srcip,dstip;syn-fin=6-
IP/UDP packet NetFlow v5 header
v5 record
IP/UDP packet NetFlow v5 header
v5 record
v5 record
…
…
deco
ding
patte
rn
mat
chin
g g
roup
-key
ge
nera
tion
aggr
egat
ion
deco
ding
patte
rn
mat
chin
g g
roup
-key
ge
nera
tion
aggr
egat
ion
syn_flood 3.3.3.33.3.3.4 …
PortScan 1.1.1.11.1.1.2 …
dete
ctio
n de
tect
ion
parti
tioni
ng
&gro
up s
ort
parti
tioni
ng
&gro
up s
ort
pack
et
iden
tific
atio
n pa
cket
id
entif
icat
ion
1. Data handing issue in Hadoop
2. Distributed traffic analysis MapReduce algorithms
3. Performance tuning in a large-scale Hadoop
19
Challenges
• Configuration • Hadoop IO Buffer (128K 1 MB) • Java heap space (300 MB 1 024 MB) • # of MapReduce Slots (# of cores)
• MapReduce Algorithm • normal combiner vs inMapper combiner
• Job scheduling
20
Performance Tuning
Job Scheduling
• Different job types • Periodic jobs (for monitoring)
• guaranteed service within time • e.g Aggregated Statistics for monitoring, Flow Parse job for
analytics
• Small ad-hoc query job (for analytics) • fast response time
21
Collect Collect Collect … Collector
Basic Statistics Flow Parse Basic Statistics Flow Parse … Basic Statistics Flow Parse Fair Scheduler ad-hoc query ad-hoc query ad-hoc query
Basic Statistics Flow Parse Basic Statistics Flow
Parse … Basic Statistics Flow
Parse FIFO Scheduling ad-hoc query
ad-hoc query
ad-hoc query
5 munites 5 munites 5 munites 5 munites
ad-hoc job periodic job
PERFORMANCE EVALUATION
Experiments
• Testbed
• Data and MapReduce jobs
23
Type Nodes Cores CPU Memory HardDisk Rack Small 3 24 3.4 GHz 8 core 16 GB 2 TB 1 Rack
Medium 30 240 2.93 GHz 8 core 16 GB 4 TB 1 Rack
Large 200 400 2.66 GHz 2 core 2 GB 500 GB 4 Racks
Type Dataset MapReduce Job Testbed
NetFlow 1 TB from KOREN flowStats, flowDetect, flowPrint Small
Packet 1 ~ 5 TB from CNU campus N/W IP, TCP, Web (webpop, User Behavior, DDoS) Medium, Large
5.0
2.3
0.01.02.03.04.05.06.0
1 2 3
Spee
dUp
# of nodes
flowStats vs. Flowtools
flowPrint vs. FlowTools
24
NetFlow: SpeedUp (vs. Flowtools)
> FlowPrint flow-cat -p flowfile |flow-print –f14
> FlowStats flow-cat -p flowfile|flow-stat -f12 flow-cat -p flowfile|flow-stat –f5
25
NetFlow: Scalability
103
17
0
20
40
60
80
100
120
5 10 15 20 25 30
Job
Com
plet
ion
Tim
e (m
in)
# of nodes
flowStats
flowDetect
6.1
01234567
5 10 15 20 25 30Sp
eedU
p (v
s 5
node
s)
# of nodes
flowStats
flowDetect
1
10
100
1000
10000
100000
8/29/2012 9/8/2012 9/18/2012 9/28/2012 10/8/2012 10/18/2012 10/28/2012
coun
t
date
worm.sasser,w32.sasser
remote_administrator
vnc
w32.witty.worm
worm.opasoft,w32.opaserv.worm
code_red_worm
netfairy
kamun
emule
shockwave_killer
worm.killmsblast,w32.nachi.worm,w32.welchia.worm
26
NetFlow: Pattern Matching Result
0
2
4
6
8
10
# of
reco
rds (
M) NetFlows Record Distribution
9
13
02468
101214
5 10 15 20 25 30
Thro
ughp
ut (G
bps)
# of nodes
IP Analysis (Gbps)
TCP Analysis (Gbps)
WebPop (Gbps)
UserBehavior (Gbps)
DDos (Gbps)
121
15 13
77
020406080
100120140
5 10 15 20 25 30
Com
plet
ion
Tim
e (m
in)
IP Analysis (min)
TCP Analysis (min)
WebPop (min)
UserBehavior (min)
DDoS (min)
Packet: ScaleOut
27
28
Packet: SizeUp (30 nodes)
2934
7007
0
2
4
6
8
10
12
14
16
0
1000
2000
3000
4000
5000
6000
7000
8000
1TB 2TB 3TB 4TB 5TB
Thro
ughp
ut (G
bps)
Com
plet
ion
Tim
e (s
ec)
Data Size
IP Analysis
IP Analysis_ripe
TCPStats
Webpop
UserBehavior
DDoS
IP Analysis
TCPStats
Webpop
UserBehavior
DDoS
2744
8
15
0
2
4
6
8
10
12
14
16
18
0500
100015002000250030003500400045005000
1TB 2TB 3TB 4TB 5TB
Thro
ughp
ut (G
bps)
Com
plet
ion
Tim
e (s
ec)
Data Size
IP AnalysisTCP AnalysisWebpopUserBehaviorDDoSIP AnalysisTCP AnalysisWebpopUserBehaviorDDoS
Packet: SizeUp (200 nodes)
29
SUMMARY
• NetFlow analysis with Hadoop • NetFlow v5 processing module • MapReduce algorithms: statistics
• Distributed computing and storage with Hadoop
• Fits Internet measurement application • Scalability
• Source codes are available at
• Packet, NetFlow • https://sites.google.com/a/networks.cnu.ac.kr/dnlab/researc
h/hadoop • https://github.com/ssallys/pcap-on-Hadoop
31
Summary
Ongoing Work • Distributed real-time
monitoring • Rule matching for
Streamed NetFlow • Developing rule for
MapReduce • Rule classification for
dedicated rule matching
• Integration • Streaming packages • Enhanced analytics
• Data mining: Mahout • Machine learning
32
RHive
Hive Pig
MapReduce
HDFS
Mahout
RHadoop
Rhipe
Performance
Productivity
• Scalable collection • E.g.) 10GE 10 X 1 GE
HDFS
• Papers 1. Y. Lee and Y. Lee, "Toward Scalable Internet Traffic
Measurement and Analysis with Hadoop," ACM SIGCOMM Computer Communication Review (CCR), Jan. 2013
2. Y. Lee, W. Kang, and Y. Lee, "A Hadoop-based Packet Trace Processing Tool," The Third TMA, April 2011
3. Y. Lee and Y. Lee, "Detecting DDoS Attacks with Hadoop", ACM CoNEXT Student Workshop, Dec, 2011
• Software 1. http://networks.cnu.ac.kr/~yhlee 2. https://sites.google.com/a/networks.cnu.ac.kr/dnlab/research/hadoop 3. https://github.com/ssallys/pcap-on-Hadoop
33
Reference
THANK YOU !
35
0%
20%
40%
60%
80%
100%
no 10 20 30 40 50 60 70 80 90 100
110
120
130
140
150
160
170
180
190
200
Usa
ge(
%)
Tics
IP Analysis
mem_usage
disk_read_usage
disk_write_usage
cpu_usage
net_in_usage
net_out_usage
0%
20%
40%
60%
80%
100%
1 19 37 55 73 91 109
127
145
163
181
199
217
235
253
271
289
307
325
343
361
379
397
Usa
ge(
%)
Tics
TCP Analysis
mem_usage
disk_read_usage
disk_write_usage
cpu_usage
net_in_usage
net_out_usage
36
Collector TaskTracker Datanode
Collector TaskTracker Datanode
Collector TaskTracker Datanode
LoadBalancer JobTracker NameNode
PF_RING
PF_RING ,RAID
DBMS
37
…
…
…
…
…
on - the - fly packets packets NetFlow record
each statistics
aggregated statistics
table records
visualized statistics
load balancer
network interface
PF_RING /RSS
Input Format
Mapper Reducer table records
visualized statistics
Internet
rule name; filter pattern; mapout key; patition&groupsort key;detection condition; action ex) port_scan;ip,proto=6;srcip,dstport;srcip;pkts=20- syn_flood;ip,proto=6,syn-fin=1-;srcip,dstip;srcip,dstip;syn-fin=6-