Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise
-
Upload
cisco-data-center-sdn -
Category
Technology
-
view
4.734 -
download
4
Transcript of Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise
Hadoop Summit 2012 - Validated Network Architecture and Reference Deployment in Enterprise
Nimish Desai – [email protected]
Technical Leader, Data Center Group
Cisco Systems Inc.
Goal 1: Provide Reference Network Architecture for Hadoop in Enterprise
Goal 2: Characterize Hadoop Application on Network
Goal 3: Network Validation Results with Hadoop Workload
Session Objectives & Takeways
2
3
Big Data in Enterprise
Validated 96 Node Hadoop Cluster
Network
Three Racks each with 32 nodes
Distribution Layer – Nexus 7000 or Nexus 5000
ToR – FEX or Nexus 3000
2 FEX per Rack
Each Rack with either 32 single or dual attached host
Hadoop Framework
Apache 0.20.2
Linux 6.2
Slots – 10 Maps & 2 Reducers per node
Compute – UCS C200 M2
Cores: 12Processor: 2 x Intel(R) Xeon(R) CPU X5670 @ 2.93GHzDisk: 4 x 2TB (7.2K RPM)Network: 1G: LOM, 10G: Cisco UCS P81E
Name Node Cisco UCS C200
Single NIC
2248TP-E
Nexus 5548 Nexus 5548
Data Nodes 1 – 48Cisco UCS C 200 Single NIC
…Data Nodes 49- 96
Cisco UCS 200 Single NIC
…
Traditional DC Design Nexus 55xx/2248
2248TP-E
Name Node Cisco UCS C 200
Single NIC
Nexus 7000 Nexus 7000
Data Nodes 1 – 48Cisco UCS C 200 Single NIC
…Data Nodes 49 - 96
Cisco UCS C 200 Single NIC
…
Nexus 3000Nexus 3000
Nexus 7K-N3K based Topology
Cisco Confidential© 2010 Cisco and/or its affiliates. All rights reserved. 5
Aggregation & Services
Layer
Core Layer(LAN & SAN)
Access Layer
SAN Edge
WAN EdgeLayer
Data Center Infrastructure
Nexus 700010 GE Aggr
NetworkServices
Layer 3 Layer 2 - 1GELayer 2 - 10GE10 GE DCB10 GE FCoE/DCB4/8 Gb FC
FC SAN A
FC SAN B
vPC+ FabricPath
Nexus 700010 GE Core
Nexus 5500 10GE Nexus 2148TP-E
Bare Metal
CBS 31xxBlade switch
Nexus 7000End-of-Row
Nexus 5500 FCoE Nexus 2232 Top-of-Rack
UCS FCoE Nexus 3000Top-of-Rack
10G
1 GbE Server Access & 4/8Gb FC via dual HBA (SAN A // SAN B) 10Gb DCB / FCoE Server Access or 10 GbE Server Access & 4/8Gb FC via dual HBA (SAN A // SAN B)
L3
L2
MDS 9500SAN
Director
B22FEXHP
BladeC-class
FC SAN A
FC SAN
B
MDS 9200 /9100
Nexus 5500 FCoE
Bare Metal 1G Nexus 3000
Top-of-Rack
Big Data Application Realm – Web 2.0 & Social/Community Networks
Data live/die in Internet only entities
Data Domain Partially private
Homogeneous Data Life Cycle
Mostly Unstructured
Web Centric, User Driven
Unified workload – few process & owners
Typically non-virtualized
Scaling & Integration Dynamics
Purpose Driven Apps
Thousands of nodes
Hundreds of PB and growing exponentially
6
Data store
ServiceUI
Big Data Application Realm - Enterprise Data Lives in a confined zone of enterprise
repository
Long Lived, Regulatory and Compliance Driven
Heterogeneous Data Life Cycle
Many Data Models
Diverse data – Structured and Unstructured
Diverse data sources - Subscriber based
Diverse workload from many sources/groups/process/technology
Virtualized and non-virtualized with mostly SAN/NAS base
7
Customer DB(Oracle/SAP)
SocMedia
ERPModul
e B
DataServic
e
SalesPipeli
ne
ERPModul
e A
CallCente
r
Product
Catalog
Catalog
Data
VideoConf
CollabOffice Apps
Records
Mgmt
DocMgmt
B
DocMgmt
A
VOIPExec
Reports
Scaling & Integration Dynamics are different Data Warehousing(structured) with divers repository +
Unstructured Data
Few hundred to thousand nodes, few PB
Integration, Policy & Security Challenges
Each Apps/Group/Technology limited in
data generation
Consumption
Servicing confined domains
Data Sources
8
Enterprise Application
Sales Products Process Inventory Finance Payroll Shipping TrackingAuthorization Customers Profile
Machine logs Sensor data Call data records Web click stream dataSatellite feeds GPS data Sales data Blogs Emails Pictures Video
Transactional Business
Intelligence Transactional
Data
Analytics
HB
ase
Ora
cle
No
SQ
LD
BC
assa
nd
raM
on
go
DB
Co
uch
DB
Red
isM
emb
ase
Neo
4j
Hadoop
MapR
educe
MapR
GP
MapR
educe
Row Store Column
Store
Big Data
Traditional Database
RDBMS
Storage
SAN and NAS
“Big Data”
Store and Analyze
“Big Data”
Real-Time Capture, Read and Update
Operations
NoSQL
Big Data Building Blocks into the Enterprise
9
ApplicationVirtualized,
Bare Metal and Cloud Sensor DataLogs
Social
Media
Click Streams
Mobility
Trends
Event Data
Big Data
Cisco Unified Fabric
10
Hadoop Cluster Design & Reference Network Architecture
Characteristics that Affect Hadoop Clusters Cluster Size
Number of Data Nodes
Data Model & Mapper/Reduces Ratio
MapReduce functions
Input Data Size
Total starting dataset
Data Locality in HDFS
Ability to processes data where it already is located
Background Activity
Number of Jobs running
type of jobs
Importing
exporting
11
Characteristics of Data Node
‒ I/O, CPU, Memory, etc.
Networking Characteristics
‒ Availability
‒ Buffering
‒ Data Node Speed (1G vs. 10G)
‒ Oversubscription
‒ Latency
http://www.cloudera.com/resource/hadoop-world-2011-presentation-video-hadoop-network-and-compute-architecture-considerations/
MapMap
MapMap
Map
The Data Ingest & Replication
External Connectivity
East West Traffic (Replication of data blocks)
Map Phase – Raw data Analyzed and converted to name/value pair.
Workload translate to multiple batches of Map task
Reducer can start the reduce phase ONLY after the entire Map set is complete
Mostly a IO/compute function
Hadoop Components and Operations
12
Hadoop Distributed File SystemUnstructured Data
MapMap
MapMap
Map
MapMap
MapMap
Map
Key 1Key 1
Key 1Key 1
Key 1Key 1
Key 1Key 2
Key 1Key 1
Key 1Key 3
Key 1Key 1
Key 1Key 4
Reduce
Shuffle Phase
ReduceReduce
Result/Output
Reduce
MapMap
MapMap
Map
MapMap
MapMap
Map
Shuffle Phase - All name/value pair are sorted and grouped by their keys.
Mapper sending the data to Reducers
High Network Activity
Reduce Phase – All values associates with a key are process for results, three phases
Copy - get intermediate result from each data node local disk
Merge - to reduce the number of files
Reduce method
Output Replication Phase - Reducer replicating result to multiple nodes
Highest Network Activity
Network Activities Dependent on Workload Behavior
Hadoop Components and Operations
13
Hadoop Distributed File SystemUnstructured Data
MapMap
MapMap
Map
MapMap
MapMap
Map
Key 1Key 1
Key 1Key 1
Key 1Key 1
Key 1Key 2
Key 1Key 1
Key 1Key 3
Key 1Key 1
Key 1Key 4
Reduce
Shuffle Phase
ReduceReduce
Result/Output
Reduce
MapMap
MapMap
Map
MapReduce Data ModelETL & BI Workload Benchmark
14
The complexity of the functions used in Map and/or Reduce has a large impact on the job completion time and network traffic.
• Data set size varies in various phase – Varying impact on the network e.g. 1TB Input, 10MB Shuffle, 1MB Output
• Most of the processing in the Map Functions, smaller intermediate and even smaller final Data
Map Start
Yahoo TeraSort – ETL Workload – Most Network Intensive
Reducers Start
Map Finish Job Finish
• Input, Shuffle and Output data size is the same – e.g. 10 TB data set in all phases• Yahoo Terasort has a more balanced Map vs. Reduce functions - linear compute and IO
Map Start
Shakespeare WordCount – BI Workload
Reducers Start Map Finish
Job Finish
ETL Workload (1TB Yahoo Terasort)
15
Network Graph of all Traffic Received on an Single Node (80 Node Run)
Reducers Start Maps
Finish
Job CompleteMaps Start
These symbols represent a node sending traffic to HPC064
Shortly after the Reducers start Map tasks are finishing and data is being shuffled to reducersAs Maps completely finish the network is no loner used as Reducers have all the data they need to finish the job
The red line is the total amount of traffic received by hpc064
Output Data Replication Enabled Replication of 3 enabled (1 copy stored locally, 2 stored remotely) Each reduce output is replicated now, instead of just stored locally
If output replication is enabled, then the end of the terasort, must store additional copies. For a 1TB sort, 2TB will need to be replicated across the network.
ETL Workload (1TB Yahoo Terasort)
16
Network Activity of all Traffic Received on an Single Node (80 Node Run)
Wordcount on 200K Copies of complete works of Shakespeare
Reducers Start Maps
Finish
Job CompleteMaps Start
The red line is the total amount of traffic received by hpc064
These symbols represent a node sending traffic to HPC064
Due the combination of the length of the Map phase and the reduced data set being shuffled, the network is being utilized throughout the job, but by a limited amount.
BI Workload
17
Network Graph of all Traffic Received on an Single Node (80 Node Run)
Data Locality in HDFS
18
Data Locality – The ability to process data where it is locally stored.
Note:During the Map Phase, the JobTracker attempts to use data locality to schedule map tasks where the data is locally stored. This is not perfect and is dependent on a data nodes where the data is located. This is a consideration when choosing the replication factor. More replicas tend to create higher probability for data locality.
Reducers StartMaps Finish
Job CompleteMaps Start
ObservationsNotice this initial spike in RX Traffic is before the Reducers kick in.
It represents data each map task needs that is not local.
Looking at the spike it is mainly data from only a few nodes.
Map Tasks: Initial spike for non-local data. Sometimes a task may be scheduled on a node that does not have the data available locally.
1 TB file with 128 MB Blocks == 7,813 Map Tasks
The job completion time is directly related to number of reducers
Average Network buffer usage lowers as number of reducer gets lower (see hidden slides) and vice versa.
Map to Reducer Ratio Impact on Job Completion
19
192 96 48 24 12 60
5000
10000
15000
20000
25000
30000
Total Graph of Job Comple-tion Time in Sec
No. Of Reduceers 24 12 60
5000
10000
15000
20000
25000
30000
Job Completion Time in Sec
No. Of Reduceers
192 96 480
100200300400500600700800
Job Completion Time in Sec
No. Of Reduceers
Job Completion Time with 96 Reducers
20
Job Completion Time with 48 Reducers
21
Job Completion Graph with 24 Reducers
22
23
Network Characteristics
The relative impact of various network characteristics on Hadoop clusters*
* Not a scaled or measured data
Availa
blity
Buffe
ring
Ove
rsub
scrip
tion
Data
Node
Speed
Late
ncy
AvailablityBufferingOversubscriptionData Node SpeedLatency
24
Validated Network Reference Architecture
© 2012 Cisco and/or its affiliates. All rights reserved. Cisco PublicBRKAPP-2027 25
Nexus 2000
1GE Rack Mount Servers
1GE Rack Mount Servers
Nexus 2000
10GE Rack Mount Servers
10GE Rack Mount Servers
Nexus 4000
10GE Blade Switch w/ FCoE
(IBM/Dell)
10GE Blade Switch w/ FCoE
(IBM/Dell)
Cisco UCS
Nexus 2000
1 & 10GE Blade Servers w/ Pass-Thru
1 & 10GE Blade Servers w/ Pass-Thru
10GE Rack Mount Servers
10GE Rack Mount Servers
Direct Attach 10GE
Core Distribution
Nexus 7000 MDS 9000
LAN SAN
Unified Access Layer
Nexus 5000
UCS Compute Blade & Rack
UCS Compute Blade & Rack
Data Center Access Connectivity
Nexus 1000V
26
Network AttributesArchitecture AvailabilityCapacity, Scale &
Oversubscription Flexibility
Management & Visibility
Network Reference Architecture
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
Edg
e/A
cces
s La
yer
Nex
us L
AN
and
SA
N C
ore:
O
ptim
ized
for D
ata
Cen
tre
Cisco Confidential© 2010 Cisco and/or its affiliates. All rights reserved. 27
Scaling the Data Centre FabricChanging the device paradigm
De-Coupling of the Layer 1 and Layer 2 Topologies
Simplified Management Model, plug and play provisioning, centralized configuration
Line Card Portability (N2K supported with Multiple Parent Switches – N5K, 6100, N7K)
Unified access for any server (100M1GE10GE FCoE): Scalable Ethernet, HPC, unified fabric or virtualization deployment
Virtualized Switch
. . .
Integration with Enterprise architecture – essential pathway for data flow
Integration
Consistency
Management
Risk-assurance
Enterprise grade features
Consistent Operational Model
NxOS, CLI, Fault Behavior and Management
Though higher BW east-west compared to traditional transactional networks
Over the time it will have multi-user, multi-workload behavior
Need enterprise centric features
Security, SLA, QoS etc
Hadoop Network Topologies - ReferenceUnified Fabric & ToR DC Design
1Gbps Attached Server
Nexus 7000/5000 with 2248TP-E
Nexus 7000 and 3048
NIC Teaming - 1Gbps Attached
Nexus 7000/5000 with 2248TP-E
Nexus 7000 and 3048
10 Gbps Attached Server
Nexus 7000/5000 with 2232PP
Nexus 7000 and 3064
NIC Teaming – 10 Gbps Attached Server
Nexus 7000/5000 with 2232PP
Nexus 7000 & 3064
Validated Reference Network Topology
Network
Three Racks each with 32 nodes
Distribution Layer – Nexus 7000 or Nexus 5000
ToR – FEX or Nexus 3000
2 FEX per Rack
Each Rack with either 32 single or dual attached host
Hadoop Framework
Apache 0.20.2
Linux 6.2
Slots – 10 Maps & 2 Reducers per node
Compute – UCS C200 M2
Cores: 12Processor: 2 x Intel(R) Xeon(R) CPU X5670 @ 2.93GHzDisk: 4 x 2TB (7.2K RPM)Network: 1G: LOM, 10G: Cisco UCS P81E
Name Node Cisco UCS C200
Single NIC
2248TP-E
Nexus 5548 Nexus 5548
Data Nodes 1 – 48Cisco UCS C 200 Single NIC
…Data Nodes 49- 96
Cisco UCS 200 Single NIC
…
Traditional DC Design Nexus 55xx/2248
2248TP-E
Name Node Cisco UCS C 200
Single NIC
Nexus 7000 Nexus 7000
Data Nodes 1 – 48Cisco UCS C 200 Single NIC
…Data Nodes 49 - 96
Cisco UCS C 200 Single NIC
…
Nexus 3000Nexus 3000
Nexus 7K-N3K based Topology
30
Network AttributesArchitecture Availability Capacity, Scale &
Oversubscription FlexibilityManagement & Visibility
Network Reference Architecture Characteristics
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
Edg
e/A
cces
s La
yer
Nex
us L
AN
and
SA
N C
ore:
O
ptim
ized
for D
ata
Cen
tre
The Core High Availability Design Principles are common across all Network Systems Designs
Understand the causes of network outages
Component Failures
Network Anomalies
Understand the Engineering foundations of systems level availability
Device and Network level MTBF
Understanding Hierarchical and Modular Design
Understand the HW and SW interaction in the system
Enhance VPC allows such topology and ideally suited for Big Data applications
Enhanced vPC (EvPC)configuration any and all server NIC teaming configurations will be supported on any port
High Availability Switching Design Common High Availability Engineering Principles
System High Availability is a function of topology and component level High
Availability
Dual NIC Active/Standby
Dual NIC 802.3adSingle NIC
L3Dual Node
L2Dual Node
Full Mesh
Full Mesh
ToRDual Node
NIC Teaming
Availability with Single Attached Server1G or 10G
Important to evaluate the overall availability of the system.
Network failures can span many nodes in the system causing rebalancing and decreased overall resources.
Typically multi-TB of data transfer occurs for a single ToR or FEX failure
Load Sharing, ease of management and consistent SLA is important to enterprise operation
Failure Domain Impact on Job Completion
1 TB Terasort typically Takes ~4.20- 4.30 minutes
A failure of a SINGLE NODE (either NIC or server component) results in roughly doubling of the job completion time
Key observation is that the failure impact is dependent on type of workload being run on the cluster
Short lived interactive vs. Short live batch
Long job – ETL, Normalization, Joins32
Single NIC32 per ToR
No single point of failure from network view point. No impact on job completion time
NIC bonding configured at Linux – with LACP mode of bonding
Effective load-sharing of traffic flow on two NICs.
Recommended to change the hashing to src-dst-ip-port (both network and NIC bonding in Linux) for optimal load-sharing
Availability Single Attached vs. Dual Attached Node
33
Failure of various components
Failure introduce at 33%, 66% and 99% of reducer completion
Singly attached NIC server & Rack failure has bigger impact on job completion time then any other failure
FEX Failure is a RACK failure for 1G topology
Job Completion Time with Various Failure
Availability Network Failure Result – 1TB Terasort - ETL
34
Failure Point 1G Single Attached
2GDual Attached
Peer Link 5000 301 258
FEX * 1137 259
Rack * 1137 1017
A Port – Single Attached
See previous
Slide See previous Slide
1 port – Dual Attach
See previous
SlideSee previous Slide
FEX/ToR A
FEX/ToR B
FEX/ToR A
FEX/ToR A
Rack 1 Rack 2 Rack 3
96 Nodes
2 FEX per Rack
*Variance in run time with % reducer completed
35
Network AttributesArchitectureAvailabilityCapacity, Scale &
Oversubscription FlexibilityManagement & Visibility
Network Reference Architecture Characteristics
Hadoop is a parallel batch job oriented framework
Primary benefits of hadoop is the reduction in job completion time that would otherwise would take longer with traditional technique. E.g. Large ETL, Log Analysis, Join-only-Map job etc.
Typically oversubscription occurs with 10G server access then at 1G server
Non-blocking network is NOT a needed, however degree of oversubscription matters for
Job Completion Time
Replication of Results
Oversubscription during rack or FEX failure
Static vs. actual oversubscription
Often how much data a single node push is IO bound and number of disk configuration
Oversubscription Design
36
Uplinks Oversubscription Theoretical (16 Servers)
Measured
8 2:1 Next Slides
4 4:1 Next Slides
2 8:1 Next Slides
1 16:1 Next Slides
Steady state
Result Replication with 1,2,4, & 8 uplink
Rack Failure with 1, 2, 4 & 8 Uplink
Network Oversubscriptions
37
Data Node Speed Differences1G vs. 10G TCPDUMP of Reducers TX
38
• Generally 1G is being used largely due to the cost/performance trade-offs. Though 10GE can provide benefits depending on workload
• Reduced spike with 10G and smoother job completion time• Multiple 1G or 10G links can be bonded together to not only increase bandwidth, but increase
resiliency.
1 13 25 37 49 61 73 85 97 109
121
133
145
157
169
181
193
205
217
229
241
253
265
277
289
301
313
325
337
349
361
373
385
397
409
421
433
445
457
469
481
493
505
517
529
541
553
565
577
589
601
613
625
637
649
661
673
685
697
709
721
733
745
757
769
781
793
Job
Com
pleti
on
Cell
Usa
ge
1G Buffer Used 10G Buffer Used 1G Map % 1G Reduce % 10G Map % 10G Reduce %
1GE vs. 10GE Buffer Usage
39
Moving from 1GE to 10GE actually lowers the buffer requirement at the switching layer.
By moving to 10GE, the data node has a wider pipe to receive data lessening the need for buffers on the network as the total aggregate transfer rate and amount of data does not increase substantially. This is due, in part, to limits of I/O and Compute capabilities
40
Network AttributesArchitecture Capacity Availability Scale & Oversubscription FlexibilityManagement & Visibility
Network Reference Architecture Characteristics
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
Edg
e/A
cces
s La
yer
Nex
us L
AN
and
SA
N C
ore:
O
ptim
ized
for D
ata
Cen
tre
Multi-use Cluster Characteristics
41
Hadoop clusters are generally multi-use. The effect of background use can effect any single jobs
completion.
Example View of 24 Hour Cluster Use
Large ETL Job Overlaps with medium and small ETL Jobs and many small BI Jobs(Blue lines are ETL Jobs and purple lines are BI Jobs)
Importing Data into HDFS
A given Cluster, running many different types of Jobs, Importing into HDFS, Etc.
100 Jobs each with 10GB Data SetStable, Node & Rack Failure
42
• Almost all jobs are impacted with a single node failure• With multiple jobs running concurrently, node failure impact is as significant
as rack failure
43
Network AttributesArchitecture Capacity Availability Scale & Oversubscription FlexibilityManagement & Visibility
Network Reference Architecture Characteristics
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
slot 1slot 2slot 3slot 4slot 5slot 6slot 7slot 8
blade1blade2blade3blade4blade5blade6blade7blade8
Edg
e/A
cces
s La
yer
Nex
us L
AN
and
SA
N C
ore:
O
ptim
ized
for D
ata
Cen
tre
Burst Handling and Queue Depth
44
• Several HDFS operations and phases of MapReduce jobs are very bursty in nature
• The extent of bursts largely depend on the type of job (ETL vs. BI)
• Bursty phases can include replication of data (either importing into HDFS or output replication) and the output of the mappers during the shuffle phase.
A network that cannot handle bursts effectively will drop packets, so optimal buffering is needed in network devices to absorb bursts.
Optimal Buffering• Given large enough incast, TCP
will collapse at some point no matter how large the buffer
• Well studied by multiple universities
• Alternate solutions (Changing TCP behavior) proposed rather than Huge buffer switches
http://simula.stanford.edu/sedcl/files/dctcp
-final.pdf
Nexus 2248TP-E Buffer Monitoring Nexus 2248TP-E utilizes a 32MB shared buffer to handle larger traffic bursts Hadoop, NAS, AVID are examples of bursty applications You can control the queue limit for a specified Fabric Extender for egress
(network to the host) or ingress(host to network) Extensive Drop Counters
Provides drop counters for both directions: Network to host and Host to Network on a per host interface basis
Drop counters for different reason
• Out of buffer drop, No credit drop, Queue limit drop(tail drop), MAC error drop, Truncation drop, Multicast drop
Buffer Occupancy Counter
How much buffer is being used. One key indicator of congestion or bursty traffic
45
N5548-L3(config-fex)# hardware N2248TPE queue-limit 4000000 rxN5548-L3(config-fex)# hardware N2248TPE queue-limit 4194304 tx
fex-110# show platform software qosctrl asic 0 0
TeraSort FEX(2248TP-E) Buffer Analysis (10TB)
46
0:1
5:2
6
0:1
7:3
1
0:1
9:5
6
0:2
2:2
8
0:2
4:4
8
0:2
6:5
5
0:2
9:4
1
0:3
2:0
1
0:3
4:3
3
0:3
6:5
4
0:3
9:2
6
0:4
1:4
8
0:4
4:2
0
0:4
6:5
3
0:4
9:2
5
0:5
1:5
8
0:5
4:4
4
0:5
7:2
9
0:5
9:5
0
1:0
2:0
9
1:0
4:2
9
1:0
7:0
3
1:0
9:3
6
1:1
2:0
8
1:1
4:4
0
1:1
6:5
9
1:1
9:1
9
1:2
1:3
9
1:2
4:2
5
1:2
7:4
8
1:3
1:1
1
1:3
4:4
4
1:3
8:1
8
1:4
1:3
9
1:4
5:0
0
1:4
7:5
6
1:5
0:5
0
1:5
3:2
7
1:5
4:1
8
2:0
7:5
5
2:1
0:3
9
2:1
7:1
5
2:2
0:2
4
2:2
3:2
0
2:2
6:3
1
2:2
9:2
7
2:3
2:3
7
2:3
6:0
0
2:3
8:5
8
2:4
2:2
1
2:4
5:3
1
2:4
8:2
8
2:5
1:3
9
2:5
5:0
1
2:5
8:1
2
3:0
1:2
4
3:0
4:3
5
3:0
7:4
6
3:1
0:5
9
3:1
4:1
3
3:1
7:2
8
3:2
1:0
0
3:2
4:3
6
3:2
8:1
0
3:3
1:4
2
3:3
4:5
6
3:3
7:4
5
3:4
0:5
7Jo
b C
om
ple
tio
n
Ce
ll U
sa
ge
FEX #1 FEX #2 Map % Reduce %
The buffer utilization is highest during the shuffle and output replication phases.
Optimized buffer sizes are required to avoid packet loss leading to slower job completion times.
Buffer Usage During Shuffle Phase
Buffer Usage During output Replication
TeraSort(ETL) N3k Buffer Analysis (10TB)
47
16
:39
:45
16
:42
:57
16
:46
:09
16
:49
:22
16
:52
:34
16
:55
:47
16
:59
:00
17
:02
:12
17
:05
:25
17
:08
:37
17
:11
:50
17
:15
:02
17
:18
:15
17
:21
:27
17
:24
:39
17
:27
:53
17
:31
:05
17
:34
:17
17
:37
:29
17
:40
:40
17
:43
:53
17
:47
:05
17
:50
:18
17
:53
:29
17
:56
:42
17
:59
:55
18
:03
:07
18
:06
:20
18
:09
:32
18
:12
:45
18
:15
:58
18
:19
:10
18
:22
:22
18
:25
:34
18
:28
:47
18
:31
:59
18
:35
:12
18
:38
:25
18
:41
:37
18
:44
:49
18
:48
:02
18
:51
:15
18
:54
:27
18
:57
:39
19
:00
:52
19
:04
:04
19
:07
:17
19
:10
:29
19
:13
:42
19
:16
:53
19
:20
:05
19
:23
:18
19
:26
:31
19
:29
:44
19
:32
:56
19
:36
:08
19
:39
:21
19
:42
:33
19
:45
:46
19
:48
:58
19
:52
:10
19
:55
:23
Job
Co
mp
leti
on
Ce
ll U
sag
e
N3048 #1 N3048 #2 N3064 Map % Reduce %
• The buffer utilization is highest during the shuffle and output replication phases.
• Optimized buffer sizes are required to avoid packet loss leading to slower job completion times.
The Aggregation switch buffer remained flat as the bursts were absorbed at the Top of Rack layer
Buffer Usage During Shuffle Phase
Buffer Usage During output Replication
Network Latency
48
Generally network latency, while
consistent latency being important, does
not represent a significant factor for Hadoop Clusters.
Note:There is a difference in network latency vs. application latency. Optimization in the application stack can decrease application latency that can potentially have a significant benefit. 1TB 5TB 10TB
N3K Topology 5k/2k Topology
Data Set Size (80 Node Cluster)
Co
mp
leti
on
Tim
e (S
ec)
Summary Extensive Validation of
Hadoop Workload
Reference ArchitectureMake it easy for Enterprise
Demystify Network for Hadoop Deployment
Integration with Enterprise with efficient choices of network topology/devices
10G and/or Dual attached server provides consistent job completion time & better buffer utilization
10G provide reduce burst at the access layer
A single attached node failure has considerable impact on job completion time
Dual Attached Sever is recommended design – 1G or 10G. 10G for future proofing
Rack failure has the biggest impact on job completion time
Does not require non-blocking network
Degree of oversubscription does impact job completion time
Latency does not matter much in Hadoop work load
49
Big Data @ Cisco
50
Cisco.com Big Datawww.cisco.com/go/bigdata
Certifications and Solutions with UCS C-Series and Nexus 5500+22xx
• EMC Greenplum MR Solution• Cloudera Hadoop Certified Technology
• Cloudera Hadoop Solution Brief• Oracle NoSQL Validated Solution
• Oracle NoSQL Solution Brief
Multi-month network and compute analysis testing
(In conjunction with Cloudera)
• Network/Compute Considerations Whitepaper• Presented Analysis at Hadoop World
128 Node/1PB test cluster