HPE Reference Configuration for networking best practices ......Hadoop is an ecosystem of several...

HPE Reference Configuration for networking best practices on the HPE Elastic Platform for Big Data Analytics (EPA) Hadoop ecosystem

Reference Architecture

http://www.hpe.com


Contents Executive summary ................................................................................................................................................................................................................................................................................................................................ 3 Introduction ................................................................................................................................................................................................................................................................................................................................................... 3 Solution overview ..................................................................................................................................................................................................................................................................................................................................... 5

Traditional Balanced and Data Optimized (BDO) network ......................................................................................................................................................................................................................... 5 Workload and Density Optimized (WDO) elastic Hadoop network .................................................................................................................................................................................................... 8

Solution components ............................................................................................................................................................................................................................................................................................................................ 9 HPE FlexFabric 1G/10G/25G/40G/100G Switches ..................................................................................................................................................................................................................................... 16 HPE FlexFabric 5950 32QSFP28 Switch (JH321A) ................................................................................................................................................................................................................................... 17 HPE FlexFabric 5940 Switch Series ............................................................................................................................................................................................................................................................................. 17 HPE FlexFabric 5900AF 48G 4XG 2QSFP+ Switch ..................................................................................................................................................................................................................................... 18 Mellanox Spectrum SN2100 16-port 100GbE Switch ............................................................................................................................................................................................................................... 18 Arista 7050X Series 48-port 10GbE /4-port 40GbE Switch ................................................................................................................................................................................................................ 19

Best practices and configuration guidance .................................................................................................................................................................................................................................................................. 20 Dual switch configuration using Intelligent Resilient Fabric (IRF) .................................................................................................................................................................................................... 20 Splitting and combining 10GbE, 25GbE, 40GbE and 100GbE ports ........................................................................................................................................................................................... 22 QSFP to SFP adapter ................................................................................................................................................................................................................................................................................................................. 24 Spanning Tree Protocol (STP) for HPE and Cisco interoperability ................................................................................................................................................................................................ 24 Setting up multi-active detection (MAD) with IRF.......................................................................................................................................................................................................................................... 25 Enable tagged VLANs on HPE Network switches .......................................................................................................................................................................................................................................... 27 Hadoop server network configuration ....................................................................................................................................................................................................................................................................... 29 HPE Elastic Platform for Big Data Analytics (EPA) for Hadoop data center connectivity......................................................................................................................................... 32

Summary ...................................................................................................................................................................................................................................................................................................................................................... 36 Appendix A: Bill of materials ...................................................................................................................................................................................................................................................................................................... 37 Appendix B: HPE 5940 Switch configuration ............................................................................................................................................................................................................................................................ 39 Appendix C: Configuring the Mellanox adapter ....................................................................................................................................................................................................................................................... 41 Appendix D: Glossary of terms ................................................................................................................................................................................................................................................................................................ 42 Resources and additional links ................................................................................................................................................................................................................................................................................................ 43


Executive summary As organizations strive to identify and realize the value in big data, many now seek more agile and capable analytic systems. Some of the business drivers are to improve customer retention, increasing operational efficiencies, influence product development and quality, and to gain a competitive advantage.

Apache Hadoop is a software framework that is being adopted by many enterprises as a cost-effective analytics platform for big data analytics. Hadoop is an ecosystem of several services rather than a single product, and is designed for storing and processing Petabytes of data in a linear scale-out model. Each service in the Hadoop ecosystem may use different technologies in the processing pipeline to ingest, store, process, and visualize data.

Enterprises are looking to accelerate their analytics workloads to drive real-time use cases such as fraud detection, recommendation engines, and advertising analytics. The separation of processing from resource management with Apache Hadoop YARN, has driven new architecture frameworks for big data processing. These frameworks are designed to extend Hadoop beyond the often I/O intensive, high latency MapReduce batch analytics workloads, to also address real-time and interactive analytics. Technologies like Spark, NoSQL, and Kafka are critical components of these new frameworks to unify batch, interactive, and real time big data processing. These technologies typically have different capacity and performance requirements to scale out processing, and need a variety of options for compute, storage, memory, and networking

One of the key design considerations for Hadoop, was the ability to scale-out processing across 100s to 1000s of nodes built using commodity hardware, usually with 2-socket Intel® processors, 10-12 1TB SATA drives and 1Gb networking. Modern computing now offers larger drives, faster processors and faster networking at almost the same cost. With the explosion of data over the last 5-10 years, and the need to ingest and process this data in a timely manner, organizations are looking to accelerate their analytics applications and data processing pipelines by exploiting the advances in modern computing.

This white paper provides guidance on network architecture and design This document also discusses the different types of link aggregation protocols that HPE leverages in its networking products to help meet the network resiliency needs of your network and business applications.

Target audience: The intended audience of this document includes, but is not limited to network specialists, IT managers, solution architects, sales engineers, services consultants, partner engineers and customers that are interested in configuring the network for their Hadoop clusters using servers and switches from Hewlett Packard Enterprise. The document also provides options for integration with partner and third-party switches from Arista, Mellanox, and Cisco.

Document purpose: This document provides recommendations on network design and configuration recommendations for HPE’s Standard, Balanced and Data Optimized (BDO) and Workload and Density Optimized (WDO) system based Hadoop deployments. Based on in-house testing and learnings from customer deployments, our network design and configurations enable big data analytics workloads using HPE servers, HPE and third-party network equipment such as Cisco.

This document describes solution testing performed in December 2016.

Disclaimer: Products sold prior to the separation of Hewlett-Packard Company into Hewlett Packard Enterprise Company and HP Inc. on November 1, 2015 may have a product name and model number that differ from current models.

Introduction Since being founded in 2006 Hadoop has proven to be cost-effective at solving the challenge of storing and processing big data at scale, by distributing the workload and data across a cluster running commodity hardware. Hadoop contributors favored data locality, i.e., bringing the compute to where the data was stored, by collocating compute and storage in the same server. Additional identical servers could be added for more compute and storage, and Hadoop workloads could scale-out linearly across 100s to 1000s of servers. This is what many consider to be a traditional, symmetric architecture where each server is configured with identical compute, storage, and memory. As new more compute and memory intensive frameworks like Spark, or low-latency stores like NoSQL are used along with Hadoop, organizations are faced with the challenge of ensuring that capacity expansion is more efficient in utilizing the available compute and storage.

Hewlett Packard Enterprise has recognized the challenges and limitations of deploying Hadoop in the traditional manner, and has introduced the HPE Elastic Platform for Big Data Analytics (EPA) for Hadoop to address some of these challenges. HPE’s EPA provides a modular foundation for deploying Hadoop and other big data workloads. This solution solves the common challenges many customers face through a robust, yet flexible, offering that enables organizations to maximize the performance, infrastructure and analytics abilities of Hadoop on an enterprise-grade, trusted and proven Hewlett Packard Enterprise solution that can scale with their evolving business needs.


The HPE Elastic Platform for Big Data Analytics (EPA) is a modular infrastructure foundation to accelerate business insights and enable organizations to rapidly deploy, efficiently scale and securely manage the explosive growth in volume, speed and variety of big data workloads. Hewlett Packard Enterprise supports two different deployment models under this platform:

HPE Balanced and Data Optimized (BDO) system supports conventional Hadoop deployments that scale compute and storage together, with some flexibility in choice of memory, processor and storage capacity. This is primarily based on the HPE ProLiant DL380 server platform, with density optimized variants using the HPE Apollo 4200 and Apollo 4530 servers.

HPE Workload and Density Optimized (WDO) system harnesses the power of faster Ethernet networks to independently scale compute and storage using a building block approach, and lets you consolidate your data and workloads growing at different rates. The base HPE WDO system uses the HPE Apollo 4200 as a storage block and the HPE Apollo 2000 as a compute block. Additional building blocks such as HPE Moonshot, can be layered on top of the base configuration to target different workloads and requirements.

Organizations that are already invested in balanced systems have the option of consolidating their existing deployments to a more elastic platform with the HPE Workload and Density Optimized (WDO) system.

For more information on how to configure Hadoop clusters in Balanced and Data Optimized (BDO) traditional cluster and Workload and Density Optimized (WDO) Hadoop cluster with separate storage and compute blocks, refer to http://h20195.www2.hpe.com/V2/GetDocument.aspx?docname=4AA6-8931ENW

The acquisition and analysis of data and its subsequent transformation into actionable insight is a complex workflow which extends beyond data centers, to the edge, and into the cloud in a seamless hybrid environment. The utilization of edge devices, in situ-computation and analysis, centralized storage and analysis, and deep learning methodologies which accelerate data processing at scale requires a new technological approach. Big data is collected from multiple sources and is stored and processed in data centers. Big data analytics usually use distributed frameworks like MapReduce and Spark to achieve scalability.

Data processing in those distributed frameworks like Hadoop and Spark consists of multiple computational stages, e.g., map and reduce in the MapReduce framework. Between the stages, massive amounts of data need to be shuffled and transferred among servers. The servers usually communicate in an all-to-all manner, which requires high bisection bandwidth in data center networks. Most data center networks are oversubscribed and often became a bottleneck for these applications, with data transfers accounting for a significant portion of the running time of these workloads. Other operations that can impact the network are cluster rebalancing, which re-balances the data across multiple nodes, recovery from failure of a single or multiple nodes, data ingestion and replication.

HPE Networking solutions HPE Networking solutions enable the following:

• Breakthrough cost reductions by converging and consolidating server, storage, and network connectivity onto a common fabric with a flatter topology and fewer switches than the competition.

• Predictable performance and low latency for bandwidth-intensive server-to-server communications.

• Improved business agility, faster time to service, and higher resource utilization by dynamically scaling capacity and provisioning connections to meet virtualized application demands by leveraging technologies.

• Removal of costly, time-consuming, and error-prone change management processes by utilizing HPE Intelligent Resilient Fabric (IRF) to allow multiple devices to be managed using a single configuration file from a single, easy-to-manage virtual switch operating across network layers.

• Modular, scalable, industry standards-based platforms and multi-site, multi-vendor management tools to connect and manage thousands of physical resources.

While the standard network used in modern Hadoop deployments is based on 10GbE networking, our WDO system is configured with 40GbE networking for high speed data transfer between compute and storage blocks. With the advent of 25GbE and 100GbE networking, redundant pairs of 10/25GbE network switches provide high speed, low latency connectivity for servers offering full redundancy, performance (active-active), and exceptional scalability for large numbers of nodes, typical in big data clusters, at costs that are comparable to 10GbE networking. As Hadoop is rack aware and the relationship between which servers are in which racks and which switch supports them is important, will be able to better distribute data and ensure that the copies of the data are distributed across different servers in different racks supported by top of rack switches thus preventing data loss in case of failures.

http://h20195.www2.hpe.com/V2/GetDocument.aspx?docname=4AA6-8931ENW


For new deployments consider installing NIC cards that support 10/25GbE so that you will be able to upgrade your environment to support 25/100GbE switches if you are not planning to invest at present, but might consider in the future. Thus minimizing the downtime for the upgrade of the cluster.

For this white paper, the following are the network definitions used throughout:

• Hadoop data network – The Hadoop data network carries the bulk of the traffic within the cluster. Dual bonded connections with active load balancing are used from each node. This provides increased bandwidth, and redundancy when a cable or switch fails. The Hadoop data network in this paper refers to the data nodes network in traditional BDO architecture and also refers to the network for storage and compute nodes in WDO architecture. The data nodes represented in figures 1-9 with traditional BDO architecture, can be interpreted as storage and compute nodes when referring to WDO architecture. The network configuration and recommendations are the same in both the architectures.

• Operations network – The operations network is used to provide cluster management and provisioning capabilities. It is aggregated into operations 1GbE switch in each rack. It is connected to NIC1 Embedded 1GbE network adapter on the servers. This network will be used in shared iLO configuration as iLO will use NIC1 for its traffic in this configuration.

• iLO network – The iLO network provides access to the iLO ports on the servers. It also provides access to the management ports of the cluster switches.

• Edge network – The edge network provides connectivity from the edge nodes to an existing customer network, either directly, or via cluster aggregation switches. Edge nodes are needed when multi-homing and that means different IP subnets in this context, not necessarily different network adapters unless the edge and data networks are completely isolated or separate. You can achieve the same thing with a single adapter configured with two IP addresses on it (on separate VLANs or not).

Note You can share all networks data/ops/ilo/edge in order to save on switching, cabling and space.

Solution overview Traditional Balanced and Data Optimized (BDO) network Figure 1 shows a traditional Hadoop architecture with storage (HDFS) and compute (YARN, Spark and others) on the same server. This design is primarily based on the HPE ProLiant DL380 server platform, with density optimized variants using HPE Apollo 4200 servers. The operations and iLO networks can be on the same network (shared switch and same VLAN) or on separate VLANs still using iLO dedicated ports. The switch has to be set up with appropriate PVIDs on the 1GbE switch connected to operations and iLO network.


Figure 1. Traditional BDO network architecture with separate operations and iLO networks

Table 1 provides the recommended switches for each network.

Table 1. Network configuration for traditional BDO architecture

Logical network Connection Switch

Hadoop data network

Bonded 10GbE

Bonded 25GbE

Bonded 40GbE

Bonded 100GbE

Dual top of rack switches and aggregation switches

(2) HPE FlexFabric 5940-48XGT-6QSF28 Switch or

(2) HPE FlexFabric 5940-48XGT-6QSFP+ Switch

(2) HPE FlexFabric 5950 48SFP28 8QSFP28 Switch

(2) HPE FlexFabric 5940 32QSFP+ Switch

(2) HPE FlexFabric 5950 32QSFP28 Switch

Operations network 1GbE Dedicated switch per rack or share with iLO network

(1) 5900AF 48G 4XG 2QSFP+ (JG510A) 52 port switch

iLO network 1GbE Share with operations network switch

(1) 5900AF 48G 4XG 2QSFP+ (JG510A) 52 port switch

Edge network

Bonded 10GbE

Bonded 25GbE

Bonded 40GbE

Bonded 100GbE

Direct to edge network or aggregation switch

(2) HPE FlexFabric 5940-48XGT-6QSF28 Switch or

(2) HPE FlexFabric 5940 48XGT 6QSFP+ Switch


(2) HPE FlexFabric 5940 32QSFP+ Switch

(2) HPE FlexFabric 5950 32QSFP28 Switch


Figure 2 shows the operations and iLO network sharing NIC1 on the server. iLO is configured in shared iLO mode so that the iLO network traffic will use NIC1 embedded 1GbE network adapter on the server. The iLO network can be on the same VLAN or separate VLAN. This will reduce the number of ports used on the switch. One can use this configuration and connect NIC1 to the same switch as the Hadoop data network switch thus reducing additional 1GbE switch requirement.

Figure 2. Hadoop cluster with shared operations and iLO network on 1GbE network switch


Workload and Density Optimized (WDO) elastic Hadoop network The HPE Workload and Density Optimized (WDO) system harnesses the power of faster Ethernet networks to independently scale compute and storage using a building block approach, and lets you consolidate your data and workloads growing at different rates. The base HPE WDO system uses the HPE Apollo 4200 as a storage block and the HPE Apollo 2000 as a compute block. Figure 3 shows a Hadoop network with workload and density optimized systems, with separate storage (HDFS) and compute (YARN, Spark and others) on a different platform with density optimized Apollo 4200 Gen9 servers for storage given the additional storage density (28 LFF disk drives) as it is important to have dual high bandwidth network connectivity (25GbE/40GbE/100GbE) to eliminate network bottlenecks, and Apollo 2000 for compute.

Figure 3. Hadoop WDO network architecture with separate operations and iLO networks

Table 2 provides the recommended switches for each network.

Table 2. Network configuration for WDO architecture for Hadoop network

Block Type Workload and Density Optimized (WDO) Connection Switch

Control block (includes edge node)

(4) HPE ProLiant DL360 Gen9

Apollo 2000 with (4) XL170r Gen9

Bonded 10GbE

Bonded 25GbE

Bonded 40GbE

Bonded 100GbE

(2) HPE FlexFabric 5940 48SFP+ 6QSFP28 Switch, or



(2) HPE-FlexFabric 5940 32QSFP+ Switch

(2) HPE-FlexFabric 5950 32QSFP28 Switch

Operations network 1GbE network with built-in NIC0 for provisioning and operations.

1GbE Dedicated switch per rack or share with iLO network

(1) 5900AF 48G 4XG 2QSFP+ (JG510A) 52 Port Switch

iLO network For connecting server iLOs 1GbE Share the switch with operations network

(1) 5900AF 48G 4XG 2QSFP+ (JG510A) 52 Port Switch

Worker node block

Apollo 2000 with (4) XL170r Gen9

(Compute nodes)

Apollo 4200 Gen9 (Storage nodes)

Bonded 10GbE

Bonded 25GbE

Bonded 25GbE

Bonded 40GbE

Bonded 100GbE

(2) HPE FLEXFABRIC 5940 48SFP+ 6QSFP28 Switch, or




(2) HPE-FlexFabric 5940 32QSFP+ Switch

(2) HPE-FlexFabric 5950 32QSFP28 Switch


Solution components Hadooop single rack configuration Figure 4 shows a single rack configuration of HPE Elastic Platform for Big Data Analytics (EPA). The HPE EPA servers are configured with two 10GbE network ports with FlexibleLOM support (561FLR-T 2p 10Gb Ethernet adapter) which allows HPE Integrated Lights-Out 4 (iLO 4) management traffic to be shared on NIC1 of the server. Two HPE 5940-48XGT-6QSFP28 switches are configured with IRF. NIC1 from each server is connected to switch 1 (SW1 – HPE 5940-48XGT-6QSFP28 Ethernet switch) and NIC2 from each server is connected to switch 2 (SW2 – HPE 5940-48XGT-6QSFP28 Ethernet switch) as shown in Figure 2. The two 10GbE NICs are configured as a bonded pair and trunked to the HPE switches.

Edge network

TOR 1GbE switch

TOR 10GbE switch

Edge node

Master node

Data node

OPs/ iLO network Hadoop network

Figure 4. HPE Elastic Platform for Big Data Analytics (EPA) for Hadoop single rack configuration


Hadoop dual rack configuration – sharing a single switch Figure 5 shows an expanded Hadoop cluster with two rack units using the switch on the primary rack and connecting the second rack servers to the primary rack switch running network cables from the primary rack to the second rack.

Note In the multi rack configuration, it is recommended to move one of the master nodes to the expansion racks for high availability of the services. Also add additional edge nodes on expansion racks so that you can ingest data into the cluster as shown in Figure 5 and other multi rack designs.

TOR 1GbE switch

TOR 10GbE switch

Edge node

Master node

Data node

OPs/ iLO network Edge networkEdge network

Hadoop network

Rack #1 Rack #2

Figure 5. HPE Elastic Platform for Big Data Analytics (EPA) for Hadoop two rack configuration sharing the primary switch on rack #1


Figure 6 shows the switch network port assignments in a two rack configuration using a single switch with 48 10GbE ports and additional ports are used for uplinks or customer network. Two ports are used for IRF configuration which makes the two physical switches into one logical switch. Thus the two ports connected from the servers to the switch can be used for bonding.

4847343332311817 1/10GBASE-T161521 Green=10Gbps,Yellow=1Gbps

49

50

51

53

52

54

SYS

QSFP+ Green=40Gbps,Yellow=10Gbps HPE FlexFabric 5940 Series


49

50

51

53

52

54

SYS


IRF

Uplinks(49-52)

IRF(53-54)

ManagementNetwork(39-42)

Hadoop Network(17-34)Rack #2

Hadoop Network(1-16)

Rack #1

EdgeNetwork(35-38)

BFD MAD

Figure 6. Two rack network port assignments


Hadoop dual rack configuration – with separate ToR switches Figure 7 shows two rack Hadoop cluster with 10GbE top of the rack switches connected using 40GbE/100GbE network bridge aggregation so that the servers in the two racks can communicate between each other.


Edge network

TOR 1GbE switch

TOR 10GbE switch

Edge node

Master node

Data node

OPs/ iLO network Edge network

Hadoop network

Rack #1 Rack #2

Figure 7. HPE Elastic Platform for Big Data Analytics (EPA) for Hadoop two rack configuration without an aggregation switch

Hadoop multi rack configuration with dual rack single TOR switch Figure 8 shows Hadoop network with the racks connected to 40GbE/100GbE aggregation switch. All top of the rack switches are connected to Hadoop aggregation switch for Hadoop data transfers and all the edge nodes are connected to another edge aggregation switch or the aggregation switches can be shared. Figure 8 shows two racks using a single rack TOR switch for connectivity. As the uplinks on the switches are 100GbE, it is sufficient to have a two uplinks (200GbE) to the aggregation switch from each TOR switch. If the uplinks are 40GbE we recommend having four uplinks (160GbE) to the aggregation switch per TOR switch.


Note For Hadoop, a “rack” is a logical construct for minimizing inter-rack network traffic and providing rack level fault tolerance. As such the dual rack shared switch can be configured as a single Hadoop rack if needed. For example since all nodes are connected to the same switch they are all “local”. The same design can be used when upgrading an existing cluster where you run out of rack space to add more nodes per rack. Specify the rack for each host (think repurposing a symmetric cluster as storage nodes and adding Apollo 2000 as compute). Hadoop racks can be “grouped” together also.


OPs/ iLO network

Edge network

Edge network

Hadoop network

Rack #1 Rack #2

OPs/ iLO network

Edge network

Hadoop network

Rack #3 Rack #4

Edge aggregation switch

Hadoop aggregation switch

Hadoop network

Edge network

OPs/ iLO network

Hadoop network

Rack #5 Rack #6

Hadoop network

Figure 8. Multi rack Hadoop network configuration aggregation switch

Hadoop multi rack configuration with ToR and aggregation switches Figure 9 shows the Hadoop network with the racks connected to 40GbE/100GbE aggregation switch. All top of the rack switches are connected to Hadoop aggregation switch for Hadoop data transfers and all the edge nodes are connected to another edge aggregation switch or the aggregation switches can be shared. In the multi rack configuration, it is recommended to move one of the master nodes to the expansion racks for high availability of the services. Also add additional edge nodes on expansion racks so that you can ingest data into the cluster as shown in Figure 9 and other multi rack designs. As the uplinks on the switches are 100GbE, it is sufficient to have a single uplink (100GbE) to the aggregation switch from each TOR switch. If the uplinks are 40GbE, we recommend having two uplinks (80GbE) to the aggregation switch per switch.


The edge network can be the customer network, no need for dedicated switch for edge nodes as shown in Figure 8 and Figure 9. The operations network is connected to customer existing management and operations network. Assuming that the customer has 48 port switch on this network, one can connect up to 48 racks without using an aggregation switch for operations network.

OPs/ iLO network

Edge network

Edge network

Hadoop network

Rack #1 Rack #2

OPs/ iLO network

Edge network

Hadoop network

Rack #3 Rack #4

Edge aggregation switch

Hadoop aggregation switch

Hadoop network

Edge network

OPs/ iLO network

Hadoop network

Rack #5 Rack #6

Hadoop network Hadoop network

Edge network

Hadoop networkHadoop network

Figure 9. Multi rack Hadoop network configuration

When connecting multiple switches in the network, it is recommended to connect the top switch in the rack to aggregation switch one and the bottom switch in the rack to aggregation switch two so that there is minimum network traffic on the IRF link. If traffic flows through the IRF link, it is like one hop and will increase the network latency as shown in logical diagram Figure 10.

Figure 10. Multi rack Hadoop network configuration


Figure 11 shows the aggregation switch connections to top of rack switches.

4847343332311817 1/10GBASE-T161521 Green=10Gbps,Yellow=1Gbps QSFP28 Green=100Gbps,Yellow=40/10Gbps 5449




IRFBFD MAD

HPE 5940-48XGT-6QSFP+

IRFBFD MAD

RACK1 RACK2

32313029282726252423222120191817161514131211654321

SYS

Green=40Gbps,Yellow=10GbpsQSFP+

5930 SeriesSwitchJG726A

HP FlexFabric

32313029282726252423222120191817161514131211654321

SYS

Green=40Gbps,Yellow=10GbpsQSFP+

5930 SeriesSwitchJG726A

HP FlexFabric

HPE 5940-32QSFP+

HPE 5940-32QSFP+




Figure 11. Multi rack Hadoop aggregation switch port configuration

Hadoop cluster capacity planning Table 2 shows the number of nodes that can be connected in a Hadoop cluster using a 10GbE BDO network architecture. The amount of network capacity between the leaf and spine can be decreased below 3:1 because of rack awareness of Hadoop (depending on the workload requirements). Even in a WDO architecture we recommend having some storage nodes along with compute nodes in the rack thus utilizing rack awareness thus reducing the network capacity requirement in the cluster. The following table shows the maximum capacity that can be provided by a pair of 32-port aggregation switches configured in IRF. The number of nodes can be multiplied by adding additional aggregation switches and network tiers as shown in Figure 12. As the size of the cluster grows the inter rack network traffic also comes down and the oversubscription can be increase.

A rack with 16 servers with 10GbE network has 10x16 = 160GbE total network bandwidth assigned, with one 100GbE uplink the oversubscription is 160/100 = 1.6 and with two 40GbE uplinks the oversubscription is 160/80 = 2.0. Table 3 shows the oversubscription and maximum number of nodes that can be configured using a pair of 32-port aggregation switches in IRF configuration. With 100GbE 32-port aggregation switch, one can connect 27 expansion racks with 16 servers and one primary rack with 16 servers totaling 502 servers in the cluster. The following table assumes that one is using only one aggregation switch. Once you add additional aggregation switches to the cluster, 6-8 ports have to set as bridges or uplinks to other aggregation switches.

Table 3. HPE BDO network architecture and capacity

Hadoop Servers Nodes in

1st rack

Nodes per

extension rack

Racks per cluster Nodes per cluster

(32-port aggregation switch)

Bandwidth oversubscription

HPE ProLiant DL380 Gen9

Apollo 4200 Gen9

10GbE 48-ports switch

16 16

16

14

28

224 (16x13 + 16x1)

448 (16x27 + 16x1)

2:1 (40G)

1.6:1 (100G)

HPE ProLiant DL380 Gen9

Apollo 4200 Gen9

25GbE 48-port switch

16 16 14 224 (16x13 + 16x1)

2:1 (100G)

Apollo 4200 Gen9


16 16

14

224 (16x13 + 16x1)

224 (16x13 + 16x1)

4:1 (40G)

3.2:1 (100G)

Apollo 4200 Gen9


16 16

7

112 (16x6 + 16x1)

4:1 (100G)


Figure 12 shows the tiered network configuration where multiple aggregation switches can be connected to extend the number of cluster network to multiple racks. One can add additional aggregation switches pair for every 13-14 ToR switches with 25GbE/10GbE 48 port switch. In Table 3, 28 ports are used for connecting to ToR switches with 2 ports for IRF. If additional aggregation switches are to be configured to existing aggregation network. 6-8 ports are needed for bridge aggregation or uplinks. The two ports that are available can be used for bridge aggregation and move one rack ToR connection to the new aggregation switch. The freed ports can be added to the bridge so that you have 6-8 uplinks or bridge aggregation ports in the configuration.

Green=100Gbps,Yellow=40/10GbpsQSFP28 SFP+32313029282726252423222120191817161514131211654321 3433
















2-4 uplinks per rack

6-8 uplinks per switch







Figure 12. Tiered network to support multiple racks

HPE FlexFabric 1G/10G/25G/40G/100G Switches Following is the configuration of Hadoop and aggregation switches discussed in the paper with the number of I/O ports, memory, packet buffer sizes and throughput for each switch.

Table 4. HPE Network switches and configuration details

I/O ports and slots Memory and processor Throughput

Routing/Switching capacity

HPE FlexFabric 5940 48XGT 6QSFP+ Switch (JH394A)

48 1/10GBASE-T ports 6 QSFP+ 40GbE ports

1 GB flash; Packet buffer size: 12.2 MB, 4 GB SDRAM up to 1071 Mpps 1440 Gbps

HPE FlexFabric 5940 48SFP+ 6QSFP+ Switch (JH395A)

48 fixed 1000/10000 SFP+ ports 6 QSFP+ 40GbE ports


HPE FlexFabric 5940 48SFP+ 6QSFP28 Switch (JH390A)

48 fixed 1000/10000 SFP+ ports 6 QSFP28 100GbE ports


HPE FlexFabric 5940 32QSFP+ Switch (JH396A) 32 QSFP+ 40GbE ports 1 GB flash; Packet buffer size: 12.2 MB,

4 GB SDRAM up to 1904Mpps 2560 Gbps

HPE FlexFabric 5950 32QSFP28 Switch (JH321A)

32 QSFP28 100GbE ports 2 SFP+ 1/10GbE ports

1 GB flash; Packet buffer size: 16 MB, 4 GB SDRAM up to 2796 Mpps 3200 Gbps

HPE FlexFabric 5950 48SFP28 8QSFP28 Switch (JH402A)

48 SFP28 25GbE ports 8 QSFP28 100GbE ports

1 GB flash; Packet buffer size: 16 MB, 4 GB SDRAM up to 2796 Mpps 3200 Gbps

HPE FlexFabric 5900AF 48G 4XG 2QSFP+ Switch (JG510A)

48 autosensing 10/100/1000 ports 4 fixed 1000/10000 SFP+ ports 2 QSFP+ 40-GbE ports

512 MB flash; Packet buffer size: 9 MB, 2 GB SDRAM

up to 250 Mpps (64-byte packets) 336 Gbps

Mellanox Spectrum SN2100 (2) 16-port 100GbE switch 32 10/25/40/50/56/100GbE 8 GB System; Packet buffer size: 16 MB;

16 GB SSD up to 4760 Mpps 3200 Gbps

Arista 7050X Series (2) 48-port 10GbE /4-port 40GbE switch

48 fixed 1000/10000 T ports 4 QSFP+ 40GbE ports 4 GB System; Packet buffer size: 12 MB; up to 960 Mpps 1280 Gbps


The following section provides additional details of the switches discussed.

HPE FlexFabric 5950 32QSFP28 Switch (JH321A)

Figure 13. HPE FlexFabric 5950 Switch Series

The HPE FlexFabric 5950 Switch Series provides advanced features and high performance in a top-of-rack, data center switch architecture. Consisting of a 1U 32-port 100GbE QSFP28 Switch, the 5950 brings high density to a small footprint. While 10-unit IRF reduces management complexities by up to 88%, it also delivers <50 msec convergence time. You can rely on the FlexFabric 5950 Switch Series to improve switch utilization and lower TCO, while delivering business resilience, high availability, and:

• VXLAN and OVSDB support for network virtualization and overlay solutions

• IRF support of up to ten switches simplifies management by up to 88%

• OpenFlow and SDN automate manual tasks and speed service delivery

The HPE FlexFabric 5950 Switch Series is a high density, advanced, data center switch available as a 1RU 32-port 100GbE QSFP28 form factor. This switch can be used for high-density 100GbE/40GbE/25GbE/10GbE spine/ToR connectivity. 100GbE ports may be split into four 25GbE ports and can also support 40GbE which can be split into four by 10GbE for a total of 128 25/10GbE ports.

HPE FlexFabric 5940 Switch Series

Figure 14. HPE FlexFabric 5940 Switch Series


The HPE FlexFabric 5940 Switch Series is a family of high-performance and low-latency 10GbE and 40GbE top-of-rack (ToR) data center switches. The 5940 Switch includes 100G uplink technology which is part of the HPE FlexFabric data center solution and is a cornerstone of the FlexNetwork architecture. The 5940 Switch is suited for deployment at the aggregation or server access layer of large enterprise data centers or at the core layer of medium-sized enterprises. It is optimized for high-performance server connectivity, convergence of Ethernet and storage traffic, and virtual environments.

The HPE FlexFabric 5940 Switch Series enables customers to scale their server-edge 10/40/100GbE ToR deployments with high-density 48 x 10GbE (SFP or BASE-T) with 6 x 40GbE ports, 48 x 10GbE (SFP or BASE-T) with 6 x 100GbE ports, and 32 x 40GbE ports, delivered in a 1RU design.

Note 40-GE QSFP+ ports FortyGigE 1/0/1 through FortyGigE 1/0/4 and FortyGigE 1/0/29 through FortyGigE 1/0/32 on an HPE FlexFabric 5940 32QSFP+ Switch (JH396A) cannot be split into 10-GE SFP+ ports.

HPE FlexFabric 5900AF 48G 4XG 2QSFP+ Switch

Figure 15. HPE FlexFabric 5900AF 48G 4XG 2QSFP+ Switch

The HPE 5900 Switch Series are low-latency 1/10GbE data center top-of-rack (ToR) switches. 10/100/1000BASE-T and fiber support for data center deployments. The high server port density is backed by 40 GbE QSFP+ uplinks to deliver the availability of needed bandwidth for demanding applications; each 40 GbE QSFP+ port can also be configured as four 10GbE ports by using a 40-GbE-to-10GbE splitter cable. For more information: hpe.com/us/en/product-catalog/networking/networking-switches/pip.overview.networking-switches.5354511.html.

Mellanox Spectrum SN2100 16-port 100GbE Switch

Figure 16. Mellanox Spectrum SN2100 16-port 100GbE Ethernet Switch

https://www.hpe.com/us/en/product-catalog/networking/networking-switches/pip.overview.networking-switches.5354511.html


The SN2100 switch provides a high density, side-by-side 100GbE switching solution which scales up to 128 ports in 1RU for the growing demands of today’s database, storage, data centers environments. The SN2100 switch is an ideal spine and top of rack (ToR) solution, allowing maximum flexibility, with port speeds spanning from 10Gb/s to 100Gb/s per port and port density that enables full rack connectivity to any server at any speed. For more information: mellanox.com/related-docs/prod_eth_switches/PB_SN2100.pdf

Arista 7050X Series 48-port 10GbE /4-port 40GbE Switch

Figure 17. Arista 7050X Series 48-port 10GbE /4-port 40GbE switch

The Arista 7050TX are members of the Arista 7050X Series and key components of the Arista portfolio of data center switches. The Arista 7050X Series are purpose built 10/40GbE data center switches in compact and energy efficient form factors with wire speed layer 2 and layer 3 features combined with low latency and advanced features for software defined cloud networking. Increased adoption of 10 Gigabit Ethernet servers coupled with applications using higher bandwidth is accelerating the need for dense 10 and 40 Gigabit Ethernet switching. The 7050TX Series support from 32 to 96 ports of auto-negotiating 100Mb/1Gb/10GBASE-T and from 4 to 12 ports of 10/40GbE that allow customers to design large leaf and spine networks to accommodate the east-west traffic patterns found in modern data centers. For more information: arista.com/en/products/7050x-series

https://www.mellanox.com/related-docs/prod_eth_switches/PB_SN2100.pdf

https://www.arista.com/en/products/7050x-series


Best practices and configuration guidance Dual switch configuration using Intelligent Resilient Fabric (IRF) The recommended configuration for performance and resiliency is to leverage HPE’s Intelligent Resilient Fabric (IRF) in the data center switches, as seen in Figure 18. When IRF is used, the HPE Elastic Platform for Big Data Analytics (EPA) for Hadoop servers can take advantage of link aggregation to boost performance in addition to providing resiliency even during a complete switch failure. When IRF is used, you can create a link aggregation across two separate switches since IRF enables the two switches to act as one.

Ports 53 and 52, 100GbE QSFP28 ports, are configured as IRF ports. Port 53 on SW1 is connected to Port 54 on SW2 and Port 54 on SW1 is connected to Port 53 on SW2 using HPE X240 100G QSFP28 to QSFP28 1m Direct Attach Copper Cable (JL271A) cables. Figure 18. shows the connection on the HPE 5940 48XGT 6QSFP28 Switch.

The multi-active detection (MAD) feature detects identical active IRF virtual devices and handles multi-active collisions on a network. To detect and handle multi-active collisions, the MAD feature identifies each IRF virtual device with an active ID, which is the member ID of the master switch. BFD MAD is implemented by using the Bidirectional Forwarding Detection (BFD) protocol, which helps fast detect link failures and loss of IP connectivity. The BFD MAD enabled IRF section details how to configure the switch to enable BFD MAD protocol.

2

1

3

PS1PS2

UID 41

iLO

4 1

500W

94%



P2 P1ACT LINK ACTLINK



IRFBFD MAD

bond

Figure 18. HPE Elastic Platform for Big Data Analytics (EPA) for Hadoop Intelligent Resilient Fabric (IRF) configuration

Figure 18 above shows how the server network NICs are connected to the HPE 5940 48XGT 6QSFP28 Switches. Figure 19 shows the ports assigned to different networks on the HPE 5940 Switch. Figure 18 shows connections from ports 1-16 connecting to Hadoop data nodes. Ports 39 and 40 are connected to Hadoop head nodes. Port 41 is connected to the management node, port 42 on SW1 is connected to iLO of the management node. Ports 49 to 52, 100GbE QSFP28 ports, can be used for uplinks. Ports 19 to 36 are available for Share iLO/operations network. How these ports can be configured to connect to a customer network is described in section HPE Elastic Platform for Big Data Analytics (EPA) for Hadoop data center connectivity.


49

50

51

53

52

54

SYS



49

50

51

53

52

54

SYS


IRF

uplinks(49-52)

IRF(53-54)

MasterNetwork(39-42)

Shared iLO / Operations

Network(19-36)

Hadoop Network(1-18)

Rack #1

EdgeNetwork(37-38)

BFD MAD

Master iLONetwork(44-48)

Figure 19. HPE Elastic Platform for Big Data Analytics (EPA) for Hadoop switch configuration


The switch configuration is shown in Appendix B. In this document VLAN 1120 is a private VLAN and VLAN 120 is a public VLAN. Ifcfg-eth0 is configured to be slave of ifcfg-bond0 Ethernet device. The section Server NIC bonding describes the configuration of the private network with 172.24.1.x IP addresses. The section Server with 802.1q VLAN tagging provides information on how the public VLAN network is configured. If the customer VLAN is different than 120, they will have to replace the VLAN number 120 to represent their customer VLAN. Each interface port on the switch has been configured to support this network configuration. In the example below, port 1 of switch SW1 is configured to permit untagged 1120 traffic and allow tagged 120 VLAN traffic. The details on switch configuration are provided in the Enable tagged VLANs on HPE Network switches section.

The ports that are assigned to the Hadoop nodes are configured as hybrid ports as they allow untagged network traffic (VLAN 1120) within the HPE Elastic Platform for Big Data Analytics (EPA) for Hadoop server rack, and allow tagged network traffic (VLAN 120) for CEN traffic.

Note Configuring switch ports

interface Ten-GigabitEthernet1/0/1 port link-mode bridge port link-type trunk port trunk permit vlan all flow-interval 5 flow-control stp edged-port lldp tlv-enable dot1-tlv dcbx dldp enable qos trust dot1p

For the network traffic to flow from the HPE Elastic Platform for Big Data Analytics (EPA) for Hadoop rack to the customer / public network, one can configure multiple ports on the HPE 5940-48XGT-6QSFP28 Switches as shown in the example below to permit VLAN 120. The section Configuring HPE 5940 network switch shows a sample of how the switch is configured.

Note Configuring uplink

interface HundredGigE1/0/52 port link-type trunk port trunk permit vlan all port link-aggregation group 1

If the customer has a Cisco switch that connects to the HPE 5940 Switch the section Configure Cisco switch for HPE 5940 will provide information on how to configure the customer’s Cisco switch so that it communicates with the HPE 5940 Switch and passes appropriate traffic through their networks.

The section Spanning Tree Protocol (STP) for HPE and Cisco interoperability provides information on STP configuration.

Explanation of specific port settings on the switch The following settings are configured on the server port interfaces on the switch. The settings are explained below.

interface FortyGigE1/0/1 port link-mode bridge port link-type trunk port trunk permit vlan all flow-interval 5 flow-control stp edged-port


lldp tlv-enable dot1-tlv dcbx dldp enable qos trust dot1p The following are the descriptions of the settings of the interfaces on the switch: flow-interval: Sets the statistics polling interval, in seconds.

flow-control: For heavy write workloads depending on the compute to storage ratio the storage nodes may need to invoke flow control if HDFS is not able to handle the write requests. The device supports flow control in both the inbound and outbound directions.

• For flow control in the inbound direction, the local device listens to flow control information from the remote device.

• For flow control in the outbound direction, the local device sends flow control information to the remote device.

• The flow control setting takes effect in both directions. To communicate, two devices must be configured with the same flow control mode.

ethtool -a eth0: To verify the network adapters flow control setting

Autonegotiate: on: To pause parameters for eth0:

– RX: on

– TX: on

stp edged-port: A port directly connecting to a user terminal rather than another device or a shared LAN segment can be configured as an edge port. In case the network topology changes, an edge port does not cause a temporary loop. You can enable the port to transit to the forwarding state rapidly by configuring it as an edge port. HPE recommends that you configure ports that directly connect to user terminals as edge ports.

lldp tlv-enable dot1-tlv dcbx: The device can advertise on an Ethernet port all types of LLDP TLVs except DCBX TLVs, location identification TLVs, VLAN Name TLVs, and Protocol VLAN ID TLVs.

dot1-tlv: Advertises IEEE 802.1 organizationally specific LLDP TLVs.

dcbx: Advertises the Data Center Bridging Exchange Protocol (DCBX) TLV.

dldp enable: The Device Link Detection Protocol (DLDP) was developed by HPE to detect the status of fiber links or twisted-pair links. When DLDP detects unidirectional links, it can automatically shut down the faulty port to avoid network problems.

qos trust dot1p: Configure the interface to trust the 802.1p priority carried in packets.

jumbo frames: Sets the maximum length of Ethernet frames that are allowed to pass through. The value range is 1536 to 10000 bytes.

Splitting and combining 10GbE, 25GbE, 40GbE and 100GbE ports Splitting a 40-GE interface into four 10-GE breakout interfaces You can use a 40-GE interface as a single interface. You can also split a 40-GE interface into four 10-GE breakout interfaces.

For example, you can split 40-GE interface FortyGigE 1/0/1 into four 10-GE breakout interfaces Ten-GigabitEthernet 1/0/1:1 through Ten-GigabitEthernet 1/0/1:4.

After you configure this feature on a 40-GE interface, the system deletes the 40-GE interface and creates the four 10-GE breakout interfaces. After the using tengige command is successfully configured, you do not need to reboot the switch. You can view the four 10-GE breakout interfaces by using the display interface brief command. A 40-GE interface split into four 10-GE breakout interfaces must use a dedicated 1-to-4 cable.

Combining four 10-GE breakout interfaces into a 40-GE interface If you need higher bandwidth on a single interface, you can combine the four 10-GE breakout interfaces into a 40-GE interface. After you configure this feature on a 10-GE breakout interface, the system deletes the four 10-GE breakout interfaces and creates the 40-GE interface.

After the using fortygige command is successfully configured, you do not need to reboot the switch. You can view the 40-GE interface by using the display interface brief command. After you combine the four 10-GE breakout interfaces, replace the dedicated 1-to-4 cable with a dedicated 1-to-1 cable or a 40-GE transceiver module.


using fortygige Use using fortygige to combine 10-GE breakout interfaces split from a 40-GE interface into a 40-GE interface.

Use undo using fortygige to cancel the configuration. If you need higher bandwidth on a single interface, you can combine four 10-GE breakout interfaces split from a 40-GE interface into a 40-GE interface. To do so, execute this command on any of these 10-GE breakout interfaces.

Examples:

# Combine Ten-GigabitEthernet 1/0/1:1 through Ten-GigabitEthernet 1/0/1:4 into a 40-GE interface.

<System> system-view [System] interface ten-gigabitethernet1/0/1:1 [System-Ten-GigabitEthernet1/0/1:1] using fortygige The interfaces Ten-GigabitEthernet1/0/1:1 will be deleted. Continue? [Y/N]:y

using hundredgige Use using hundredgige to combine 25-GE breakout interfaces split from a 100-GE interface into a 100-GE interface.

Use undo using hundredgige to cancel the configuration. If you need higher bandwidth on a single interface, you can combine four 25-GE breakout interfaces split from a 100-GE interface into a 100-GE interface. To do so, execute this command on any of these 25-GE breakout interfaces.

Examples:

# Combine Twenty-FiveGigE 1/0/1:1 through Twenty-FiveGigE1/0/1:4 into a 100-GE interface.

<Sysname> system-view [Sysname] interface Twenty-FiveGigE 1/0/1:1 [Sysname-Twenty-FiveGigE1/0/1:1]using hundredgige The interfaces Twenty-FiveGigE1/0/1:1 will be deleted. Continue? [Y/N]:y

using twenty-fivegige Use using twenty-fivegige to split a high bandwidth interface into 4 25-GE breakout interfaces.

Use undo using twenty-fivegige to cancel the configuration.

You can split a high bandwidth interface into multiple 4 25-GE breakout interfaces. For example: Split a 100-GE interface HundredGigE 1/0/1 into four 25-GE breakout interfaces Twenty-FiveGigE 1/0/1:1 through Twenty-FiveGigE 1/0/1:4. The 10-GE breakout interfaces support the same configuration and attributes as common 10-GE interfaces, except that they are numbered in a different way.

Examples:

# Split HundredGigE 1/0/1 into 25-GE breakout interfaces.

<Sysname> system-view

[Sysname] interface hundredgige 1/0/1 [Sysname-HundredGigE1/0/1] using twenty-fivegige The interface HundredGigE1/0/1 will be deleted. Continue? [Y/N]:y

Following is the display for the interface after the change:

WGE1/0/1:1 UP 25G F T 1 WGE1/0/1:2 UP 25G F T 1 WGE1/0/1:3 UP 25G F T 1 WGE1/0/1:4 UP 25G F T 1


using tengige Use using tengige to split a high bandwidth interface into 4 10-GE breakout interfaces.

Use undo using tengige to cancel the configuration.

You can split a high bandwidth interface into multiple 10-GE breakout interfaces. For example:

• Split a 40-GE interface FortyGigE 1/0/1 into four 10-GE breakout interfaces Ten-GigabitEthernet 1/0/1:1 through Ten-GigabitEthernet 1/0/1:4.

• Split a 100-GE interface HundredGigE 1/0/1 into four 10-GE breakout interfaces Ten-GigabitEthernet 1/0/1:1 through Ten-GigabitEthernet 1/0/1:4.

The 10-GE breakout interfaces support the same configuration and attributes as common 10-GE interfaces, except that they are numbered in a different way.

Examples:

# Split HundredGigE 1/0/1 into 10-GE breakout interfaces.

<Sysname> system-view

[Sysname] interface hundredgige 1/0/1 [Sysname-HundredGigE1/0/1] using tengige The interface HundredGigE1/0/1 will be deleted. Continue? [Y/N]:y

# Split FortyGigE 1/0/1 into four 10-GE breakout interfaces.

<System> system-view [System] interface fortygige 1/0/1 [System-FortyGigE1/0/1] using tengige The interface FortyGigE1/0/1 will be deleted. Continue? [Y/N]:y

Following is the display for the interface after the change:

XGE1/0/1:1 UP 10G F(a) A 1 XGE1/0/1:2 UP 10G F(a) A 1 XGE1/0/1:3 UP 10G F(a) A 1 XGE1/0/1:4 UP 10G F(a) A 1

QSFP to SFP adapter If the customer does not have QSFP connectivity, the HPE QSFP/SFP+ adapter kit can be used to convert QSFP port to SFP+ port using the following adapter. You need to convert the QSFP port to SFP port ‘using tengige’ command described in the previous section.

• 655874-B21 HPE QSFP/SFP+ Adapter Kit – QSFP to SFP+ adapter (convert QSFP port to SFP+ port, if fiber connection is needed use part JD092B)

• HPE X130 10G SFP+ LC SR Transceiver (JD092B) – SFP+ Fiber transceiver (can be used on 5900 SFP+ ports)

If one needs an RJ45 connection on SFP+ port, use HPE X120 1G SFP RJ45 T Transceiver JD089B for 1GbE Ethernet connectivity.

Spanning Tree Protocol (STP) for HPE and Cisco interoperability Spanning Tree Protocol (STP) is a Layer 2 protocol that runs on bridges and switches. The specification for STP is IEEE 802.1D. The main purpose of STP is to ensure that you do not create loops when you have redundant paths in your network.

While STP/RSTP (Rapid Spanning Tree Protocol)/MSTP (Multiple Spanning Tree Protocol)/PVST (Per VLAN Spanning Tree) are fairly effective in preventing unwanted network loops, convergence can still take several seconds, affecting applications that cannot handle that length of delay. In addition, the performance of STP is poor because it blocks all parallel paths except the one it has selected as active.


When you enable the spanning tree feature globally on an HPE 5940 Switch, the device operates in STP, RSTP, PVST or MSTP mode, depending on the spanning tree mode setting. When the spanning tree feature is enabled, the device dynamically maintains the spanning tree status of VLANs based on received configuration Bridge Protocol Data Units (BPDUs). When the spanning tree feature is disabled, the device stops maintaining the spanning tree status. To enable the spanning tree feature globally on HPE 5940 Switch, type:

[HP]stp global enable

Set the spanning tree mode on HPE 5940 Switch. A spanning tree device operates in MSTP mode by default. To set the spanning tree mode, use the command: “stp mode { mstp |pvst | rstp | stp }”. Specify the mode for the customer network:

[HP]stp mode ?

mstp Multiple spanning tree protocol mode

pvst Per-Vlan spanning tree mode

rstp Rapid spanning tree protocol mode

stp Spanning tree protocol mode

The spanning tree protocol serves as a key loop prevention and redundancy mechanism in enterprise networks. Over the years it has been refined with updates, such as rapid spanning tree (RSTP) to reduce convergence time and multiple spanning tree (MSTP) to form a separate spanning tree instance for each VLAN. In addition to these standards-based methods, Cisco switches use proprietary variants called per- VLAN spanning tree plus (PSVT+) and Rapid PVST+.

For interoperability between HPE and Cisco switches, use the following three variations of spanning tree:

• PVST+ (HPE) / PVST+ (Cisco)

• MSTP (HPE) / PVST+ (Cisco)

• MSTP (HPE and Cisco, using the IEEE 802.1s specification)

If a port directly connects to a server rather than to another device or a shared LAN segment, this port is regarded as an edge port. You can enable the port to transit to the forwarding state rapidly by configuring it as an edge port. HPE recommends that you configure the ports directly connected to servers as edge ports.

Following is the instruction to change a port to an edge port:

[HP] interface Ten-GigabitEthernet 1/0/1

[HP-Ten-GigabitEthernet1/0/8] stp edge-port

Use “display stp” to display the spanning tree status and statistics information. Based on the information, you can analyze and maintain the network topology or determine whether the spanning tree is working correctly.

[HP]disp stp brief

Additional information regarding STP configuration can be found at http://networktest.com/hpiop/hpiopcookbook.pdf in the HP/Cisco Switching and Routing Interoperability Cookbook.

Setting up multi-active detection (MAD) with IRF BFD MAD enabled IRF The multi-active detection (MAD) feature detects identical active IRF virtual devices and handles multi-active collisions on a network. To detect and handle multi-active collisions, the MAD feature identifies each IRF virtual device with an active ID, which is the member ID of the master switch. If multiple identical active IRF virtual devices are detected, only the one that has the lowest active ID can operate in the active state and forward traffic. MAD sets all other IRF virtual devices in the recovery state, and shuts down all their physical ports but the console and IRF ports.

http://networktest.com/hpiop/hpiopcookbook.pdf


Note IRF provides MAD mechanisms by extending LACP, BFD, ARP, and IPv6 ND. You can configure a minimum of one MAD mechanism on an IRF fabric for prompt IRF split detection.

• Do not configure LACP MAD together with ARP MAD or ND MAD, because they handle collisions differently.

• Do not configure BFD MAD together with ARP MAD or ND MAD. BFD MAD is mutually exclusive with the spanning tree feature, but ARP MAD and ND MAD require the spanning tree feature. At the same time, BFD MAD handles collisions differently than ARP MAD and ND MAD.

BFD MAD is implemented by using the Bidirectional Forwarding Detection (BFD) protocol, which helps fast detect link failures and loss of IP connectivity. Figure 18 shows a link that is used between HPE 5940 Switches that is used for BFD MAD. BFD MAD is implemented by using the Bidirectional Forwarding Detection (BFD) protocol, which helps fast detect link failures and loss of IP connectivity.

Set up dedicated BFD MAD links between each pair of IRF member switches. Do not use the BFD MAD links for data transmission. Assign port 44 of the BFD MAD links to a dedicated VLAN 3; create a VLAN interface for VLAN 3; and, assign a MAD IP address for each member switch. The MAD IP addresses are used for setting up BFD sessions between member switches, and they must be in the same network segment.

Create VLAN 3, and add Ten-GigabitEthernet 1/0/44, Ten-GigabitEthernet 2/0/44 to VLAN 3.

[HP] vlan 3 [HP-vlan3] port ten-gigabitethernet 1/0/44 ten-gigabitethernet 2/0/44 [HP-vlan3] quit Create VLAN-interface 3, and configure a MAD IP address for each member device on the VLAN interface.

1. [HP] interface vlan-interface 3 2. [HP-Vlan-interface3] mad bfd enable 3. [HP-Vlan-interface3] mad ip address 192.168.2.1 24 member 1 4. [HP-Vlan-interface3] mad ip address 192.168.2.2 24 member 2 5. [HP-Vlan-interface3] quit Disable the spanning tree feature on Ten-GigabitEthernet 1/0/44, Ten-GigabitEthernet 2/0/44.

[HP] interface ten-gigabitethernet 1/0/44 [HP-Ten-GigabitEthernet1/0/1] undo stp enable [HP-Ten-GigabitEthernet1/0/1] quit [HP] interface ten-gigabitethernet 2/0/44 [HP-Ten-GigabitEthernet2/0/1] undo stp enable [HP-Ten-GigabitEthernet2/0/1] quit Save the configuration of the switch.

To view the configuration of BFD MAD on the switch:

<HP>disp mad verbose Current MAD status: Detect Excluded ports(configurable): Excluded ports(can not be configured): FortyGigE1/0/51 FortyGigE1/0/52 FortyGigE2/0/51 FortyGigE2/0/52 MAD ARP disabled. MAD ND disabled. MAD LACP disabled. MAD BFD enabled interface: Vlan-interface3 mad ip address 192.168.2.1 255.255.255.0 member 1 mad ip address 192.168.2.2 255.255.255.0 member 2


LACP MAD enabled IRF LACP MAD requires an HPE device that supports extended LACPDUs to act as the intermediate device. You must set up a dynamic link aggregation group that spans all IRF member devices between the IRF fabric and the intermediate device. To enable dynamic link aggregation, configure the link-aggregation mode dynamic command on the aggregate interface.

If one IRF fabric uses another IRF fabric as the intermediate device for LACP MAD, you must assign the two IRF fabrics different domain IDs for correct split detection. False detection causes IRF split.

Note IRF provides MAD mechanisms by extending LACP, BFD, ARP, and IPv6 ND. You can configure a minimum of one MAD mechanism on an IRF fabric for prompt IRF split detection.

• Do not configure LACP MAD together with ARP MAD or ND MAD, because they handle collisions differently.

• Do not configure BFD MAD together with ARP MAD or ND MAD. BFD MAD is mutually exclusive with the spanning tree feature, but ARP MAD and ND MAD require the spanning tree feature. At the same time, BFD MAD handles collisions differently than ARP MAD and ND MAD.

When you use the mad enable command, the system prompts you to enter a domain ID. If you do not want to change the current domain ID, press enter at the prompt.

An IRF fabric has only one IRF domain ID. You can change the IRF domain ID by using the following commands: irf domain, mad enable, mad arp enable, or mad nd enable. The IRF domain IDs configured by using these commands overwrite each other.

Examples

# Enable LACP MAD on Bridge-Aggregation 1, a Layer 2 dynamic aggregate interface.

<Sysname> system-view [Sysname] interface bridge-aggregation 1 [Sysname-Bridge-Aggregation1] link-aggregation mode dynamic [Sysname-Bridge-Aggregation1] mad enable You need to assign a domain ID (range: 0-4294967295) [Current domain is: 0]: 1 The assigned domain ID is: 1 MAD LACP only enable on dynamic aggregation interface. # Enable LACP MAD on Route-Aggregation 1, a Layer 3 dynamic aggregate interface.

<Sysname> system-view [Sysname] interface route-aggregation 1 [Sysname-Route-Aggregation1] link-aggregation mode dynamic [Sysname-Route-Aggregation1] mad enable You need to assign a domain ID (range: 0-4294967295) [Current domain is: 0]: 1 The assigned domain ID is: 1 MAD LACP only enable on dynamic aggregation interface.

Enable tagged VLANs on HPE Network switches The following instructions are for configuring the HPE 5940 Switch to support tagged VLANs configured on the servers described previously.

1. Log into the network device via the console and enter system-view.

<HP>system-view


2. Use vlan vlan-id to create a VLAN.

[HP]vlan 120

[HP-VLAN120]quit

3. The following instructions will configure the range of ports on the switch, ports 1-18 and ports 45-48 on both the HPE 5940 Switches.

Use the following command to enter interface range view to bulk configure multiple interfaces with the same feature instead of configuring them one by one. Use interface range name name interface interface-list to create an interface range, configure a name for the interface range, and enter the interface range view.

Use interface range name name without the interface keyword to enter the view of an interface range with the specified name.

interface range name Hadoop-net interface Ten-GigabitEthernet1/0/1 to Ten-GigabitEthernet1/0/18 Ten-GigabitEthernet2/0/1 to Ten-GigabitEthernet2/0/18 Ten-GigabitEthernet1/0/39 to Ten-GigabitEthernet1/0/42 Ten-GigabitEthernet2/0/39 to Ten-GigabitEthernet2/0/42

Commands in Steps 4, 5 and 6 are configured on the ports specified in the command.

4. Use port link-type to configure the link type of a port.

Parameters:

access: Configures the link type of a port as access.

hybrid: Configures the link type of a port as hybrid.

trunk: Configures the link type of a port as trunk.

Usage guidelines:

To change the link type of a port from trunk to hybrid or vice versa, first set the link type to access.

a. The configuration made in Ethernet interface view applies only to the port.

b. The configuration made in aggregate interface view applies to the aggregate interface and its aggregation member ports.

c. If the system fails to apply the configuration to the aggregate interface, it stops applying the configuration to aggregation member ports.

d. If the system fails to apply the configuration to an aggregation member port, it skips the port and moves to the next member port.

e. The configuration made in S-channel interface view applies only to the interface.

[HP-if-range-hadoop-net] port link-type trunk

5. Use undo port trunk vlan to remove the trunk ports from the specified VLANs.

Use port trunk vlan to assign the trunk ports to the specified VLANs.

Parameters:

vlan-list: Specifies a list of existing VLANs in the format of:

vlan-list = { vlan-id1 [ to vlan-id2 ] }&<1-10>

where vlan-id1 and vlan-id2 each are in the range of 1 to 4094, vlan-id2 cannot be smaller than vlan-id1, and &<1-10> indicates that you can specify up to 10 vlan-id1 [ to vlan-id2 ] parameters.

tagged: Configures the ports to send the packets of the specified VLANs without removing VLAN tags.

untagged: Configures the ports to send the packets of the specified VLANs after removing VLAN tags.


Usage guidelines:

A trunk port can carry multiple VLANs. If you execute the port trunk vlan command multiple times, the VLANs that the trunk port allows are the VLANs that are specified by vlan-list in each execution.

[HP-if-range-hadoop-net] undo port trunk vlan 1

[HP-if-range-hadoop-net] port trunk vlan 120 tagged

[HP-if-range-hadoop-net] port trunk vlan 1120 untagged

6. Use the port trunk pvid command to configure the PVID (Port VLAN ID) of the trunk port.

[HP-if-range-hadoop-net] port trunk pvid vlan 1120

[HP-if-range-hadoop-net] quit

7. Save the configuration. Once completed, exit the switch, and you are done.

Hadoop server network configuration Following sections in the document focus on best practices for single rack configuration, focusing on:

• Server NIC bonding

• Server with 802.1q VLAN tagging

Server NIC bonding The following are the instructions on configuring servers with bonding. On the servers, two 10GbE adapter NICs are configured as bonded. The following is the configuration showing bond0 being the master and eth0 as slave. Red Hat® Enterprise Linux® allows administrators to bind multiple network interfaces together into a single channel using the bonding kernel module and a special network interface called a channel bonding interface. Channel bonding enables two or more network interfaces to act as one, simultaneously increasing the bandwidth and providing redundancy. The server is configured with jumbo frames MTU to 9000 for better performance and needs to be configured on all the servers in the Hadoop data network interfaces. The default MTU is 1500.

The following are the bonding modes:

• Mode 0 (balance-rr) This mode transmits packets in a sequential order from the first available slave through the last. This provides load balancing and fault tolerance.

• Mode 1 (active-backup) This mode places one of the interfaces into a backup state and will only make it active if the link is lost by the active interface. Only one slave in the bond is active at an instance of time. A different slave becomes active only when the active slave fails. This mode provides fault tolerance.

• Mode 2 (balance-xor) Transmits based on XOR formula. (Source MAC address is XOR’d with destination MAC address) modula slave count. This selects the same slave for each destination MAC address and provides load balancing and fault tolerance.

• Mode 3 (broadcast) This mode transmits everything on all slave interfaces. This mode is least used (only for specific purpose) and provides only fault tolerance.

• Mode 4 (802.3ad) This mode is known as Dynamic Link Aggregation mode. It creates aggregation groups that share the same speed and duplex settings. This mode requires a switch that supports IEEE 802.3ad Dynamic link. Slave selection for outgoing traffic is done according to the transmit hash policy, which may be changed from the default simple XOR policy via the xmit_hash_policy option.

• Mode 5 (balance-tlb) This is called as Adaptive transmit load balancing. The outgoing traffic is distributed according to the current load and queue on each slave interface. Incoming traffic is received by the current slave.


• Mode 6 (balance-alb) This is Adaptive load balancing mode. This includes balance-tlb + receive load balancing (rlb) for IPV4 traffic. The receive load balancing is achieved by ARP negotiation.

To create a channel bonding interface, create a file in the /etc/sysconfig/network-scripts/ directory called ifcfg-bondN, replacing N with the number for the interface, such as 0.

The contents of the file can be identical to whatever type of interface is getting bonded, such as an Ethernet interface. The only difference is that the DEVICE directive is bondN, replacing N with the number for the interface. The NM_CONTROLLED directive can be added to prevent NetworkManager from configuring this device.

The following is a sample channel bonding configuration file:

The interface bond0 is configured with the IP address and bonding options; additional information on the values is below.

# cat /etc/sysconfig/network-scripts/ifcfg-bond0 DEVICE=bond0 IPADDR=172.24.1.1 NETMASK=255.255.255.0 BOOTPROTO=none ONBOOT=yes NM_CONTROLLED=no BONDING_OPTS="mode=balance-rr primary=eth0 miimon=500 updelay=1000" TYPE=Ethernet MTU=9000 IPV6INIT=no USERCTL=no After the channel bonding interface is created, the network interfaces to be bound together must be configured by adding the MASTER and SLAVE directives to their configuration files. The configuration files for each of the channel-bonded interfaces can be nearly identical.

For example, if two Ethernet interfaces are being channel bonded, both eth0 and eth1 may look like the following example:

# cat /etc/sysconfig/network-scripts/ifcfg-eth0 DEVICE=eth0 BOOTPROTO=none ONBOOT=yes NM_CONTROLLED=no MASTER=bond0 SLAVE=yes USERCTL=no MTU=9000 IPV6INIT=no Bonding mode 4 required switch configuration to support dynamic link aggregation.

10GbE netperf and iperf results with bonding modes The following table shows the network throughput with netperf and iperf using different bonding modes. The default mode is 0 which requires no switch configuration.

Example command lines used to test the bonding mode on 10GbE NIC:

netperf -H <hostname> -f M -l 30 -T 2,2 -- -s 256k -S 256k iperf3 -f M -A 2,2 -c <hostname> -t 30 -w 256K -P 8

netperf and iperf run with following performance setting on the servers

MTU set to 9000

tuned-adm profile set to throughput-performance


Table 5. netperf and iperf results with different bonding moded

Single Port 0 1 2 3 4 5 6

Netperf MBps 2313 3922 2278 2658 2271 4383 2389 3784

Iperf threads=1 MBps 2350 3471 2312 2422 2172 3042 2447 3753




Server with 802.1q VLAN tagging In the Hadoop configuration each server has two NICs that are configured with two VLANs. The HPE 5940 Switch only allows VLAN 120 traffic on uplinks; all other traffic is blocked within the rack. The private VLAN 1120 that is used within the HPE 5940 Switch is not transmitted outside the HPE Elastic Platform for Big Data Analytics (EPA) for Hadoop rack. The server is dual homed with an internal private IP and an external CEN IP. To permit CEN traffic from the server, the bonded NIC has to be configured with VLAN tagging.

The following are the instructions on how to configure VLAN tagging on the servers.

Ensure that the module 802.1q is loaded by entering the following command:

lsmod | grep 8021q

If the module is not loaded, load it with the following command:

modprobe 8021q

To configure the VLAN interface configuration, create a file in the /etc/sysconfig/network-scripts/ directory. The configuration filename should be the physical interface, plus a . (period) character, plus the VLAN ID number. For example, if the VLAN ID is 120, and the physical interface is eth0, then the configuration filename should be ifcfg-eth0.120:

# cat /etc/sysconfig/network-scripts/ifcfg-bond0.120 DEVICE=bond0.120 IPADDR=10.120.45.200 NETMASK=255.255.252.0 GATEWAY=10.120.45.1 BOOTPROTO=none ONBOOT=yes NM_CONTROLLED=no BONDING_OPTS="mode=balance-rr primary=eth0 miimon=500 updelay=1000" PEERDNS=no USERCTL=no VLAN=yes The following bonding policies are set on the servers.

mode=<value>

Where <value> allows you to specify the bonding policy and can be one of the following:

• balance-rr or 0 Sets a round-robin policy for fault tolerance and load balancing. Transmissions are received and sent out sequentially on each bonded slave interface beginning with the first one available.


• miimon=<time_in_milliseconds> Specifies (in milliseconds) how often MII link monitoring occurs. This is useful if high availability is required because MII is used to verify that the NIC is active. To verify that the driver for a particular NIC supports the MII tool, type the following command as root:

# ethtool <interface_name> | grep "Link detected:"

In this command, replace <interface_name> with the name of the device interface, such as eth0, not the bond interface. If MII is supported, the command returns:

Link detected: yes

If using a bonded interface for high availability, the module for each NIC must support MII. Setting the value to 0 (the default), turns this feature off.

• updelay=<time_in_milliseconds> Specifies (in milliseconds) how long to wait before enabling a link. The value must be a multiple of the value specified in the miimon parameter. The value is set to 0 by default, which disables it.

HPE Elastic Platform for Big Data Analytics (EPA) for Hadoop data center connectivity The following section focuses on configuring the HPE 5940 Switch to connect to the customer data center network.

Configuring HPE 5940 network switch This section provides instructions for configuring a network switch for customer uplinks. Configure customer uplinks using ports 19 through 32 10GbE ports or 49 and 50 40GbE ports on the HPE 5940 Switch that are labeled as customer in Figures 3 and 4. The example below shows how to configure one port. If multiple ports need to be configured, repeat these steps for each port.

If the customer wants to use 40GbE uplinks, “HPE X240 40G QSFP+ to QSFP+ 5m Direct Attach Copper Cable (JG328A)” cables need to be ordered to connect to the customer’s spine switches with 40GbE connectivity.

Additional 40GbE connectivity options are available to customers, for additional details refer to the QuickSpecs for HPE 5940 Switch Series:

• HPE X240 40G QSFP+ to 4x10G SFP+ 1m Direct Attach Copper Splitter Cable (JG329A)

• HPE X140 40G QSFP+ MPO SR4 Transceiver (JG325B)

The 40GbE Ports 49 and 50 on the HPE 5940 can be configured using the instructions below.

Note Port numbers will be formatted differently depending on the model and how the switch is configured. For example, a switch configured to use Intelligent Resilient Fabric (IRF) will also include a chassis number as part of the port number.

In this example, we will be making a two-port link aggregation group in an IRF configuration. You can see that this is a two switch IRF configuration by observing the port number scheme of chassis/slot/port. A scheme of 1/0/43 means chassis 1, or switch 1, slot 0, port 43. A scheme of 2/0/43 means chassis 2, or switch 2, slot 0, port 43. In the following example, port 43 from each switch, r001sw001 and r001sw002, is configured. You can use any other ports to configure the bridge aggregation that provides an uplink from the switch. If you use 40GbE QSFP port 49 or 50, then you will have to replace “interface Ten-GigabitEthernet” to “interface FortyGigE”; other instruction are the same.

1. Log into the network device via the console and enter system-view.

<HP>system-view

2. Create the Bridge Aggregation interface to contain the uplinks from your server. In this example we will be creating the interface of Bridge Aggregation 1. Your numbering may vary depending on the current configuration on the switch you are using.

[HP] interface Bridge-Aggregation 1

https://h20195.www2.hpe.com/v2/GetDocument.aspx?docname=c05158726


3. Give your new interface a description to ease identification:

[HP-Bridge-Aggregation1] description bridge-aggregation-to-customer-network

4. By default, an aggregation group operates in static aggregation mode. So you need to configure the aggregation group to operate in dynamic aggregation mode.

[HP-Bridge-Aggregation1] link-aggregation mode dynamic

5. Type quit to exit the bridge aggregation:

[HP-Bridge-Aggregation1] quit

6. Enter the first interface that you will be aggregating. The example below adds port 43 in the bottom switch to link aggregation group 1:


7. Put the port in the link aggregation group:

[HP-Ten-GigabitEthernet1/0/43] port link-aggregation group 1

8. Enable the interface. If it is already enabled, it will tell you that the interface is not shut down.

[HP-Ten-GigabitEthernet1/0/43] undo shutdown

9. Type “quit” and repeat steps 6-7 for all interfaces that will be in the link aggregation group. The example below adds port 43 in the second switch to link aggregation group 1.


[HP-Ten-GigabitEthernet2/0/43] port link-aggregation group 1

[HP-Ten-GigabitEthernet2/0/43] undo shutdown

10. Return to the Bridge Aggregation for the final configuration:

[HP] interface Bridge-Aggregation 1

Note If you get an error similar to “Error: Failed to configure on interface…” during any of the following steps, you will need to run the following command on the interface that has the error and then re-run Steps 6-7.


[HP-Ten-GigabitEthernet1/0/43] default

This command will restore the default settings. Continue? [Y/N]: Y

If the default command is not available:

[HP-Ten-GigabitEthernet1/0/43] port link-type access

11. Change the port type to a trunk:

[HP-Bridge-Aggregation1] port link-type trunk

12. Enable the interface:

[HP-Bridge-Aggregation1] undo shutdown


13. Set the Port Default VLAN ID (PVID) of the connection. The PVID is the VLAN ID the switch will assign to all untagged frames (packets) received on each port. Another term for this would be your untagged or native VLAN. By default, it is set to 1, but you will want to change it if your network is using another VLAN ID for your untagged traffic.

[HP-Bridge-Aggregation1] port trunk pvid vlan 120

14. If you configured your vSwitch to pass multiple VLAN tags, you can configure your bridge aggregation link at this time by running the following command. Repeat for all the VLANs you need to pass through that connection.

[HP-Bridge-Aggregation1] port trunk permit vlan 120 Please wait... Done. Configuring Ten-GigabitEthernet1/0/43... Done. Configuring Ten-GigabitEthernet2/0/43... Done.

15. If you set your PVID to something other than the default 1, you will want to remove that VLAN 1 and repeat Step 13 for your PVID VLAN.

[HP-Bridge-Aggregation1] undo port trunk permit vlan 1 Please wait... Done. Configuring Ten-GigabitEthernet1/0/43... Done. Configuring Ten-GigabitEthernet2/0/43... Done.

16. Now display your new Bridge Aggregation interface to ensure things are set up correctly. You will want to make sure your PVID is correct, and

that you are both passing and permitting the VLAN you defined. In this example, we are not passing the untagged traffic (PVID 1) and only packets tagged with VLAN ID 120. You will also want to make sure your interfaces are up, and you are running at the correct speed, two 10Gbps links would give you 20Gbps of aggregated performance.

[HP] display interface Bridge-Aggregation 1

Bridge-Aggregation1 current state: UP

IP Packet Frame Type: PKTFMT_ETHNT_2, Hardware Address: 000f-e207-f2e0

Description: bridge-aggregation-to-customer-network

20Gbps-speed mode, full-duplex mode

Link speed type is autonegotiation, link duplex type is autonegotiation

PVID: 120

Port link-type: trunk

VLAN passing : 120

VLAN permitted: 120

Trunk port encapsulation: IEEE 802.1q

… Output truncated…

17. Now check to make sure the trunk was formed correctly. If both connections have something other than “S” for the status, here are a few troubleshooting steps. If none of these work, then delete and recreate the bridge aggregation and reset all the ports back to default. Ensure that:

a. You configured the interfaces correctly.

b. You enabled (undo shutdown) the port on the switch.

c. The VLANs being passed/permitted match that of the group.


d. The port is connected to the switch on the interface you specified and is connected and enabled on the server:

[HP] display link-aggregation verbose Bridge-Aggregation 1

Loadsharing Type: Shar -- Loadsharing, NonS -- Non-Loadsharing

Port Status: S -- Selected, U -- Unselected

Flags: A -- LACP_Activity, B -- LACP_Timeout, C -- Aggregation,

D -- Synchronization, E -- Collecting, F -- Distributing,

G -- Defaulted, H -- Expired

Aggregation Interface: Bridge-Aggregation1

Aggregation Mode: Static

Loadsharing Type: Shar

Port Status Oper-Key

---------------------------------------------------------------------------

XGE1/0/43 S 2

XGE2/0/43 S 2

18. Save the configuration. Once completed, exit the switch, and you are done.

Configure Cisco switch for HPE 5940 The following steps are for configuring Cisco switches to communicate with HPE 5940 networking switches on HPE Elastic Platform for Big Data Analytics (EPA) for Hadoop. In the following example two interface ports are configured as uplinks from the 5900 network switch to the customer Cisco switch. The ports only allow VLAN 120 traffic; all other traffic is blocked from the customer enterprise network. The private VLAN 1120 that is used within the HPE 5940 Switch is not transmitted outside the HPE Elastic Platform for Big Data Analytics (EPA) for Hadoop rack.

1. Specify the port-channel interface to configure, and enter the interface configuration mode. The range is from 1 to 4096.

[Cisco]interface Port-channel 1

2. Configure the interface for Layer 2 switching. Enter the switchport command once without any keywords to configure the interface as a Layer 2 port before you can enter additional switchport commands with keywords.

switchport

3. Specify the encapsulation command with the dot1q keyword to support the IEEE standard.

switchport trunk encapsulation dot1q

4. For 802.1Q trunks, specify the native VLAN. Note: If you do not set the native VLAN, the default is used (VLAN 1).

switchport trunk native vlan 120

5. Configure the list of VLANs allowed on the trunk. All VLANs are allowed by default.

switchport trunk allowed vlan 120

6. Remove VLAN 1 from the trunk.

switchport trunk allowed vlan remove 1

7. Configure the interface as a Layer 2 trunk. (Required only if the interface is a Layer 2 access port or to specify the trunking mode.)

switchport mode trunk


8. Specify the interface to configure. Specify the port mode for the link in a port channel. After LACP is enabled, configure each link or the entire channel as active or passive.

When you run port channels with no associated protocol, the port-channel mode is always on.

The default port-channel mode is on.

interface GigabitEthernet1/1

switchport






channel-group 1 mode active

9. Specify other interfaces that connect to the HPE Networking switches.

interface GigabitEthernet1/2

switchport






channel-group 1 mode active

10. Verify port-channel 1 is correctly formed with the 5900 Switch.

show interfaces port-channel 1 etherchannel

Summary With Hadoop server nodes ranging from 100s to 1000s, network design and configuration is an important consideration for processing the data and transferring data between the nodes efficiently. With the explosion of data over the last 5-10 years, and the need to ingest and process this data in a timely manner, organizations are looking to accelerate their analytics applications and data processing pipelines by exploiting the advances in modern computing and networking infrastructure. This paper provided design considerations and recommendations on how these operations, iLO, edge and Hadoop data networks are to be configured with some of the best practices while configuring the switches and servers. The paper also provides information on how to configure HPE and Cisco networking switches with interoperability considerations.


Appendix A: Bill of materials Table A-1. BOM for HPE 5940-48XGT-6QSFP Switch

Qty Part Number Description

2 JH394A HPE 5940-48XGT-6QSFP+ switch

2 JG326A HPE X240 40G QSFP+ QSFP+ 1m DAC cable

4 JC680A HPE A58x0AF 650W AC power supply

4 JG553A HPE X712 Bck(pwr)-Frt(prt) HV fan tray

Table A-2. BOM for HPE FF 5930-32QSFP Switch


2 JG726A HPE FF 5930-32QSFP+

4 JC680A HPE 58x0AF 650W AC Power Supply

4 JG553A HPE X712 Bck(pwr)-Frt(prt) HV Fan Tray



Table A-3. BOM for HPE 5900AF-48G-4XG-2QSFP


1 JG510A HPE 5900AF-48G-4XG-2QSFP+ switch


2 JC682A HPE A58x0AF Back (power side) to front (port side) airflow fan tray

1 806565-B21 HPE Apollo 4200 Gen9 iLO Mgmt Prt kit (one kit per HPE Apollo 4200 Gen9 server)

Table A-4. BOM for HPE FF 5950 32Q28 Switch


2 JH321A HPE FF 5950 32Q28 Switch


12 JH389A HPE X712 Bck(pwr)-Frt(prt) HV2 Fan Tray

4 JL271A HPE X240 100G QSFP28 1m DAC Cable

Table A-5. BOM for HPE FF 5950 48SFP28 8QSFP28 Switch


2 JH402A HPE FF 5950 48SFP28 8QSFP28 Switch


10 JH389A HPE X712 Bck(pwr)-Frt(prt) HV2 Fan Tray

4 JL294A HPE X240 25G SFP28 to SFP28 1m DAC


Table A-6. BOM for Mellanox SN2100 Ethernet Switch

Qty Part Number Description Notes

2 MSN2100-CB2F Spectrum based 100GbE, 1U Open Ethernet switch with MLNX-OS, 16 QSFP28 ports, 2 Power Supplies (AC), RoHS6

P2C= Power_Supply-to-Connector (Cold Aisle Power_Supply and Power Cords Present)

1 MTEF-KIT-D Rack installation kit for SN2100 series short depth 1U switches Allows installation of one or two switches side-by-side into standard depth racks

Table A-7. BOM for Arista 7050X Switch


1 JH588A Arista 7050X 32XGT 4QSFP+ BF AC Switch

Table A-8. Miscellaneous adapters


1 655874-B21 HPE QSFP/SFP+ Adapter Kit

1 JD089B HPE X120 1G SFP RJ45 T Transceiver

1 JD092B HPE X130 10G SFP+ LC SR Transceiver


Appendix B: HPE 5940 Switch configuration # version 7.1.045, Release 2422P01 # irf mac-address persistent timer irf auto-update enable undo irf link-delay irf member 1 priority 32 irf member 2 priority 1 irf mode normal # lldp global enable # interface range name compute-net-18 interface Ten-GigabitEthernet1/0/18:1 to Ten-GigabitEthernet1/0/18:4 Ten-GigabitEthernet2/0/18:1 to Ten-GigabitEthernet2/0/18:4 interface range name compute-net-19 interface Ten-GigabitEthernet1/0/19:1 to Ten-GigabitEthernet1/0/19:4 Ten-GigabitEthernet2/0/19:1 to Ten-GigabitEthernet2/0/19:4 interface range name compute-net-20 interface Ten-GigabitEthernet1/0/20:1 to Ten-GigabitEthernet1/0/20:4 Ten-GigabitEthernet2/0/20:1 to Ten-GigabitEthernet2/0/20:4 interface range name customer-net interface FortyGigE1/0/22 to FortyGigE1/0/27 FortyGigE2/0/22 to FortyGigE2/0/27 interface range name irf-net interface FortyGigE1/0/29 to FortyGigE1/0/32 FortyGigE2/0/29 to FortyGigE2/0/32 interface range name management-net interface Ten-GigabitEthernet1/0/21:1 to Ten-GigabitEthernet1/0/21:3 Ten-GigabitEthernet2/0/21:1 to Ten-GigabitEthernet2/0/21:3 interface range name storage-net interface FortyGigE1/0/1 to FortyGigE1/0/4 FortyGigE2/0/1 to FortyGigE2/0/4 # system-working-mode standard password-recovery enable # vlan 1 # irf-port 1/1 port group interface FortyGigE1/0/29 port group interface FortyGigE1/0/30 port group interface FortyGigE1/0/31 port group interface FortyGigE1/0/32 # irf-port 2/2 port group interface FortyGigE2/0/29 port group interface FortyGigE2/0/30 port group interface FortyGigE2/0/31 port group interface FortyGigE2/0/32

# stp global enable # interface Bridge-Aggregation1 description Downlink to 5900 port link-type trunk port trunk permit vlan all # interface NULL0 # interface FortyGigE1/0/1 port link-mode bridge port link-type trunk port trunk permit vlan all flow-interval 5 flow-control stp edged-port lldp tlv-enable dot1-tlv dcbx dldp enable qos trust dot1p # interface FortyGigE1/0/2 # SAME AS ABOVE FortyGigE1/0/1 # interface FortyGigE1/0/3 # SAME AS ABOVE FortyGigE1/0/1 # interface FortyGigE1/0/4 # SAME AS ABOVE FortyGigE1/0/1 # interface FortyGigE1/0/5 port link-mode bridge shutdown # #SKIPPING OTHER INTERFACES # # interface FortyGigE1/0/28 port link-mode bridge description m11sw03 1/0/52 port link-type trunk port trunk permit vlan all port link-aggregation group 1 #


interface Ten-GigabitEthernet1/0/18:1 port link-mode bridge port link-type trunk port trunk permit vlan all flow-interval 5 flow-control stp edged-port lldp tlv-enable dot1-tlv dcbx dldp enable qos trust dot1p # interface Ten-GigabitEthernet1/0/21:1 port link-mode bridge port link-type trunk port trunk permit vlan all flow-interval 5 flow-control stp edged-port lldp tlv-enable dot1-tlv dcbx dldp enable qos trust dot1p #

# snmp-agent snmp-agent local-engineid 800063A2804431926E617300000001 snmp-agent community read public snmp-agent sys-info version all # ssh server enable # radius scheme system user-name-format without-domain user-group admin # user-group system # local-user admin class manage password hash xxx service-type ssh terminal group admin authorization-attribute user-profile admin authorization-attribute user-role level-3 authorization-attribute user-role network-admin authorization-attribute user-role network-operator #


Appendix C: Configuring the Mellanox adapter The recommended adapter for the Apollo 4200 storage nodes is the HPE 765285-B21 40Gb adapter. You should be aware the drivers are not included in the OS release and the adapters will not be visible to the operating system until the latest software and firmware has been loaded. The latest drivers and installation guidelines can be found at http://h20566.www2.hpe.com/hpsc/swd/public/readIndex?sp4ts.oid=7152863.

Note Mellanox OFED installation procedures for HPE 765285-B21 adapter will require gcc-gfortran, tk and tcl and their dependences. These are not part of the standard RHEL installation.

The following changes should only be made after the latest Mellanox OFED drivers have been installed as the install process will reset the configuration files.

Optimizing the HPE 764285-B21 40Gb adapter Increase ring buffer sizes. In the example, the network adapter devices are eth0 and eth1.

ethtool -G eth0 rx 8192 tx 8192 ethtool -G eth1 rx 8192 tx 8192

Enable flow control (pause):

ethtool -A eth0 rx on tx on ethtool -A eth1 rx on tx on

Note These changes need to be reapplied every time the Mellanox OFED driver is updated:

In /etc/modprobe.d/mlnx.conf set options mlx4_core enable_sys_tune=1 In /etc/infiniband/openib.conf set RUN_AFFINITY_TUNER=yes /usr/sbin/mlnx_affinity start

http://h20566.www2.hpe.com/hpsc/swd/public/readIndex?sp4ts.oid=7152863


Appendix D: Glossary of terms Table D-1. Glossary

Term Definition

Bridge Aggregation Comware OS terminology for Link Aggregation.

Distributed Trunking (DT) A link aggregation technique, where two or more links across two switches are aggregated together to form a trunk.

IEEE 802.3ad An industry standard protocol that allows multiple links/ports to run in parallel, providing a virtual single link/port. The protocol provides greater bandwidth, load balancing, and redundancy.

Intelligent Resilient Fabric (IRF) Technology in certain HPE Networking switches that enables the ability to connect similar devices together to create a virtualized distributed device. This virtualization technology realizes the cooperation, unified management, and non-stop maintenance of multiple devices.

LACP Link Aggregation Control Protocol (see IEEE 802.3ad or 802.1ax)

Port Aggregation Combining ports to provide one or more of the following benefits: greater bandwidth, load balancing, and redundancy.

Port Bonding A term typically used in the UNIX®/Linux world that is synonymous to NIC teaming in the Windows® world.

Spanning Tree Protocol (STP) Spanning Tree Protocol (STP) is standardized as IEEE 802.1D and ensures a loop-free topology for any bridged Ethernet local area network by preventing bridge loops and the broadcast traffic that results from them.

Trunking Combining ports to provide one or more of the following benefits: greater bandwidth, load balancing, and redundancy.


Sign up for updates

© Copyright 2017-2018 Hewlett Packard Enterprise Development LP. The information contained herein is subject to change without notice. The only warranties for Hewlett Packard Enterprise products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. Hewlett Packard Enterprise shall not be liable for technical or editorial errors or omissions contained herein.

Intel is a trademark of Intel Corporation in the U.S. and other countries. Red Hat is a registered trademark of Red Hat, Inc. in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other countries. Apache, Hadoop, Kafka, Spark, and Metron are either registered trademarks or trademarks of the Apache Software Foundation in the United States or other countries. UNIX is a registered trademark of The Open Group. Windows is either registered trademark or trademark of Microsoft Corporation in the United States and/or other countries.

a00004216enw, April 2018, Rev. 1

Resources and additional links HPE Reference Architectures, hpe.com/info/ra

HPE Servers, hpe.com/servers

HPE Storage, hpe.com/storage

HPE Networking, hpe.com/networking

HPE Technology Consulting Services, hpe.com/us/en/services/consulting.html

To help us improve our documents, please provide feedback at hpe.com/contact/feedback.

http://www.hpe.com/info/getupdated

http://www.facebook.com/sharer.php?u=https://www.hpe.com/h20195/V2/GetDocument.aspx?docname=a00004216ENW

http://twitter.com/home/?status=HPE%20Reference%20Configuration%20for%20networking%20best%20practices%20on%20the%20HPE%20Elastic%20Platform%20for%20Big%20Data%20Analytics%20(EPA)%20Hadoop%20ecosystem+@+https://www.hpe.com/h20195/V2/GetDocument.aspx?docname=a00004216ENW

http://www.linkedin.com/shareArticle?mini=true&ro=true&url=https://www.hpe.com/h20195/V2/GetDocument.aspx?docname=a00004216ENW&title=HPE%20Reference%20Configuration%20for%20networking%20best%20practices%20on%20the%20HPE%20Elastic%20Platform%20for%20Big%20Data%20Analytics%20(EPA)%20Hadoop%20ecosystem+&armin=armin

http://www.hpe.com

http://www.hpe.com/info/ra

http://www.hpe.com/servers

http://www.hpe.com/storage

http://www.hpe.com/networking

http://www.hpe.com/us/en/services/consulting.html

http://www.hpe.com/contact/feedback

HPE Reference Configuration for networking best practices ......Hadoop is an ecosystem of several...

Documents

Transcript of HPE Reference Configuration for networking best practices ......Hadoop is an ecosystem of several...