C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

lingu@ieee.orgLin Gu

C o m p u t e r N e t w o r k s

Datacenter Networks

Rack-Mounted ServersSun Fire x4150 1U server

Scale Up vs. Scale Out

SMPSuper Server

DepartmentalServer

PersonalSystem

Clusterof PCs

Link Layer 5-4

Data center networks 10’s to 100’s of thousands of hosts, often

closely coupled, in close proximity: e-business (e.g. Amazon) content-servers (e.g., YouTube, Akamai, Apple,

Microsoft) search engines, data mining (e.g., Google)

challenges: multiple applications,

each serving massive numbers of clients

managing/balancing load, avoiding processing, networking, data bottlenecks

Inside a 40-ft Microsoft container, Chicago data center

Link Layer 5-5

Server racks

TOR switches

Tier-1 switches

Tier-2 switches

Load balancer

1 2 3 4 5 6 7 8

Border router

Access router

Internet

Data center networks load balancer: application-layer routing receives external client requests directs workload within data center returns results to external client

(hiding data center internals from client)

Server racks

TOR switches

Tier-1 switches

Tier-2 switches

1 2 3 4 5 6 7 8

Data center networks rich interconnection among switches, racks:

increased throughput between racks (multiple routing paths possible)

increased reliability via redundancy

Low Earth Orbit networks

Some unsuccessful earlier attempts, Iridium, … Such systems may come back in the future It does not have to be a satellite – Google Loon, …

Wireless communication is convenient, and can be high-bandwidth. Satellites can be an effective solution.

Deep space communication

Extremely long latency What protocols work? How to build the transceivers? Befriend physicists

How to communication in the Solar System, in the Galaxy, or in deeper space?

The network is the computer”- SUN

Microsystems

Appendix

Motivations of using Clusters over Specialized Parallel Computers

• Individual PCs are becoming increasingly powerful

• Communication bandwidth between PCs is increasing and latency is decreasing (Gigabit Ethernet, Myrinet)

• PC clusters are easier to integrate into existing networks

• Typical low user utilization of PCs (<10%)

• Development tools for workstations and PCs are mature

• PC clusters are a cheap and readily available

• Clusters can be easily grown

Cluster Architecture

Sequential Applications

Parallel Applications

Parallel Programming Environment

Cluster Middleware(Single System Image and Availability Infrastructure)

Cluster Interconnection Network/Switch

PC/Workstation

Network Interface Hardware

CommunicationsSoftware

PC/Workstation

Sequential Applications

Parallel ApplicationsParallel

Applications

Major Components of a Datacenter

• Computing hardware (equipment racks)

• Power supply and distribution hardware

• Cooling hardware and cooling fluid distribution hardware

• Network infrastructure

• IT Personnel and office equipment

Datacenter Networking

Growth Trends in Datacenters• Load on network & servers continues to rapidly grow

– Rapid growth: a rough estimate of annual growth rate: enterprise datacenters: ~35%, Internet datacenters: 50% - 100%

– Information access anywhere, anytime, from many devices• Desktops, laptops, PDAs & smart phones, sensor

networks, proliferation of broadband• Mainstream servers moving towards higher speed links

– 1-GbE to10-GbE in 2008-2009– 10-GbE to 40-GbE in 2010-2012

• High-speed datacenter-MAN/WAN connectivity– High-speed datacenter syncing for disaster recovery

• A large part of the total cost of the DC hardware– Large routers and high-bandwidth switches are very

expensive• Relatively unreliable – many components may fail.• Many major operators and companies design their

own datacenter networking to save money and improve reliability/scalability/performance.– The topology is often known– The number of nodes is limited– The protocols used in the DC are known

• Security is simpler inside the data center, but challenging at the border

• We can distribute applications to servers to distribute load and minimize hot spots

Networking components (examples)

• High Performance & High Density Switches & Routers

– Scaling to 512 10GbE ports per chassis

– No need for proprietary protocols to scale

• Highly scalable DC Border Routers

– 3.2 Tbps capacity in a single chassis

– 10 Million routes, 1 Million in hardware

– 2,000 BGP peers– 2K L3 VPNs, 16K L2 VPNs– High port density for GE and

10GE application connectivity– Security

768 1-GE port Downstream

64 10-GE port Upstream

Common data center topologyInternet

Servers

Layer-2 switchAccess

Datacenter

Layer-2/3 switchAggregation

Layer-3 routerCore

Data center network design goals• High network bandwidth, low latency• Reduce the need for large switches in the core• Simplify the software, push complexity to the

edge of the network• Improve reliability• Reduce capital and operating cost

Avoid this…

and simplify this…

Can we avoid using high-end switches?• Expensive high-end switches to

scale up• Single point of failure and

bandwidth bottleneck– Experiences from real systems

• One answer: DCell20

Interconnect

DCell Ideas• #1: Use mini-switches to scale out • #2: Leverage servers to be part of the routing

infrastructure– Servers have multiple ports and need to forward

packets• #3: Use recursion to scale and build complete

graph to increase capacity

Interconnect

One approach: switched network with a hypercube interconnect

• Leaf switch: 40 1Gbps ports+2 10 Gbps ports.– One switch per rack.– Not replicated (if a switch fails, lose one rack of capacity)

• Core switch: 10 10Gbps ports– Form a hypercube

• Hypercube – high-dimensional rectangle

Data Center Networking

Hypercube properties• Minimum hop count• Even load distribution for all-all communication.• Can route around switch/link failures.• Simple routing:

– Outport = f(Dest xor NodeNum)– No routing tables

Interconnect

A 16-node (dimension 4) hypercube

0 2021 5 4

10 11 15 14

8 9 13 12

1 1 1 1

3 3 3 3

Interconnect

64-switch Hypercube

63 * 4 links toother containers

One container:

4 links

Level 0: 32 40-port 1 Gb/sec switches

Level 1: 8 10-port 10 Gb/sec switches

64 10 Gb/sec links

16 10 Gb/sec linksLevel 2: 2 10-port 10 Gb/sec switches

1280 Gb/sec links

4X4Sub-cube

16links

Interconnect

How many servers can be connected in this system?

81920 servers with 1Gbps bandwidth

Core switch: 10Gbps port x 10

Leaf switch: 1Gbps port x 40 + 10Gbps port x 2.

The Black BoxData Center Networking

Typical layer 2 & Layer 3 in existing systems• Layer 2

– One spanning tree for entire network• Prevents looping• Ignores alternate paths

• Layer 3– Shortest path routing between source and

destination– Best-effort delivery

Interconnect

Problem With common DC topology• Single point of failure• Over subscription of links higher up in the

topology– Trade off between cost and provisioning

• Layer 3 will only use one of the existing equal cost paths

• Packet re-ordering occurs if layer 3 blindly takes advantage of path diversity

Interconnect

Fat-tree based SolutionConnect hosts together using a fat-tree topology

– Infrastructure consists of cheap devices• Each port supports same speed as the end host

– All devices can transmit at line speed if packets are distributed along existing paths

– A k-ary fat-tree is composed of switches with k ports• How many switches? … 5k2/4• How many connected hosts? … k3/4

Interconnect

k-ary Fat-Tree (k=4)Interconnect

k2/4 switches

Use the same type of switches in the core, aggregation, and edge, with each switch having k ports

k pods

k/2 switches/pod

k/2 hosts/edge switch

Fat-tree Modified• Enforce special addressing scheme in DC

– Allows host attached to same switch to route only through switch

– Allows inter-pod traffic to stay within pod– unused.PodNumber.switchnumber.Endhost

• Use two level look-ups to distribute traffic and maintain packet ordering.

Interconnect

2-Level look-ups

• First level is prefix lookup– Used to route down the topology to endhost

• Second level is a suffix lookup– Used to route up towards core– Diffuses and spreads out traffic– Maintains packet ordering by using the same ports for the

same endhost

Interconnect

Comparison of several schemes– Hypercube: high-degree interconnect for large net, difficult

to scale incrementally – Butterfly and fat-tree: cannot scale as fast as DCell– De Bruijn: cannot incrementally expand – DCell: low bandwidth between two clusters (sub-DCells)

Interconnect

Distributed Systems

Sun Fire x4150 1U server

Datacenter

DNS LB system

Users are geographically distributed, and computation is globally optimized.

Datacenter

Load Balancing• The load balancing systems regulate global data center traffic• Incorporates site health, load, user proximity, and service

response for user site selection• Provides transparent site failover in case of disaster or service

outage

• Providing site selection for users• Harnessing the benefits and

intricacies of geo-distribution• Leveraging both DNS and non-

DNS methods for multi-site redundancy

Global Data Center Deployment

Cloud and Globalization of Computation

Google’s Search System

Computing in an LSDS The browser issues a query DNS lookup HTTP handling GWS Backend HTTP response

San Jose

London

Hong Kong

Google.com

GWS GWS GWS GWS

Backend

Inside d

ata center

Google’s Cluster ArchitectureGoals A high-performance distributed system for

search Thousands of machines collaborate to handle the

workload Price-performance ratio Scalability Energy efficiency and cooling High availability

Luiz Andre Barroso, Jeffrey Dean, Urs Holzle. Web Search for a Planet: The Google Cluster Architecture. IEEE Micro, vol. 23, no. 2, pp. 22-28, Mar./Apr. 2003

How to compute in a network?

Multiple computers on a perfect network may work like one larger traditional computer.

However, computing becomes more complex when messages can be lost/tampered/duplicated. bandwidth is limited. operations incur long latencies, non-uniform

latencies, or both. events are asynchronous.

Hence, computation in an LSDS over imperfect networks may have to be organized in a different way from traditional computer systems.

How to correctly compute on an imperfect network?

Two Generals’ Problem

Send a messenger, then expect the messenger to come back with one acknowledgment?

Send 100 messengers? How to prove it is possible

or impossible to reach an agreement?

Attack at 5am.

Two generals want to agree on a time to attack an enemy in between them. If the attack is synchronized, the generals can defeat the enemy. Otherwise, they will be defeated by enemy one by one. The generals can send messengers to each other, but the messenger may be caught by the enemy. Can the two generals reach an agreement?

Three Generals’ Problem in Paxos

Who decides the attack time? When is the decision made and agreed on? What if one general betrayed? Byzantine

Generals Problem.

Paxos: reach global consensus in a distributed system with packet loss

Further reading: [Lamport98] Leslie Lamport. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May. 1998), 133-169.

How to maintain state in a network? Part-Time Parliament

Priests can leave the Chamber (server crash or isolated from the system) and may never come back (server fail).

Messengers can leave the Chamber (delayed or out-of-order packets) and may never come back (packet loss).

Priests (legislators) in a parliament communicate to each other using messengers. Both the priests and the messengers can leave the parliament Chamber at any time, and may never come back. Can the parliament pass laws (decrees) and ensure consistency (no ambiguity/discrepancy on the content of a decree)?

Each priest keeps records of the passed decrees (and some additional information) on his/her ledger (nonvolatile storage).

Messengers deliver the candidate decree and votes.

Protocol 1: Suppose we know there are n priests. A priest constructs a decree and sends it to the other n-1 priests, and collects their votes that support the decree. A vote against the degree equal to not voting. If there are n-1 votes for the decree, the decree is passed.

Problem?

Tax=0Tax=

Tax=0Tax=0

Resilient to server failures and packet loss.

State (passing or not passing) of a decree is defined unambiguously.

What is “majority”? The proposing priest may contact a

“quorum” consisting a majority of the priests.

Protocol 2: … A decree is passed by a majority voting for it.

Tax=0Tax=

Tax=0Tax=0OK

Problem?

Clients can query any priest, and the priest may know the decree.

What if the particular priest does not know the decree?

Protocol 3: … Inform all priests about the passing of a decree.

Tax=0Tax=0

Tax=0OK

Problem?

Tax=0 done

If all replies agree, the decree is (perhaps) unambiguous.

What if a priest in the majority set does not know?

Protocol 4: … Read from a majority set.

Problem?

Tax = ?

don’t know

Tax==0

Will there be a majority? Can the majority be wrong?

Protocol 5: … Read following a majority.

Answers to a query should be consistent (identical, or, at least, compatible).

Tax = ?

Tax = 100

Tax==0

1. One priest serves as the president, and proposes a decree with a unique ballot no. b. The president sends the ballot with the proposal to a set of priests.

2. A priest responds to the receipt of the proposal message by replying with its latest vote (LastVote) and a promise that it would not respond to any ballots whose ballot nos. are between the LastVote’s ballot no. and b. LastVote can be null.

Protocol 6: Consider a single decree (e.g., tax = 0) –

PaxosProtocol 6: (continued)3. After receiving promises from a majority set, the

president selects the value for the decree based on the LastVotes of this set (the quorum), and sends the acceptance of the decree to the quorum.

4. The members of the quorum replies with a confirmation (vote) to the president, and reception of all the quorum members’ confirmations (votes) means the decree is passed.

PaxosProtocol 6: (continued)5. After receiving votes from the whole quorum, the

president records the decree in its ledger, and sends a success message to all the priests.

6. Upon receiving the success message, a priest records the decree d in its ledger.

How to know the protocol works correctly?

The leads to a system where every passed decree is the same as the first passed one.

All passed decrees are identical.

Three Generals’ Problem in Paxos

Who decides the attack time? How to agree? What if one general betrayed?

Byzantine Generals Problem.

Attack at 5am.

Can we use Paxos to solve the Three Generals’ problem?

Beyond Single-Decree Paxos

Multiple Paxos instances Sequence of instances Further reading: [Lamport98] Leslie

Lamport. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May. 1998), 133-169.

Can we use Paxos to pass more than one decree?

C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Documents

Transcript of C o m p u t e r N e t w o r k s Datacenter Networks Lin Gu

Datacenter guide

DataCenter 2012

Adam Datacenter

Datacenter Management and Virtualizationdownload.microsoft.com/documents/uk/enterprise/57... · Datacenter Management and Virtualization ... Datacenter infrastructure and applications

Google Datacenter

SAPO Datacenter

Datacenter virtualiazation

Datacenter Services Datacenter Automation Design … · Web viewDesign Architecture Datacenter Automation Table of Contents 1Introduction3 1.1Background3

Primary Datacenter Secondary Datacenter - Koloffon Eureka · Primary Datacenter vCloud Connector vCenter Server ... Software-defined Datacenter Compute Cluster 01 ... 192.168.1.100

Datacenter Transformation

Datacenter Security Automation - Nanjgel Solutions · Datacenter Security Automation. Modern Datacenter –An intelligent infrastructure Improve performance and efficiencies Respond

Steelhead DX for Datacenter-to-Datacenter optimization

DataCenter - Overview

Datacenter Networks

DataCenter Design

9 FULTON - MARTA · a T a tato to o o at .. Sa 0000 to t o a o at a aot o at .a.om a t a. o t a ot o to a t 000 m mmat.. Tap o a o t o t tat o t a a at o a o. ap t tat o t a at t

DataCenter 2020: hot-aisle and cold- aisle containment ... · The DataCenter 2020 is a joint T-Systems and Intel data center test laboratory in the Munich-based Euroindustriepark.

Datacenter Care

Datacenter Airlift

New World Of Datacenter Kris Vandermeulen, Product Marketing Manager Datacenter.