Post on 18-Jan-2018
description
lingu@ieee.orgLin Gu
C o m p u t e r N e t w o r k s
Datacenter Networks
Rack-Mounted ServersSun Fire x4150 1U server
Scale Up vs. Scale Out
SMPSuper Server
DepartmentalServer
PersonalSystem
Clusterof PCs
MPP
Link Layer 5-4
Data center networks 10’s to 100’s of thousands of hosts, often
closely coupled, in close proximity: e-business (e.g. Amazon) content-servers (e.g., YouTube, Akamai, Apple,
Microsoft) search engines, data mining (e.g., Google)
challenges: multiple applications,
each serving massive numbers of clients
managing/balancing load, avoiding processing, networking, data bottlenecks
Inside a 40-ft Microsoft container, Chicago data center
Link Layer 5-5
Server racks
TOR switches
Tier-1 switches
Tier-2 switches
Load balancer
Load balancer
B
1 2 3 4 5 6 7 8
A C
Border router
Access router
Internet
Data center networks load balancer: application-layer routing receives external client requests directs workload within data center returns results to external client
(hiding data center internals from client)
Server racks
TOR switches
Tier-1 switches
Tier-2 switches
1 2 3 4 5 6 7 8
Data center networks rich interconnection among switches, racks:
increased throughput between racks (multiple routing paths possible)
increased reliability via redundancy
Low Earth Orbit networks
Some unsuccessful earlier attempts, Iridium, … Such systems may come back in the future It does not have to be a satellite – Google Loon, …
Wireless communication is convenient, and can be high-bandwidth. Satellites can be an effective solution.
Deep space communication
Extremely long latency What protocols work? How to build the transceivers? Befriend physicists
How to communication in the Solar System, in the Galaxy, or in deeper space?
6-9
The network is the computer”- SUN
Microsystems
Appendix
10
Motivations of using Clusters over Specialized Parallel Computers
• Individual PCs are becoming increasingly powerful
• Communication bandwidth between PCs is increasing and latency is decreasing (Gigabit Ethernet, Myrinet)
• PC clusters are easier to integrate into existing networks
• Typical low user utilization of PCs (<10%)
• Development tools for workstations and PCs are mature
• PC clusters are a cheap and readily available
• Clusters can be easily grown
Cluster Architecture
Sequential Applications
Parallel Applications
Parallel Programming Environment
Cluster Middleware(Single System Image and Availability Infrastructure)
Cluster Interconnection Network/Switch
PC/Workstation
Network Interface Hardware
CommunicationsSoftware
PC/Workstation
Network Interface Hardware
CommunicationsSoftware
PC/Workstation
Network Interface Hardware
CommunicationsSoftware
PC/Workstation
Network Interface Hardware
CommunicationsSoftware
Sequential Applications
Sequential Applications
Parallel ApplicationsParallel
Applications
Major Components of a Datacenter
• Computing hardware (equipment racks)
• Power supply and distribution hardware
• Cooling hardware and cooling fluid distribution hardware
• Network infrastructure
• IT Personnel and office equipment
Datacenter Networking
Growth Trends in Datacenters• Load on network & servers continues to rapidly grow
– Rapid growth: a rough estimate of annual growth rate: enterprise datacenters: ~35%, Internet datacenters: 50% - 100%
– Information access anywhere, anytime, from many devices• Desktops, laptops, PDAs & smart phones, sensor
networks, proliferation of broadband• Mainstream servers moving towards higher speed links
– 1-GbE to10-GbE in 2008-2009– 10-GbE to 40-GbE in 2010-2012
• High-speed datacenter-MAN/WAN connectivity– High-speed datacenter syncing for disaster recovery
Datacenter Networking
• A large part of the total cost of the DC hardware– Large routers and high-bandwidth switches are very
expensive• Relatively unreliable – many components may fail.• Many major operators and companies design their
own datacenter networking to save money and improve reliability/scalability/performance.– The topology is often known– The number of nodes is limited– The protocols used in the DC are known
• Security is simpler inside the data center, but challenging at the border
• We can distribute applications to servers to distribute load and minimize hot spots
Datacenter Networking
Networking components (examples)
• High Performance & High Density Switches & Routers
– Scaling to 512 10GbE ports per chassis
– No need for proprietary protocols to scale
• Highly scalable DC Border Routers
– 3.2 Tbps capacity in a single chassis
– 10 Million routes, 1 Million in hardware
– 2,000 BGP peers– 2K L3 VPNs, 16K L2 VPNs– High port density for GE and
10GE application connectivity– Security
768 1-GE port Downstream
64 10-GE port Upstream
Datacenter Networking
Common data center topologyInternet
Servers
Layer-2 switchAccess
Datacenter
Layer-2/3 switchAggregation
Layer-3 routerCore
Datacenter Networking
Data center network design goals• High network bandwidth, low latency• Reduce the need for large switches in the core• Simplify the software, push complexity to the
edge of the network• Improve reliability• Reduce capital and operating cost
Datacenter Networking
Avoid this…
Datacenter Networking
and simplify this…
?
Can we avoid using high-end switches?• Expensive high-end switches to
scale up• Single point of failure and
bandwidth bottleneck– Experiences from real systems
• One answer: DCell20
Interconnect
DCell Ideas• #1: Use mini-switches to scale out • #2: Leverage servers to be part of the routing
infrastructure– Servers have multiple ports and need to forward
packets• #3: Use recursion to scale and build complete
graph to increase capacity
Interconnect
One approach: switched network with a hypercube interconnect
• Leaf switch: 40 1Gbps ports+2 10 Gbps ports.– One switch per rack.– Not replicated (if a switch fails, lose one rack of capacity)
• Core switch: 10 10Gbps ports– Form a hypercube
• Hypercube – high-dimensional rectangle
Data Center Networking
Hypercube properties• Minimum hop count• Even load distribution for all-all communication.• Can route around switch/link failures.• Simple routing:
– Outport = f(Dest xor NodeNum)– No routing tables
Interconnect
A 16-node (dimension 4) hypercube
0
3
2
1
0 0
1
2
3 3
1 1
3
0 2021 5 4
6732
10 11 15 14
8 9 13 12
1 1 1 1
1 1 1 1
1 1 1 1
3 3 3 3
2
2
2
2
2
2
2
2
2
2
2
2
0
0
0
0
0
0
0
0
0
0
0
0
3 3 3 3
3 3 3 3
Interconnect
64-switch Hypercube
63 * 4 links toother containers
One container:
4 links
Level 0: 32 40-port 1 Gb/sec switches
Level 1: 8 10-port 10 Gb/sec switches
64 10 Gb/sec links
16 10 Gb/sec linksLevel 2: 2 10-port 10 Gb/sec switches
1280 Gb/sec links
4X4Sub-cube
4X4Sub-cube
4X4Sub-cube
4X4Sub-cube
16links
16links
16links
16links
Interconnect
How many servers can be connected in this system?
81920 servers with 1Gbps bandwidth
Core switch: 10Gbps port x 10
Leaf switch: 1Gbps port x 40 + 10Gbps port x 2.
The Black BoxData Center Networking
Typical layer 2 & Layer 3 in existing systems• Layer 2
– One spanning tree for entire network• Prevents looping• Ignores alternate paths
• Layer 3– Shortest path routing between source and
destination– Best-effort delivery
Interconnect
Problem With common DC topology• Single point of failure• Over subscription of links higher up in the
topology– Trade off between cost and provisioning
• Layer 3 will only use one of the existing equal cost paths
• Packet re-ordering occurs if layer 3 blindly takes advantage of path diversity
Interconnect
Fat-tree based SolutionConnect hosts together using a fat-tree topology
– Infrastructure consists of cheap devices• Each port supports same speed as the end host
– All devices can transmit at line speed if packets are distributed along existing paths
– A k-ary fat-tree is composed of switches with k ports• How many switches? … 5k2/4• How many connected hosts? … k3/4
Interconnect
k-ary Fat-Tree (k=4)Interconnect
k2/4 switches
Use the same type of switches in the core, aggregation, and edge, with each switch having k ports
k pods
k/2 switches/pod
k/2 switches/pod
k/2 hosts/edge switch
Fat-tree Modified• Enforce special addressing scheme in DC
– Allows host attached to same switch to route only through switch
– Allows inter-pod traffic to stay within pod– unused.PodNumber.switchnumber.Endhost
• Use two level look-ups to distribute traffic and maintain packet ordering.
Interconnect
2-Level look-ups
• First level is prefix lookup– Used to route down the topology to endhost
• Second level is a suffix lookup– Used to route up towards core– Diffuses and spreads out traffic– Maintains packet ordering by using the same ports for the
same endhost
Interconnect
Comparison of several schemes– Hypercube: high-degree interconnect for large net, difficult
to scale incrementally – Butterfly and fat-tree: cannot scale as fast as DCell– De Bruijn: cannot incrementally expand – DCell: low bandwidth between two clusters (sub-DCells)
Interconnect
Distributed Systems
Sun Fire x4150 1U server
Datacenter
DNS LB system
Users are geographically distributed, and computation is globally optimized.
Datacenter
Datacenter
Load Balancing• The load balancing systems regulate global data center traffic• Incorporates site health, load, user proximity, and service
response for user site selection• Provides transparent site failover in case of disaster or service
outage
• Providing site selection for users• Harnessing the benefits and
intricacies of geo-distribution• Leveraging both DNS and non-
DNS methods for multi-site redundancy
Global Data Center Deployment
Cloud and Globalization of Computation
GWS
Google’s Search System
Computing in an LSDS The browser issues a query DNS lookup HTTP handling GWS Backend HTTP response
San Jose
HTTP
London
Hong Kong
Google.com
GWS GWS GWS GWS
Backend
HTTP
Inside d
ata center
s
Google’s Cluster ArchitectureGoals A high-performance distributed system for
search Thousands of machines collaborate to handle the
workload Price-performance ratio Scalability Energy efficiency and cooling High availability
Luiz Andre Barroso, Jeffrey Dean, Urs Holzle. Web Search for a Planet: The Google Cluster Architecture. IEEE Micro, vol. 23, no. 2, pp. 22-28, Mar./Apr. 2003
How to compute in a network?
Multiple computers on a perfect network may work like one larger traditional computer.
However, computing becomes more complex when messages can be lost/tampered/duplicated. bandwidth is limited. operations incur long latencies, non-uniform
latencies, or both. events are asynchronous.
Hence, computation in an LSDS over imperfect networks may have to be organized in a different way from traditional computer systems.
How to correctly compute on an imperfect network?
Two Generals’ Problem
Send a messenger, then expect the messenger to come back with one acknowledgment?
Send 100 messengers? How to prove it is possible
or impossible to reach an agreement?
Attack at 5am.
Two generals want to agree on a time to attack an enemy in between them. If the attack is synchronized, the generals can defeat the enemy. Otherwise, they will be defeated by enemy one by one. The generals can send messengers to each other, but the messenger may be caught by the enemy. Can the two generals reach an agreement?
Three Generals’ Problem in Paxos
Who decides the attack time? When is the decision made and agreed on? What if one general betrayed? Byzantine
Generals Problem.
Paxos: reach global consensus in a distributed system with packet loss
Further reading: [Lamport98] Leslie Lamport. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May. 1998), 133-169.
How to maintain state in a network? Part-Time Parliament
Priests can leave the Chamber (server crash or isolated from the system) and may never come back (server fail).
Messengers can leave the Chamber (delayed or out-of-order packets) and may never come back (packet loss).
Priests (legislators) in a parliament communicate to each other using messengers. Both the priests and the messengers can leave the parliament Chamber at any time, and may never come back. Can the parliament pass laws (decrees) and ensure consistency (no ambiguity/discrepancy on the content of a decree)?
Paxos
Each priest keeps records of the passed decrees (and some additional information) on his/her ledger (nonvolatile storage).
Messengers deliver the candidate decree and votes.
Protocol 1: Suppose we know there are n priests. A priest constructs a decree and sends it to the other n-1 priests, and collects their votes that support the decree. A vote against the degree equal to not voting. If there are n-1 votes for the decree, the decree is passed.
Problem?
Tax=0Tax=
0
Tax=0Tax=0
OK
OK
OK
OK
Paxos
Resilient to server failures and packet loss.
State (passing or not passing) of a decree is defined unambiguously.
What is “majority”? The proposing priest may contact a
“quorum” consisting a majority of the priests.
Protocol 2: … A decree is passed by a majority voting for it.
Tax=0Tax=
0
Tax=0Tax=0OK
OK
OK
Problem?
Paxos
Clients can query any priest, and the priest may know the decree.
What if the particular priest does not know the decree?
Protocol 3: … Inform all priests about the passing of a decree.
Tax=0Tax=0
Tax=0OK
OK
OK
Problem?
Tax=0 done
Tax=0 done
Tax=0 done
Paxos
If all replies agree, the decree is (perhaps) unambiguous.
What if a priest in the majority set does not know?
Protocol 4: … Read from a majority set.
Tax=
=0
Problem?
Tax = ?
don’t know
Tax==0
Paxos
Will there be a majority? Can the majority be wrong?
Protocol 5: … Read following a majority.
Tax=
=0
Answers to a query should be consistent (identical, or, at least, compatible).
Tax = ?
Tax = 100
Tax==0
Paxos
1. One priest serves as the president, and proposes a decree with a unique ballot no. b. The president sends the ballot with the proposal to a set of priests.
2. A priest responds to the receipt of the proposal message by replying with its latest vote (LastVote) and a promise that it would not respond to any ballots whose ballot nos. are between the LastVote’s ballot no. and b. LastVote can be null.
Protocol 6: Consider a single decree (e.g., tax = 0) –
PaxosProtocol 6: (continued)3. After receiving promises from a majority set, the
president selects the value for the decree based on the LastVotes of this set (the quorum), and sends the acceptance of the decree to the quorum.
4. The members of the quorum replies with a confirmation (vote) to the president, and reception of all the quorum members’ confirmations (votes) means the decree is passed.
PaxosProtocol 6: (continued)5. After receiving votes from the whole quorum, the
president records the decree in its ledger, and sends a success message to all the priests.
6. Upon receiving the success message, a priest records the decree d in its ledger.
How to know the protocol works correctly?
Paxos
Paxos
The leads to a system where every passed decree is the same as the first passed one.
Paxos
All passed decrees are identical.
Three Generals’ Problem in Paxos
Who decides the attack time? How to agree? What if one general betrayed?
Byzantine Generals Problem.
Attack at 5am.
Can we use Paxos to solve the Three Generals’ problem?
Beyond Single-Decree Paxos
Multiple Paxos instances Sequence of instances Further reading: [Lamport98] Leslie
Lamport. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May. 1998), 133-169.
Can we use Paxos to pass more than one decree?