Google B4 network report

B4: Experience with a Globally-Deployed

Software Defined Wan [4] Manjot Singh

Department of Computer Science and Engineering

Indian Institute of Technology Delhi

8285445568

[email protected]

ABSTRACT

B4 is a private WAN that connects Google’s data center across

various geographical locations. It has certain characteristics that

distinguish it from traditional WANs. It efficiently handles unique

demands of Google data center connectivity like massive data

transfer, modest number of connected sites, elastic traffic demand,

and need of full control over edge servers and network. B4 is a

Software Defined Network that uses OpenFlow to control simple

custom switches. B4’s centralized Traffic Engineering service

helps in nearly 100% link utilization as compared to 30-40%

utilization in traditional networks.

Categories and Subject Descriptors

C.2.2 [Network Protocols]: Routing Protocols

Keywords

Centralized Traffic Engineering; Wide-Area Networks (WANs);

Software-Defined Networking; Routing; OpenFlow

1 INTRODUCTION Modern Wide-Area Networks (WANs) are vital for internet

performance. Typically, all application are treated equally

regardless of their variable sensitivity to the provided capacity. To

provide reliability and tolerate link or router failures, WAN links

are provisioned to 30-40% average utilization. A WAN network

connecting Google data centers requires significant bandwidth.

Google data center WAN as certain characteristics which

distinguish it normal WAN. They control applications, servers,

LANs all the way to edge of network. Most bandwidth intensive

applications perform large scale data copies and can adapt to

available network capacity. Finally, there is a modest number of

data center deployments allowing central control feasible. To

exploit these, B4 network , building around Software Defined

Networking architecture and OpenFlow[5] is implemented to

customize the traffic engineering and routing protocols according

to the unique requirements. Also, the level of scale, fault

tolerance, cost efficiency, and control required could not be

achieved by traditional WANs.

2 BACKGROUND Google’s Wan is among the largest delivering a range of search

videos, cloud computing, and enterprise applications.

Architecturally, Google’s WAN can be divided into two distinct

WANs. First, a user-facing network which provides end services

and exchanges data with other Internet domains. Second, a WAN

which provides connectivity among the data centers. The end user

requests and responses are delivered centers and edge caches

across the first network. The data copies, index pushes and end

user data replication are performed on second network.

The various individual applications that run across B4

are categorized into three classes ordering them in increasing

volume, decreasing latency sensitivity and decreasing priority: 1)

user data copies to remote centers for availability and durability,

2) remote storage access for computation over distributed data

sources, and 3) large scale data push to synchronize state across

data centers. Number of B4 characteristics carved the designed

approach:

Elastic Bandwidth Demands: The applications for

synchronizing large data sets contributing major share

of traffic can tolerate temporary bandwidth reduction.

Modest Number of Sites: The B4 connects data centers

which would be few in number.

End Application Control: Control over both the

applications and site network enables to control the

bursts at network and enforce application priorities.

Cost sensitivity: The traditional WAN links at 30-40%

provisioning would make the deployment infeasible

3 DESIGN

3.1 Overview Please use a 9-point Times Roman font, or The B4 architecture

has been logically divided into three layers: Global layer, Site

Controller layer, and Switch Hardware layer. Switch Hardware

layer forwards traffic and has no part in running complex control

software. The Site Controller layer includes Network Control

Servers (NCS) which hosts OpenFlow controllers (OFC) and

Network Control Applications (NCAs). OFCs maintain network

states based upon the data provided by NCA and switch events

and train the switches to construct forwarding tables in order with

the changing network states. The Global layer includes centralized

applications (central TE server and SDN gateway) that provide

the central control of the network using the site-level NCAs.

Standard routing is implemented with traffic engineering as an

overlay to keep them independent. This gives a “big red button”

to disable TE and fallback to shortest path forwarding.

3.2 Switch Design The primary reason for using custom made hardware

was the non availability of a platform that could support SDN

deployment. Also, conventional switches have deep buffers, very

large forwarding tables and hardware support. By controlling the

transmission rates, the need of deep buffers can be avoided. As

the number of sites is small, there is no need of large forwarding

tables. Switch failures occurs mainly due to software issues rather

than hardware issues. Thus, fault tolerance can be increased by

shifting the software functionality off the switch enabling more

customization. The efficiency gains of the custom switches exceed

their extra input cost making it economically feasible.

B4 switches are built from multiple merchant silicon

chips assembled in a two-stage Clos topology with a copper

backbone[3] consisting of a spine layer and 128-port 10GE switch

built from 24 individual non-blocking switch chips (see Figure 1).

If the destination is not on the same ingress chip, the packet is

bounced to spine layer which then forwards it to the appropriate

output chip according to the destination of the packet.

An OpenFlow Agent (OFA) was developed that

connects to a remote OFC which runs the OpenFlow command

and forwards link/switch events to the OFC.

Figure 1. Topology of Custom made Switch.

3.3 Network Control Functionality The NCS in the site controller layer is responsible for

most of the B4 functionality. The leader election for all control

functionality is done by Paxos [1] which at each site performs a

application level failure detection. Paxos is a family of protocols

for solving consensus in a network. A new leader is elected when

a fair share of Paxos servers detect failure which identify

themselves to the clients with a new monotonically increasing

generation ID provided by Paxos.

3.4 Routing To concentrate on core SDN/OpenFlow functionality,

the open source Quagga stack for BGP/ISIS was chosen to

integrate OpenFlow based switch control with the existing routing

protocols. BGP and ISIS sessions run across data plane with the

help of hardware ports while Quagga runs on NCS with no data

plane connectivity. A Routing Application Proxy (RAP) was

written to connect Quagga and OF switches. It provides BGP/ISIS

routing updates, updates from switches to Quagga, and helps in

routing-protocol packets flowing between switches and Quagga.

RAP works as a translator that converts the RIB entries of Quagga

(a network-level view of global connectivity) to low-level

hardware tables used by OpenFlow data plane. Each RIB entry is

translated to two OpenFlow tables, a Flow table and a ECMP

Group table. The former table maps prefixes to entries in the later

table. The later table entries identify the next-hop physical

interfaces for a set of former table entries. RAP also informs

Quagga upon port state changes. When such a change is detected,

the switch OFA sends a OF message to OFC which updates its

NIB, which propagates to RAPd. It changes the netdev state for

the corresponding interface change which in turn propagated to

Quagga for routing protocol updates shortening path between the

switch interface change and the protocol processing.

4 TRAFFIC ENGINEERING Traffic engineering aims at identifying multiple paths

and sharing the bandwidth among applications using multiple

paths delivering max-min[2] fair allocation to applications. A

max-min solution maximizes utilization till no further gain can be

achieved by compromising fair share of applications.

4.1 Centralized TE Architecture The TE server operates over Network Topology, Flow

Group, Tunnel, and Tunnel Group. In Network Topology graph,

sites are represented as vertices and site to site connectivity as

edges. A Flow Group is defined as {source site, destination site,

QoS} tuple. A site level path is represented by a Tunnel (T). A

Tunnel Group is a mapping from Flow Groups to a set of tunnels

and corresponding weights defining the fraction of traffic to be

forwarded along each tunnel.

4.2 Bandwidth Functions A bandwidth function is associated with every

application. It defines the relative priority between the application

and B4. It specifies the bandwidth allocation given the flow’s

relative priority on a dimensionless scale, fair share. These

functions are derived from administrator-specified static weights

specifying relative application priority.

4.3 TE Optimization Algorithm TE Optimization Algorithm allocates optimal fair share

amongst all FG .It has two main components: (i) Tunnel Group

Generation, and (ii) Tunnel Group Quantization. Tunnel Group

Generation using bandwidth functions allocates bandwidth to FGs

to prioritize at bottleneck edges. It iterates by finding the

bottleneck edge and allocates bandwidth based on the demand and

priority such that the FGs receive an equal fair share or fully

satisfy their demands. Tunnel Group Quantization optimizes the

split ratios in TG and quantizes them.

5 TE PROTOCOL AND OPENFLOW

5.1 TE State and OpenFlow Switches in B4 can be divided as: i) encapsulating

switch, ii) transit switch, and iii) decapsulating switch. The

encapsulating switch at the source site initiates the tunnels,

encapsulating the packet with an outer IP header and divides the

traffic among associated tunnels. The packets are mapped to a FG

based on the address matching of the IP header with the prefixes

associated with the FG. The outer IP destination address

determines the tunnel ID rather than the actual destination

address. TE preconfigures the switches to encapsulate the packets

correctly. Every packet is hashed to a tunnel from the TG in the

desired ratio. The transit switch forwards the received packet

based on the tunnel ID. The decapsulating switch terminates the

flow when the tunnel determined by tunnel ID ends and

decapsulates the packet based on the table predefined by TE.

After decapsulation, the packets are forwarded based on the inner

packet header using conventional protocols.

5.2 Composing Routing and TE In B4, the routing and TE are independently deployed.

Thus, even if TE is disabled the network continues to operate

without any failure or packet loss. Routing/BGP populated the

routing table based on LPM (Longest Prefix Matching) with

appropriate entries. Whereas, TE uses the Access Control List

(ACL) table for computing the action. In B4, the packets are

matched to both the tables, LPM table and ACL table, but the rule

defined by ACL takes preference over the one defined by LPM.

Fro example, if the LPM rule states forwarding packet with no

tunneling but the ACL table rule states forwarding through port

with tunneling, then the ACL rule dominates and the packet is

forwarded according to it.

5.3 Coordinating TE State Across Sites Tunnel/Tunnel Groups/Flow Groups rules are

coordinated across multiple sites OFCs with the help of TE server.

The TE output is translated and stored in a per-site Traffic

Engineering Database (TED). Each OFC refers to TED to instruct

and set the forwarding states across the individual switches. When

there is a need for modification in the TED based upon the current

state and desired state, a single TED op is generated for each

difference. Thus, a single TED op can change only one TED entry

at one OFC. TE server can issue multiple ops to a single site OFC.

5.4 Dependencies and Failure Dependencies among Ops: All the ops cannot be

simultaneously issued as this may lead to packet drop.

For example, a tunnel must be configured before

making any changes to the corresponding TG and FG.

Likewise, a tunnel cannot be removed prior removing

the referencing entries.

Synchronizing TED between TE and OFC: To compute

the difference between the TED current and the desired

state, there ought to be a common TED view between

the TE master and the OFC. A TE session is thus used

to facilitate this synchronization. A unique TE session

identifier is generated for both the end points. Both the

endpoints sync their TED view upon the start of the

session allowing one endpoint to recover in case one

restarts. The session ID asserts that the only the op

belonging to the current session are deployed.

Ordering issues: A site specific sequence IDs are

associated with the TE op to make sure the proper

functioning of the network. The OFC accepts and

maintains the highest session sequence ID and all

subsequent ops with smaller IDs are rejected.

TE op failure: A TE op can fail due to various reasons

like OFC rejection, RPC failure etc. Hence, when a op

is issued, TE marks the corresponding TED entry as

dirty and is marked as clean only after receiving the

acknowledgement from the concerned OFC.

6 Evaluation

6.1 Deployment and Evaluation Since the first deployment, the network traffic doubled

in year 2012. The ability to deploy new functionalities and other

progressive changes is a significant advantage. Catching of the

recently used paths to reduce tunnel ops load and the system to

handle the unresponsive OFCs are some of the TE evolutions

designed and implemented. TE servers elect one master TE server

and other remains on hot standby to quickly assume the position

of master in less than 10 seconds in case of any failure. The TE

server works on aggregated topology view. This reduces the path

churn and system load. The edge removal happens multiple times

a day which is efficient handled by TE. Dynamic centralized

management helps in dealing with frequent port flaps of WAN

links.

6.2 Impact of Failures The traffic between two sites was observed for six types

of failure events: a single link failure, an encapsulation switch

failure and separately the failure of its neighboring router, an OFC

failover, a TE server failover, and disabling/enabling TE. On

facing a single link failure, the affected switches quickly prune

their ECMP groups leading to the traffic loss for only a few

milliseconds. An encapsulation switch failure leads to multiple

similar pruning tasks summing up to few milliseconds longer. As

the design takes care of OFC and TE server failure, the traffic loss

is zero. Again, upon disabling TE, the network operates on

baseline routing protocols without affecting or leading to any

traffic loss.

6.3 TE Algorithm Evaluation As the maximum number of paths available to TE

algorithm increases, the global throughput improves. Also, it

provides more flexibility to TE on the cost of more hardware table

resources. Upon fixing the path split quantum to 1/64, the

throughput improvement become stagnant at around 4 paths.

Again, for maximum of 4 paths, the throughput improvement

again flattens around 1/16. Thus, B4 uses TE with quantum of ¼

and 4 paths. The main gains from TE are observed at the time of

failure or high demand.

6.4 Link Utilization and Hashing In a typical network, the WAN links are provisioned at

approximate 30-40% utilization to handle the packet drops and to

reserve dedicated backup capacities to cope up with failures. In

B4, the busiest edges are operated at 100% utilization. The high

utilization of the links is tolerated by differentiating among the

traffic classes unlike in regular WAN. With centralized TE,

priority classes can be mixed across all edges ensuring the heavily

utilized links carry low-priority traffic while high-priority traffic

loss can be avoided. The low priority traffic loss can be

minimized by adjusting the transmission rates at application level.

7 ACTIVE AREA OF WORK For improving the implementations and evolving , the

scalability and latency of the packet IO path between the

OFC and OFA is an important factor

The TE server must adapt to the failed/unresponsive

OFCs while it is modifying TGs which depends on new

tunnel creation.

OFA should be asynchronous. It should be multi-

threaded for more parallelism.

There is a need for more performance profiling and

reporting in the network.

There is a need of application level signals for broken

connectivity to distinguish between WAN hardware

failure and software failures.

Reducing the manual sequential operations in

management operations

Observing and experimenting the system with sufficient

load to test its limits and breaking point.

8 CONCLUSION B4 has provided a highly efficient and cost effective

network to support large data transfer driving links nearly at

100% utilization. It serves more traffic than the use-facing WAN

and with higher fault tolerance and throughput. Centralized traffic

engineering dynamically allocates bandwidth among the

competing applications based upon the relative priorities and thus

is a new step in classifying traffic. Its hybrid approach

demonstrates an effective technique of improving the global

connectivity while still having a backward compatibility with the

routing protocols. With all its advantages, it is still not the

solution to all the network related problems. The latency of

bridging protocol packets between the data plane and control

plane serves as a bottleneck and requires further improvement and

research.

9 References [1] Chandra, T.D., Griesemer, R., and Redstone, J. Paxos Made

Live: an Engineering Perspective. In Proc. Of the ACM

Symposium on Principles of Distributed Computing (New York,

NY, USA, 2007), ACM, pp. 398-407

[2] Danna, E., Hassidim, A., Kaplan, H., Kumar, A.Mansour, Y.,

Raz, D., and SegaLOV, m. Upward Max Min Fairness. In

INFOCOM (2012) pp. 837-845

[3] Farrington, N., Rubow, E., and Vahdat, A. Data Center Switch

Architecture in the Age of Merchant Silicon. In Proc. Hot

Interconnects (August 2009), IEEE, pp. 93-102

[4] Jain, Sushant, et al. B4: Experience with a globally-deployed

software defined WAN. Proceedings of the ACM SIGCOMM

2013 conference on SIGCOMM. ACM, 2013.

[5] OpenFlow Documentations

http://archive.openflow.org/wp/learnmore/

http://archive.openflow.org/wp/learnmore/

Google B4 network report

Engineering

Transcript of Google B4 network report