Google B4 network report
Click here to load reader
-
Upload
manjot-singh -
Category
Engineering
-
view
2.349 -
download
0
Transcript of Google B4 network report
B4: Experience with a Globally-Deployed
Software Defined Wan [4] Manjot Singh
Department of Computer Science and Engineering
Indian Institute of Technology Delhi
8285445568
ABSTRACT
B4 is a private WAN that connects Google’s data center across
various geographical locations. It has certain characteristics that
distinguish it from traditional WANs. It efficiently handles unique
demands of Google data center connectivity like massive data
transfer, modest number of connected sites, elastic traffic demand,
and need of full control over edge servers and network. B4 is a
Software Defined Network that uses OpenFlow to control simple
custom switches. B4’s centralized Traffic Engineering service
helps in nearly 100% link utilization as compared to 30-40%
utilization in traditional networks.
Categories and Subject Descriptors
C.2.2 [Network Protocols]: Routing Protocols
Keywords
Centralized Traffic Engineering; Wide-Area Networks (WANs);
Software-Defined Networking; Routing; OpenFlow
1 INTRODUCTION Modern Wide-Area Networks (WANs) are vital for internet
performance. Typically, all application are treated equally
regardless of their variable sensitivity to the provided capacity. To
provide reliability and tolerate link or router failures, WAN links
are provisioned to 30-40% average utilization. A WAN network
connecting Google data centers requires significant bandwidth.
Google data center WAN as certain characteristics which
distinguish it normal WAN. They control applications, servers,
LANs all the way to edge of network. Most bandwidth intensive
applications perform large scale data copies and can adapt to
available network capacity. Finally, there is a modest number of
data center deployments allowing central control feasible. To
exploit these, B4 network , building around Software Defined
Networking architecture and OpenFlow[5] is implemented to
customize the traffic engineering and routing protocols according
to the unique requirements. Also, the level of scale, fault
tolerance, cost efficiency, and control required could not be
achieved by traditional WANs.
2 BACKGROUND Google’s Wan is among the largest delivering a range of search
videos, cloud computing, and enterprise applications.
Architecturally, Google’s WAN can be divided into two distinct
WANs. First, a user-facing network which provides end services
and exchanges data with other Internet domains. Second, a WAN
which provides connectivity among the data centers. The end user
requests and responses are delivered centers and edge caches
across the first network. The data copies, index pushes and end
user data replication are performed on second network.
The various individual applications that run across B4
are categorized into three classes ordering them in increasing
volume, decreasing latency sensitivity and decreasing priority: 1)
user data copies to remote centers for availability and durability,
2) remote storage access for computation over distributed data
sources, and 3) large scale data push to synchronize state across
data centers. Number of B4 characteristics carved the designed
approach:
Elastic Bandwidth Demands: The applications for
synchronizing large data sets contributing major share
of traffic can tolerate temporary bandwidth reduction.
Modest Number of Sites: The B4 connects data centers
which would be few in number.
End Application Control: Control over both the
applications and site network enables to control the
bursts at network and enforce application priorities.
Cost sensitivity: The traditional WAN links at 30-40%
provisioning would make the deployment infeasible
3 DESIGN
3.1 Overview Please use a 9-point Times Roman font, or The B4 architecture
has been logically divided into three layers: Global layer, Site
Controller layer, and Switch Hardware layer. Switch Hardware
layer forwards traffic and has no part in running complex control
software. The Site Controller layer includes Network Control
Servers (NCS) which hosts OpenFlow controllers (OFC) and
Network Control Applications (NCAs). OFCs maintain network
states based upon the data provided by NCA and switch events
and train the switches to construct forwarding tables in order with
the changing network states. The Global layer includes centralized
applications (central TE server and SDN gateway) that provide
the central control of the network using the site-level NCAs.
Standard routing is implemented with traffic engineering as an
overlay to keep them independent. This gives a “big red button”
to disable TE and fallback to shortest path forwarding.
3.2 Switch Design The primary reason for using custom made hardware
was the non availability of a platform that could support SDN
deployment. Also, conventional switches have deep buffers, very
large forwarding tables and hardware support. By controlling the
transmission rates, the need of deep buffers can be avoided. As
the number of sites is small, there is no need of large forwarding
tables. Switch failures occurs mainly due to software issues rather
than hardware issues. Thus, fault tolerance can be increased by
shifting the software functionality off the switch enabling more
customization. The efficiency gains of the custom switches exceed
their extra input cost making it economically feasible.
B4 switches are built from multiple merchant silicon
chips assembled in a two-stage Clos topology with a copper
backbone[3] consisting of a spine layer and 128-port 10GE switch
built from 24 individual non-blocking switch chips (see Figure 1).
If the destination is not on the same ingress chip, the packet is
bounced to spine layer which then forwards it to the appropriate
output chip according to the destination of the packet.
An OpenFlow Agent (OFA) was developed that
connects to a remote OFC which runs the OpenFlow command
and forwards link/switch events to the OFC.
Figure 1. Topology of Custom made Switch.
3.3 Network Control Functionality The NCS in the site controller layer is responsible for
most of the B4 functionality. The leader election for all control
functionality is done by Paxos [1] which at each site performs a
application level failure detection. Paxos is a family of protocols
for solving consensus in a network. A new leader is elected when
a fair share of Paxos servers detect failure which identify
themselves to the clients with a new monotonically increasing
generation ID provided by Paxos.
3.4 Routing To concentrate on core SDN/OpenFlow functionality,
the open source Quagga stack for BGP/ISIS was chosen to
integrate OpenFlow based switch control with the existing routing
protocols. BGP and ISIS sessions run across data plane with the
help of hardware ports while Quagga runs on NCS with no data
plane connectivity. A Routing Application Proxy (RAP) was
written to connect Quagga and OF switches. It provides BGP/ISIS
routing updates, updates from switches to Quagga, and helps in
routing-protocol packets flowing between switches and Quagga.
RAP works as a translator that converts the RIB entries of Quagga
(a network-level view of global connectivity) to low-level
hardware tables used by OpenFlow data plane. Each RIB entry is
translated to two OpenFlow tables, a Flow table and a ECMP
Group table. The former table maps prefixes to entries in the later
table. The later table entries identify the next-hop physical
interfaces for a set of former table entries. RAP also informs
Quagga upon port state changes. When such a change is detected,
the switch OFA sends a OF message to OFC which updates its
NIB, which propagates to RAPd. It changes the netdev state for
the corresponding interface change which in turn propagated to
Quagga for routing protocol updates shortening path between the
switch interface change and the protocol processing.
4 TRAFFIC ENGINEERING Traffic engineering aims at identifying multiple paths
and sharing the bandwidth among applications using multiple
paths delivering max-min[2] fair allocation to applications. A
max-min solution maximizes utilization till no further gain can be
achieved by compromising fair share of applications.
4.1 Centralized TE Architecture The TE server operates over Network Topology, Flow
Group, Tunnel, and Tunnel Group. In Network Topology graph,
sites are represented as vertices and site to site connectivity as
edges. A Flow Group is defined as {source site, destination site,
QoS} tuple. A site level path is represented by a Tunnel (T). A
Tunnel Group is a mapping from Flow Groups to a set of tunnels
and corresponding weights defining the fraction of traffic to be
forwarded along each tunnel.
4.2 Bandwidth Functions A bandwidth function is associated with every
application. It defines the relative priority between the application
and B4. It specifies the bandwidth allocation given the flow’s
relative priority on a dimensionless scale, fair share. These
functions are derived from administrator-specified static weights
specifying relative application priority.
4.3 TE Optimization Algorithm TE Optimization Algorithm allocates optimal fair share
amongst all FG .It has two main components: (i) Tunnel Group
Generation, and (ii) Tunnel Group Quantization. Tunnel Group
Generation using bandwidth functions allocates bandwidth to FGs
to prioritize at bottleneck edges. It iterates by finding the
bottleneck edge and allocates bandwidth based on the demand and
priority such that the FGs receive an equal fair share or fully
satisfy their demands. Tunnel Group Quantization optimizes the
split ratios in TG and quantizes them.
5 TE PROTOCOL AND OPENFLOW
5.1 TE State and OpenFlow Switches in B4 can be divided as: i) encapsulating
switch, ii) transit switch, and iii) decapsulating switch. The
encapsulating switch at the source site initiates the tunnels,
encapsulating the packet with an outer IP header and divides the
traffic among associated tunnels. The packets are mapped to a FG
based on the address matching of the IP header with the prefixes
associated with the FG. The outer IP destination address
determines the tunnel ID rather than the actual destination
address. TE preconfigures the switches to encapsulate the packets
correctly. Every packet is hashed to a tunnel from the TG in the
desired ratio. The transit switch forwards the received packet
based on the tunnel ID. The decapsulating switch terminates the
flow when the tunnel determined by tunnel ID ends and
decapsulates the packet based on the table predefined by TE.
After decapsulation, the packets are forwarded based on the inner
packet header using conventional protocols.
5.2 Composing Routing and TE In B4, the routing and TE are independently deployed.
Thus, even if TE is disabled the network continues to operate
without any failure or packet loss. Routing/BGP populated the
routing table based on LPM (Longest Prefix Matching) with
appropriate entries. Whereas, TE uses the Access Control List
(ACL) table for computing the action. In B4, the packets are
matched to both the tables, LPM table and ACL table, but the rule
defined by ACL takes preference over the one defined by LPM.
Fro example, if the LPM rule states forwarding packet with no
tunneling but the ACL table rule states forwarding through port
with tunneling, then the ACL rule dominates and the packet is
forwarded according to it.
5.3 Coordinating TE State Across Sites Tunnel/Tunnel Groups/Flow Groups rules are
coordinated across multiple sites OFCs with the help of TE server.
The TE output is translated and stored in a per-site Traffic
Engineering Database (TED). Each OFC refers to TED to instruct
and set the forwarding states across the individual switches. When
there is a need for modification in the TED based upon the current
state and desired state, a single TED op is generated for each
difference. Thus, a single TED op can change only one TED entry
at one OFC. TE server can issue multiple ops to a single site OFC.
5.4 Dependencies and Failure Dependencies among Ops: All the ops cannot be
simultaneously issued as this may lead to packet drop.
For example, a tunnel must be configured before
making any changes to the corresponding TG and FG.
Likewise, a tunnel cannot be removed prior removing
the referencing entries.
Synchronizing TED between TE and OFC: To compute
the difference between the TED current and the desired
state, there ought to be a common TED view between
the TE master and the OFC. A TE session is thus used
to facilitate this synchronization. A unique TE session
identifier is generated for both the end points. Both the
endpoints sync their TED view upon the start of the
session allowing one endpoint to recover in case one
restarts. The session ID asserts that the only the op
belonging to the current session are deployed.
Ordering issues: A site specific sequence IDs are
associated with the TE op to make sure the proper
functioning of the network. The OFC accepts and
maintains the highest session sequence ID and all
subsequent ops with smaller IDs are rejected.
TE op failure: A TE op can fail due to various reasons
like OFC rejection, RPC failure etc. Hence, when a op
is issued, TE marks the corresponding TED entry as
dirty and is marked as clean only after receiving the
acknowledgement from the concerned OFC.
6 Evaluation
6.1 Deployment and Evaluation Since the first deployment, the network traffic doubled
in year 2012. The ability to deploy new functionalities and other
progressive changes is a significant advantage. Catching of the
recently used paths to reduce tunnel ops load and the system to
handle the unresponsive OFCs are some of the TE evolutions
designed and implemented. TE servers elect one master TE server
and other remains on hot standby to quickly assume the position
of master in less than 10 seconds in case of any failure. The TE
server works on aggregated topology view. This reduces the path
churn and system load. The edge removal happens multiple times
a day which is efficient handled by TE. Dynamic centralized
management helps in dealing with frequent port flaps of WAN
links.
6.2 Impact of Failures The traffic between two sites was observed for six types
of failure events: a single link failure, an encapsulation switch
failure and separately the failure of its neighboring router, an OFC
failover, a TE server failover, and disabling/enabling TE. On
facing a single link failure, the affected switches quickly prune
their ECMP groups leading to the traffic loss for only a few
milliseconds. An encapsulation switch failure leads to multiple
similar pruning tasks summing up to few milliseconds longer. As
the design takes care of OFC and TE server failure, the traffic loss
is zero. Again, upon disabling TE, the network operates on
baseline routing protocols without affecting or leading to any
traffic loss.
6.3 TE Algorithm Evaluation As the maximum number of paths available to TE
algorithm increases, the global throughput improves. Also, it
provides more flexibility to TE on the cost of more hardware table
resources. Upon fixing the path split quantum to 1/64, the
throughput improvement become stagnant at around 4 paths.
Again, for maximum of 4 paths, the throughput improvement
again flattens around 1/16. Thus, B4 uses TE with quantum of ¼
and 4 paths. The main gains from TE are observed at the time of
failure or high demand.
6.4 Link Utilization and Hashing In a typical network, the WAN links are provisioned at
approximate 30-40% utilization to handle the packet drops and to
reserve dedicated backup capacities to cope up with failures. In
B4, the busiest edges are operated at 100% utilization. The high
utilization of the links is tolerated by differentiating among the
traffic classes unlike in regular WAN. With centralized TE,
priority classes can be mixed across all edges ensuring the heavily
utilized links carry low-priority traffic while high-priority traffic
loss can be avoided. The low priority traffic loss can be
minimized by adjusting the transmission rates at application level.
7 ACTIVE AREA OF WORK For improving the implementations and evolving , the
scalability and latency of the packet IO path between the
OFC and OFA is an important factor
The TE server must adapt to the failed/unresponsive
OFCs while it is modifying TGs which depends on new
tunnel creation.
OFA should be asynchronous. It should be multi-
threaded for more parallelism.
There is a need for more performance profiling and
reporting in the network.
There is a need of application level signals for broken
connectivity to distinguish between WAN hardware
failure and software failures.
Reducing the manual sequential operations in
management operations
Observing and experimenting the system with sufficient
load to test its limits and breaking point.
8 CONCLUSION B4 has provided a highly efficient and cost effective
network to support large data transfer driving links nearly at
100% utilization. It serves more traffic than the use-facing WAN
and with higher fault tolerance and throughput. Centralized traffic
engineering dynamically allocates bandwidth among the
competing applications based upon the relative priorities and thus
is a new step in classifying traffic. Its hybrid approach
demonstrates an effective technique of improving the global
connectivity while still having a backward compatibility with the
routing protocols. With all its advantages, it is still not the
solution to all the network related problems. The latency of
bridging protocol packets between the data plane and control
plane serves as a bottleneck and requires further improvement and
research.
9 References [1] Chandra, T.D., Griesemer, R., and Redstone, J. Paxos Made
Live: an Engineering Perspective. In Proc. Of the ACM
Symposium on Principles of Distributed Computing (New York,
NY, USA, 2007), ACM, pp. 398-407
[2] Danna, E., Hassidim, A., Kaplan, H., Kumar, A.Mansour, Y.,
Raz, D., and SegaLOV, m. Upward Max Min Fairness. In
INFOCOM (2012) pp. 837-845
[3] Farrington, N., Rubow, E., and Vahdat, A. Data Center Switch
Architecture in the Age of Merchant Silicon. In Proc. Hot
Interconnects (August 2009), IEEE, pp. 93-102
[4] Jain, Sushant, et al. B4: Experience with a globally-deployed
software defined WAN. Proceedings of the ACM SIGCOMM
2013 conference on SIGCOMM. ACM, 2013.
[5] OpenFlow Documentations
http://archive.openflow.org/wp/learnmore/