Post on 16-Apr-2017
XoS “is like an elastic Fabric”
Z I/OBandwidth
Y MemoryStorage
X Compute
fn(x,y,z) You can never have
enough Customers want
Scale. made easy. Hypervisor
integration.The next convergence will be collapsing the datacenter designs into smaller, elastic form factors for compute, storage and networking.
Open Source Software - The Beginning
“Hello everybody out there using minix —
I’m doing a (free) operating system (just a hobby, won’t be big and professional like gnu) for 386(486) AT clones..”
Free Unix!
“Starting this Thanksgiving I am going to write a complete Unix-compatible software system called GNU (for Gnu's Not Unix), and give it away free(1) to everyone who can use it. Contributions of time, money, programs and equipment are greatly needed”.
Software TimelineClosed Hybrid Open
1970s-80s 1980s-90s 1990s-Present
Evolution of the Data Center
1970s 1990s 2000’s
Mainframe Scale Up Scale Out
Proprietary Proprietary Proprietary
2011++
Open Source
Evolution of the Cloud
1998 2003 2006 2008 2009
Virtualization Devops IAAS / CLOUD
Proprietary Open Source Proprietary
2008 2009 2010
Open Core Open Source
Open Compute
• Virt IO
• Hardware Management
• Data Center Design
• Open Rack
• Storage
5 Main Verticals – All Open Source
What is it?
Who is it?
Why does it exist?
To democratize hardware and eliminate gartuitous differentiation allowing for standardization across tier 1’s and ODMs.
Is your network faster today than it was 3 years ago?
Folded: Merge input and output in to one switch = 1.1 Strict-sense non blocking Clos networks (m ≥ 2n−1): the
original 1953 Clos result. 1.2 Rearrange ably non blocking Clos networks (m ≥ n). 1.3 Blocking probabilities: the Lee and Jacobaeus
approximations.
Multi-stage circuit switching network proposed by Charles Clos in 1953 for telephone switching systems. Allows forming a large switch from smaller switches…
Fat-tree (Blocking characteristics) Clos networks
Big
Dat
a
IP S
tora
ge VM Farms
Cloud
Web 2.0
Legacy
VDI
It should be…
2-3 Generations of Silicon
1G -> 100G Speed Transition
Lowest Latency with 2 tier Leaf/Spine
Data Center challenges & needs
Containing the Failure Domain No downtime – planned or
unplanned High bandwidth Automate provisioning,
change control and upgrades
Supports all use cases and applications Client-server Modern distributed apps,
Big Data Storage, VirtualizationAny hardware with Any Switch Silicon
Any OS
Driver
Routing
AnalyticsVirtual
Policy
Platform
Config
Dis-aggregated Switch
Data Center challenges & needs
Multi-tenancy with Integrated Security
Low and predictable latency
Mix and match multiple generations of technologies
Fabric Modules (Spine)
I/O M
odules (Leaf)
Spine
3 µsec Latency
Leaf
Network Disaggregation common Goals
Chassis V SplineFabric Modules (Spine)
I/O M
odules (Leaf)
Spine
Leaf
Chassis
≠ Can not access Line cards.≠ No L2/l3 recovery inside.≠ No access to Fabric.
Fabric
Control Top-of-Rack Switches
Advance L2/L3 protocols inside the Spline
Full access to Spine Switches
Fully Non-blocking Architecture
Three key Components to Fabric Design… Spine - 10 Gb Spine
switches (10Gb - 40Gb - 100Gb Spine ) verylow Oversubscription
Leaf - 1/10 Gb Non-blocking (wirespeed) leaf switches Inter-rack latency: 2 to 5 microseconds
Compute and Storage - High performance hosts (VMfarms, IP Storage, Video Streaming, HPC, Low-latency trading)
Ethernet Fabric 3 µsec Chassis V 1.8 µsec
Network Fabric
O/S Maximum 10GbE
Connections
Total Summit
X770 Units
40 GbE Ports Downlinks at
the Edge
40 GbE Ports Uplinks at the Edge
1:1 256 6 16 16
~2:1
512 8 21 11
3:1 768 10 24 8
80% Traffic East-West
Facebook Switch FabricsThe Facebook design Advantages: No dependency on the underlying physical network and
protocols. No addresses of virtual machines are present in Ethernet switches, resulting in smaller MAC tables and less complex STP layouts
No limitations related to the Virtual LAN (VLAN) technology, resulting in more than 16 million possible separate networks, compared to the VLAN's limit of 4,000
No dependency on the IP multicast traffic
Busi
ness
Va
lue
Stra
tegi
c As
set
Ethernet Fabric
Single software train
Purposed OS for Broadcom ASIcs
Modularity = High Availability
Network Operating
System
Design once, leverage everywhere.
Why
?W
hy?
How
?
Considerations 80% North-South Traffic
Oversubscription : from upto 200:1
Inter-rack latency: 75 to 150 microseconds
Scale: Up to 20 racks Client Request
+Server Response = 20% traffic
Lookup Storage = 80% traffic (inter- rack)
Non - blocking 2 tier designs optimal
ComputeCache
DatabaseStorage
Client
Response
Simple Spine(Active – Active)LAG MLAG
No LoopJust more
bandwidth!
LAG
Summit Summit Summit
Active /
Active
Simplifies or eliminates the Spanning Tree topology
Simple to understand and easy to engineer traffic Scale using standard L2/L3 Protocols (LACP,
OSPF, BGP)
Total LatencyUnder 5us
L2 L3 L2 L3 L2 L3
Simple FabricLAG
No LoopJust more
bandwidth!
LAG
Summit
Active / Active
Topology Independent ISSU Plug and Play Provisioning spines and leaves
L2 L3 L2 L3
L2 L3
Start Small; Scale as You Grow
Cluster
Leaf
Spine
Cluster Cluster
Simply add a Extreme Leaf Clusters… Each cluster is in an
independent cluster that includes servers, storage, database & interconnects
Each cluster can be used for a different type of service.
Simple repeatable design capacity can be added as a commodity.
Ingress EgressScale
Intel, Facebook, OCP, and Disaggregation
4-Post Architecture at Facebook - Each rack switch (RSW) has up to 48 10G downlinks and 4-8 10G uplinks (10:1 oversubscription) to cluster switch (CSW) Each CSW has 4 40G uplinks – one to each of
the 4 FatCat (FC) aggregation switches (4:1 oversubscription)
4 CSW’s are connected in a 10G×8 protection ring
4FC’s are connected in a 10G×16 protection ring
No routers at FC. One CSW failure reduces intra-cluster capacity to 75%.
Dense 10GbE Interconnect using breakout cables, Copper or Fiber
“Wedge”
“6-pack”
The Open Source Data CenterThe current rack standard has no
specification for depth or height. The only “standard” is the requirement for it to be 19 inches wide. This standard evolved out of the railroad switching era.
What is it? 5 projects – all completely open to lower cost and increase data center efficiency
Why does it exist? To democratize hardware and eliminate gartuitous differentiation allowing for standardization across tier 1’s and ODMs.
One RACK DesignTop of Rack Switches
Servers
Storage
Summit
Management Switch
Open Compute because it might allow companies to purchase "vanity free" You also don't have stranded ports with a spline network. Scale beyond traditional data center design, Modular datacenter construction is the future.
Outdated design only supports low density IT computing
Time consuming maintenance Long lead time to deploy additional data center
capacity
Loosely coupled
Nearly coupled
Closely coupled
The monolithic datacenter is dead.
Summit Summit
Shared Combo Ports4x10GBASE-T & 4xSFP+
100Mb/1Gb/10GBASE-T
Two RACK DesignReduce OPEX leverage a
repeatable solution from planning, configuration, installation, commissioning , full turnkey deployments and maintenance Leverage best in class servers, storage, networking and services and is uniquely positioned to create efficient, high performance modular data centers with the infrastructure to support IT
Flexible solution from planning, configuration, installation, commissioning , full turnkey deployments and maintenance
Summit
Top of Rack Switches
Servers
Storage
Summit
Management Switch
Summit
Top of Rack Switches
Servers
Storage
Summit
Management Switch
Summit Summit
Summit Summit
Eight RACK PoD DesignSummit Summit
Spine
Leaf
Summit
Top of Rack Switches
Servers
Storage
Summit
Management
Top of Rack Switches
Servers
Storage
Summit
Management
Summit Summit
Summit Summit
Summit
Top of Rack Switches
Servers
Storage
Summit
Management
Top of Rack Switches
Servers
Storage
Summit
Management
Summit Summit
Summit Summit
Summit
Top of Rack Switches
Servers
Storage
Summit
Management
Top of Rack Switches
Servers
Storage
Summit
Management
Summit Summit
Summit Summit
Spine Design Spine redundancy and
capacity Low device latency and low Performance oversubscription rates
I/O Diversity (10/25/40/50/100G)
Ability to grow/scale as capacity is needed
Collapsing of fault/broadcast domains (due to Layer 3 topologies)
Deterministic failover and simpler troubleshooting , Lossless and Lossy traffic over single converged fabric
Readily available operational expertise as well as a variety of traffic engineering capabilities
Collapsed (1-tier) Spine
Summit Summit Spine
Leaf
Storage
Summit
Management
Storage
Summit
Management
Storage
Summit
Management
Storage
Summit
Management
Storage
Summit
Management
Storage
Summit
Management
Summit Summit
4 x 72 =248 10Gs
25/50GbE NICs Allows 25/50GbE from server to switch (Mellanox, Qlogic, Broadcom)
Connectivity Connectors (SFP28, QSFP28), Optics and Cabling solutions backwards compatible
Commercial switching silicon 25Gb signaling lanes make 25/50/100GbE devices possible
VXLAN - Overlay and underlayVXLAN Tunneling
between sitesRemove Network EntropyW L2 connectivity over L3
L3 Core
VLANsXX-YY
VLANsXX-YYVXLAN
Gateway
VXLAN Gateway
VXLANGateway
VXLANGateway
L2 connections within IP overlay Unicast & multicast
Allows flat DC design w/out boundaries Simple and elastic network
Hypervisor / distributed
Virtual Switch – other end of
VxLAN tunnels
IP overlay connections established between VxLAN end-points of a tenant
Fully meshed unicast tunnels – for known L2 unicast traffic PIM signaled
multicast tunnels for L2 BUM traffic
Fabric vs. Cisco legacy architecture
Rack Blade servers
Multitier network design
E WS
N
80%
Elastic Spine-LeafSimpler Flatter
Any-to-any network
E W
80%
A fundamental change in data flows
To a Service Oriented Architecture
MLAGClient – Server ArchitectureHigh Latency
Complex
High CapEx / OpEx
Constrains Virtualization
Inefficient
STORAGEFC SAN
Converged Network: UCSA single system that encompasses: Network: Unified fabric Compute: Industry standard x86 Storage: Access options Virtualization optimized
Uplinks
Petabyte Scale Data - Data Flow Architecture at
FacebookWeb Servers
Scribe MidTierFilers
Production Hive-Hadoop ClusterOracle RAC Federated MySQL
Scribe-Hadoop Cluster
Adhoc Hive-Hadoop Cluster
Hivereplication