LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1...

23
LinkedIn DC Network Architecture (or how to build a network for 100,000 servers) Ernesto Ovcharenko Staff Network Engineer Infrastructure Engineering

Transcript of LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1...

Page 1: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

LinkedIn DC Network Architecture(or how to build a network for 100,000 servers)

Ernesto OvcharenkoStaff Network EngineerInfrastructure Engineering

Page 2: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

LinkedIn Infrastructure

Bare Metal Servers>200K ~20

PoPs~4000

Networks Peered~1.5Tbps

Inter-DC NG BB

Page 3: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

34% infrastructure growth every year…High bandwidth & compute demand due to the organic growth.

For every single byte, thousands bytes of east-west traffic:

• Application Call Graph

• Kafka (metrics and analytics)

• Hadoop & Offline Compute

• Machine Learning

• Data Replication

• Search and Indexing

Growth

Page 4: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

Plan for 10x Scale on Demand Active Active Datacenters (Multi-colo)

2013-2015Capacity Uplift

Capacity Crisis

Page 5: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

Unlimited Bandwidth

Compute on Demand

Scale Cost Effectively

Programmable Datacenter

2016+Innovate for hyperscale

Innovate for Hyperscale

Page 6: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

Freedom and Choice

Move Fast ControlIndependence

QualityMaintenance

RisksSecurity

ChannelProcurement

Build StrategyOwnership

Growth!Scale

EvolveCode & Innovate

FlexibilityCustomization

Modularity

Own the code

Page 7: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

Edge Network to Eyeballs (EdgeConnect)Backbone Network (Falco)

Bare Metal HW (Open19)OS / Kernel (Linux)

Container (LPS)

Application

Own the code

Data Center Network (Open19 + SONiC + OpenFabric)

enables us to solve puzzles & complexities in different ways

Bare Metal HW (Open19)OS / Kernel (Linux)

Container (LPS)

Application

Page 8: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

• Load balancers: moved to application, x86 server running a BGP daemon.

• Firewalls: moved to application/server.

• NAS filers: failover complexity moved to servers running BGP daemons, allowed for L2 to L3 network migration.

On solving puzzles in a different way…

Page 9: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

5-StageBGP Clos

Single SKUData Center

SingleChip Architecture

Project Altair

Page 10: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

Simple Open ProgrammableIndependent

Core Design Principles

Page 11: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

• Simplicity: “perfection has been reached not when there is nothing left to add, but when there is nothing left to take away.”

• Openness: Use community-based tools where possible.

• Independence: Refuse to develop a dependence on a single vendor or vendor-driven architecture (and hence avoid the inevitable forklift upgrades)

• Programmability: Being able to modify the behavior of the data center fabric in near real time in software…

Core Design Principles

Page 12: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

The Building Block: Hardware

Merchant Silicon Custom Designed Switch (ODM)No Big Chassis SwitchesDesigned around robustness (NSR, ISSU, etc.)

Feature-rich but mostly irrelevant to LinkedIn needsNo (FCoE, VXLAN, EVPN, MCLAG, etc.)

Project Falco

Page 13: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

The Building Block: Software

• Unified Architecture: Single SKU (hardware and software) for all switches while procuring hardware from multiple ODM channels (multi-homing)

• Minimum Features: BGP, BFD, IPv4, IPv6, ECMP, LLDP• No Overlay: For the infrastructure, the application is stateless• No Middle-box: (Firewall, Load-balancer, etc.), moved to application• Network is only a set of intermediate boxes running linux• https://github.com/Azure/SONiC

Page 14: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

Pod 1ToRX ToR32ToRYToR1

Pod XToRX ToR32ToRYToR1

Pod YToRX ToR32ToRYToR1 ToR32ToRX ToRY

Pod 64ToR1

Leaf4Leaf3Leaf2Leaf1Leaf4Leaf3Leaf2Leaf1Leaf4Leaf3Leaf2Leaf1Leaf4Leaf3Leaf2Leaf1

Spine32SpineYSpineXSpine1 Spine1 SpineX SpineY Spine32 Spine1 SpineX SpineY Spine32Spine32SpineYSpineXSpine1

ToR

Leaf

Spine

• True 5 Stage Clos Architecture (Maximum Path Length: 5 Chipsets to Minimize Latency)

• Moved complexity from big boxes to our advantage, where we can manage and control!

• Single SKU - Same Chipset - Uniform IO design (Bandwidth, Latency and Buffering)

• Dedicated control plane, OAM and CPU for each ASIC

DC Architecture: Altair Design

Page 15: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

Pod 1

ToRX ToR32ToRYToR1

Pod X

ToRX ToR32ToRYToR1

Pod Y

ToRX ToR32ToRYToR1 ToR32

ToRX ToRY

Pod 64

ToR1

Leaf4Leaf3Leaf2Leaf1Leaf4Leaf3Leaf2Leaf1Leaf4Leaf3Leaf2Leaf1Leaf4Leaf3Leaf2Leaf1

Spine32SpineYSpineXSpine1 Spine1 SpineX SpineY Spine32 Spine1 SpineX SpineY Spine32Spine32SpineYSpineXSpine1

ToR

Leaf

Spine

DC Architecture: Altair Design

Page 16: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX
Page 17: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

● Modular and scalable growth

● Efficient server deployment

● Single protocol

● Operations friendly

● Predictable performance/failure

● Automation friendly

● Server-server latency 2.5uS

Page 18: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

Fabric 4

Fabric 3

Fabric 2

Fabric 1

ServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServer

ToR

Server

ServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServerServer

ToR

Server

Non-blocking Parallel Fabric

Page 19: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

ToR - Top of the Rack

Broadcom Tomahawk 32x 100G

10/25/50/100G Attachement

Regular Server Attachement 10G

Each Cabinet: 96 Dense Compute units

Half Cabinet (Leaf-Zone) 48x 10G port for servers + 4 uplinks of 50G

Full Cabinet: 2x Single ToR Zones: 48 + 48 = 96 Servers

Project Falco

ToR

Server

Leaf

Spine Spine

Leaf Leaf Leaf

Spine Spine

Tier 1

Page 20: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

Leaf

Broadcom Tomahawk 32x 100G

Non-Blocking Topology:

32x downlinks of 50G to serve 32 ToR

32x uplinks of 50G to provide 1:1 Over-subscription

Project Falco

ToR

Server

Leaf

Spine Spine

Leaf Leaf Leaf

Spine Spine

Tier 2

Page 21: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

Spine

Broadcom Tomahawk 32x 100G

Non-Blocking Topology:

64 downlinks to provide 1:1 Over-subscription

To serve 64 pods (each pod 32 ToR)

100,000 Servers: Each pod (Approximately 1550 Compute)

Project Falco

ToR

Server

Leaf

Spine Spine

Leaf Leaf Leaf

Spine Spine

Tier 3

Page 22: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

• Fault isolation.

• Fault correlation and remediation.

• Build and operations automation.

• Physical design.

• Logical design.

Challenges

Page 23: LinkedIn DC Network Architecture - LACNIC · Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Leaf1 Leaf2 Leaf3 Leaf4 Spine1 SpineX SpineY Spine32 Spine1 SpineX

Looking Ahead

OPS optimization(Prediction & Remediation

Engine)

Open Fabric(New WebScale

Protocol)

12.8Tbps chip(Ultra-Low

Latency)