Controlling Congestion in New Storage Architectures

Post on 13-Feb-2017

220 views 0 download

Transcript of Controlling Congestion in New Storage Architectures

PRESENTATION TITLE GOES HERE

Controlling Congestion in New Storage Architectures

September 15, 2015

Today’s Presenters

Chad Hintz Ethernet Storage Forum

Board Member Solutions Architect - Cisco

David L. Fair SNIA Ethernet Storage Forum Chair

Intel

SNIA Legal Notice

!   The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted.

!   Member companies and individual members may use this material in presentations and literature under the following conditions:

!   Any slide or slides used must be reproduced in their entirety without modification !   The SNIA must be acknowledged as the source of any material used in the body of any

document containing material from these presentations. !   This presentation is a project of the SNIA Education Committee. !   Neither the author nor the presenter is an attorney and nothing in this

presentation is intended to be, or should be construed as legal advice or an opinion of counsel. If you need legal advice or a legal opinion please contact your attorney.

!   The information presented herein represents the author's personal opinion and current understanding of the relevant issues involved. The author, the presenter, and the SNIA do not assume any responsibility or liability for damages arising out of any reliance on or use of this information. NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK.

3

Terms

!   STP-Spanning Tree Protocol !   An older network protocol that ensures a loop-free topology for

any bridged Ethernet local area network.

!   Spine/Leaf, CLOS, Fat-Tree, Multi Rooted Tree !   Network based on routing with all links active

!   VXLAN !   Standards based layer 2 overlay scheme over a layer 3 network

4

Agenda

!  Data Center “Fabrics” Current State

!  Current Congestion Control Mechanisms !  CONGA’s Design !  Why is CONGA important for IP based

Storage and Congestion Control? !  Q&A

5

Storage over Ethernet Needs and Concerns

6

Minimize chance of dropped traffic

In order frame delivery

Minimal oversubscription

Lossless fabric for FCoE (no Drop)

Reliability Performance

Data Center Fabric: Current State

7

STP- Spanning Tree

Data Center “Fabric” Journey

Blocking

STP- Spanning Tree Multi-Chassis

Etherchannel

Data Center “Fabric” Journey

STP- Spanning Tree Multi-Chassis

Etherchannel

Data Center “Fabric” Journey

Spine-Leaf

LAYER 3

STP- Spanning Tree Multi-Chassis

Etherchannel

Data Center “Fabric” Journey

MAN/WAN

VXLAN /EVPN

LAYER 3 W/ LAYER 2 OVERLAY

Spine-Leaf

Multi-rooted Tree(Spine/Leaf) = Ideal DC Network

12

1000s of server ports

Ideal DC network:

Giant switch

Ø  No internal bottlenecks è predictable Ø  Simplifies BW management

Multi-rooted Tree(Spine/Leaf) = Ideal DC Network

13

1000s of server ports

Ideal DC network:

Giant switch

Ø  No internal bottlenecks è predictable Ø  Simplifies BW management

Can’t build it L

Multi-rooted Tree(Spine/Leaf) = Ideal DC Network

14

1000s of server ports

Ideal DC network:

Giant switch

Ø  No internal bottlenecks è predictable Ø  Simplifies BW management

Multi-rooted tree

1000s of server ports

Can’t build it L ≈

Multi-rooted Tree(Spine/Leaf) = Ideal DC Network

15

1000s of server ports

Ideal DC network:

Giant switch

Ø  No internal bottlenecks è predictable Ø  Simplifies BW management

Multi-rooted tree

1000s of server ports

Can’t build it L ≈Possible

bottlenecks

Multi-rooted Tree(Spine/Leaf) = Ideal DC Network

16

1000s of server ports

Ideal DC network:

Giant switch

Ø  No internal bottlenecks è predictable Ø  Simplifies BW management

Multi-rooted tree

1000s of server ports

Can’t build it L ≈Possible

bottlenecks Need precise load balancing

Leaf-Spine DC Fabric

17

Approximates ideal giant switch

H1   H2   H3   H4   H5   H6   H7   H8   H9  H1   H2   H3   H4   H5   H6   H7   H8   H9  

Leaf switches

Spine switches

Leaf-Spine DC Fabric

18

Approximates ideal giant switch

H1   H2   H3   H4   H5   H6   H7   H8   H9  H1   H2   H3   H4   H5   H6   H7   H8   H9  

Leaf switches

Spine switches

≈!   How close is Leaf-Spine to ideal giant switch?

!   What impacts its performance? !   Link speeds, oversubscription, buffering

Today: Equal Cost Multipath (ECMP) Routing

Pick among equal-cost paths by an algorithm(hash) Ø Randomized load balancing Ø Preserves packet order (Optimal for TCP)

19

Problems: -  Hash collisions -  No idea of congestion -  Flows mapped to one

path (Loss of link issues)

Impact of Link Speed

20

Three non-oversubscribed topologies:

20×10Gbps Downlinks

20×10Gbps Uplinks

20×10Gbps Downlinks

5×40Gbps Uplinks

20×10Gbps Downlinks

2×100Gbps Uplinks

How does Link Speed affect ECMP

Higher speed links improve ECMP efficiency

21

20×10Gbps Uplinks

2×100Gbps Uplinks

11×10Gbps flows (55% load)

1 2

1 2 20

Prob of 100% throughput = 3.27%

Prob of 100% throughput = 99.95%

http://simula.stanford.edu/~alizade/papers/conga-sigcomm14.pdf

Better

Impact of Link Speed

22

0 2 4 6 8

10 12 14 16 18 20

30 40 50 60 70 80 FCT

(no

rmal

ized

to

opti

mal

)

Load (%)

Avg FCT: Large (10MB,∞) background flows

OQ-Switch

20x10Gbps

5x40Gbps

2x100Gbps

http://simula.stanford.edu/~alizade/papers/conga-sigcomm14.pdf

Better

Impact of Link Speed

23

0 2 4 6 8

10 12 14 16 18 20

30 40 50 60 70 80 FCT

(no

rmal

ized

to

opti

mal

)

Load (%)

Avg FCT: Large (10MB,∞) background flows

OQ-Switch

20x10Gbps

5x40Gbps

2x100Gbps

!   40/100Gbps fabric: ~ same Flow Completion Time as Giant Switch (OQ)

!   10Gbps fabric: FCT up 40% worse than OQ

Storage over Spine-Leaf

!   New scale out storage is looking to have initiators and targets spread over multiple leaf switches

!   Concerns !   Multiple hops !   Potential for increased latency !   Oversubscription !   TCP Incast !   Potential buffering issues

24

Incast Issue with IP Based Storage

25

Initiators (Senders)

Target (receiver)

Spread over many paths

Single point of convergence

Incast events are most severe at receiver (iSCSI, other IP based storage)

Summary

!   40/100Gbps fabric + ECMP ≈ Giant switch; some performance loss with 10Gbps fabric

!   Oversubscription(incast) in IP Storage networks are very common and have cascading effect on performance and throughput

26

Current Congestion Control Mechanisms Hop-By-Hop

27

IEEE 802.1Qaz

Enhanced Transmission Selection (ETS)

!   Required when consolidating I/O – It’s a QoS problem

!   Prevents a single traffic class of “hogging” all the bandwidth and starving other classes

!   When a given load doesn’t fully utilize its allocated bandwidth, it is available to other classes

!   Helps accommodate for classes of a“bursty” nature

Ethernet Wire

FCoE

50% 50%

IEEE 802.1Qbb Priority Flow Control

!   PFC enables Flow Control on a Per-Priority basis

!   Therefore, we have the ability to have lossless and lossy priorities at the same time on the same wire

!   Allows FCoE to operate over a lossless priority independent of other priorities

!   Other traffic assigned to other CoS will continue to transmit and rely on upper layer protocols for retransmission

!   Not only for FCoE traffic

29

Ethernet Wire

FCoE

Adding in Spine-Leaf

!   Use IEEE ETS to guarantee bandwidth for traffic types

!   Use IEEE PFC to create lossless traffic for FCoE

!   Use Ethernet infrastructure for all kinds of storage

!   Improve scalability for all application needs and maintain high, consistent performance for all traffic types, not just storage

30

Problems ETS/PFC do not solve

!   Does not take in consideration Layer 3 links and ECMP in spine-leaf topology !   Limited to hop-by-hop links !   PFC designed for lossless

traffic, not typical IP-based storage

!   ETS guarantees bandwidth, does not alleviate congestion

31

The network paradigm as we know it…

Control and Data Plane

!   Two Models: !   Distributed Control and Data Plane-Traditional !   Centralized (Controller Based/SDN)

33

Control and Data Plane resides within Physical Device

Traditional

Software defined networking (SDN) definition: The physical separation of the network control plane from

the forwarding plane, and where a control plane controls several devices.

What is SDN? per ONF definition

https://www.opennetworking.org/sdn-resources/sdn-definition

In other words…

In the SDN paradigm, not all processing happens inside the same device

CONGA’s Design

37

CONGA in 1 Slide

38

L0 L1 L2

1.  Leaf switches (top-of-rack) track congestion to other leaves on different paths in near real-time

2.  Send traffic on least congested path(s)

Fast feedback loops between leaf switches, directly in dataplane

CONGA in 1 Slide

39

L0 L1 L2

1.  Leaf switches (top-of-rack) track congestion to other leaves on different paths in near real-time

2.  Send traffic on least congested path(s)

Fast feedback loops between leaf switches, directly in dataplane

Could this work with centralized control plane?

!   If control plane is separate then feedback can be sent in dataplane but then has to be computed in central control point (controller)

!   Latency for this along with constant change in network makes this not a valid option

40

VS.

CONGA in Depth

CONGA operates over a standard DC overlay (VXLAN) Ø Already broadly supported (VXLAN) to virtualize the

physical network

41 H1   H2   H3   H4   H5   H6   H7  

L0   L1   L2  

H9  H8  

H1èH9  L0èL2  H1èH9  L0èL2  

VXLAN encap.

CONGA-In Depth (VXLAN)

42

VXLAN Frame Format

H9  

CONGA In Depth: Leaf-to-Leaf Congestion

43 H1   H2   H3   H4   H5   H6   H7  

L1  

H8  

1  2  3  0  

Track path-wise congestion metrics (3 bits) between each pair of leaf switches

Conges4on-­‐From-­‐Leaf  Table  @L2  

Src  Leaf  

Path  0   1   2  

L0  L1  

3  

L0èL2  Path=2  CE=5  

5

L0   L2  

L0èL2  Path=2  CE=0  

L0èL2  Path=2  CE=5  

L0èL2  Path=2  CE=0  

pkt.CE ç max(pkt.CE, link.util)

H9  

CONGA In Depth: Leaf-to-Leaf Congestion

44 H1   H2   H3   H4   H5   H6   H7  

L1  

H8  

1  2  3  0  

Track path-wise congestion metrics (3 bits) between each pair of leaf switches

Conges4on-­‐To-­‐Leaf  Table  @L0  

Dest  Leaf  

Path  0   1   2  

L1  L2  

3  

5

L0   L2  

L0èL2  Path=2  CE=0  

L2èL0  FB-­‐Path=2  FB-­‐Metric=5  

51 1 43 7 2

CONGA-In Depth: LB Decisions

Send each packet on least congested path

45 H1   H2   H3   H4   H5   H6   H7  

L0   L1   L2  

H9  H8  

Conges4on-­‐To-­‐Leaf  Table  @L0  

Dest  Leaf  

Path  0   1   2  

L1  L2  

3  

551 1 43 7 2

1  2  3  0  

L0 è L1: p* = 3 L0 è L2: p* = 0 or 1

http://groups.csail.mit.edu/netmit/wordpress/wp-content/themes/netmit/papers/texcp-

hotnets04.pdf

CONGA-In Depth: LB Decisions

Send each packet on least congested path

46 H1   H2   H3   H4   H5   H6   H7  

L0   L1   L2  

H9  H8  

Conges4on-­‐To-­‐Leaf  Table  @L0  

Dest  Leaf  

Path  0   1   2  

L1  L2  

3  

551 1 43 7 2

1  2  3  0  

L0 è L1: p* = 3 L0 è L2: p* = 0 or 1

flowlet [Kandula et al 2004]

http://groups.csail.mit.edu/netmit/wordpress/wp-content/themes/netmit/papers/texcp-

hotnets04.pdf

CONGA-In Depth: Flowlet Switching

H1 H2

TCP flow

•  State-of-the-art ECMP hashes flows (5-tuples) to path to prevent reordering

TCP packets. •  Flowlet switching* routes bursts of

packets from the same flow independently.

•  No packet re-ordering

Gap ≥ |d1 – d2|

d1 d2

*Flowlet Switching (Kandula et al ’04) http://groups.csail.mit.edu/netmit/wordpress/wp-content/

themes/netmit/papers/texcp-hotnets04.pdf

Of Elephants and Mice

!   Two types of Flows in the data center !   Long Live Flows-”Elephant”

!   Data or (block) storage migrations, VM Migrations, MapReduce !   Flows that impact buffers !   Not many in Data Centers, but just a few of these could be impactful

!   Short Lived Flows-”Mice” !   Web requests, emails, small data requests !   Can be bursty

!   How they interact is key !   In tradition ECMP, multiple long live flows can be mapped to few

links !   If mice are mapped to same links it is detrimental to the

application performance

48

Of Elephants and Mice

!   Two types of Flows in the data center !   Long Live Flows-”Elephant”

!   Data or storage migrations, VM Migrations, MapReduce !   Flows that impact buffers !   Not many in Data Centers, but just a few of these could be impactful

!   Short Lived Flows-”Mice” !   Web requests, emails, small data requests !   Can be bursty

!   How they interact is key !   In tradition ECMP, multiple long live flows can be mapped to few

links !   If mice are mapped to same links it is detrimental to the

application performance

49

Of Elephants and Mice

!   Two types of Flows in the data center !   Long Live Flows-”Elephant”

!   Data or storage migrations, VM Migrations, MapReduce !   Flows that impact buffers !   Not many in Data Centers, but just a few of these could be impactful

!   Short Lived Flows-”Mice” !   Web requests, emails, small data requests !   Can be bursty

!   How they interact is key !   In tradition ECMP, multiple long live flows can be mapped to few

links !   If mice are mapped to same links it is detrimental to the

application performance

50

Need a new Metric to determine this impact

Application Flow Completion Time (FCT)

CONGA: Fabric Load Balancing Dynamic Flow Prioritization

Real traffic is a mix of large (elephant) and small (mice) flows.

Key Idea: Fabric detects initial few flowlets of each flow and

assigns them to a high priority class.

CONGA: Fabric Load Balancing Dynamic Flow Prioritization

Real traffic is a mix of large (elephant) and small (mice) flows.

F1

F2

F3

Standard (single priority): Large flows severely impact

performance (latency & loss). for small flows

Key Idea: Fabric detects initial few flowlets of each flow and

assigns them to a high priority class.

CONGA: Fabric Load Balancing Dynamic Flow Prioritization

Real traffic is a mix of large (elephant) and small (mice) flows.

F1

F2

F3

Standard (single priority): Large flows severely impact

performance (latency & loss). for small flows

High Priority

Dynamic Flow Prioritization: Fabric automatically gives a higher priority to small flows.

Standard Priority

Key Idea: Fabric detects initial few flowlets of each flow and

assigns them to a high priority class.

CONGA: Fabric Load Balancing Dynamic Flow Prioritization

Real traffic is a mix of large (elephant) and small (mice) flows.

F1

F2

F3

Standard (single priority): Large flows severely impact

performance (latency & loss). for small flows

High Priority

Dynamic Flow Prioritization: Fabric automatically gives a higher priority to small flows.

Standard Priority

Key Idea: Fabric detects initial few flowlets of each flow and

assigns them to a high priority class.

Why Is CONGA important for IP based Storage?

55

Storage Flows are Elephant Flows

!   Using Flowlet switching we can break long lived storage flows(block) into flowlets(bursts) and route across multiple paths !   No Packet reorder-in order delivery

!   Send traffic on least congested path using CONGA feedback loop

!   Loss of link in path is a loss of flowlet –Minimal disruption !   No TCP reset (ISCSI, NFS, APPs)

!   Just send small burst that was lost

!   Object and File based short flows(mice) get higher priority and are able complete flows faster

56

CONGA for “Elephant” flows (block)

57

0

0.2

0.4

0.6

0.8

1

1.2

1.4

10 20 30 40 50 60 70 80 90

FCT

(N

orm

. to

EC

MP

)

Load (%)

ECMP

CONGA

Mice flows (<100KB) Elephant flows (>10MB)

0

0.2

0.4

0.6

0.8

1

1.2

10 20 30 40 50 60 70 80 90

FCT

(N

orm

. to

EC

MP

)

Load (%)

ECMP

CONGA

CONGA up-to 35% better than ECMP for

elephants

CONGA for “Mice” flows (file or object)

58

0

0.2

0.4

0.6

0.8

1

1.2

1.4

10 20 30 40 50 60 70 80 90

FCT

(N

orm

. to

EC

MP

)

Load (%)

ECMP

CONGA

Mice flows (<100KB) Elephant flows (>10MB)

0

0.2

0.4

0.6

0.8

1

1.2

10 20 30 40 50 60 70 80 90

FCT

(N

orm

. to

EC

MP

)

Load (%)

ECMP

CONGA

CONGA up-to 40% better for mice

Single Fabric for Storage(block,file or object) and Data

59

0

0.2

0.4

0.6

0.8

1

1.2

1.4

10 20 30 40 50 60 70 80 90

FCT

(N

orm

. to

EC

MP

)

Load (%)

ECMP

CONGA

Mice flows (<100KB) Elephant flows (>10MB)

0

0.2

0.4

0.6

0.8

1

1.2

10 20 30 40 50 60 70 80 90

FCT

(N

orm

. to

EC

MP

)

Load (%)

ECMP

CONGA

CONGA up-to 35% better than ECMP for

elephants

CONGA up-to 40% better for mice

Link Failures with Minimal Loss

60

0 2 4 6 8

10 12 14 16 18 20

10 20 30 40 50 60 70

FCT

(N

orm

. to

Opt

imal

)

Load (%)

ECMP

CONGA

Overall Average FCT

Summary

61

CONGA with DCB meets the needs of Storage over Ethernet

62

Minimize chance of Dropped traffic

In Order Frame Delivery

Minimal Oversubscription/

Incast Issues

Lossless Fabric for FCoE (no Drop)

CONGA with DCB meets the needs of Storage over Ethernet

63

Routing flowlets over least Congestion

across path-link loss has minimal impact

Minimize chance of Dropped traffic

In Order Frame Delivery

Flowlet Switching with CONGA guarantees in

order delivery

Minimal Oversubscription/

Incast Issues

40G fabric with Enhanced ECMP and Mice/Elephant Flow detection/ separation

Lossless Fabric for FCoE (no Drop)

CONGA and DCB (PFC,ETS) can be

implemented together

After This Webcast

!   This webcast will be posted to the SNIA Ethernet Storage Forum (ESF) website and available on-demand ! http://www.snia.org/forums/esf/knowledge/webcasts

!   A full Q&A from this webcast, including answers to questions we couldn't get to today, will be posted to the SNIA-ESF blog ! http://sniaesfblog.org/

!   Follow us on Twitter @ SNIAESF

64

PRESENTATION TITLE GOES HERE Q&A

PRESENTATION TITLE GOES HERE Thank You