Synchronized Progress in Interconnection Networks (SPIN) · Synchronized Progress in...

39
Synchronized Progress in Interconnection Networks (SPIN) : A new theory for deadlock freedom Aniruddh Ramrakhyani Georgia Tech ([email protected]) Tushar Krishna Georgia Tech ([email protected]) Paul V. Gratz Texas A&M University ([email protected]) ISCA 2018 Session 8B: Interconnection Networks

Transcript of Synchronized Progress in Interconnection Networks (SPIN) · Synchronized Progress in...

Synchronized Progress in Interconnection Networks (SPIN) : A new theory for

deadlock freedom

Aniruddh Ramrakhyani Georgia Tech

([email protected])

Tushar Krishna Georgia Tech

([email protected])

Paul V. Gratz Texas A&M University

([email protected])

ISCA 2018 Session 8B: Interconnection Networks

Network Routing2

B A

C

F

E

D

G

H

I

J

KDeadlock

Routing DeadlocksA Routing Deadlock is a cyclic buffer dependency chain that renders forward progress impossible.

3

A

C D

B

Deadlock

E

F

Routing DeadlocksA Routing Deadlock is a cyclic buffer dependency chain that renders forward progress impossible.

Deadlocks are a fundamental problem in both off-chip and on-chip interconnection networks.

Cause system breakdown and kill chips.

Deadlocks are hard to detect during functional verification.

Manifest after a long use time.

Depend on : traffic pattern, injection rate, congestion.

Show up due to system wear out faults and power-gating of network elements which are hard to simulate.

Need a solution for functional correctness !!

4

Solution I: Dally’s TheoryDefines a strict order in acquisition of links and/or buffers which ensures a cyclic dependency is never created.       

5

A

C D

BE

F1

2

3

45

6

Higher to Lower not

allowed

Solution I: Dally’s TheoryDefines a strict order in acquisition of links and/or buffers which ensures a cyclic dependency is never created.  

Implementations: Turn model [5], XY routing, Up-Down routing [20].

Limitations:

Routing Restrictions: Increased Latency, Throughput loss, Energy overhead

Require large no. of VCs for fully adaptive routing.

6

A

C D

BE

F

Solution II: Duato’s TheoryAdds buffers to create a deadlock free escape path that can be used to avoid/recover from deadlocks.

Implementation: turn restrictions in escape-VC.

7

E

FVC0

VC0

Escape-VC

Escape-VC

Solution II: Duato’s Theory

Limitations:

Energy and Area overhead of escape VCs.

Additional routing tables/logic for routing within escape-VC.

8

Adds buffers to create a deadlock free escape path that can be used to avoid/recover from deadlocks.

Implementation: turn restrictions in escape-VC.

Other SolutionsSolution III: Flow Control

Restrict injection when no. of empty buffers fall below a threshold

Implementation: Bubble Flow Control [9]

Limitation: Implementation Complexity, Throughput Loss.

Solution IV: Deflection Routing

Assign every flit to some output port even if they get misrouted.

Implementation: BLESS [10], CHIPPER [35]

Limitation: Livelocks, non-minimal routing

9

Comparison of Deadlock Freedom Theories10

Acyclic CDG not Required

No Packet Injection

RestrictionsLivelock

Free

VC cost for Mesh Routing

Minimal Adaptive

Topology Indepen-

dent

Dally

DuatoFlow

ControlDeflection Routing

SPIN

Theory

Metric

1 61 2

2 2

1

1 1

Can we do better ??

OutlineRouting Deadlocks

State of the Art

Dally’s Theory

Duato’s Theory

Flow Control Routing

Deflection Routing

SPIN : Synchronized Progress in Interconnection Networks

Evaluations

Conclusion

11

SPIN : Key Idea12

A

C D

B

Deadlock

E

F

What if: We coordinate the movement of every packet to the next hop at a given time ??

Simultaneous Synchronized

Movement

Simultaneous Synchronized Movement of all deadlocked packets in the loop is called a spin.

spin complete

SPIN : Key Idea13

Simultaneous Synchronized Movement of all deadlocked packets in the loop is called a spin.

Each spin leads to one hop forward movement of all deadlocked packets.

One spin may not resolve the deadlock. If so, spin can be repeated

Deadlock is guaranteed to be resolved in a finite number of spins [proof in paper, Sec. III]

SPIN : Key Idea14

A

C D

BE

F

First spin complete

Second spin complete

SPIN : Key Idea15

A

C

D

B

E

F

Packets E &B exit the loop

Deadlock Resolved

OutlineRouting Deadlocks

State of the Art

SPIN : Synchronized Progress in Interconnection Networks

Key Idea

Implementation Example

Micro-architecture

FAvORS

Evaluations

Conclusion

16

SPIN: Implementation ExampleSPIN is a generic deadlock freedom theory that can have multiple implementations.

We choose a recovery approach as deadlocks are rare scenarios (See Sec. II-F).

Our Implementation:

Detect the Deadlock.

Coordinate a time for spin.

Execute the spin.

17

Implementation Example : Detect Deadlocks

Use counters.

Placed at every node at design time. Optimize by exploiting topology symmetry (See Static Bubble [6]).

If packet does not leave in threshold time (configurable), it indicates a potential deadlock.

Counter expired ? Send probe to verify deadlock.

18

Implementation Example : Probe Msg.19

A

C D

BE

FCounter

Expires at Node 5

Probe

Send Probe

1. Deadlock Detection

2. Coordinating the spin.

3. Executing the spin.

Probe Returns: Deadlock Confirmed

Implementation Example : Probe Msg.Probe is a special message that tracks the buffer dependency.

Probe returns to sender: Cyclic buffer dependence, hence deadlock.

Next, send a move msg. to convey the spin time Upon receiving move msg., router sets its counter to count to spin cyle.

20

Implementation Example : Move Msg.21

A

C D

BE

F

Move

Send Move

Set counter to count to spin cycle

Move returns

1. Deadlock Detection

2. Coordinating the spin.

3. Executing the spin.

Implementation Example : spin22

A

C D

BE

F

Counters expire together in the spin

cycle

1. Deadlock Detection

2. Coordinating the spin.

3. Executing the spin.

Implementation Example : spin23

A

C

D

B

EF1. Deadlock Detection

2. Coordinating the spin.

3. Executing the spin.

Multiple SPIN OptimizationResolving a deadlock may require multiple spins

After spin, router can resume normal operation.

Counter expires again, process repeated.

Optimization: send probe_move after spin is complete.

probe_move checks if deadlock still exists and if so, sets the time for the next spin.

Details in paper (Sec. IV-B).

24

OutlineRouting Deadlocks

State of the Art

SPIN : Synchronized Progress in Interconnection Networks

Key Idea

Implementation Example

Micro-architecture

FAvORS

Evaluations

Conclusion

25

Implementation Micro-architectureNo additional links: Spl. Msgs. use the same links as regular flits.

Spl. Msgs. have higher priority in link usage over regular flits.

Links are anyways idle during deadlocks.

Bufferless Forwarding: Spl. Msgs. are not buffered anywhere (either forwarded or dropped).

Distributed Design: any router can initiate the recovery.

4% area overhead compared to traditional mesh router in 15nm Nangate [42].

26

OutlineRouting Deadlocks

State of the Art

SPIN : Synchronized Progress in Interconnection Networks

Key Idea

Walkthrough Example

Micro-architecture

FAvORS

Evaluations

Conclusion

27

FAvORS Routing AlgorithmSPIN is the first scheme that enables true one-VC fully adaptive deadlock-free routing for any topology.

FAvORS : Fully Adaptive One-vc Routing with SPIN.

Algorithm has two flavors: Minimal Adaptive Non-minimal Adaptive.

Route Selection Metrics: Credit turn-around time Hop Count

More details in paper (Sec. V).

28

OutlineRouting Deadlocks

State of the Art

SPIN : Synchronized Progress in Interconnection Networks

Evaluations

Conclusion

29

Evaluations 30

Simulator gem5 simulator + Garnet 2.0 Network model

Topologies 8x8 Mesh 1024 node Off-chip Dragon-fly

Link Latency

1-cycle Inter-group: 3-cycleIntra-group: 1-cycle

Traffic Synthetic + Multi-threaded (PARSEC)

Synthetic

Network Configuration:

Evaluations : Baselines8x8 Mesh:

31

Design Routing Adaptivity

Minimal Theory Deadlock Freedom Type

West-first Routing Partial Yes Dally Avoidance

Escape-VC Full Yes Duato Avoidance

Static-Bubble [6] Full Yes Flow-Control Recovery

1024 Node Off-chip Dragon-fly:

Design Routing Adaptivity

Minimal Theory Deadlock Freedom Type

UGAL [37] Full No Dally Avoidance

Saturation Throughput 1024-node Off-chip Dragon-fly:

32

0255075

100

0.01 0.03 0.05 0.07 0.09 0.11

Bit-complement

Late

ncy

(cyc

les)

Inj. Rate (flits/node/cycle)

0255075

100

0.01 0.08 0.15 0.22 0.29 0.36Inj. Rate (flits/node/cycle)

Neighbor

Late

ncy

(cyc

les)

UGAL_3VCSPIN

UGAL_3VCDally

FAvORS_NMin_1VCSPIN

Minimal_1VCSPIN

50% higher throughput compared to UGAL_Dally

62% higher throughput compared to

Minimal Routing 1-VC

25% higher throughput compared to UGAL_Dally

Saturation Throughput 33

8x8 On-chip Mesh:

Transpose (3-VC)

Inj. Rate (flits/node/cycle)

Late

ncy

(cyc

les)

West-First_3VCDally

0

25

50

75

100

0.001 0.031 0.061 0.091 0.121

Static_Bubble_3VCFlow-Control

EscapeVC_3VCDuato

FAvORS_Min_3VCSPIN

0

25

50

75

100

0.001 0.031 0.061 0.091 0.121

Transpose (1-VC)

Inj. Rate (flits/node/cycle)

Late

ncy

(cyc

les)

FAvORS_Min_1VCSPIN

West-First_1VCDally

68% higher throughput compared

to West-first 3-VC

8% higher throughput compared

to Escape-VC 3-VC

80% higher throughput compared

to West-First 1-VC

10% higher throughput

compared to Static-Bubble 3-VC

ConclusionDeadlocks are a fundamental problem in Interconnection Networks.

SPIN is a new deadlock freedom theory

Simultaneous packet movement for deadlock recovery

No routing restrictions or escape-VCs required.

Enables true one-VC fully adaptive routing for any topology

Salient Features of our Implementation:

Scalable: Distributed Deadlock Resolution

Plug-n-Play: topology agnostic

68% higher (Mesh) & 62% higher (dragon-fly) saturation throughput.

34

ConclusionPractical Applications:

35

On-Chip Mesh (Intel Xeon Phi, Cavium Thunder X2)

Super-computers Dragon-fly (Cray XC Networks)

Datacenters JellyFish (HP), Fat Tree (Google)

Irregular Topologies Faults (Static Bubble [6])Power-gating (Router Parking[29])

NoC Generators FlexNoc (ARTERIS),Sonics GN (SONICS)

Domain specific Accelerators

Eyeriss [15]

Thank you !!

Back-up36

SPIN : ApplicationsSPIN is a generic deadlock freedom theory

Scalable: distributed deadlock resolution

Plug-n-Play: doesn’t require knowledge of topology

SPIN can thus be used in :

On-chip networks: Mesh (Intel SCC, Tilera Tile64)

Supercomputers: Dragon-fly (Cray XC Networks)

Datacenters: Jellyfish (HP), Fat Tree (Google)

Static & Dynamically Changing Irregular topologies due to faults (Static Bubble [6]) & power-gating (Router Parking [29])

NoC Generators (Opensmart [13]) & Domain specific accelerator (Eyeriss[15])

37

Implementation Example : Probe Msg.38

A

C D

BE

FCounter

Expires at Node 5

Probe

Send Probe

Probe Returns: Deadlock Confirmed

Implementation Example : Move Msg.39

A

C D

BE

F

Move

Send Move

Set counter to count to spin cycle

Cntr

Move returns

1. Counter Expires2. Send Probe3. Send Move4. Counter expires

in spin cycle5. Spin