Doctor Dissertation Final Review High Throughput...
Transcript of Doctor Dissertation Final Review High Throughput...
High Throughput Architecture and Routing Algorithms
Towards the Design of Reliable Mesh-based Many-Core Network-on-
Chip Systems
Graduate School of Computer Science and Engineering
Adaptive Systems Laboratory
Jan. 14, 2015 Doctor Dissertation Final Review
d8141104, Akram Ben AhmedSupervised by Prof. Abderazek Ben Abdallah
Doctor Dissertation Final Review
1
Outline
• Background
• Research motivation
• Research goal and contributions
• Related work
• Graceful fault-tolerant routing algorithms
• Reliable router architecture and design
• Evaluation
• Conclusion and discussion
Jan. 14, 2015 Doctor Dissertation Final Review 4
Outline
• Background
• Research motivation
• Research goal and contributions
• Related work
• Graceful fault-tolerant routing algorithms
• Reliable router architecture and design
• Evaluation
• Conclusion and discussion
Jan. 14, 2015 Doctor Dissertation Final Review 5
• System-on-Chip (SoC)
– Required components are integrated on a single chip.
– Different LSI must be developed for each application.
• System-in-Package (SiP) or 3D IC
– Required components are stacked for each application.
Design cost of LSI is increasing
Fig.1 System-in-Package [NOCs2014] Fig.2 Computation power scaling [SoCPaR 2014]
Jan. 14, 2015 Doctor Dissertation Final Review 6
• To keep up with demands on computational power we need: - Increase parallelism (ILP/TLP/CLP).- Provide an efficient and low-power interconnect infrastructure to achieve better
scalability, bandwidth, reliability.
Era of Many-core Chips
Fig.3 Cores’ number scaling [SoCPaR 2014] Fig.4 Gate and interconnect delay overtime [VLSI2005]
• Gate delay
– Continuous decrease
• Interconnect delay
– Exponential increase
• Constant increase of the number of cores
Multi-Core Many-Core
7
Circuit switching
Point–to-Point
Bus-based
IO
P1P2
M1P3
WaitWait
M1
IO
P1
M2P1
P2
tsetup tdata
DataAcknowledgmentHeader Probe
ts
tr ts
Time
tr = routing time
ts = setup time
Limited bandwidth and
important power overhead
due to the significant path setup latency
M: memory.
P: Processing Element.
IO: input/output peripheral
Jan. 14, 2015
Fig.5 Communication overhead in circuit switching- based interconnect
Doctor Dissertation Final Review 8
Network-on-Chip
Fig. 6 Network-on-Chip architecture
Tail flit Body flit Head flit
Packet
Flit
information
Carried
Payload
Ending
flit
RX
TX
RX
TX
RX
TX
Multihop communication
Receive -> Buffer - > Transmit
every flit at every switch.
R: Router. NI: Network interface. PE: Processing Element
Fig. 7 Conventional router architecture
9
Network-on-Chip
10
Fig. 9 Torus topology
Fig. 10 Store & Forward switching
Fig. 12 Routing minimality
Fig. 15 Credit based flow control
Fig. 8 Mesh topology
Fig. 11 Wormhole switching
Fig. 13 Routing locality Fig. 14 routing adaptivity
Fig. 16 ACK/NACK flow control
Network-on-Chip
10
00
11
01
12
02
13
03
20 21 22 23
14
04
15
05
24 25
30 31 32 33
40 41 42 43
34 35
44 45
50 51 52 53 54 55
16
06
17
07
26 27
36 37
46 47
56 57
60 61 62 63 64 65
70 71 72 73 74 75
66 67
76 77
Layer1
Layer3
30 31 32 33
20 21 22 32
10 11 21 31
00 01 02 03
30 31 32 33
20 21 22 32
10 11 21 31
00 01 02 03
30 31 32 33
20 21 22 32
10 11 21 31
00 01 02 03
30 31 32 33
20 21 22 32
10 11 21 31
00 01 02 03
Layer4
Layer2
Fig. 17 3D- Network-on-Chip architecture
X
Y
Z
X
Y
Router addressed NM (in decimal) NM
Vertical link
Lateral link
Wire length reduction a
b2D
3D
a/2
b/2
Footprint reduction
2D 3D
c
c =die thickness (0.6 mm)+ interdie distance
(1mm ~ 4mm)
(10 μm ~200 μm)11
Network-on-Chip
10
00
11
01
12
02
13
03
20 21 22 23
14
04
15
05
24 25
30 31 32 33
40 41 42 43
34 35
44 45
50 51 52 53 54 55
16
06
17
07
26 27
36 37
46 47
56 57
60 61 62 63 64 65
70 71 72 73 74 75
66 67
76 77
30 31 32 33
20 21 22 32
10 11 21 31
00 01 02 03
30 31 32 33
20 21 22 32
10 11 21 31
00 01 02 03
30 31 32 33
20 21 22 32
10 11 21 31
00 01 02 03
30 31 32 33
20 21 22 32
10 11 21 31
00 01 02 03
X
Y
Z
X
Y
Router addressed NM (in decimal) NM
1
2
3
4
5
6
7
891011121314 hops
7
8
9 hops
321
4
5
6
Diameter reduction
The number of hops that a flit
traverses in the longest possible
minimal path between a (source,
destination) pair.
Packet energy reduction
n =the number of flits
h= the number of hops
Fig. 17 3D- Network-on-Chip architecture
12
Network-on-Chip
10
00
11
01
12
02
13
03
20 21 22 23
14
04
15
05
24 25
30 31 32 33
40 41 42 43
34 35
44 45
50 51 52 53 54 55
16
06
17
07
26 27
36 37
46 47
56 57
60 61 62 63 64 65
70 71 72 73 74 75
66 67
76 77
30 31 32 33
20 21 22 32
10 11 21 31
00 01 02 03
30 31 32 33
20 21 22 32
10 11 21 31
00 01 02 03
30 31 32 33
20 21 22 32
10 11 21 31
00 01 02 03
30 31 32 33
20 21 22 32
10 11 21 31
00 01 02 03
X
Y
Z
X
Y
Router addressed NM (in decimal) NM
1
2
3
4
5
6
7
891011121314 hops
7
8
9 hops
321
4
5
6
Packet latency reduction
Lp= Lsender+ Ltransport+ Lreceiver
Tightly dependent on the latency overhead of each hop
Fig. 17 3D- Network-on-Chip architecture
Diameter reduction
The number of hops that a flit
traverses in the longest possible
minimal path between a (source,
destination) pair. 13
• NoCs are exposed to a variety of manufacturing anddesign factors making them vulnerable to differenttypes of faults (permanent, intermittent, andtransient).
• The single-point-failure nature of NoC introduces abig concern to their reliability as they are the solecommunication medium.
• The need for fault-tolerance in Many-core systemshas become imperative to ensure their reliability andit is growing in importance as technology scales,especially in NoCs.
Jan. 14, 2015
Research motivation
Doctor Dissertation Final Review 14
• There is a need for architectures that can combinefault tolerance aspects with performance aspects inan adaptive manner, adapting to different run-time environments.
• The lack of reliability can be illustrated in corruptedmessage delivery, time requirements unsatisfactory,or even sometimes the entire system collapses.
• Reliability becomes crucial with hard real-time(aeronautics, energy plants) and high-precisioncalculation (disaster simulation, bio-medical)applications.
Jan. 14, 2015 Doctor Dissertation Final Review
Research motivation
15
• To avoid costly packet retransmission, increase chip yield, and ensure success of interconnect, I propose in this research a high-throughput architecture and routing algorithms for reliable Network-on-Chip designs:
- Graceful fault-tolerant routing algorithms
- Reliable router architecture
- Hardware design and evaluation
Jan. 14, 2015 Doctor Dissertation Final Review
Research goaland contributions
16
1. Graceful fault-tolerant routing algorithms: – Look-Ahead-Fault-Tolerant (LAFT) [Jnl2]: a high-throughput
and light-weight fault-tolerant routing algorithm to handle the presence of faults in inter/intra layer links.
– Hybrid-Look-Ahead-Fault-Tolerant (HLAFT) [Jnl1]: combines both local and look-ahead routing to further enhance the router's throughput under worst-case fault scenarios and make the performance degradation as graceful as possible.
Jan. 14, 2015 Doctor Dissertation Final Review
Research goaland contributions
17
2. Reliable router architecture: – Random-Access-Buffer (RAB) [Jnl1,Conf2]:
Employed to recover from deadlock at low cost without the needfor Virtual channels.
Efficiently manages the correct flit buffering at the presence offailures in input-buffers.
– Traffic-Prediction-Unit (TPU): endorses RAB for furtherperformance enhancement by providing an alternative forfaulty congested buffers to relieve the traffic overhead .
– Bypass-Link-on-Demand (BLoD) [Conf1]: Ensures the faulttolerance in the crossbar links by allocating theappropriate and minimal escape channels with noconsiderable area and power overhead.
Jan. 14, 2015 Doctor Dissertation Final Review
Research goaland contributions
18
3. Hardware design and evaluation of a 3D-NoC system based on a reliable router:
- Hardware synthesis
- Performance evaluation over various benchmarks
Latency
Throughput
Reliability
Jan. 14, 2015 Doctor Dissertation Final Review
Research goaland contributions
19
Outline
• Background
• Research motivation
• Research goal and contributions
• Related work
• Robust fault-tolerant routing algorithms
• Reliable router architecture and design
• Evaluation
• Conclusion and discussion
Jan. 14, 2015 Doctor Dissertation Final Review 20
• Faults can be classified in terms of the occurrence frequency:
– Transient faults: they occur and remain in the system for a particular period of time before disappearing.
– Intermittent faults: they are transient faults that occur from time to time.
– Permanent faults: they start at a particular time and remain in the system until they are repaired.
• Tackling the presence of faults was investigated in two main approaches:
– Routing approach
– Architectural approach
Jan. 14, 2015 Doctor Dissertation Final Review
Related work
21
• Routing approach:
– Routing table [Feng 2001]: uses a local routing table for 2D routing, and a vector to store the vertical link failure status.Area and power overhead
– 4NP-FIRST [Pasricha 2011]: non-minimal fault-tolerant routing for deadlock avoidance.Additional Latency
– AFRA [DATE 2012]: adaptive routing that considers permanent faults in vertical links only.Lack of reliability
– HamFA [DATE 2013]: adaptive fault-tolerant routing based on Hamilton path searching algorithm.Restriction to the fault placement
Jan. 14, 2015 Doctor Dissertation Final Review
Related work
22
– Adaptive-Z [Rahmani 2012]: targeted for Hybrid-3D-NoC.Unscalable
– Planar Adaptive Routing (PAR) [Chien 1992]: based on registers to store fault information and Virtual Channels (VCs).
– 3D Minimum-Connected-Component (MCC) [Jiang 2008]: optimized version of PAR.Area and latency overhead
• Architectural approach:
– BulletProof [Constantinides 2006]: based on N-modular redundancy (NMR) technique to duplicate the target component.Very high area and power overhead
Jan. 14, 2015 Doctor Dissertation Final Review
Related work
23
– RoCo [Kim 2006]: the router is decomposed by using parallel arbiters and small crossbar to ensure fault-tolerance in Virtual-channel Allocation (VCA), Switch Allocation (SA), and Routing computation (RC) stages. It does not consider the occurrence of faults in the input-buffer nor the crossbar.
– Minimal correction circuitry [Poluri2013]: provides fault-tolerance in VCA, SA, and CT by sharing the VC and redundant crossbar links. It does not consider the occurrence of faults in the buffer.Lack of reliability
Jan. 14, 2015 Doctor Dissertation Final Review
Related work
24
– Vicis [DeOrio 2012]: “flexible-fifo” is proposed to deal with permanent fault in buffer entries. “Crossbar-Bypass-Bus” was presented to deal with faults in the crossbar.Does not consider transient and intermittent faults
Latency overhead for sharing the single Crossbar-Bypass-Bus
The routing algorithm is non-minimal
Jan. 14, 2015 Doctor Dissertation Final Review
Related work
25
Outline
• Background
• Research motivation
• Research goal and contributions
• Related work
• Graceful fault-tolerant routing algorithms
• Reliable router architecture and design
• Evaluation
• Conclusion and discussion
Jan. 14, 2015 Doctor Dissertation Final Review 26
Look-Ahead-Fault-Tolerant:fault information exchange
RouterEast
Router
SouthRouter
NorthRouter
WestRouter
1 bit : fault detection signal:Issued when a fault is detected in the incoming flit.
6 bits fault information signal:Sent to all the neighboring nodes giving informationabout the router links statusin each direction.
out
in
6 bits not 7: We assume that the link connecting the Local port is always valid and can not be fault
DownRouter
UpRouter
36bit fault-information are required for each router
X
Z
Y
Jan. 14, 2015 Doctor Dissertation Final Review
Fig. 18 Fault information exchange
27
Look-Ahead-Fault-Tolerant:algorithm
Fig. 19 Look-Ahead-Fault-Tolerant flowchart Fig. 20 Look-Ahead-Fault-Tolerant Pseudo-code
27
Look-Ahead-Fault-Tolerant:Example
Fault link Valid link
Source node Destination nodeS D
Current node Next nodeC N
Current out-port Next out-portS
C
D
N
X
Y
Z
Fig. 21 Look-Ahead-Fault-Tolerant example
1- The current out-port is read from the flit and the next-node address is computed2- The three possible directions are calculated: North, East, and Up3- Verify the link status of the three directions, and eliminate the faulty path: East4- Calculating the diversity value and select the highest one: North=3 (North, east, and up); Up=2 (North and east)
1- The current out-port is read from the flit and the next-node address is computed2- The three possible directions are calculated: North, East, and Up3- Verify the link status of the three directions, and eliminate the faulty path: East4- Calculating the diversity value and select the highest one: North=3 (North, east, and up); Up=2 (North and east)
1- The current out-port is read from the flit and the next-node address is computed2- The three possible directions are calculated: North, East, and Up3- Verify the link status of the three directions, and eliminate the faulty path: East4- Calculating the diversity value and select the highest one: North=3 (North, east, and up); Up=2 (North and east)
1- The current out-port is read from the flit and the next-node address is computed2- The three possible directions are calculated: North, East, and Up3- Verify the link status of the three directions, and eliminate the faulty path: East4- Calculating the diversity value and select the highest one: North=3 (North, east, and up); Up=2 (North and east)
1- The current out-port is read from the flit and the next-node address is computed2- The three possible directions are calculated: North, East, and Up3- Verify the link status of the three directions, and eliminate the faulty path: East4- Calculating the diversity value and select the highest one: North=3 (North, east, and up); Up=2 (North and east) 29
Jan. 14, 2015 Doctor Dissertation Final Review
• LAFT showed its ability to reduce the latency by anaverage of 32%.
• As the number of faults increases, the performanceof LAFT is not optimal
– LAFT receives the fault information in a single hop range
• LAFT algorithm was published in:– A. Ben Ahmed and A. Ben Abdallah ''Architecture and Design of High-
throughput, Low-latency and Fault Tolerant Routing Algorithm for 3D-Network-on-Chip'', The Journal of Supercomputing, 66(3): 1507-1532,December 2013.
– A. Ben Ahmed and A. Ben Abdallah, “Deadlock-Recovery Support for Fault-tolerant Routing Algorithms in 3D-NoC Architectures, The IEEE 7thInternational Symposium on Embedded Multicore SoCs (MCSoC-13), pp. 67-72, Tokyo, Japan, September 26-28, 2013.
Related publications toLook-Ahead-Fault-Tolerant
30
Jan. 14, 2015 Doctor Dissertation Final Review
S
C N
D
- Local routing sometimes is preferred since it provides more information about the fault status.
- Look ahead routing does not provide enough. information for routing
- The route selection might lead to a blocked path, where a turn back or a non-minimal path selection is required.
WORST CASE
Look-Ahead-Fault-Tolerant:limitations
Fig. 22 Example of Look-Ahead-Fault-Tolerant limitations
The solution is to combine both look ahead and local routing for performance enhancement.
31
Hybrid-Look-Ahead-Fault-Tolerant: algorithm
Read faults
Compute 3 possible directions
Calculate next-node
LAFT routing
Loc-rout enb?
Local routing
No
Yes
End
Start
Fig. 23 Hybrid-Look-Ahead-Fault-Tolerant flowchart Fig. 24 Hybrid-Look-Ahead-Fault-Tolerant Pseudo-code31
Router architecture
Fig.25 3D-OASIS- router architecture 33
West
Input-port
Up
Input-port
Down
Input-port
Local
Input-port
North
Input-port
East
Input-port
South
Input-port
BW RC/SA CT
Jan. 14, 2015 Doctor Dissertation Final Review
Hybrid-Look-Ahead-Fault-Tolerant: architecture
FIFO LAFT
sw_reqcontroller
Fault controller
data_in (77)
LocalFT routing
Nex
t(3
)+d
est
(9)
New_next (3)
sw_req (3)
Next(3)+dest (9)
Nex
t(3
)
fau
lt (
6)
Ou
t_p
ort
(3)
de
st(9
)
Fault_in (36)
Loc-rout-enbLoc-rout-enb
Fig.26 Simplified block diagram of the input-port circuit
Path non blocked
Loc-rout-en =0
34
Jan. 14, 2015 Doctor Dissertation Final Review
Hybrid-Look-Ahead-Fault-Tolerant: architecture
data_in (77)
Fault_in (36)
Nex
t(3
)+d
est
(9)
New_next (3)
sw_req (3)
Next(3)+dest (9)
Nex
t(3
)
fau
lt (
6)
Ou
t_p
ort
(3)
de
st(9
)
FIFO LAFT
sw_reqcontroller
Fault controller
LocalFT routing
Path blocked
Loc-rout-en =1
Fig.26 Simplified block diagram of the input-port circuit
35
Jan. 14, 2015
Hybrid-Look-Ahead-Fault-Tolerant: Example
Compute next node
Compute 3 possible directions
Check faults
Read destination and next port
Decide routing
S
C
D
Local routing
X
Y
Z
- Destination node (2,2,2)
- Current node: (1,0,1)
- Next_port: North
• Current node: 1,0,1 & Next_port: North– Next_node_x= cur_x
– Next_node_y= cur_y+1 1,1,1
– Next_node_z= cur_z
• Next_node= 1,1,1 & dest = 2,2,2
– next_x< dest_xPossible_x= east
– next_y< dest_y Possible_y= north
– next_z< dest_z Possible_z= up
• Fault= 010011– Possible_x= faulty
– Possible_y= faulty
– Possible_z= faulty
Loc-rout-enb=1
Local routing
Fig.27 Hybrid-Look-Ahead-Fault-Tolerant example
36
Jan. 14, 2015 Doctor Dissertation Final Review
• HLAFT made the performance degradation furthergraceful by 12% when compared to LAFT.
• HLAFT algorithm was published in the Journal of Paralleland Distributed Computing:– A. Ben Ahmed and A. Ben Abdallah, “Graceful deadlock-free fault-
tolerant routing algorithm for 3D Network-on-Chip architectures'',Journal of Parallel and Distributed Computing, 74(4): 2229-2240, April2014.
• With a large number of cores and layers, 3D-NoC systemsface greater challenges and become more vulnerable tofaults.
• Faults can be caused by the increasing area, thermalpower, stacking misplacement, etc.
Hybrid-Look-Ahead-Fault-Tolerant: limitations
37
Jan. 14, 2015 Doctor Dissertation Final Review
• At this high core density, considering faults only inthe inter-router links does not provide the optimalreliability.
• Other components such as input-buffers andcrossbar should be given greater attention to ensurefault tolerance and enhance the system reliability.
• These components consume a large portion of theentire router area and power budget.
– Vulnerable to failures
Hybrid-Look-Ahead-Fault-Tolerant: limitations
38
Outline
• Background
• Research motivation
• Research goal and contributions
• Related work
• Graceful fault-tolerant routing algorithms
• Reliable router architecture and design
• Evaluation
• Conclusion and discussion
Jan. 14, 2015 Doctor Dissertation Final Review 39
Jan. 14, 2015
3D-Fault-Toleant-OASIS-NoC router architecture
Fig.28 3D-Fault-Toleant-OASIS-NoC router block diagram 40
Jan. 14, 2015
Random Access Buffer mechanism
TimerIf the flit’s request is not served after a period of time a flag is issued
RAB managerWhen receiving the flags, it updates the status register and avoids to write or read from the flagged slots
FIFO managerManages the input buffer when neither deadlock nor fault is detected
Fault-detectIssues a flag whenever a permanent or transient fault is detected
Fig.29 Random-Access-Buffer mechanism block diagram 41
Jan. 14, 2015
Random-Access-Buffer mechanism: example
RAB_cntrl
data_out
Next_port
Timer
00 00 00 00
Wr_
ad
r
Rd
_a
dr
sw_grnt
data_in
Status-registerUsed to keep the status of the blocking and faulty slots:00: non blocking nor faulty01: blocking10: transient fault11: permanent fault
SR
P4
We
st
P3
East
P2
Sou
th
P1
No
rth
Fig.30 Example showing how Random-Access-Buffer mechanism works 42
Jan. 14, 2015
Random-Access-Buffer mechanism: example
RAB_cntrl
data_out
Next_port
Timer
11 00 00 00
Wr_
ad
r
Rd
_a
dr
sw_grnt
data_in
00: non blocking nor faulty01: blocking10: transient fault11: permanent fault
- In this example, we assume the presence of a permanent fault in one slot- The status register indicates 11
X P3
East
P2
Sou
th
P1
No
rth
SR
Permanent fault Transient fault
Fig.30 Example showing how Random-Access-Buffer mechanism works 43
Jan. 14, 2015
Random-Access-Buffer mechanism: example
RAB_cntrl
data_out
Next_port
Timer
11 00 00 00
Wr_
ad
r
Rd
_a
dr
sw_grnt 0
data_in
SR
00: non blocking nor faulty01: blocking10: transient fault11: permanent fault
-Timer informs that the flit being processed did not get the grant and it is blocked- The request is dropped and the SR is updated to 01
No
rthX P
3Ea
st
P2
Sou
th
P1
No
rth
Timer
01
Permanent fault Transient fault
Fig.30 Example showing how Random-Access-Buffer mechanism works 44
Jan. 14, 2015
Random-Access-Buffer mechanism: example
RAB_cntrl
data_out
Next_port
Timer
11 00 00 01
Wr_
ad
r
Rd
_a
dr
sw_grnt 1
data_in
00: non blocking nor faulty01: blocking10: transient fault11: permanent fault
-The other flits are processed, requested, granted and read from the buffer
So
uthX P
3Ea
st
P2
Sou
th
P1
No
rth
SR
Permanent fault Transient fault
Fig.30 Example showing how Random-Access-Buffer mechanism works 45
Jan. 14, 2015
Random-Access-Buffer mechanism: example
RAB_cntrl
data_out
Next_port
Timer
11 00 00 01
Wr_
ad
r
Rd
_a
dr
sw_grnt 1
data_in
00: non blocking nor faulty01: blocking10: transient fault11: permanent fault
-The other flits are processed, requested, granted and read from the buffer
Ea
st X P
3Ea
st
P2
Sou
th
P1
No
rth
SR
Permanent fault Transient fault
Fig.30 Example showing how Random-Access-Buffer mechanism works 46
Jan. 14, 2015
Random-Access-Buffer mechanism: example
RAB_cntrl
data_out
Next_port
Timer
11 00 00 01
Wr_
ad
r
Rd
_a
dr
sw_grnt 1
data_in
SR
00: non blocking nor faulty01: blocking10: transient fault11: permanent fault
- A new incoming flit is stored in the buffer- A transient fault is detected- SR is updated to 10
No
rthX P
3Ea
st
P1
No
rth
P4
UpX
10
Permanent fault Transient fault
Fig.30 Example showing how Random-Access-Buffer mechanism works 47
Jan. 14, 2015
Random-Access-Buffer mechanism: example
RAB_cntrl
data_out
Next_port
Timer
11 01 00 01
Wr_
ad
r
Rd
_a
dr
sw_grnt 1
data_in
SR
00: non blocking nor faulty01: blocking10: transient fault11: permanent fault
- The previously blocking packet is checked again, granted, and read from the buffer- The status register is updated to 00- The transient fault is removed and SR is updated to 00
No
rthX P
1N
ort
h
P4
Up
00X
00
Permanent fault Transient fault
Fig.30 Example showing how Random-Access-Buffer mechanism works 48
Jan. 14, 2015
3D-Fault-Toleant-OASIS-NoC router architecture
Fig.28 3D-Fault-Toleant-OASIS-NoC router block diagram 49
Traffic Prediction Unit
Jan. 14, 2015 Doctor Dissertation Final Review
Fig.31 Traffic-Prediction-Unit block diagram
50
Traffic Prediction Unit:monitoring interval selection
Jan. 14, 2015
Simulate
Evaluate the buffer occupancy Bo(t)
Select initial intervalt=t0
Optimal interval
BO(tn)≈BO(tn-1)?
Yes
End
Start
Increment intervalt(n)= t(n-1)+s
No
Fig.32 Monitoring interval selection flow-chart 51
Jan. 14, 2015
3D-Fault-Toleant-OASIS-NoC router architecture
Fig.28 3D-Fault-Toleant-OASIS-NoC router block diagram 52
Jan. 14, 2015
Bypass-Link-on-Demand
Bypass-1
Ctrl
Fault-control-module (FCM)
Bypass-2
Fa
ulty_
Cro
ss
En
ab
le_
byp
ass
dis
ab
le_crs
s
L_in
N_in
E_in
S_in
W_in
U_in
D_in
L_out
N_out
E_out
S_out
W_out
U_out
D_out
Crss flag (to sw_req_ctrl)Fault-control-module- Responsible for deciding the number of the necessary bypass links and disabling the faulty baseline ones
Bypass-3
Bypass-nAdditional escape links used when the baseline crossbar links are detected faulty
Ctrl- Detects the presence of faults and manages the selection between the baseline crossbar and bypass link
Fig.33 Bypass-Link-on-Demand block diagram 53
Jan. 14, 2015
Bypass-Link-on-Demand
Bypass-1
Ctrl
Fault-control-module (FCM)
Bypass-2
L_out
N_out
E_out
S_out
W_out
U_out
D_out
Crss flag
Fa
ulty_
Cro
ss
En
ab
le_
byp
ass
Bypass-3
Crss flag
L_in
N_in
E_in
S_in
W_in
U_in
D_in
• A permanent fault is detected and the Ctrl sends a signal to the FCM.
• FCM sends Crss-flagsignal to sw-req-ctrl in the input port to prevent from requesting the faulty crossbar link.
Permanent fault
Fig.34 Bypass-Link-on-Demand example 54
Bypass-Link-on-Demand
Bypass-1
Ctrl
Fault-control-module (FCM)
Bypass-2
L_in
N_in
E_in
S_in
W_in
U_in
D_in
L_out
N_out
E_out
S_out
W_out
U_out
D_out
• At the same time, FCM sends a signal to Ctrl to enable the bypass link.
• When other faults are detected other bypasses are enabled.
• When transient faults are removed, the bypass link is disabled.
Crss flag
Fa
ulty_
Cro
ss
En
ab
le_
byp
ass
Bypass-1
Bypass-2
Bypass-3Permanent fault
Transient fault Fig.34 Bypass-Link-on-Demand example 55
Outline
• Background
• Research motivation
• Research goal and contributions
• Related work
• Graceful fault-tolerant routing algorithms
• Reliable router architecture and design
• Evaluation
• Conclusion and discussion
Jan. 14, 2015 Doctor Dissertation Final Review 56
• We evaluate:
– Hardware complexity
• Area
• Total power
– System performance
• Latency/flit
• Throughput
• Reliability
Jan. 14, 2015 Doctor Dissertation Final Review
Evaluation methodology
• Benchmarks:
– Transpose
– Uniform
– Matrix Multiplication
– JPEG
• We use:
– Verilog HDL
– Synopsys Design Compiler
– Cadence SoC Encounter
– Modelsim
57
Jan. 14, 2015 Doctor Dissertation Final Review
Configuration and assumptions
Configuration
58
- Fault-rate: 0%, 5%, 10%, and 20%
- Injection rate: 1,000 to 100,000 flits
- Faults cannot occur on links connecting to the local port
- There exists at least one valid path between a source and destination
Assumption
LAFT and HLAFT routing latency/flit evaluation
(a) (b)
(c) (d)
Fig.35 LAFT and HLAFT latency/flit evaluation with: (a) Transpose (b) Uniform (c) 6x6 Matrix (d) JPEG.
57%
36.5%
12.1% 10.5%3%
41.2%
16 %
58.8%
11.6%
59
- 0% FR: 48% reduction compared with XYZ- 20% FR 12.6% when compared LAFT
LAFT and HLAFT routing throughput evaluation
(a) (b)
(c) (d)
Fig.36 LAFT and HLAFT throughput evaluation with: (a) Transpose (b) Uniform (c) 6x6 Matrix (d) JPEG.
59.3%
15.5 %
- 0% FR: 43% increase compared with XYZ- 20% FR 11.8% when compared LAFT
60
Jan. 14, 2015 Doctor Dissertation Final Review
HLAFT reliability evaluation
• We define the reliability as the capability of the system tocorrectly deliver all the packets to their destinations, evenat the presence of failures.
Routing /faulty links 1 faulty link 2 faulty links 3 faulty links
HamFA 95% 44% 20%
AFRA 33% 7% 3%
HLAFT 100% 100% 100%
Table II HLAFT reliability evaluation results
61
BLoD latency/flit evaluation
Fig.37 BLoD latency/flit evaluation with: (a) Transpose (b) Uniform (c) 6x6 Matrix (d) JPEG.
(a) (b)
(c) (d)
20.1%13.2%
27.317.3
- 3 bypasses seems to be the best number- At 20% FT BLoD performs better in 3 applications
62
RAB and TPUlatency/flit evaluation
Fig.38 RAB latency/flit evaluation with: (a) Transpose (b) Uniform (c) 6x6 Matrix (d) JPEG.
(a) (b)
(c) (d)
19.2% 12.4%
9.3%15.3 %
- At 20% FT, RAB+TPU exhibits negligible variation- Adding TPU further reduced the latency with 14.05
63
Complete 3D-FTO router latency/flit evaluation
(a)(b)
(c) (d)
Fig.39 3D-FTO latency/flit evaluation with: (a) Transpose (b) Uniform (c) 6x6 Matrix (d) JPEG.
- 37% and 18.5% at 0% fault-rate- Performs better than XYZ in two applications
- 12.1% latency increase with the remaining two
22.4%
14.1%
64
Complete 3D-FTO router throughput evaluation
Fig.40 3D-FTO throughput evaluation with: (a) Transpose (b) Uniform (c) 6x6 Matrix (d) JPEG.
(a) (b)
(c) (d)
13.3%
17.4%
- 51% and 38.5% at 0% fault-rate- Performs better than XYZ in two applications
- 10.1% throughput decrease with the remaining two
65
Hardware design results
Jan. 14, 2015 Doctor Dissertation Final Review
Table III Router hardware complexity evaluation results
Routing circuit Crossbar circuit Input-buffer circuit Complete router
Proposed(LAFT)
Proposed(HLAFT)
BaselineProposed
(BLoD)Baseline
Proposed(RAB+TPU)
Baseline Proposed Baseline
Area (µm) 688 772 609 2443 2085 4529 3543 10587 7654
Power (µW) 111.2 134.4 94.2 379.3 316.7 483.4 373.7 1175.6 886.32
Module
Parameter
38.3% additional area32.6% power overhead
12% additional area18% power overhead
17.1% additional area19.9% power overhead
27.8% additional area29.4% power overhead
66
Power: 1175.6 uWNumber of Pins: 557Frequency: 0.9 GHzVoltage: 1.1 VTotal code: 2386 lines of code
Evaluation summaryBC: Best case
WC: Worst case
BenchmarksLatency/flit (us/flit) Throughput (flit/cycle) Area (µm) Power (µW)
BC(0%)
WC(20%)
XYZ LAXYZ BC(0%)WC
(20%)XYZ LAXYZ Proposed Baseline Proposed Baseline
Transpose 159 370 370 320 7.81 4.5 4.27 5.4
688 609 111.2 94.2Uniform 520 950 820 630 6.1 1.1 2.16 4.2
Matrix 268 560 460 320 11.25 4.1 5.23 8.54
JPEG 620 1120 1020 810 8.45 5.1 6.33 7.5
Transpose 158 325 370 320 8.8 5.2 4.27 5.4
872 609 134.4 94.2Uniform 421 850 820 630 6.1 1.9 2.16 4.2
Matrix 270 470 460 320 11.25 5.2 5.23 8.54
JPEG 620 990 1020 810 8.45 6.4 6.33 7.5
BC(0%)
WC(20%)
LAFT LAXYZBC
(0%)WC
(20%)LAFT LAXYZ
2443 2085 379.3 316.7Transpose 940 950 910 1190 9.22 7.12 9.34 6.68
Uniform 10600 14600 10300 12900 2.46 2.12 3.0 2.5
Matrix 930 1050 905 1280 2.3 2.05 2.67 1.8
JPEG 670 680 630 810 6.6 6.5 6.9 5.33
BC(0%)
WC(20%)
RAB(WC)
LAXYZBC
(0%)WC
(20%)RAB(WC)
LAXYZ
5529 3543 623.4 373.7Transpose 9300 11700 14100 11900 10.6 9.0 7.4 9.0
Uniform 106 134 153 129 7.15 5.2 4.32 5.45
Matrix 910 1260 1390 1280 9.4 8.1 7.8 8.3
JPEG 640 770 910 810 10.7 8.2 6.65 7.9
BC(0%)
WC(20%)
XYZ LAXYZBC
(0%)WC
(20%)XYZ LAXYZ
10587 7654 1175.6 886.32Transpose 158 356 376 324 8.92 4.64 4.35 5.41
Uniform 427 917 634 821 6.15 1.88 2.17 4.22
Matrix 268 453 469 320 11.7 4.9 4.76 8.43
JPEG 620 1192 1023 811 8.45 5.21 6. 32 7.45
Table IV Evaluation results summary
Co
mp
lete
3D
-FT
O
RA
B+
TP
UB
Lo
DH
LA
FT
LA
FT
Outline
• Background
• Research motivation
• Research goal and contributions
• Related work
• Graceful fault-tolerant routing algorithms
• Reliable router architecture and design
• Evaluation
• Conclusion and discussion
Jan. 14, 2015 Doctor Dissertation Final Review 68
Conclusion• In this research, I proposed a high-throughput
architecture and routing algorithms targetedfor reliable Network-on-Chip designs.
• Two efficient routing algorithms, named Look-Ahead-Fault-Tolerant (LAFT) and Hybrid-Look-Ahead-Fault-Tolerant (HLAFT), were presentedto ensure fault-tolerance in links.
• They exploit the look-ahead routing propertiesto provide graceful performance degradationat high fault-rates.
Jan. 14, 2015 Doctor Dissertation Final Review 69
Conclusion• Random-Access-Buffer (RAB) mechanism was
proposed to ensure deadlock recovery and alsoto handle the fault occurrence in the input-buffers.
• RAB was endorsed with Traffic-Prediction-Unit(TPU) to further reduce the latency caused bythe presence of faulty buffer-slots.
• To relieve the congestion caused by faults inthe crossbar, we developed a technique namedBypass-Link-on-Demand (BLoD).
Jan. 14, 2015 Doctor Dissertation Final Review 70
Conclusion• For up to 10% fault-rate, 3D-FTO reduces the latency
with an average of 37% and 18.5% when compared toXYZ- and LA-XYZ-based systems, respectively.
• It also provides a throughput improvement that canreach the 51% and 38% at the absence of faults.
• At 20% fault-rate, the proposed router provides betterthroughput than that of XYZ in two applications.
• The hardware complexity evaluation results showedthat 3D-FTO exhibits 29.3% additional area, 24.6%power overhead, and a negligible speed drop whencompared to the baseline system.
Jan. 14, 2015 Doctor Dissertation Final Review 71
Discussion• Further research is needed about the diagnosis
mechanisms to capture detailed performance of thewhole system.
• Thermal power is one of the major issues 3D-NoCdesigns.
– In-depth thermal power study is necessary.
• Investigate more about the reliability andperformance of the whole system for more generalrun-time scenarios.
• Investigate about Quality-of-Service, especially fortime-constrained applications.
Jan. 14, 2015 Doctor Dissertation Final Review 72
Publications• Refereed Journals
– [Jnl1] A. Ben Ahmed and A. Ben Abdallah, “Graceful deadlock-free fault-tolerantrouting algorithm for 3D Network-on-Chip architectures'', Journal of Parallel andDistributed Computing, 74(4): 2229-2240, April 2014.
– [Jnl2] A. Ben Ahmed and A. Ben Abdallah ''Architecture and Design of High-throughput, Low-latency and Fault Tolerant Routing Algorithm for 3D-Network-on-Chip'', The Journal of Supercomputing, 66(3): 1507-1532, December 2013.
• Refereed International conferences– [Conf1] A. Ben Ahmed, M. Meyer, Y. Okuyama, A. Ben Abdallah, Adaptive Error-
and Traffic-aware Router Architecture for Electrical 3D Network-on-ChipSystems", The IEEE 8th International Symposium on Embedded Multicore SoCs(MCSoC-14), pp. 197-204, Aizu-Wakamatsu, Japan, September 23-25, 2012.
– [Conf2] A. Ben Ahmed and A. Ben Abdallah, “Deadlock-Recovery Support forFault-tolerant Routing Algorithms in 3D-NoC Architectures, The IEEE 7thInternational Symposium on Embedded Multicore SoCs (MCSoC-13), pp. 67-72,Tokyo, Japan, September 26-28, 2013.
Jan. 14, 2015 Doctor Dissertation Final Review 73
Publications– [Conf3] A. Ben Ahmed, T. Ochi, Sh. Miura, A. Ben Abdallah, “Run-Time
Monitoring Mechanism for Efficient Design of Application-specific NoCArchitectures in Multi/Manycore Era”, The 6th International Workshop on Engineering Parallel and Multicore Systems, pp. 440-445, Taichung-Taiwan, July 3-5, 2013.
– [Conf4] A. Ben Ahmed, A. Ben Abdallah, “Low-overhead Routing Algorithm for 3D Network-on-Chip”, The Third International Conference on Networking and Computing (ICNC-12), pp. 23-32, Okinawa, Japan, December 20-22, 2012.
– [Conf5] A. Ben Ahmed, A. Ben Abdallah, “LA-XYZ: Low Latency, High Throughput Look-Ahead Routing Algorithm for 3D Network-on-Chip (3D-NoC) Architecture, The IEEE 6th International Symposium on Embedded Multicore SoCs (MCSoC-12), pp. 167-174, Aizu-Wakamatsu, Japan, September 20-22, 2012.
– [Conf6] A. Ben Ahmed, K. Mori and A. Ben Abdallah, “ONoC-SPL Customized Network-on-Chip (NoC) Architecture and Prototyping for Data-intensive Computation Applications”, The 4th International Conference on Awareness Science and Technology (iCAST-2012), pp. 257-262, Seoul, South Korea, August 21-24, 2012.
Jan. 14, 2015 Doctor Dissertation Final Review 74
Publications
– [Conf7] A. Ben Ahmed, A. Ben Abdallah, K. Kuroda “Architecture and Design of Efficient 3D Network-on-Chip (3D NoC) for Custom Multi-Core SoC”, Fifth International Conference on Broadband and Wireless Computing, Communication and Applications (BWCCA-2010) pp. 67-73, Fukuoka, Japan, November 4-6, 2010 (Best Paper Award).
Jan. 14, 2015 Doctor Dissertation Final Review 75
Jan. 14, 2015 Doctor Dissertation Final Review
Under Review Publications
• Refereed Journals– A. Ben Ahmed, A. Ben Abdallah, “Adaptive Fault-Tolerant Architecture and
Routing Algorithm for Reliable Many-Core 3D-NoC systems”, submitted to the Journal of Parallel and Distributed Computing on February, 2014.
76
References- [NOCS2014] Hiroki Matsutani, "3D WiNoC Architectures", The 8th ACM/IEEE InternationalSymposium on Networks-on-Chip (NOCS'14), Special Session, Sep 2014.
- [SoCPaR 2014] A. Ben Abdallah, On-Chip Optical Interconnects: Prospects and Challenges, InvitedTalk, 6th International Conference of Soft Computing and Pattern Recognition, August 11-14, 2014
- [VLSI2005] M. El-Moursy and E. Friedman, "Shielding effect of onchip interconnect inductance,"Proc. Great Lakes Symp. VLSI, Apr. 2003, pp. 165-170.
- [Loi 2008] I. Loi et al.. A low-overhead fault tolerance scheme for TSV-based 3D network on chiplinks. In Proc. of the 2008 IEEE/ACM International Conference on Computer-Aided Design, pages598-602, 2008.
- [Parisha 2009] S. Pasricha. Exploring serial vertical interconnects for 3D ICs. In Proc. Of the 46thACM/IEEE Design Automation Conference, pages 581-586, July 2009.
- [Rahmani 2012] A. -M. Rahmani et al.. Design and Management of High-performance, Reliableand Thermal-aware 3D Networks-on-Chip. IET Circuits, Devices & Systems, 6(5):308-321,September 2012.
- [Chien 1992] A. A. Chien and J. H. Kim. Planar-adaptive Routing: Low-cost Adaptive Networks forMultiprocessors. The 19th Annual International Symposium on Computer Architecture, pages 268-277, 1992.
Jan. 14, 2015 Doctor Dissertation Final Review 77
References- [Jiang 2008] Z. Jiang, J. Wu and D. Wang. A New Fault Information Model for Fault-Tolerant Adaptive and Minimal Routing in 3-D Meshes. IEEE Transactions on Reliability, 57(1):149-162, March 2008.
- [Feng 2011] Ch. Feng et al.. A Low-Overhead Fault-Aware Deflection Routing Algorithm for 3D Network-on-Chip. IEEE Computer Society Annual Symposium on VLSI, pages 19-24, July 2011.
- [Pasricha 2011] S. Pasricha and Y. Zou. A Low Overhead Fault Tolerant Routing Scheme for 3D Networks-on-Chip. The 12th International Symposium on Quality Electronic Design, pages 1-8, March 2011.
- [DATE 2012] S. Akbari, A. Shafieey, M. Fathy and R. Berangi. AFRA: A Low Cost High Performance Reliable Routing for 3D Mesh NoCs. Design, Automation & Test in Europe Conference & Exhibition, pages 332-337, March 2012.
- [DATE 2013] M. Ebrahimi, M. Daneshtalab, and J. Plosila. Fault-Tolerant Routing Algorithm for 3D NoC Using Hamiltonian Path Strategy. In Proc. of Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 1601-1604, March 2013.
- [Constantinides 2006] K. Constantinides et.al, BulletProof: A defect-tolerant CMP switch architecture, in Proceedings of the 12th International Symposium on High-Performance Computer Architecture (HPCA), pp. 5-16, 2006.
Jan. 14, 2015 Doctor Dissertation Final Review 78
References- [Kim 2006] J. Kim, et.al A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks, in Proceedings of the 33rd International Symposium on Computer Architecture (ISCA), 2006.
- [Poluri 2013] P. Poluri and A. Louri In Proceedings of the 25th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 49-56, October 23-26, 2013
- [DeOrio 2012] A. DeOrio, D. Fick, V. Bertacco, D. Sylvester, D. Blaauw, J. Hu, and G. Chen. A Reliable Routing Architecture and Algorithm for NoCs. IEEE Transactions on CAD of Integrated Circuits and Systems, 31(5):726-739, May 2012.
Jan. 14, 2015 Doctor Dissertation Final Review 79
Thank you
for your attention
Doctor Dissertation Final ReviewJan. 14, 2015 80