Techniques d’optimisation architecturale
-
Upload
sameermdani -
Category
Documents
-
view
14 -
download
2
description
Transcript of Techniques d’optimisation architecturale
1DIOUCamille
Master EAII Sp. RSEE
Camille Diou
Techniques d’optimisationarchitecturale
2DIOUCamille
Master EAII Sp. RSEE
State machine ALU
t1t2t3ABC
DATAPATHCONTROLLERBUS
Arithmetic and Logic Unit (ALU)Register file
Tristate components (inputs/ outputs)
Microprocessor basics1
3DIOUCamille
Master EAII Sp. RSEE
t1 <- x
t2 <- y
t3 <- A.t1
t3 <- t3.t1
t2 <- B.t2
t3 <- t2+t3
out<- t3+C
ALU
t1t2t3ABC
S=Ax²+By+C
Computation example : DATAPATHCONTROLLER
Microprocessor basics1
4DIOUCamille
Master EAII Sp. RSEE
t1 <- x
t2 <- y
t3 <- A.t1
t3 <- t3.t1
t2 <- B.t2
t3 <- t2+t3
out<- t3+C
#CYCLES: 1
ALU
t1t2t3ABC
x
S=Ax²+By+C
DATAPATHCONTROLLERComputation example :
Microprocessor basics1
5DIOUCamille
Master EAII Sp. RSEE
t1 <- x
t2 <- y
t3 <- A.t1
t3 <- t3.t1
t2 <- B.t2
t3 <- t2+t3
out<- t3+C
#CYCLES: 2
ALU
t1t2t3ABC
y
S=Ax²+By+C
DATAPATHCONTROLLERComputation example :
Microprocessor basics1
6DIOUCamille
Master EAII Sp. RSEE
t1 <- x
t2 <- y
t3 <- A.t1
t3 <- t3.t1
t2 <- B.t2
t3 <- t2+t3
out<- t3+C
#CYCLES: 3
X
t1t2t3ABC
A
t1
A.t1S=Ax²+By+C
DATAPATHCONTROLLERComputation example :
Microprocessor basics1
7DIOUCamille
Master EAII Sp. RSEE
t1 <- x
t2 <- y
t3 <- A.t1
t3 <- t3.t1
t2 <- B.t2
t3 <- t2+t3
out<- t3+C
#CYCLES: 4
X
t1t2t3ABC
t3
t1
t3.t1S=Ax²+By+C
DATAPATHCONTROLLERComputation example :
Microprocessor basics1
8DIOUCamille
Master EAII Sp. RSEE
t1 <- x
t2 <- y
t3 <- A.t1
t3 <- t3.t1
t2 <- B.t2
t3 <- t2+t3
out<- t3+C
#CYCLES: 5
X
t1t2t3ABC
t2
B
B.t2S=Ax²+By+C
DATAPATHCONTROLLERComputation example :
Microprocessor basics1
9DIOUCamille
Master EAII Sp. RSEE
t1 <- x
t2 <- y
t3 <- A.t1
t3 <- t3.t1
t2 <- B.t2
t3 <- t2+t3
out<- t3+C
#CYCLES: 6
+
t1t2t3ABC
t3
t2
t2+t3S=Ax²+By+C
DATAPATHCONTROLLERComputation example :
Microprocessor basics1
10DIOUCamille
Master EAII Sp. RSEE
t1 <- x
t2 <- y
t3 <- A.t1
t3 <- t3.t1
t2 <- B.t2
t3 <- t2+t3
out<- t3+C
#CYCLES: 7
+
t1t2t3ABC
C
t3
t3+C
S
S=Ax²+By+C
DATAPATHCONTROLLERComputation example :
Microprocessor basics1
11DIOUCamille
Master EAII Sp. RSEE
STARTSTART HALTHALTFetch NextInstruction
Fetch NextInstruction
ExecuteInstructionExecute
Instruction
Fetch Cycle Execute Cycle
Execution principle
Microprocessor basics1
12DIOUCamille
Master EAII Sp. RSEE
Data flow
Control signals
A Single accumulator machineMAR : Memory Adress RegisterIR : Instruction RegisterPC : Program Counter register
Load path
Store path
n
Address
mFSM
incr
Address operand
Branch
Instruction path
IR
OpcodeLD
Functioncontrols
MAR
PC
ACC
A B
ALU
S
16 bits wide16M words
Memory
Microprocessor basics1
13DIOUCamille
Master EAII Sp. RSEE
Instruction:
Opcode:
00: Load01: Store10: Add11: Branch
Address
15 14 13 0
Single Address Instruction: one of the registers is fixed (= accumulator)-AC is an implicit operand
AC:= AC <operation> Memory(Address)
Microprocessor basics1
14DIOUCamille
Master EAII Sp. RSEE
OpcodeMAR
PC
MAR : Memory Adress RegisterIR : Instruction RegisterPC : Program Counter register
Load path
Store path
Instruction path
Address
ACC
FSM
incr
Address operand
Branch
LD
ALU
IR
Functioncontrols 16 bits wide
16M words
Memory
2
141416
16
14
S
A B
Microprocessor basics1
15DIOUCamille
Master EAII Sp. RSEE
OpcodeMAR
PC
MAR : Memory Adress RegisterIR : Instruction RegisterPC : Program Counter register
Load path
Store path
Instruction path
ACC
S
FSM
incr
Address operand
BranchIR
LD
10110100110011
1. Instruction fetch:- PC is moved into MAR- Read from memory- Load instruction into IR
2. Instruction decode: - Op code bits to FSM(ADD)- rest of bits is operand addr.
10110100110011
1000110100110011
1000110100110011
A B
ALU Address
Functioncontrols 16 bits wide
16M words
Memory
2
141416
16
14
Microprocessor basics1
16DIOUCamille
Master EAII Sp. RSEE
OpcodeMAR
PC
MAR : Memory Adress RegisterIR : Instruction RegisterPC : Program Counter register
Load path
Store path
Instruction path
16 bits wide16M words
Memory
ACC
ALUFSM
incr
Address operand
Branch
LD
3. Operand Fetch:- IR<address> -> MAR- Read data from memory
4. Instr. Execute- Memory to ALU B- AC to ALU - ALU Add- S to AC
00110100110011
10110100110011
0101010101110001
01010101011100010011001101110110
1000100011100111
1000100011100111
Address
Functioncontrols
2
141416
16
14
S
A B
1000110100110011
Microprocessor basics1
17DIOUCamille
Master EAII Sp. RSEE
OpcodeMAR
PC
MAR : Memory Adress RegisterIR : Instruction RegisterPC : Program Counter register
Load path
Store path
Instruction path
16 bits wide16M words
Memory
ACC
ALUFSM
incr
Address operand
Branch
LD00110100110011
10110100110011
0101010101110001
01010101011100010011001101110110
1000100011100111
1000100011100111
Address
Functioncontrols
2
141416
16
14
S
A B
10110100110100
5. Housekeeping:- Increment PC
1000110100110011
Microprocessor basics1
18DIOUCamille
Master EAII Sp. RSEE
A simple microprocessor : Architecture
16x16registers
Adress to memorydata to/from memory
To controller(FSM)
To controller(FSM)
Microprocessor basics1
19DIOUCamille
Master EAII Sp. RSEE
A simple microprocessor : Instruction format
shift
oror
or
Microprocessor basics1
20DIOUCamille
Master EAII Sp. RSEE
Instruction formatInstruction
Action
A simple microprocessor : Instruction format
Microprocessor basics1
21DIOUCamille
Master EAII Sp. RSEE
A simple microprocessor : Instruction format
Microprocessor basics1
22DIOUCamille
Master EAII Sp. RSEE
0000 7C0A ;
0001 8C00 ; LOAD RC, #A
0002 7B04 ; ...
0003 7A0A ; ...
0004 9C7C ; ...
0005 611A ; ...
0006 614B ; ...
...
A simple microprocessor : test programWhat will it do ?
Microprocessor basics1
23DIOUCamille
Master EAII Sp. RSEE
Compiler dependancies detection for ILP
• Detect data dependency at compile time:– examples:
c[i]=a[i]+b[i]; potential dependencyd[i]=a[i]+c[j]; c[i] might be c[j]
c[1]=a[i]+b[i]; no dependencyd[i]=a[i]+c[2]; c[1] is never c[2]
Microprocessor basics1
24DIOUCamille
Master EAII Sp. RSEE
• Superscalar processors must find dataflow graph at run time
• Reconfigurable architectures constructs data flow graph at compile time
• No FU limitations
• No control logic overhead
• No window size limitations
Reconfigurable computing : Instruction level parallelism (ILP)
Systolic ring2
25DIOUCamille
Master EAII Sp. RSEE
• RC scheme: • General Purpose Computeradd r1, r2, r4add r1, r3, r5sub r3, r2, r6add r4 r5 r1add r5 r6 r2
r1 r2 r3 r1 r3 r2
r4 r5 r6
r1 r2
Question: what is the advantage of RC against superscalar?
Reconfigurable computing : Instruction level parallelism (ILP)
Answer: Dataflow graph constructed at compile time, thus, no overhead
Systolic ring2
26DIOUCamille
Master EAII Sp. RSEE
Reconfigurable computing : Why now ?
• Increasing number of transistors
• Complexity and cost of chip design increase fast
• Current computing demands are RC friendly :
Desktops & embedded demands driven NOT by Word or Excel but by multimedia, encryption, filters (dataflow oriented applications
Systolic ring2
27DIOUCamille
Master EAII Sp. RSEE
• RA less flexible (like a VLIW with fixed instructions)
but
• RA provides more (customized) computation elements• RA can decrease memory traffic• RA can be tailored for specific algorithms and data types
RA will not replace µP, but complement them
RA versus microprocessors
Systolic ring2
28DIOUCamille
Master EAII Sp. RSEE
•A set of simple processing elements with regular and local connections which takes external inputs and processes them in a prederterminedmanner in a determined fashion
Systolic computing : definition
H.T. Kung
Systolic ring2
29DIOUCamille
Master EAII Sp. RSEE
• Simple PE
• Regular and local interconnect
• Pipeline between Pes
• I/O at boundary
Systolic computing : characteristics of best RC design
Systolic ring2
30DIOUCamille
Master EAII Sp. RSEE
In abstract :Instructions configure both PE and interconnect every cycle
In reality :Instruction Bandwidth / Memory too high, so…
COMPROMISE
Coarse grain RA model
Systolic ring2
31DIOUCamille
Master EAII Sp. RSEE
Relationship of communication among processors• Shared clock (Pipelined)• Shared registers (VLIW)• Shared memory (SMM)• Shared network
Communications…
Systolic ring2
32DIOUCamille
Master EAII Sp. RSEE
Instructionscurrently in hardware
Instructions paged out
Actual availablehardware
Prog
ram
Reconfigurable computing
Systolic ring2
33DIOUCamille
Master EAII Sp. RSEE
xn
a1a2 a0
ynZ-1Z-1 Z-1
∑−
=−−=
1
0)1()()(
N
iinxiany
xn
aN-1aN aN-2 a1 a0
ynZ-1 Z-1Z-1 Z-1
3 coefficients filter)3()2()1()( .2.1.0 −+−+−= nxanxanxany
Finite Impulse response filter (FIR)
Systolic ring2
34DIOUCamille
Master EAII Sp. RSEE
Systolic FIR implementation
(MAC unit)
Systolic ring2
35DIOUCamille
Master EAII Sp. RSEE
Systolic FIR implementation
Systolic ring2
36DIOUCamille
Master EAII Sp. RSEE
Systolic FIR implementation
Systolic ring2
37DIOUCamille
Master EAII Sp. RSEE
Systolic FIR implementation
Systolic ring2
38DIOUCamille
Master EAII Sp. RSEE
Systolic FIR implementation
Systolic ring2
39DIOUCamille
Master EAII Sp. RSEE
Systolic FIR implementation
Systolic ring2
40DIOUCamille
Master EAII Sp. RSEE
Systolic FIR implementation
Systolic ring2
41DIOUCamille
Master EAII Sp. RSEE
Optimize outer loop, preload-repeated value
Systolic FIR implementation
Systolic ring2
42DIOUCamille
Master EAII Sp. RSEE
Optimize outer loop, broadcast common value
Systolic FIR implementation
Systolic ring2
43DIOUCamille
Master EAII Sp. RSEE
Optimize outer loop, retime to eliminate broadcast
Systolic FIR implementation
Systolic ring2
44DIOUCamille
Master EAII Sp. RSEE
Systolic FIR implementation
Systolic ring2
45DIOUCamille
Master EAII Sp. RSEE
Systolic FIR implementation
Systolic ring2
46DIOUCamille
Master EAII Sp. RSEE
Systolic FIR implementation
Systolic ring2
47DIOUCamille
Master EAII Sp. RSEE
The Systolic Ring
• Coarse grain architecture• Multi-mode dynamical reconfiguration• Scalable, bidimentionnal array• VHDL design• Designed for SoC integration
Systolic ring2
48DIOUCamille
Master EAII Sp. RSEE
Dnode : word-level processing unit
ALU + MULT
Reg FILE
Constitution• Optimized Datapath (16 bits)• Register File (4x16bits)• Hardwired ALU and multiplier
Features• Complex computations in local mode (FIR,IIR, WT…)
• Low silicon area (0.07mm², 0.18µm CMOS process)
• Single-cycle operations (ex:MAC+register load)
µinst.
Systolic ring2
49DIOUCamille
Master EAII Sp. RSEE
Local controller : Dynamical reconfiguration at the Dnode levelConstitution
• 8 configuration registers• 3 differents run modes• 1 programming mode
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
reg0
reg1
reg3
reg2
ck
Mux
Mux
reg4
reg5
reg6
reg7
8
Decoderenex
wait
wait
3 Controller
2
inhib
mode
Systolic ring2
50DIOUCamille
Master EAII Sp. RSEE
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
Programming mode
reg0
reg1
reg3
reg2
ck
Mux
Mux
reg4
reg5
reg6
reg7
8
Decoderenex
wait
wait
3 Controller
2
inhib
mode
clk
Systolic ring2
51DIOUCamille
Master EAII Sp. RSEE
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
reg0
reg1
reg3
reg2
ck
Mux
Mux
reg4
reg5
reg6
reg7
8
Decoderenex
wait
wait
3 Controller
2
inhib
mode
clk
Instruction 0
Programming mode
Systolic ring2
52DIOUCamille
Master EAII Sp. RSEE
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
reg0
reg1
reg3
reg2
ck
Mux
Mux
reg4
reg5
reg6
reg7
8
Decoderenex
wait
wait
3 Controller
2
inhib
mode
clk
Instruction 1
Programming mode
Systolic ring2
53DIOUCamille
Master EAII Sp. RSEE
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
reg0
reg3
ck
Mux
Mux
reg4
reg5
reg6
reg7
8
Decoderenex
wait
wait
3 Controller
2
inhib
mode
Reg2
reg1
clk
Instruction 2
Programming mode
Systolic ring2
54DIOUCamille
Master EAII Sp. RSEE
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
reg0
ck
Mux
Mux
reg4
reg5
reg6
reg7
8
Decoderenex
wait
wait
3 Controller
2
inhib
mode
reg1
Reg3
Reg2
clk
Instruction 3
Programming mode
Systolic ring2
55DIOUCamille
Master EAII Sp. RSEE
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
Run-mode 1 : Fixed
reg0
ck
Mux
Mux
reg4
reg5
reg6
reg7
8
Decoderenex
wait
wait
3 Controller
2 mode
reg1
Reg2
Reg3
inhib
clk
Instruction 0
Systolic ring2
56DIOUCamille
Master EAII Sp. RSEE
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
reg0
ck
Mux
Mux
reg4
reg5
reg6
reg7
8
Decoderenex
wait
wait
3 Controller
2 mode
reg1
Reg2
Reg3
inhib
clk
Instruction 0
Run-mode 1 : Fixed
Systolic ring2
57DIOUCamille
Master EAII Sp. RSEE
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
reg0
ck
Mux
Mux
reg4
reg5
reg6
reg7
8
Decoderenex
wait
wait
3 Controller
2 mode
reg1
Reg2
Reg3
inhib
clk
Instruction 0
Run-mode 1 : Fixed
Systolic ring2
58DIOUCamille
Master EAII Sp. RSEE
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
reg0
ck
Mux
Mux
reg4
reg5
reg6
reg7
8
Decoderenex
wait
wait
3 Controller
2 mode
reg1
Reg2
Reg3
inhib
clk
Instruction 0
Run-mode 1 : Fixed
Systolic ring2
59DIOUCamille
Master EAII Sp. RSEE
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
reg0
ck
Mux
Mux
reg4
reg5
reg6
reg7
8
Decoderenex
wait
wait
3 Controller
2 mode
reg1
Reg2
Reg3
Inhib
clk
Instruction 1
Run-mode 2 : Dynamic
Systolic ring2
60DIOUCamille
Master EAII Sp. RSEE
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
reg0
ck
Mux
Mux
reg4
reg5
reg6
reg7
8
Decoderenex
wait
wait
3 Controller
2 mode
reg1
Reg2
Reg3
inhib
clk
Instruction 2
Run-mode 2 : Dynamic
Systolic ring2
61DIOUCamille
Master EAII Sp. RSEE
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
ALU + MULT
In(1,2),fifo(1,2),bus,Rp(i,j)(i=1~4 , j=1~2)
out
Reg FILE
µinst.
reg0
ck
Mux
Mux
reg4
reg5
reg6
reg7
8
Decoderenex
wait
wait
3 Controller
2 mode
reg1
Reg2
Reg3
inhib
clk
Instruction 3
Run-mode 2 : Dynamic (one-time or loop)
Systolic ring2
62DIOUCamille
Master EAII Sp. RSEE
• Unidirectional communications between neighbours• Hard to implement datapath with greater pipeline depth than the array
• Hard to implement recursive operations
Scalable
Unités de traitement Switchs
Flots de données UNIDIRECTIONNELS
BUS : ressource PARTAGÉE…
ENTR
ÉES
SORT
IES
Unités de traitement Switchs
Flots de données UNIDIRECTIONNELS
BUS : ressource PARTAGÉE…
ENTR
ÉES
SORT
IES
Configurableblocks
INPU
TS
OU
TPU
TS
BUS : Shared resources
Main dataflow (unidirectional)
Array structure
Systolic ring2
63DIOUCamille
Master EAII Sp. RSEE
RING STRUCTURERING STRUCTURE
Use of a Ring structure
• Unidirectional communications between neighbours• Hard to implement datapath with greater pipeline depth than the array
• Hard to implement recursive operations
Array structure
Unités de traitement Switchs
Flots de données UNIDIRECTIONNELS
BUS : ressource PARTAGÉE…
ENTR
ÉES
SORT
IES
Unités de traitement Switchs
Flots de données UNIDIRECTIONNELS
BUS : ressource PARTAGÉE…
ENTR
ÉES
SORT
IES
Configurableblocks
INPU
TS
OU
TPU
TS
BUS : Shared resources
Main dataflow (unidirectional)
Systolic ring2
64DIOUCamille
Master EAII Sp. RSEE
Forward
Dataflow
Reverse Dataflow
Use of a bi-dataflows structure
• Unidirectional communications between neighbours• Hard to implement datapath with greater pipeline depth than the array
• Hard to implement recursive operations
Unités de traitement Switchs
Flots de données UNIDIRECTIONNELS
BUS : ressource PARTAGÉE…
ENTR
ÉES
SORT
IES
Unités de traitement Switchs
Flots de données UNIDIRECTIONNELS
BUS : ressource PARTAGÉE…
ENTR
ÉES
SORT
IES
Configurableblocks
INPU
TS
OU
TPU
TS
BUS : Shared resources
Main dataflow (unidirectional)
Array structure
RING STRUCTURERING STRUCTURE
Systolic ring2
65DIOUCamille
Master EAII Sp. RSEE
SwitchSwitch
Sw
itchS
witch
DnodeDnode
DnodeDnode
DnodeDnode
DnodeDnode
SwitchSwitch
SwitchSwitch
DnodeDnode DnodeDnode
DnodeDnode DnodeDnode
SwitchSwitch
Sw
itchS
witch
DnodeDnode
DnodeDnode
SwitchSwitch
DnodeDnode DnodeDnode
Forward dataflow
Peak power : 3200 MIPS@200MHz (16 Dnodes version)
DnodeDnode
DnodeDnode
SwitchSwitch
E/S
E/S
E/S
E/S
E/S
E/S
Flot de données
Couche n
DnodeDnode DnodeDnode Couche n+1
Systolic Ring architecture
Systolic ring2
66DIOUCamille
Master EAII Sp. RSEE
No complex data routing problems (crossbars…)Unidirectional data transfers between adjacent layers (pipeline)Linear performances increase with Dnode numberProvides 3200 MIPS@200MHz of computing power for a 16 Dnodes realization
Forward dataflow
D node Local mode : stand-alone
D node Global mode : FPGA like
Switch components:Direct FIFO connection for Data injection
BUS connection for RISC communication
Full connectivity between 2 Dnode layers
Config.controller
Node
Switch Switch
SwitchSwitch
SwitchSwitch
D-Node D-Node
D-Node D-Node
D-Node D-Node
D-Node D-Node
D-Node D-Node
D-Node D-Node
D-Node D-Node
D-Node
Sw
itch
D-Node
Sw
itch
Layer n
Layer n+1
Layer n-1Forward Dataflow
I/O I/O
I/O
I/O
I/O
I/O
I/OI/O
D node
D node
D node
D node
Systolic Ring architecture
Systolic ring2
67DIOUCamille
Master EAII Sp. RSEE
NodeD-Node D-Node
D-Node D-Node
D-Node D-Node
D-Node D-Node
D-Node D-Node
D-Node D-Node
D-Node D-Node
D-Node D-Node
Switch
SwitchSwitch
SwitchSwitch
Sw
itchS
witch
Switch
Reverse dataflow
Each switch writes computed data in his own feedback pipelineEach switch has read ports on others switch’s pipelinesEasy implementation of various recursive algorithms (IIR, WT…)
Feedback pipelines
Systolic ring2
68DIOUCamille
Master EAII Sp. RSEE
NodeD-Node D-Node
D-Node D-Node
D-Node D-Node
D-Node D-Node
D-Node D-Node
D-Node D-Node
D-Node D-Node
D-Node D-Node
Switch
SwitchSwitch
SwitchSwitch
Sw
itchS
witch
Switch
Reverse dataflow
Each switch writes computed data in his own feedback pipelineEach switch has read ports on others switch’s pipelinesEasy implementation of various recursive algorithms (IIR, WT…)
Feedback pipelines
Systolic ring2
69DIOUCamille
Master EAII Sp. RSEE
NodeD-Node D-Node
D-Node D-Node
D-Node D-Node
D-Node D-Node
D-Node D-Node
D-Node D-Node
D-Node D-Node
D-Node D-Node
Switch
SwitchSwitch
SwitchSwitch
Sw
itchS
witch
Switch
Reverse dataflow
Each switch writes computed data in his own feedback pipelineEach switch has read ports on others switch’s pipelinesEasy implementation of various recursive algorithms (IIR, WT…)
Feedback pipelines
Systolic ring2
70DIOUCamille
Master EAII Sp. RSEE
• Global mode (first level)The program which manages the configuration runs on the RISC processorThe configuration of an entire cluster can be modified at each clock cycleThe operating layer computes the data coming from the host processor
112233
• Local mode (second level) Each Dnode runs his own up-to-8 instructions program
ConfigConfigConfigControllerControllerController
ConfigConfigConfigControllerControllerController
DATA HostHostHostµPµPµP
+
*+
*RAMRAMRAMRAMRAMRAM
OPERATING layerOPERATING layer
CONFIGURATIONCONFIGURATIONlayerlayer
MANAGEMENT CODE
CONFIG
Dnode
ALU + MULT
Reg FILE
A B
S
Dnode
ALU + MULT
Reg FILE
A B
S
ALU + MULT
Reg FILE
A B
S
11
2233
2 levels dynamically reconfigurable architecture:
Systolic ring2
71DIOUCamille
Master EAII Sp. RSEE
8 Dnodes version…• ST* CMOS process 0.25 µm & 0.18 µm
200 MHz200 MHz150 MHzFréquency
0.04 mm20.7 mm20.9 mm2Area
Dnode0.18 µm
Ring-80.18 µm
Ring-80.25 µm
• Low Dnode area Possible to realize 128 Dnodes versions…
• Suited as an IP core for SoC
*: ST: STmicroelectronics
Features :• Parametrizable core (number of Dnodes)
• Good Performances / cost tradeoff: (Ring-8@200MHz Systolic Ring system)
• 1600 MIPS (PII@450MHz : 400 MIPS)
• 3 Gb/s bandwidth
Systolic ring2
72DIOUCamille
Master EAII Sp. RSEE
0000 r:ldl(0,8) M1: N1:clr N2:clr 0001 r:ldl(1,2) M2: N1:clr N2:clr 0002 r:dec(0,0) M1: N1:add(fifo1,fifo1) N2:sub(fifo1,fifo1) 0003 r:jnz(1) M2: N1:mac(in1) N2:mac(in2) 0004 r: halt
RISC instructions Layer selection Dnodes instructions
Assembly-level programming
RAM FPGA
Prototype
Testbench
Simulator
Ring-8RAM
File1.bin
File2.m
Assembler
Systolic ring2
73DIOUCamille
Master EAII Sp. RSEE
RIF filter : edge detection
0000 r:ldl(0,1) M1: N1:rst N2:rst 0001 r:jmp(0) M1: N2:sub(fifo,fifo)
Convolution mask : [ -1 1 0 ] yn=xn-xn-1.
Input image Output image
Assembly code Timing diagrams
Testbench
Simulator
Ring-8Ring-8RAMFile2.m
AssemblerAssembler
Systolic ring2
74DIOUCamille
Master EAII Sp. RSEE
ALU + MULT
x
x
x
x
x²
Polynomial calculus• P(x)=a.x+b.x²+c.x3
x reg0
x.x reg1
reg0.reg1 reg2
a.reg0 ACC
b.reg1 + ACC ACC
c.reg2 + ACC ACC
11
12
13
14
15
/* load reg1,x3 */
/* load ACC,a.x */
/* load ACC,a.x+b.x² */
/* load ACC,a.x+b.x²+c.x3 */
/* load reg0,x */
/* load reg1,x² */
Systolic ring2
75DIOUCamille
Master EAII Sp. RSEE
ALU + MULT
x
x x²
x3x²
Polynomial calculus• P(x)=a.x+b.x²+c.x3
x reg0
x.x reg1
reg0.reg1 reg2
a.reg0 ACC
b.reg1 + ACC ACC
c.reg2 + ACC ACC
11
12
13
14
15
/* load reg1,x3 */
/* load ACC,a.x */
/* load ACC,a.x+b.x² */
/* load ACC,a.x+b.x²+c.x3 */
/* load reg0,x */
/* load reg1,x² */
Systolic ring2
76DIOUCamille
Master EAII Sp. RSEE
ALU + MULT
x² x3
a
x
a.x
a x
x reg0
x.x reg1
reg0.reg1 reg2
a.reg0 ACC
b.reg1 + ACC ACC
c.reg2 + ACC ACC
11
12
13
14
15
/* load reg1,x3 */
/* load ACC,a.x */
/* load ACC,a.x+b.x² */
/* load ACC,a.x+b.x²+c.x3 */
/* load reg0,x */
/* load reg1,x² */
Polynomial calculus• P(x)=a.x+b.x²+c.x3
Systolic ring2
77DIOUCamille
Master EAII Sp. RSEE
ALU + MULT
x3
b
x
a.x+b.x²
x²
b x²
x reg0
x.x reg1
reg0.reg1 reg2
a.reg0 ACC
b.reg1 + ACC ACC
c.reg2 + ACC ACC
11
12
13
14
15
/* load reg1,x3 */
/* load ACC,a.x */
/* load ACC,a.x+b.x² */
/* load ACC,a.x+b.x²+c.x3 */
/* load reg0,x */
/* load reg1,x² */
Polynomial calculus• P(x)=a.x+b.x²+c.x3
Systolic ring2
78DIOUCamille
Master EAII Sp. RSEE
ALU + MULT
c
x
a.x+b.x²+c. x3
x²
c x3
x3
x reg0
x.x reg1
reg0.reg1 reg2
a.reg0 ACC
b.reg1 + ACC ACC
c.reg2 + ACC ACC
11
12
13
14
15
/* load reg1,x3 */
/* load ACC,a.x */
/* load ACC,a.x+b.x² */
/* load ACC,a.x+b.x²+c.x3 */
/* load reg0,x */
/* load reg1,x² */
Polynomial calculus• P(x)=a.x+b.x²+c.x3
Systolic ring2
79DIOUCamille
Master EAII Sp. RSEE
Finite Impulse response filter (FIR)
∑−
=−−=
1
0)1()(
N
ii inxany
Z-1xn
yn
a1a0
Z-1Z-1 Z-1
a2 aN-1 aN
Systolic ring2
80DIOUCamille
Master EAII Sp. RSEE
xn
a1a2 a0
ynZ-1Z-1 Z-1
∑−
=−−=
1
0)1()()(
N
iinxiany
xn
aN-1aN aN-2 a1 a0
ynZ-1 Z-1Z-1 Z-1
3 coefficients filter
)3()2()1()( .2.1.0 −+−+−= nxanxanxany
Finite Impulse response filter (FIR)
Systolic ring2
81DIOUCamille
Master EAII Sp. RSEE
Cycle 1
x0, x0, x0
a2x0 a1 a2
a2, a1, a0
3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle
FIR implementation
Systolic ring2
82DIOUCamille
Master EAII Sp. RSEE
a2.x0
a2x1
a2.x0
x1, x1, x1
a1
MAC
x1
Cycle 2
Feedback
3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle
FIR implementation
Systolic ring2
83DIOUCamille
Master EAII Sp. RSEE
Cycle 3
a2.x0+a1.x1a2.x1
a2x2
a2.x1
x2, x2, x2
a2.x0+a1.x1
a1
MAC
x2
a0
MAC
x2
Feedback
3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle
FIR implementation
Systolic ring2
84DIOUCamille
Master EAII Sp. RSEE
Cycle 4
a2.x1+a1.x2a2.x2
a2x3
a2.x2
x3, x3, x3
a2.x1+a1.x2
a1
MAC
x3
a0
MAC
x3
a2.x0+a1.x1 +a0.x2
OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE
Feedback
3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle
FIR implementation
Systolic ring2
85DIOUCamille
Master EAII Sp. RSEE
Cycle 5
a2.x2+a1.x3a2.x3
a2x4
a2.x3
x4, x4, x4
a2.x2+a1.x3
a1
MAC
x4
a0
MAC
a2.x1+a1.x2 +a0.x3
x4
3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle
Feedback
OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE
Systolic ring2
86DIOUCamille
Master EAII Sp. RSEE
Cycle 6
a2.x2+a1.x3a2.x3
a2x4
a2.x3
x4, x4, x4
a2.x2+a1.x3
a1
MAC
x4
a0
MAC
a2.x1+a1.x2 +a0.x3
x4
3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle
Feedback
OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE
Systolic ring2
87DIOUCamille
Master EAII Sp. RSEE
a2.x3+a1.x4a2.x4
a2x4
a2.x4
x5, x5, x5
a2.x3+a1.x4
a1
MAC
x5
a0
MAC
x5
a2.x2+a1.x3 +a0.x4
Cycle 7
3 Dnodes / layer architecture use• Piplelined implementation• Samples are injected through dedicated lines• Coefficients loaded during first cycle
Feedback
OPTIMAL IMPLEMENTATION : 1 SAMPLE / CYCLE
Systolic ring2
88DIOUCamille
Master EAII Sp. RSEE
6 coefficients filter
)6()5()4()3()2()1()( .5.4.3.2.1.0 −+−+−+−+−+−= nxanxanxanxanxanxany
a5a5 a4a4
MACMAC
a3a3
MACMAC
a1a1
MACMAC
a0a0
MACMAC
a2a2
MACMAC
xn
xn
yn
Inter-layersfeedback
Systolic ring2
89DIOUCamille
Master EAII Sp. RSEE
DCTCoeff.
DCT
iDCT
Originalimage
inverse Quantification
QuantifiedCoeff.
Quantification
Decoding
Decompressedimage
Coding
Compressedimage
Discrete Cosine Transform• Usually bidimensional 8x8 points DCT• Very demanding algorithm…
Systolic ring2
90DIOUCamille
Master EAII Sp. RSEE
( ) ( )∑−
=⎟⎠⎞⎜
⎝⎛ +=
1
0 212cos2 N
nnk N
knxkNz πα k = 0,1,……,N-1
Direct transform
( ) ( )∑−
=⎟⎠⎞⎜
⎝⎛ +=
1
0 212cos2 N
kkn N
knzkNx πα n = 0,1,……,N-1
Inverse transform
1/√2 for k = 0
1 else=with
DCT algorithm
( )kα
Systolic ring2
91DIOUCamille
Master EAII Sp. RSEE
⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
7,70,7
1,10,1
7,01,00,0
......................
.....
xx
xxxxx
64
64
8
8
Image initiale
64 blocs 8x8
Image• 64x64 points• 8x8 pixels blocks•16 bits coded image
Systolic ring2
92DIOUCamille
Master EAII Sp. RSEE
Implementation• Matrix implementation• Even / Odd frequency decomposition of the DCT algorithm
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
++++
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
δ−ββ−δαα−α−αβ−δ−δβ
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
43
52
61
70
6
4
2
0 1111
xxxxxxxx
zzzz
α = cos (π/4)
β = cos (π/8)
δ = sin (π/8) ⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
++++
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
δ−ββ−δαα−α−αβ−δ−δβ
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
43
52
61
70
6
4
2
0 1111
xxxxxxxx
zzzz
α = cos (π/4)
β = cos (π/8)
δ = sin (π/8)
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−−−−
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
λ−γμ−νγνλ−μμ−λ−ν−γνμγλ
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
43
52
61
70
7
5
3
1
xxxxxxxx
zzzz λ = cos (π/16)
γ = cos (3π/16)
μ = sin (3π/16)
ν = sin (π/16)⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−−−−
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
λ−γμ−νγνλ−μμ−λ−ν−γνμγλ
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
43
52
61
70
7
5
3
1
xxxxxxxx
zzzz λ = cos (π/16)
γ = cos (3π/16)
μ = sin (3π/16)
ν = sin (π/16)
xNTN
z )(2= xNT
Nz )(2=
Systolic ring2
93DIOUCamille
Master EAII Sp. RSEE
α = 0000000000010110 - α = 1111111111101010β = 0000000000011101 - β = 1111111111100011δ = 0000000000001100 - δ = 1111111111110100 λ = 0000000000011111 - λ = 1111111111100001γ = 0000000000011010 - γ = 1111111111100101μ = 0000000000010001 - μ = 1111111111101111ν = 0000000000000110 - ν = 1111111111111010
Example : n=6
( ) ( )∑−
=⎟⎠⎞
⎜⎝⎛ +
=1
0 212cos2 N
nnk N
knxkN
z πα
Coefficients coding• Fixed point
Systolic ring2
94DIOUCamille
Master EAII Sp. RSEE
Dnode1C
onfig
+_
Con
fig
MAC
MACC
oeffi
cien
tsxn - x(N-1)-n
xn + x(N-1)-n
x nx (
N-1
)-n x
nx (
N-1
)-n
z0 , z2 , z4 , z6
z1 , z3 , z5 , z7
Dnode2
Dnode1
Dnode2
Implementation :• ADD and SUB on the first Dnode layer• Multiply-accumulate operations (MAC) on the second Dnodes layer
Systolic ring2
95DIOUCamille
Master EAII Sp. RSEE
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
++++
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
δ−ββ−δαα−α−αβ−δ−δβ
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
43
52
61
70
6
4
2
0 1111
xxxxxxxx
zzzz
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−−−−
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
λ−γμ−νγνλ−μμ−λ−ν−γνμγλ
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
43
52
61
70
7
5
3
1
xxxxxxxx
zzzz
M0: N1:add(fifo,fifo) N2: sub(fifo,fifo)
t=0 n=0Computing…
Dnode1
+_
x 0x 7
x0
x 7 Dnode2
Dnode1
Dnode2
Con
fig
Systolic ring2
96DIOUCamille
Master EAII Sp. RSEE
t=1 n=1
Dnode1
+_ x0 – x7
x0 + x7
x 1x 6
x1
x 6 Dnode2
Dnode1
Dnode2
MAC
MACλ,
x,1,
x
M1: N1:MAC(in1,fifo) N2: MAC(in2,fifo)
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
++++
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
δ−ββ−δαα−α−αβ−δ−δβ
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
43
52
61
70
6
4
2
0 1111
xxxxxxxx
zzzz
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−−−−
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
λ−γμ−νγνλ−μμ−λ−ν−γνμγλ
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
43
52
61
70
7
5
3
1
xxxxxxxx
zzzz
Computing…
Con
fig
Systolic ring2
97DIOUCamille
Master EAII Sp. RSEE
t=2 n=2
Dnode1
+_ x1 – x6
x1 + x6
x 2x 5
x2
x 5 Dnode2
Dnode1
Dnode2
MAC
MAC
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
++++
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
δ−ββ−δαα−α−αβ−δ−δβ
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
43
52
61
70
6
4
2
0 1111
xxxxxxxx
zzzz
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−−−−
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
λ−γμ−νγνλ−μμ−λ−ν−γνμγλ
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
43
52
61
70
7
5
3
1
xxxxxxxx
zzzz
Computing…
γ,x,
1,x
Systolic ring2
98DIOUCamille
Master EAII Sp. RSEE
Dnode1
+_ x2 – x5
x2 + x5
x 3x 4
x3
x 4 Dnode2
Dnode1
Dnode2
MAC
MAC
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
++++
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
δ−ββ−δαα−α−αβ−δ−δβ
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
43
52
61
70
6
4
2
0 1111
xxxxxxxx
zzzz
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−−−−
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
λ−γμ−νγνλ−μμ−λ−ν−γνμγλ
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
43
52
61
70
7
5
3
1
xxxxxxxx
zzzz
Computing… t=3 n=3
μ,x,
1,x
Systolic ring2
99DIOUCamille
Master EAII Sp. RSEE
Dnode1
+_ x3 – x4
x3 + x4
Dnode2
Dnode1
Dnode2
MAC
MACν,
x,1,
x
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
++++
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
δ−ββ−δαα−α−αβ−δ−δβ
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
43
52
61
70
6
4
2
0 1111
xxxxxxxx
zzzz
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−−−−
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
λ−γμ−νγνλ−μμ−λ−ν−γνμγλ
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
43
52
61
70
7
5
3
1
xxxxxxxx
zzzz
Computing… t=4 n=4
Systolic ring2
100DIOUCamille
Master EAII Sp. RSEE
M1: N1:clear N2: clear
Dnode1
+_ x3 – x4
x3 + x4
Dnode2
Dnode1
Dnode2
clear
clearν,
x,1,
x
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
++++
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
δ−ββ−δαα−α−αβ−δ−δβ
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
43
52
61
70
6
4
2
0 1111
xxxxxxxx
zzzz
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−−−−
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
λ−γμ−νγνλ−μμ−λ−ν−γνμγλ
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
43
52
61
70
7
5
3
1
xxxxxxxx
zzzz
Computing… t=5 n=0
z0
z1
Con
fig
Systolic ring2
101DIOUCamille
Master EAII Sp. RSEE
Results– 2 transforms issued each 5 machine cycles
– « Clear » performed during addition
20 cycles for 8 samples
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
++++
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
δ−ββ−δαα−α−αβ−δ−δβ
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
43
52
61
70
6
4
2
0 1111
xxxxxxxx
zzzz
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−−−−
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
λ−γμ−νγνλ−μμ−λ−ν−γνμγλ
=
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
43
52
61
70
7
5
3
1
xxxxxxxx
zzzz
Computing…
Systolic ring2
102DIOUCamille
Master EAII Sp. RSEE
DCT 1D - 4 last lines
ConfigC
onfig
Config
Config
M 0
M 3
Sw
itch
Switc
h
Switch
Switch
Dnode1
Dnode2
Dnode1
Dnode1
Dnode2
Dnode2
Dnode2
Dnode1
DCT 1D - 4 first lines
Achievable parallelisn on a 8 Dnodes structures : Ring-8
Systolic ring2
M 1
M 2
103DIOUCamille
Master EAII Sp. RSEE
⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
7,70,7
1,10,1
7,01,00,0
'......'................'''.....''
zz
zzzzz
⇒ 5 cycles
2 partial transforms
Systolic ring2
Overall performances
104DIOUCamille
Master EAII Sp. RSEE
⇒ 20 cycles
⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
7,70,7
1,10,1
7,01,00,0
'......'................'''.....''
zz
zzzzz
1 Line – 8 partial transforms
Systolic ring2
Overall performances
105DIOUCamille
Master EAII Sp. RSEE
⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
7,70,7
1,10,1
7,01,00,0
'......'................'''.....''
zz
zzzzz
⇒ 80 cyclesM 1M 0
4 Lines - 32 partial transforms
Systolic ring2
Overall performances
106DIOUCamille
Master EAII Sp. RSEE
⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
7,70,7
1,10,1
7,01,00,0
'......'................'''.....''
zz
zzzzz
4 Lines - 32 partial transforms
⇒ 80 cyclesM 1M 0
Systolic ring2
Overall performances
107DIOUCamille
Master EAII Sp. RSEE
⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
7,70,7
1,10,1
7,01,00,0
'......'................'''.....''
zz
zzzzz
⇒ 80 cyclesM 3M 2
8 Columns - 64 transforms
Systolic ring2
Overall performances
108DIOUCamille
Master EAII Sp. RSEE
⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
7,70,7
1,10,1
7,01,00,0
'......'................'''.....''
zz
zzzzz
DCT 2D sur 8 points :
160 CYCLES
Systolic ring2
Overall performances
109DIOUCamille
Master EAII Sp. RSEE
VLIW : CPU64, TM1000, TI 320C60 Superscalar : Pentium I, Pentium II, NEC V830
Comparisons : execution time (cycles)
0
50
100
150
200
250
300
350
400
CPU64 TM-1000 320C62 Ring-8 Ring-64 PentiumI PentiumII NEC V830
# cy
cles
VLIW Superscalar
Systolic ring2