Mar. 2009 Wu Jinyuan, Fermilab [email protected] 1 FPGA: From Flashing LED to Reconfigurable...
-
Upload
beryl-perry -
Category
Documents
-
view
215 -
download
1
Transcript of Mar. 2009 Wu Jinyuan, Fermilab [email protected] 1 FPGA: From Flashing LED to Reconfigurable...
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
1
FPGA: From Flashing LED to Reconfigurable Computing
Wu, Jinyuan
Fermilab
IIT
Mar, 2009
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
2
Outline Electronic Aspect of FPGA:
LED Flashing Logic Elements in a Nutshell TDC and ADC
FPGA as a Computing Fabric: Moore’s Law Forever? Space Charge Computing with FPGA Cores Doublet Matching & Hash Sorter Triplet Matching & Tiny Triplet Finder Enclosed Loop Micro-Sequencer (ELMS)
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
3
Flashing LED, The First Thing First
Counter
Q[23..0]
At least design an LED for an FPGA. When a board is first powered up, first
test the LED flashing function. Many things have to be right so that the
LED flashes: Power pins must be all connected. Configuration devices must be in correct mode. Design software must be correct.
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
4
LED Brightness Variation
Counter
Q[23..0]A
B
A<B
LUT
Counter
Q[23..0]
A
B
A<B
The LED brightness is varied by changing the output pulse duty-cycle.
Comparator input A is the brightness and B is the clock cycle count.
Look-up table can be added to input A for different brightness variation curve.
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
5
Duty-Cycle Based Single-Pin DAC (1)
The duty-cycle or pulse width of the comparator output is proportional to the DAC input at port A.
Use external RC as low-pass filter. Output voltage of an ideal LP filter is proportional to the
DAC input.
0
1
2
3
4
896 960 1024
CounterQ
A
B
A>B
DAC Input
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
6
LED Brightness Exponential Drop
Counter
Q
A
B
A<BCO
Q
SET
D
if (CO==1) {Q = Q - Q/32;}
Narrow pulse are typically stretched for LED display with fix brightness.
The circuit here provides gradually dim of the LED for better visual effect.
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
7
Exponential Sequence Generator
Q
SET
D
if (CO==1) {Q = Q - Q/32;}
0
10000
20000
30000
40000
50000
60000
70000
0 20 40 60 80 100 120 140 160
An exponential sequence is generated using an accumulator shown above.
Note that not even one multiplier is used. Other function sequences: sine, co-sine, tangent, co-
tangent etc. can also be generated similarly.
Possible
Student Lab
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
8
Duty-Cycle Based Single-Pin DAC (2)
Use carry-out of the accumulator as the output. The number of pulses is proportional to the DAC input. Rounding error is carried to later cycles. Output is smoother.
0
1
2
3
4
896 960 1024
Q
CO
DDAC Input
Possible
Student Lab
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
9
Outline Electronic Aspect of FPGA:
LED Flashing Logic Elements in a Nutshell TDC and ADC
FPGA as a Computing Fabric: Moore’s Law Forever? Space Charge Computing with FPGA Cores Doublet Matching & Hash Sorter Triplet Matching & Tiny Triplet Finder Enclosed Loop Micro-Sequencer (ELMS)
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
10
Logic Elements
D Q
ENACLRN
LUT4(16 RAM
Cells)
D Q
ENACLRN
LUT38 Cells
LUT38 Cells
NormalMode:
ArithmeticMode:
LUT4 + DFF
2 x LUT3 + DFF
ABCD
CI
A
B
CO
LUT = Look-Up Table
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
11
What Can Be Done With a Lookup Table
“Any” 4-inFunctions
ABCD
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
12
Xilinx Look-Up Table
D Q
ENACLRN
RAM16
4-input Look-Up Table
16-bitShift Register
16-bitDistributed RAM
SRL16
LUT4
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
13
Pipeline Structure
D Q
ENACLRN
LUT4(16 RAM
Cells) D Q
ENACLRN
LUT4(16 RAM
Cells)
D Q
ENACLRN
LUT4(16 RAM
Cells)
LUT4(16 RAM
Cells)
Logic cells are usually designed in pipeline structures.
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
14
Logic Element as a Full Adder Bit
D Q
ENACLRN
LUT38 Cells
LUT38 Cells
CI
A
B
D Q
ENACLRN
LUT38 Cells
LUT38 Cells
A
B
COA Logic cell resembles a full adder bit.
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
15
Myths on FPGA We commonly heard about FPGA:
FPGA is cheap. FPGA is fast. FPGA is large. FPGA can do anything.
Not really, at least it is not always the case. The reality is:
FPGA is ultra-flexible. As the cost of the flexibility, the transistor usage in FPGA is
NOT efficient.
Good design tricks are needed.
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
16
4-Input NAND, 4-Input NOR, 4-Input NAOR
A B C D
A
B
C
D
Y
A B C D
A
B
C
D
Y
A B
C D
A
B
C
D
Y
8 transistors each
ABCD
ABCD
ABCD
Y Y Y
In ASIC
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
17
Transistor Usage of Logic Element
D Q
ENACLRN
LUT16-bit
6-transistor RAM bit
At least 96 transistors
X 16
In FPGA
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
18
The Mirror Adder (Weste93)
A
A
B
B
CiCob
Sb
A
B
A
B
A B Ci
A B Ci
A
B
A
B
Ci
Ci
24-28 transistors In ASIC
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
19
Full Adder
D Q
ENACLRN
LUT8-bit
LUT8-bit
FullAdder
CI
AB
S
CO
D Q
At least 96 transistors
In FPGA
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
20
Other FPGA Resources Other resources are available in FPGA devices:
RAM Blocks Multipliers Serial Data Receivers, Power PC, etc.
Multipliers RAM Blocks
16 LogicElements
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
21
Outline Electronic Aspect of FPGA:
LED Flashing Logic Elements in a Nutshell TDC and ADC
FPGA as a Computing Fabric: Moore’s Law Forever? Space Charge Computing with FPGA Cores Doublet Matching & Hash Sorter Triplet Matching & Tiny Triplet Finder Enclosed Loop Micro-Sequencer (ELMS)
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
22
TDC Using FPGA Logic Chain Delay
This scheme uses current FPGA technology
Low cost chip family can be used. (e.g. EP2C8T144C6 $31.68)
Fine TDC precision can be implemented in slow devices (e.g., 20 ps in a 400 MHz chip).
IN
CLK
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
23
Two Major Issues In a Free Operating FPGA
0
20
40
60
80
100
120
140
160
180
0 16 32 48 64
bin
wid
th (
ps)
1. Widths of bins are different and varies with supply voltage and temperature.
2. Some bins are ultra-wide due to LAB boundary crossing
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
24
0
500
1000
1500
2000
2500
0 16 32 48 64
bin
tim
e (p
s)
Auto Calibration Using Histogram Method It provides a bin-by-bin calibration at
certain temperature. It is a turn-key solution (bin in, ps out) It is semi-continuous (auto update
LUT every 16K events)
0
20
40
60
80
100
120
140
160
180
0 16 32 48 64
bin
wid
th (
ps)
DNLHistogram
In (bin)LUT
Out (ps)
16KEvents
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
25
The Test Module
Two NIM inputs
FPGA with 8ch TDC
Data Output via Ethernet
BNC Adapter to add delay @
150ps step.
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
26
Test ResultNIM Inputs
0 1 2
RMS 10ps
LeCroy 429ANIM Fan-out
NIM/LVDS
NIM/LVDS
-
140ps
Wave Union TDC BWave Union TDC BWave Union TDC BWave Union TDC B
Wave Union TDC BWave Union TDC BWave Union TDC BWave Union TDC B
+
+BNC adapters to add delays @ 140ps step.
As good as ASIC TDC
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
27
Multi-Sampling TDC FPGA c0
c90
c180
c270
c0
MultipleSampling
ClockDomain
Changing
Trans. Detection& Encode
Q0
Q1
Q2
Q3QF
QE
QD
c90
Coarse TimeCounter
DV
T0T1
TS
Ultra low-cost: 48 channels in $18.27 EP2C5Q208C7.
Sampling rate: 360 MHz x4 phases = 1.44 GHz.
LSB = 0.69 ns.
4Ch
Logic elements with non-critical timing are freely placed by the fitter of the compiler.
This picture represent a placement in Cyclone FPGA
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
28
ADC Using FPGAAMP &Shaper
AMP &Shaper
AMP &Shaper
AMP &Shaper
AMP &Shaper
AMP &Shaper
AMP &Shaper
AMP &Shaper
ADC
ADC
ADC
ADC
FPGA
TDC
TDC
TDC
TDC
R1 R1
C
R2
FPGA
VREF
Analog signals from AMP & Shapers are directly fed to FPGA pins.
FPGA outputs and passive RC network are used to generate ramping reference voltage VREF.
The input voltages and VREF are compared using FPGA differential input receivers.
The times of transitions representing input voltage values are digitized by TDC blocks in FPGA.
T1 T2 T3 T4
V1 V2V3 V4
V1 V2V3 V4
T1 T2 T3 T4
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
29
ADC Test: Waveform Digitization on BD3_19
1
1.5
2
2.5
2500 3000 3500 4000 4500 5000 5500
t(ns)
V
Leading Ramp Trailing Ramp
0
8
16
24
32
40
48
56
64
0 32 64 96 128 160 192 224 256
Leading Ramp Trailing Ramp
RawData
Input Waveform, Overlap Trigger& Reference Voltage
Converted
FPGA
TDC
TDC
50 50
1000pF
100
VREFPossib
le
Student Lab
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
30
Outline Electronic Aspect of FPGA:
LED Flashing Logic Elements in a Nutshell TDC and ADC
FPGA as a Computing Fabric: Moore’s Law Forever? Space Charge Computing with FPGA Cores Doublet Matching & Hash Sorter Triplet Matching & Tiny Triplet Finder Enclosed Loop Micro-Sequencer (ELMS)
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
31
Moore’s Law
Number of transistors in a package:
x2 /18months
Taken from www.intel.com
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
32
Status of Moore’s Law: an Inconvenient Truth
# of transistors Yes, via multi-core.
Clock Speed ?
Taken from www.intel.com
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
33
The Fever of Moore’s Law vs. Maxwell’s Equations
t
DJH
t
BE
B
D
0
1998 2000 2002 2004 2006 2008 2010
Op/sec
MIT, 2002
During the hot days of Moore’s Law, the rules of thumb are: BRB – Buy Rather than Build URU – Use Rather than Understand WRW – Wait Rather than Work
From fundamental principles like Maxwell’s Equations, it is known limits of Moore’s Law exist. The technology advance comes from hard work.
WRW
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
34
The Execution & Non-Execution Cycles
In current micro-processors: Each instruction takes one clock cycle to execute. It takes many clock cycles to prepare for executing an instruction. Pipelined? Yes. But the non-execution pipeline stages consume silicon
area, power etc. To execute an instruction != to do useful calculation.
Can we do something different?
From MIT 6.823 Open Course Site
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
35
Outline Electronic Aspect of FPGA:
LED Flashing Logic Elements in a Nutshell TDC and ADC
FPGA as a Computing Fabric: Moore’s Law Forever? Space Charge Computing with FPGA Cores Doublet Matching & Hash Sorter Triplet Matching & Tiny Triplet Finder Enclosed Loop Micro-Sequencer (ELMS)
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
36
The Space Charge Computing
n
iij
ijijj
13
04 rr
rrF
Each electron sees sum of Coulomb forces from other N-1 electrons. The total number of calculations is about N2 and each calculation of the Coulomb force
requires a square root, a division and several multiplications. Regular sequential computers are not fast enough.
Number of
Electrons
Number of Calculations/Iteration
Computing Time/1000 Iterations
@107 Calculations/s
103 ~106 100 s
104 ~108 2.7 hours
105 ~1010 11.6 days
106 ~1012 3.2 years
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
37
The FPGA Board
Up to 16 FPGA devices ($32 ea) can be installed onto each board. Each FPGA host one core.
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
38
xi
- X
X
X
LUT10b in
16b out
yi
zi
16-bitCoordinates
32-bitForces
xj
yj
zj
vzj+
vyj+
vxj+
x2
x2
x2
+
-
-
+
+
+
16-bitVelocities
The 16-bit Demo Core
n
iij
ijijj
13
04 rr
rrF
dtjjj avv 0
dtkkk vxx 0
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
39
LUT10b in
16b out
x2
x2
x2
+
The Lookup Table
n
iij
ijijj
13
04 rr
rrF
0
4096
8192
12288
16384
20480
24576
28672
32768
0 256 512 768 1024
2/3
1
xy
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
40
Two Electrons with Natural Scales
0.0000158
0.000016
0.0000162
0.0000164
0.0000166
0.0000168
0.000017
0.0000172
0.0000174
0.0000176
0.0000178
0 5 10 15 20 25
steps
dis
tan
ce (
m)
Calculated x2
Calculated x1
Actual x2
Actual x1256 nm
28ps
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
41
256 Charged Particles, Iteration 0
10000
15000
20000
25000
30000
35000
5000 10000 15000 20000 25000 30000 35000 40000
X'''
Y'''
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
42
256 Charged Particles, Iteration 5
10000
15000
20000
25000
30000
35000
5000 10000 15000 20000 25000 30000 35000 40000
X'''
Y'''
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
43
256 Charged Particles, Iteration 10
10000
15000
20000
25000
30000
35000
5000 10000 15000 20000 25000 30000 35000 40000
X'''
Y'''
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
44
256 Charged Particles, Iteration 15
10000
15000
20000
25000
30000
35000
5000 10000 15000 20000 25000 30000 35000 40000
X'''
Y'''
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
45
256 Charged Particles, Iteration 20
10000
15000
20000
25000
30000
35000
5000 10000 15000 20000 25000 30000 35000 40000
X'''
Y'''
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
46
256 Charged Particles, Iteration 25
10000
15000
20000
25000
30000
35000
5000 10000 15000 20000 25000 30000 35000 40000
X'''
Y'''
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
47
256 Charged Particles, Iteration 30
10000
15000
20000
25000
30000
35000
5000 10000 15000 20000 25000 30000 35000 40000
X'''
Y'''
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
48
256 Charged Particles, Iteration 35
10000
15000
20000
25000
30000
35000
5000 10000 15000 20000 25000 30000 35000 40000
X'''
Y'''
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
49
256 Charged Particles, Iteration 40
10000
15000
20000
25000
30000
35000
5000 10000 15000 20000 25000 30000 35000 40000
X'''
Y'''
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
50
Speed Comparison with Regular CPU
The FPGA core is x10 faster than a typical 2.2 GHz CPU core. The FPGA core runs at 200 MHz or 200 M Coulomb force calculations/s. It seems the CPU core needs 80-100 clock cycles for each Coulomb force calculation.
0.00
50.00
100.00
150.00
200.00
250.00
300.00
350.00
0 10000 20000 30000 40000 50000 60000 70000 80000 90000
Number of Particles
Tim
e (s
)
CPU: 2.2GHz Intel Core 2 Duo FPGA: EP2C8T144C6, 200MHz
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
51
One Board: 8 FPGA Cores
One board has a calculation capacity as 40 dual core CPUs.
The power consumption of one board is < 4.5 W. Newer FPGAs capable of hosting 4 cores/FPGA
are available.
One Core/FPGA= 5 Dual Core CPUsOne Core/FPGA= 5 Dual Core CPUs
8 Cores/Board
= 40 Dual Core CPUs
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
52
Outline Electronic Aspect of FPGA:
LED Flashing Logic Elements in a Nutshell TDC and ADC
FPGA as a Computing Fabric: Moore’s Law Forever? Space Charge Computing with FPGA Cores Doublet Matching & Hash Sorter Triplet Matching & Tiny Triplet Finder Enclosed Loop Micro-Sequencer (ELMS)
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
53
Example of Doublet Match, PET
Positrons and electrons annihilate to produce pairs of photons. The back-to-back photons hit the detector at nearly the same time.
Detector hits are digitized and hits at nearly the same time are to be matched together.
The process takes O(n^2) clock cycles.
T
D
T
D
Group 1
Group 2-
T<A?
T>(-A)?
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
54
Hash Sorter
K
K
D
K
D
Pass 1: Data in Group 1 are
stored in the hash sorter bins based on key number K.
Pass 2: Data in Group 2 are
fetched though and paired up with corresponding Group 1 data with same key number K.
Group 1
Group 2
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
55
DIN DOUT
Index RAM
Pointer RAM
DATA RAM
K
Link List Structure of Hash Sorter
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
56
Hash Sorter
K
Using hash sorter, matching pairs can be grouped together
using 2n, rather than n2 clock cycles.
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
57
Outline Electronic Aspect of FPGA:
LED Flashing Logic Elements in a Nutshell TDC and ADC
FPGA as a Computing Fabric: Moore’s Law Forever? Space Charge Computing with FPGA Cores Doublet Matching & Hash Sorter Triplet Matching & Tiny Triplet Finder Enclosed Loop Micro-Sequencer (ELMS)
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
58
Hits, Hit Data & Triplets
• Hit data come out of the detector planes in random order.
• Hit data from 3 planes generated by same particle tracks are organized together to form triplets.
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
59
• Three data items must satisfy the condition: xA+ xC = 2 xB.
• A total of n3 combinations must be checked (e.g. 5x5x5=125).
• Three layers of loops if the process is implemented in software.
• Large silicon resource may be needed without careful
planning: O(N2)
Triplet Finding
Plane A Plane B Plane C
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
60
Tiny Triplet Finder OperationsPass I: Filling Bit Arrays
Note: Flipped Bit Order
Physical Planes
Bit Array/Shifters
For any hit… Fill a corresponding logic cell.
• xA+ xC = 2 xB
• xA= - xC + constant
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
61
Tiny Triplet Finder Operations Pass II: Making Match
For any center plane hit…
Logically shift the
bit array.
Perform bit-wise AND in this range.
Triplet is found.
Physical Planes
Bit Array/Shifters
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
62
Tiny? Yes, Tiny! – Logic Cell Usage:
AM, CAM, Hough Transform
etc., O(N2)
Tiny Triplet FinderO(N*logN)
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
63
Hit MatchingSoftware FPGA
Typical
FPGA Resource Saving Approaches
O(n2)for(){
for(){…}
}
O(n)*O(N)Comparator
Array
Hash Sorter
O(n)*O(N): in RAM
O(n3)for(){
for(){
for(){…}
}
}
O(n)*O(N2)CAM,
Hugh Trans.
Tiny Triplet Finder
O(n)*O(N*logN)
O(n4)for(){ for(){
for(){ for()
{…}
}}}
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
64
The Winning Line of FPGA Computing
We commonly heard: FPGA devices contains millions gate. High parallelism can be implemented in FPGA. FPGA cost drops by half every 18 months.
We want to emphasize, especially to our young students:
1. Creativity,
2. Creativity,
3. Creativity, on Arithmetic ops, on Algorithms, on Architectures & on All Aspects.
O Freunde, nicht diese Töne!
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
65
Outline Electronic Aspect of FPGA:
LED Flashing Logic Elements in a Nutshell TDC and ADC
FPGA as a Computing Fabric: Moore’s Law Forever? Space Charge Computing with FPGA Cores Doublet Matching & Hash Sorter Triplet Matching & Tiny Triplet Finder Enclosed Loop Micro-Sequencer (ELMS)
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
67
Micro-computing vs. Reconfigurable Computing
In microprocessor, the users specify program on fixed logic circuits. In FPGA, the users specify logic circuits (as well as program). The FPGA computing needs not to follow microprocessor architectures. (But useful
experiences can be borrowed.) The usefulness of FPGA reconfigurable computing is still to be fully appreciated.
(100+3-4)*5+7 =?
100
34
57Control:
Data: 100,3,4,5,7
LD (-) (+)(*)(+)
CPUFPGAData
ProgramConfiguration
DataProgram
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
68
FPGA Process Sequencing Options
Program
Type
Program
Length
(CLK cycles)
Reprogram Resource
Usage
Finite State Machine
(FSM)
Fixed
Wired
10 Hard Small
Enclosed Loop Micro-Sequencer
(ELMS)
Memory
Stored
Program
10-1000 Easy Small
Microprocessor
(MP)
Memory
Stored
Program
>1000 Easy Large
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
69
The Between Counter
0,1,2,3,4,5,6,7,8,9,A
5,6,7,8,9,ASLOAD
D[]
SCLR
N Q[]
M-1==
A[]
B[]
T
5,6,7,8,9,A
5,6,7,8,9,A
5,6,7,8,9,A
5,6,7,8,9,A,B,C,D,E,F…
PC0: instr0PC1: instr1PC2: instr2PC3: instr3PC4: instr4PC5: instr5PC6: instr6PC7: instr7PC8: instr8PC9: instr9PCA: instrAPCB: instrBPCC: instrCPCD: instrD
TROM
BetweenCounter
ControlSignals
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
70
ELMS– Enclosed Loop Micro-Sequencer
Loop & Return Logic + Stack
Conditional Branch Logic
ProgramCounter
ROM128x
36bits
AReset
CLK Con
trol
Sig
nals
PC Control Signals Opration00 000000000000000 01 001000100011010 LD R1, #n02 000010001000000 LD R2, #addr_a03 000000000000100 LD R3, #addr_X04 000000010001000 LD R7, #005 000000000100001 BckA1 LD R4, (R2)06 000100000010000 INC R207 000001000100000 LD R5, (R3)08 000100010000001 INC R309 001001000100000 MUL R6, R4, R50a 000000010001000 EndA1 ADD R7, R7, R60b 000010000010000 DEC R10c 000000100000100 BRNZ BckA1
Special in ELMSSupports FOR loops at machine code level
PC+ROM is a good sequencer in FPGA.
Adding Conditional Branch Logic allows the program to loop back.
Loop & Return Logic + Stack is a special feature in ELMS that supports FOR loops at machine code level.
Allows jump back as in microprocessors
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
71
ELMS – Detailed Block Diagram
UserControlSignals
ROM128x
36bits
+1
CondJMP
PC
Reset
Loop & Return Registers
+ Stack (128 words)
Compare
RTNJMPIF
CNT
endA
bckA
PushPop
LoopBack
DEC
RTN
LastPass
LoopBack = DEC =(PC==endA) && (CNT!=0)
LastPass =(PC==endA) && (CNT==1)
desA
JMP
0x04
RUNat04 cnt EndA BckA
FOR BckA1 EndA1 #nLD R2, #addr_aLD R3, #addr_XLD R7, #0
BckA1 LD R4, (R2)INC R2LD R5, (R3)INC R3MUL R6, R4, R5
EndA1 ADD R7, R7, R6LD R8, R7
The Stack supports nested loops and sub-routing calls up to 128 layers.
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
73
What’s Good About ELMS: FOR Loops at Machine Code Level w/ Zero-Over Head
Looping sequence is known in this example before entering the loop. Regular micro-processor treat the sequence as unknown. ELMS supports FOR loops with pre-defined iterations at machine code level. Execution time is saved and micro-complexities (branch penalty, pipeline bubble, etc.)
associated with conditional branches are avoided.
LD R1, #nLD R2, #addr_aLD R3, #addr_XLD R7, #0
BckA1 LD R4, (R2)INC R2LD R5, (R3)INC R3MUL R6, R4, R5
EndA1 ADD R7, R7, R6DEC R1BRNZ BckA1
FOR BckA1 EndA1 #nLD R2, #addr_aLD R3, #addr_XLD R7, #0
BckA1 LD R4, (R2)INC R2LD R5, (R3)INC R3MUL R6, R4, R5
EndA1 ADD R7, R7, R6
n
iii XaY
0
25%
Microprocessor The ELMS
Conditional Branch
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
74
ELMS as a Hardware Loop Sequencer
Loop & Return Logic + Stack
Conditional Branch Logic
ProgramCounter
ROM128x
36bits
AReset
CLK Con
trol
Sig
nals
There are DSP devices that support hardware loop for zero-overhead loop implementation. The emphasis of ELMS is that the FOR loop and subroutine calls/return are treated the same. Any program passage can be used as a subroutine without needing a return instruction. The ELMS uses as less resource as possible for FPGA implementation.
From http://www.analog.com/
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
75
No ALU => Small Resource Usage
ProgramDATA
Memory
PrincetonArchitecture
HarvardArchitecture
Fermilab (?)Architecture
ProgramControl
ALU
ProgramMemory
ProgramControl
ALUDATAMemory
ProgramMemory
Sequencer(ELMS)
Data Processor
DATAMemory
The Princeton Architecture is more suitable at system level while Harvard Architecture is better suited at micro-structure level.
Regular microprocessors cannot run looped program without an ALU.
The ALU takes large amount of resource while may not be efficiently utilized for data processing tasks in FPGA.
The ELMS can run nested loop program without an ALU.
Further separation of Program and data is therefore possible.
The ELMS is kept small.
The von NeumannArchitecture
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
76
The Frequency Spectrum of DAC (2)
0
1
2
3
4
896 960 1024
0
100
0 64 128 192 256 320 384 448 512
Frequency
0
100
0 64 128 192 256 320 384 448 512
Frequency
0
100
0 64 128 192 256 320 384 448 512
Frequency
Q
CO
DDAC Input
The first harmonic may be suppressed. Works better with regular low-pass
filters.
Possible
Student Lab
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
77
The Frequency Spectrum of DAC (1)
CounterQ
A
B
A>B
DAC Input
0
1
2
3
4
896 960 1024
0
100
0 64 128 192 256 320 384 448 512
Frequency
0
100
0 64 128 192 256 320 384 448 512
Frequency
0
100
0 64 128 192 256 320 384 448 512
Frequency
The first harmonic has dominate concentration.
Works better with notch filter.
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
78
Digital Calibration Using Twice-Recording Method
IN
CLK
Use longer delay line. Some signals may be
registered twice at two consecutive clock edges.
N2-N1=(1/f)/t
The two measurements can be used: to calibrate the delay. to reduce digitization errors.
1/f: Clock Periodt: Average Bin Width
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
79
TDC Output at Different PS Voltage
0
5
10
15
20
25
1.5 2 2.5
VCCINT (V)
TD
C O
utp
uts
N1
n2
TDC Output at Different PS Voltage
0
5
10
15
20
25
1.5 2 2.5
VCCINT (V)
TD
C O
utp
uts
N1
n2
Tc
Digital Calibration Result Power supply voltage
changes from 2.5 V to 1.8 V, (about the same as 100 oC to 0 oC).
Delay speed changes by 30%.
The difference of the two TDC numbers reflects delay speed.
2nd TDC
1st TDCCorrected Time
)()(
0112
01 NNL
T
NN
NNTTc
Warning: the calibration is based on average bin width, not bin-by-bin widths.
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
80
Indirect Cost of Complexity
If something like this can do the job…
… why do these?
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
81
Tiny Triplet FinderReuse Coincident Logic via Shifting Hit Patterns
C1
C2
C3
One set of coincident logic is implemented.
For an arbitrary hit on C3, rotate, i.e., shift the hit patterns for C1 and C2 to search for coincidence.
Mar. 2009 Wu Jinyuan, Fermilab [email protected]
82
Tiny Triplet Finder for Circular Tracks
*R1/R3
*R2/R3Triplet Map Output To Decoder
Bit
Arr
ay
Shifter
Bit
Arr
ay
ShifterBit-wise Coincident Logic
0
16
32
48
64
80
96
112
128
0 16 32 48 64 80 96 112 128
1. Fill the C1 and C2 bit arrays. (n1 clock cycles)
2. Loop over C3 hits, shift bit arrays and check for coincidence. (n3 clock cycles)
Also works with more than 3 layers