May. 2009, Wu Jinyuan, Fermilab [email protected] IEEE RT09 Short Course 1 FPGA Structure,...

May. 2009, Wu Jinyuan, Fermilab [email protected]

IEEE RT09 Short Course

1

FPGA Structure, Programming Principals and Applications:Part II

Wu, Jinyuan

Fermilab

IEEE Real Time Conference Short Course

May, 2009


IEEE RT09 Short Course 2

Outline Counting:

Example: LED brightness and DAC Simple Sequencing

Bandwidth and Noise Issues: General Remarks on Sampling Theorem and Dithering. Example: Huffman Coding Example: Decimation & Dynamic Decimation

After-fact Calibration: Several Topics on FPGA Based TDC Serial Communication with Independent Crystals Minimum Synchronization



Flashing LED, The First Thing First

Counter

Q[23..0]

At least design an LED for an FPGA. When a board is first powered up, first

test the LED flashing function. Many things have to be right so that the

LED flashes: Power pins must be all connected. Configuration devices must be in correct mode. Design software must be correct.



FPLED Brightness Variation

Counter

Q[23..0]A

B

A<B

LUT

Counter

Q[23..0]

A

B

A<B

The LED brightness is varied by changing the output pulse duty-cycle.

Comparator input A is the brightness and B is the clock cycle count.

Look-up table can be added to input A for different brightness variation curve.



FP

LED Brightness Exponential Drop

Counter

Q

A

B

A<BCO

Q

SET

D

if (CO==1) {Q = Q - Q/32;}

Narrow pulse are typically stretched for LED display with fix brightness.

The circuit here provides gradually dim of the LED for better visual effect.

Possible

Student Lab



Exponential Sequence Generator

Q

SET

D

if (CO==1) {Q = Q - Q/32;}

0

10000

20000

30000

40000

50000

60000

70000

0 20 40 60 80 100 120 140 160

An exponential sequence is generated using an accumulator shown above.

Note that not even one multiplier is used. Other function sequences: sine, co-sine, tangent, co-

tangent etc. can also be generated similarly.



Duty-Cycle Based Single-Pin DAC (1)

The duty-cycle or pulse width of the comparator output is proportional to the DAC input at port A.

Use external RC as low-pass filter. Output voltage of an ideal LP filter is proportional to the

DAC input.

0

1

2

3

4

896 960 1024

CounterQ

A

B

A>B

DAC Input



Duty-Cycle Based Single-Pin DAC (2)

0

1

2

3

4

896 960 1024

Q

CO

DDAC Input

Possible

Student Lab

Use carry-out of the accumulator as the output. The number of pulses is proportional to the DAC input. Rounding error is carried to later cycles. Output is smoother.



The Frequency Spectrum of DAC (2)

0

1

2

3

4

896 960 1024

0

100

0 64 128 192 256 320 384 448 512

Frequency

0

100

0 64 128 192 256 320 384 448 512

Frequency

0

100

0 64 128 192 256 320 384 448 512

Frequency

Q

CO

DDAC Input

The first harmonic may be suppressed. Works better with regular low-pass

filters.



The Frequency Spectrum of DAC (1)

CounterQ

A

B

A>B

DAC Input

0

1

2

3

4

896 960 1024

0

100

0 64 128 192 256 320 384 448 512

Frequency

0

100

0 64 128 192 256 320 384 448 512

Frequency

0

100

0 64 128 192 256 320 384 448 512

Frequency

The first harmonic has dominate concentration.

Works better with notch filter.



Outline Counting:






ST

CLK

QA[5]

QA[4..0] 0 1 03130

Start, Count: A Single Layer Loop

The ST signal start the sequence

Counting is enabled

Counting stops



CLK

ST

QC{1..0]

CNTC

VCCCLK INPUT

VCCST INPUT

QAA[7..0]OUTPUT

QBA[7..0]OUTPUT

up countersset 2sset

clock

cnt_en

q[1..0]

lpm_counter25

inst1

up countersclr

clock

cnt_en

q[7..0]

lpm_counter26

inst3

up countersclr

clock

cnt_en

q[7..0]

lpm_counter26

inst4

NOT

inst

OR2

inst6

CLRN

DPRN

Q

DFF

inst7CLRN

DPRN

Q

DFF

inst8

CLK

QC[0]

CLK

QC0QQ

OR2

inst9

AND2

inst10

NOT

inst12

NOT

inst13

data[7..0]eq254

eq255

lpm_decode2

inst15

data[7..0]eq254

eq255

lpm_decode2

inst16

OR2

inst11

AND2

inst14AND2

inst17

AAeqFF

BAeqFF

BAeqFF

QC[0]

CLK

CNTB

QC[0]

SCLRB

AAeqFF

QC[1]

CLK

CNTA

QC[1]

QC[0]QC0QQ

SCLRA

BAeqFF

AAeqFF

QBA[7..0]

QAA[7..0]

A Double-Layer + Single-Layer Sequencer BA AA

0 0 1 2 3 4 255

1 0 1 2 3 4 255

2 0 1 2 3 4 255

3 0 1 2 3 4 255

4 0 1 2 3 4 255

255 0 1 2 3 4 255

0 0 A double-layer loop is followed by a single-layer loop.

1 0

2 0

3 1

4 2

255 253

0 254

0 255

0 0

Inner Loop

Outer Loop

State Control



256

Wor

d(s)

RA

M

Block Ty pe: M4K

data_a[15..0]

address_a[7..0]

w ren_a

data_b[15..0]

address_b[7..0]

w ren_b

clock

q_a[15..0]

q_b[15..0]

lpm_ram_dp3

inst2

256

Wor

d(s)

RA

MBlock Ty pe: M4K

data_a[15..0]

address_a[7..0]

w ren_a

data_b[15..0]

address_b[7..0]

w ren_b

clock

q_a[15..0]

q_b[15..0]

lpm_ram_dp3

inst5


clock

cnt_en

q[8..0]

lpm_counter27

inst18

GN

D

A

B

A+B

dataa[15..0]

datab[15..0]

result[15..0]

lpm_add_sub15

inst21

CLRN

DPRN

Q

DFF

inst22CLRN

DPRN

Q

DFF

inst23

CLKCLK

CLK

CEA

zz[0]

zz[0]

WE

NOT

inst25

SAX[15..0]

WA[7..0]

RA[7..0]

RA[7..0]

MQX[15..0]

MQA[15..0]

SAX[15..0]

WE

VCCXD[15..0] INPUT

VCCXA[7..0] INPUT

VCCXWE INPUT

CEA,RA[7..0]

ST

CLK

CEA

OUT[15..0]OUTPUT

up countersclr

clockq[7..0]

lpm_counter28

inst24

CLK

WE

zz[31..0]

An Array Adder



Outline Counting:






Cares Must Be Taken Outside FPGA (1)

DAC

FPGA

ADCShaperLP Filter

LP Filter

BandLimiting

BandLimiting

Spectrum ofOriginal Signal

Spectrum ofDAC Output

LP filter LP filter

ADC Input

SamplingIn ADC

Aliasing w/oLP Filtering

Output ofLP filter

Nyquist Frequency <(1/2) Sampling Frequency



The “Trend” vs. The Sampling Theorem

There will be no hardware analog

processing. Everything is done

digitally in software.

It sounds very stylish

A shaper/low-pass filter is a minimum requirement.



Cares Must Be Taken Outside FPGA (2)

DAC

FPGA

ADCShaperLP Filter

n

LP Filter

Dither

51

52

53

54

0 50 100 150

Sampling Index

AD

C

Signal Signal+Noise ADC(signal+noise) Weighted Average Threshold

51

52

53

54

0 50 100 150

Sampling Index

AD

C

Signal ADC(signal) Threshold

Resolution finer than the ADC LSB can be achieved by adding noise at ADC input and digital filtering.



Adding Noise for Finer Resolution

Photo Credit: www.telegraph.co.uk, trinities.org

Mechanical pressure gauges usually do not track small pressure changes well.

The gauge readers may lightly tap the gauges to get more accurate reading.

The idea of dithering at ADC input is similar.



Some Notes on Philosophy

WidebandLow Noise

NarrowbandNoisy

Good Bad

Something good in one condition can be bad in another condition.

And vise versa.



Why Band Limiting & Dithering are Ignored? Pre-amplifiers usually have a naturally limited

bandwidth and an intrinsic noise larger than the LSB of the ADC.

So a lot of time, band limiting and dithering can be “safely” ignored since they are satisfied automatically.

High bandwidth, low noise devices now become easily accessible. A design can be too fast and too quiet.

Do not forget to review the band limiting and dithering requirements for each design.



Outline Counting:






Data Reduction on Liquid Argon TPC Data

Hit waveforms in TPC carry useful information. Digitizing the waveforms creates large volume of data. Data reduction without losing useful information is necessary.

Drift Time

Wire Number

Data from BO detector of FNAL

0

100

200

300

400

500

600

700

0 200 400 600 800 1000 1200 1400 1600 1800 2000



Slow Variation of Raw Data

140

142

144

146

148

150

152

154

156

158

160

1100 1150 1200 1250 1300 1350 1400

More than 99% points differ from previous points by -1, 0 or +1.

Huffman Coding can be applied to the differences of the data points.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7

u(n+1)-u(n)

P

wire0_15 wire16_31 wire32_95

DFF

Q

A

B

A-B

U(n+1)

D

U(n+1)-U(n)



The Huffman Coding

The U(n+1)-U(n) value with highest probability is assigned to shortest code, i.e., single bit 1.

Values with lower probabilities are assigned with longer codes, e.g., 01, 001, 0001 etc.

Huffman coded words and regular words are distinguished by bit-15.

U(n+1)-U(n)

Code

-4 and others

Full 16 bits word

-3 000001

-2 0001

-1 01

0 1

+1 001

+2 00001

+3 0000001

1

0 0 ADC value (13-bit)

Regular ADC data for first point or when U(n+1)-U(n) is outside +-3

Huffman Coded

-1 0 0 0 +1 +2 Padding orContinue toNext WordIn this example, 6 differences of the data

samples are packed in the 16-bit data word.

11 11 1 10 0 0 0 0 0 0 0 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7

u(n+1)-u(n)

P

wire0_15 wire16_31 wire32_95



The Huffman Coding Block

The block is able to operate at up to 250MHz clock in Altera Cyclone III FPGA devices.

The block uses 245 logic cells, taking 0.6% in an EP3C40F484C6 device ($129) containing 39600 logic cells.

D[15..0]

DV

D1st

DLast

CK

DV6Q

D1st6Q

DLast6Q

Q[15..0]

QRDY

HuffmanCoding1

inst

D1st

DLast

CK250

Raw Data

Huffman Coded Data

245 Logic Cells(245/39600)*$129

= $0.80 1

0 0 ADC value (13-bit)

-1 0 0 0 +1 +2

11 11 1 10 0 0 0 0 0 0 0 0



The Schematics of the Huffman Coding BlockVCC

D[15..0] INPUT

VCCDV INPUT

VCCCK INPUT

VCCD1st INPUT

VCCDLast INPUT

QRDYOUTPUT

Q[15..0]OUTPUT

D1st6QOUTPUT

DLast6QOUTPUT

DV6QOUTPUT

PRN

CLRN

D

ENA

Q

DFFE

inst3

GN

DV

CC

data7x[2..0]

data6x[2..0]

data5x[2..0]

data4x[2..0]

data3x[2..0]

data2x[2..0]

data1x[2..0]

data0x[2..0]

sel[2..0]

result[2..0]

lpm_mux0

inst8

AND12

inst10

BAND12

inst11

NOR2

inst14

PRN

CLRN

D

ENA

Q

DFFE

inst13

A

B

A+B

dataa[3..0]

datab[3..0]

cin

result[3..0]

cout

lpm_add_sub1

inst15

A

B

A+B

dataa[3..0]

datab[3..0]

cin

result[3..0]

cout

lpm_add_sub1

inst17

PRN

CLRN

D

ENA

Q

DFFE

inst19

OR2

inst21AND2

inst22

AND2

inst24

NOT

inst27

data[3..0]

eq0

eq1

eq2

eq3

eq4

eq5

eq6

eq7

eq8

eq9

eq10

eq11

eq12

eq13

eq14

eq15

lpm_decode0

inst16

data1x[15..0]

data0x[15..0]

sel

result[15..0]

lpm_mux1

inst37

PRN

CLRN

D

ENA

Q

DFFE

inst26

AND2

inst39

CLRN

DPRN

Q

DFF

inst41

CLRN

DPRN

Q

DFF

inst42

CLRN

DPRN

Q

DFF

inst43

A

B

A-B

dataa[13..0]

datab[13..0]

clock result[13..0]

lpm_add_sub0

inst5

CLRN

DPRN

Q

DFF

inst44

CLRN

DPRN

Q

DFF

inst46

CLRN

DPRN

Q

DFF

inst47

CLRN

DPRN

Q

DFF

inst50

CLRN

DPRN

Q

DFF

inst51

CLRN

DPRN

Q

DFF

inst52

CLRN

DPRN

Q

DFF

inst53CLRN

DPRN

Q

DFF

inst54CLRN

DPRN

Q

DFF

inst55

CLRN

DPRN

Q

DFF

inst56

CLRN

DPRN

Q

DFF

inst57CLRN

DPRN

Q

DFF

inst58

CLRN

DPRN

Q

DFF

inst59

CLRN

DPRN

Q

DFF

inst60

AND2

inst25

CLRN

DPRN

Q

DFF

inst61

OR2

inst28

CLRN

DPRN

Q

DFF

inst62

OR2

inst35

AND2

inst29

NOT

inst30

CLRN

DPRN

Q

DFF

inst48

AND2

inst31

CLRN

DPRN

Q

DFF

inst63CLRN

DPRN

Q

DFF

inst64

AND4

inst1

OR4

inst2

CLRN

DPRN

Q

DFF

inst49

NOT

inst4

CLRN

DPRN

Q

DFF

inst65CLRN

DPRN

Q

DFF

inst66CLRN

DPRN

Q

DFF

inst67

CLRN

DPRN

Q

DFF

inst68CLRN

DPRN

Q

DFF

inst69

CLRN

DPRN

Q

DFF

inst70

zz[3..0]CK

DV3Q

CKDV3Q

BADHC

ROVR

NEWWRD

v v [1]

BADHC

SST[1]

SST[2]

SST[3]

SST[4]

SST[5]

SST[6]

SST[7]

SST[8]

SST[9]

SST[10]

SST[11]

SST[12]

SST[13]

SST[14]

SST[15]

NEWWRD

CLRDATA

CK

DV4Q

CK

CK

DV4Q

v v [15],SST[1..15]

CLRDATA

CK

CK

DV3Q DV4Q

ROVR

ROVR

CK

D1st4QND1st4Q

CK

BADHC4Q BADHC5Q

CK

BADHC BADHC4Q

CK CK

CK

D1st4Q

Number of bits for Huffman Codes 0: 0(+1), 1: 2(+1), 2: 4(+1), 3: 6(+1)

Number of bits for Huffman Codes -1: 1(+1), -2: 3(+1), -3: 5(+1), -4: 7(+1)

If (NBHC+1+HCSS)>=16, HCSS.d=(0xf&(NBHC+1+HCSS))+1

e.g. NBHC=2, HCSS=14 --> HCSS.d=1

+1 w hen rollover since 15 bits/w ord are used for data

zz[3],NBHC[2..0]

D2VQ[15..0]

zz[31..0]

v v [31..0]

DV3Q

DIFF[2..0]

BADHC

NBHC[2..0]

CK

zz[2],zz[1],v v [0]

zz[2],v v [1],v v [0]

v v [2],zz[1],v v [0]

v v [2],v v [1],v v [0]

v v [2],v v [1],zz[0]

v v [2],zz[1],zz[0]

zz[2],v v [1],zz[0]

zz[2],zz[1],zz[0]

DIFF[8]

DIFF[7]

DIFF[12]

DIFF[11]

DIFF[4]

DIFF[6]

DIFF[5]

DIFF[3]

DIFF[13]

DIFF[9]

DIFF[10]

DIFF[2]

DIFF[12]

DIFF[11]

DIFF[8]

DIFF[10]

DIFF[9]

DIFF[5]

DIFF[7]

DIFF[6]

DIFF[2]

DIFF[4]

DIFF[3]

DIFF[13]

CK

CK

CK

CK

CK

CKCK

CK

CK

CK

CK

BADHC5Q

CK

DV5Q

BADHC5Q

DV4Q

DV4Q

NEWWRD

D1st4QN

CK

CK

DV5Q

CK

D1Q[15..0]

CK

CK

CK

CK

DIFF[13..0]

zz[13],D1Q[12..0]

zz[13],D2VQ[12..0]

CK Difference ofData Points

Huffman CodeLookup Table

Huffman CodeComposer

Huffman Code orRaw Data Selector



The Compress Ratio of Huffman Coding

On typical TPC events a compression ratio of about 10 can be achieved.

Compression ratio is sensitive to high frequency noise.

D[15..0]

DV

D1st

DLast

CK

DV6Q

D1st6Q

DLast6Q

Q[15..0]

QRDY

HuffmanCoding1

inst

D1st

DLast

CK250

N

N/(10.7)

0

100

200

300

400

500

600

700

0 200 400 600 800 1000 1200 1400 1600 1800 2000



Outline Counting:






A “Mystery” of Huffman Coding Ratios on Down Sampled Data

The 5MHz data is down sampled to 1MHz. The Huffman Coding compress ratio drops from 10.7 to 7.5 when the data is down sampled.

D[15..0]

DV

D1st

DLast

CK

DV6Q

D1st6Q

DLast6Q

Q[15..0]

QRDY

HuffmanCoding1

inst

D1st

DLast

CK250

N

N/(10.7)

D[15..0]

DV

D1st

DLast

CK

DV6Q

D1st6Q

DLast6Q

Q[15..0]

QRDY

HuffmanCoding1

inst

D1st

DLast

CK250

(N/5)

(N/5)/(7.5)



Averaging in Decimation: A Re-discovery

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 16 32 48 64

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 16 32 48 64

Simple “down-sampling” is not good. When the decimation factor is D, an averaging over D

samples is good either. An averaging over 2*D samples is necessary. There is still aliasing with averaging over 2*D samples but

it is less severe than averaging over D samples.

Nyquist Frequency <(1/2) Sampling Frequency



Weighted Average, The CIC-2 Filter

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 16 32 48 64

Filter performance can be further improved with weighted average over 4*D samples. The filter is called Cascade-Integrate-Comb filter of order 2 (CIC-2). The CIC-1 filter is the moving average.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 16 32 48 64



Huffman Coding Ratios for 5MHz to 1MHz

The Huffman Coding compress ratio improves as the filter in Dynamic Decimation improves.

0

2

4

6

8

10

12

no deci no filter AV5 AV10 CIC2_20

Hu

ffm

an C

od

ing

Co

mp

ress

Rat

io

R089_E104 R089_E175 R089_E174 R089_E178 R089_E179 R089_E110



Dynamic Decimation (DD)

400

420

440

460

480

500

520

540

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Only small time intervals, i.e., region of interest (ROI) must be sampled at high rate. Most time intervals can be sampled with lower rate, without losing useful information.



A Mystery of Dynamic Decimation & Huffman Coding

Dynamic Decimation reduces number of samples by factor of 10. Huffman Coding reduces number of bits from raw data by factor of 10. When cascaded, the combination reduces number of bits by factor of 60.

DynamicDecimation

HuffmanCoding

N N/10.6

DynamicDecimation

HuffmanCoding

N N/60N N/10.7



Huffman Coding Ratios for Dynamic Decimation

The Huffman Coding compress ratio improves as the filter in Dynamic Decimation improves.

0

2

4

6

8

10

12

no deci no filter AV16 AV32 CIC2_64

Hu

ffm

an C

od

ing

Co

mp

ress

Rat

io

R089_E104 R089_E175 R089_E174 R089_E178 R089_E179 R089_E110



Any Differences ?

Raw

With DynamicDecimation

0

100

200

300

400

500

600

700

0 200 400 600 800 1000 1200 1400 1600 1800 2000

0

100

200

300

400

500

600

700

0 200 400 600 800 1000 1200 1400 1600 1800 2000



Outline Counting:






TDC Using FPGA Logic Chain Delay

This scheme uses current FPGA technology

Low cost chip family can be used. (e.g. EP2C8T144C6 $31.68)

Fine TDC precision can be implemented in slow devices (e.g., 20 ps in a 400 MHz chip).

IN

CLK



Two Major Issues In a Free Operating FPGA

0

20

40

60

80

100

120

140

160

180

0 16 32 48 64

bin

wid

th (

ps)

1. Widths of bins are different and varies with supply voltage and temperature.

2. Some bins are ultra-wide due to LAB boundary crossing



Digital Calibration Using Twice-Recording Method

IN

CLK

Use longer delay line. Some signals may be

registered twice at two consecutive clock edges.

N2-N1=(1/f)/t

The two measurements can be used: to calibrate the delay. to reduce digitization errors.

1/f: Clock Periodt: Average Bin Width



TDC Output at Different PS Voltage

0

5

10

15

20

25

1.5 2 2.5

VCCINT (V)

TD

C O

utp

uts

N1

n2

TDC Output at Different PS Voltage

0

5

10

15

20

25

1.5 2 2.5

VCCINT (V)

TD

C O

utp

uts

N1

n2

Tc

Digital Calibration Result Power supply voltage

changes from 2.5 V to 1.8 V, (about the same as 100 oC to 0 oC).

Delay speed changes by 30%.

The difference of the two TDC numbers reflects delay speed.

N2

N1Corrected Time

)()(

0112

01 NNL

T

NN

NNTTc

Warning: the calibration is based on average bin width, not bin-by-bin widths.



0

500

1000

1500

2000

2500

0 16 32 48 64

bin

tim

e (p

s)

Auto Calibration Using Histogram Method It provides a bin-by-bin calibration at

certain temperature. It is a turn-key solution (bin in, ps out) It is semi-continuous (auto update

LUT every 16K events)

0

20

40

60

80

100

120

140

160

180

0 16 32 48 64

bin

wid

th (

ps)

DNLHistogram

In (bin)LUT

Out (ps)

16KEvents



Good, However

Auto calibration solved some problems However, it won’t eliminate the ultra-wide bins

0

20

40

60

80

100

120

140

160

180

0 16 32 48 64

bin

wid

th (

ps)



Cell Delay-Based TDC + Wave Union Launcher

Wave UnionLauncher

In

CLK

The wave union launcher creates multiple logic transitions after receiving a input logic step.

The wave union launchers can be classified into two types:

Finite Step Response (FSR) Infinite Step Response (ISR)

This is similar as filter or other linear system classifications:

Finite Impulse Response (FIR) Infinite Impulse Response (IIR)



Wave Union Launcher A (FSR Type)

In

CLK

1: Unleash0: HoldWave UnionLauncher A



Wave Union Launcher A: 2 Measurements/hit

1: Unleash



Sub-dividing Ultra-wide Bins

1: Unleash

1

2

1

2

Device: EP2C8T144C6 Plain TDC:

Max. bin width: 160 ps. Average bin width: 60 ps.

Wave Union TDC A: Max. bin width: 65 ps. Average bin width: 30 ps.

0

20

40

60

80

100

120

140

160

180

0 16 32 48 64 80 96 112 128bin

wid

th (p

s)

Plain TDC

Wave Union TDC A



Measurement Result for Wave Union TDC A

Histogram

Raw

TDC+

LUT53 MHzSeparate Crystal

-

-WaveUnion Histogram

Plain TDC: delta t RMS width: 40 ps. 25 ps single hit.

Wave Union TDC A: delta t RMS width: 25 ps. 17 ps single hit.

0

500

1000

1500

2000

2500

3000

3500

1000 1100 1200 1300 1400 1500

dt (ps)

Un-calibrated

Plain TDC

Wave Union TDC A



More Measurements

Two measurements are better than one. Let’s try 16 measurements?



Wave Union Launcher B (ISR Type)

Wave UnionLauncher B

In

CLK

1: Oscillate0: Hold



Wave Union Launcher B: 16 Measurements/hit

1 Hit16 Measurements@ 400 MHz

VCCINT=1.20V

VCCINT=1.18V



Delay Correction

0

500

1000

1500

2000

2500

3000

0 4 8 12 16

m

T0

(ps)

16

32

48

64

0 2 4 6 8 10 12 14 16

m

TD

C (

bin

)

Delay Correction Process: Raw hits TN(m) in bins are first calibrated into

TM(m) in picoseconds. Jumps are compensated for in FPGA so that

TM(m) become T0(m) which have a same value for each hit.

Take average of T0(m) to get better resolution.

The raw data contains: U-Type Jumps: [48-63][16-31] V-Type Jumps: other small jumps. W-Type Jumps: [16-31][48-63]

15

000 )(

16

1

mav mtt

The processes are all done in FPGA.



The Test Module

Two NIM inputs

FPGA with 8ch TDC

Data Output via Ethernet

BNC Adapter to add delay @

150ps step.



Test ResultNIM Inputs

0 1 2

RMS 10ps

LeCroy 429ANIM Fan-out

NIM/LVDS

NIM/LVDS

-

140ps

Wave Union TDC BWave Union TDC BWave Union TDC BWave Union TDC B

Wave Union TDC BWave Union TDC BWave Union TDC BWave Union TDC B

+

+BNC adapters to add delays @ 140ps step.



Multi-Sampling TDC FPGA c0

c90

c180

c270

c0

MultipleSampling

ClockDomain

Changing

Trans. Detection& Encode

Q0

Q1

Q2

Q3QF

QE

QD

c90

Coarse TimeCounter

DV

T0T1

TS

Ultra low-cost: 48 channels in $18.27 EP2C5Q208C7.

Sampling rate: 360 MHz x4 phases = 1.44 GHz.

LSB = 0.69 ns.

4Ch

Logic elements with non-critical timing are freely placed by the fitter of the compiler.

This picture represent a placement in Cyclone FPGA



Issues of Coarse Time Counter

There are some common misunderstandings on coarse time counters in a TDC: Tow coarse time counters are needed, driven by clocks with 180 degree

phase difference. The coarse time counter should be a Gray code counter.

Dual counters and/or Gray code counters are only needed in one ASIC TDC architecture.

In the architectures used by FPGA TDC and some ASIC TDC, only one plain binary counter is needed as coarse time counter.

CoarseTime

Counter

CoarseTime

Counter

CoarseTime

Counter

GrayCode

Counter

000001011010110111101100



Delay Line Based TDC Architectures

HIT

CLK

HIT

CLK

HIT

CLK

HIT

CLK

Delay Hit Delay CLK Delay Both

CLK is used as clock

HIT is used as clock

Only this architecture needs dual coarse time counters.



Implementation of Coarse Time CounterCoarseTime

Counter

FineTime

Encoder

In

CLK

ENA

Fine Time

Coarse Time

Data Ready

Hit Detect Logic



Outline Counting:






Classical Picture of Serial Communications

The parallel data is converted to serial bits driven by crystal oscillator X1 in the transmitter device.

The serial data stream is used to generate a recovered clock at the receiver device with a phase lock loop (PLL).

The recovered clock is used to drive the serial-to-parallel converter and store the data into a first-in-first-out (FIFO) buffer.

The FIFO buffer is used to transfer data from the recovered clock domain to the local clock domain generated by crystal oscillator X2.

Parallel-to-SerialConverter

FIFOSerial-to-Parallel

Converter

PLLX1 X2

LocalLogic

Recovered Clock



Serial Data Receiving Without PLL etc.

Generating recovered clock with PLL, VCO, VCXO etc. is an analog process and it is not convenient to generate in an FPGA, especially for applications with multiple receiving channels.

There are pure digital methods to receive the serial data. Digital Phase Follower: 1bit/CLK The Two-Cycle Serial IO: 1bit/(2CLK) FM Encoder and Decoder: 1bit/(2-16CLK) Clock-Command Combined Carrier Coding (C5): 4bits/(20CLK)

The transmitter and receiver can be driven by two independent free running crystal oscillators.

Parallel-to-SerialConverter

DigitalSerial-to-Parallel

Converter

X1 X2

LocalLogic



Digital Phase Follower

c0

c90

c180

c270

c0In

MultipleSampling

ClockDomain

Changing

b0

b1

FrameDetection

DataOut

Tri-speedShift

Register

Shift2

Shift0

was3is0

SEL

was0is3

Trans.Detection

Q0

Q1

Q2

Q3QF

QE

QD

The input data rate is 1bit/clock cycle. Four clock phases, c0, c90, c180 and c270 are used to detect input transition edge. The phase for data sample follows the variation of the transition edge.



Schematics of Digital Phase Follower

EE[3..0]OUTPUT

C1OUTPUT

C0OUTPUT

PQQ[11..0]OUTPUT

DS5B[4..0]OUTPUT

BBOUTPUT

JMPOUTPUT

ENOUTPUT

IN1

CLK0

CLK90

CLK180

CLK270

EN

QQ[11..0]

BT

JMP

WTN

EE[3..0]

phtrk1

inst3

BB

BX

JMP

EN

CLK

Q[4..0]

C1

C0

DS5B

inst

GND

D[4..0]

C1

C0

CLK

M[23..20]

Q[27..0]

QQ[23..0]

DV

S[1..0]

ERR

Word24_13z

inst9

CLK0

VCCIN1 INPUT

VCCCLK0 INPUT

VCCCLK90 INPUT

VCCCLK180 INPUT

VCCCLK270 INPUT

EE[3..0]OUTPUT

QQ[11..0]OUTPUT

JMPOUTPUT

WTNOUTPUT

BTOUTPUT

CLRN

DPRN

Q

DFF

inst3

CLRN

DPRN

Q

DFF

inst4

CLRN

DPRN

Q

DFF

inst5

CLRN

DPRN

Q

DFF

inst6

CLRN

DPRN

Q

DFF

inst9

CLRN

DPRN

Q

DFF

inst10

CLRN

DPRN

Q

DFF

inst11

CLRN

DPRN

Q

DFF

inst12

NOT

inst27

AND4

inst29

PRN

CLRN

D

ENA

Q

DFFE

inst19CLRN

DPRN

Q

DFF

inst26

CLRN

DPRN

Q

DFF

inst21CLRN

DPRN

Q

DFF

inst24

OR4

inst8

AND2

inst13

AND2

inst14

AND2

inst15

AND2

inst16

CLRN

DPRN

Q

DFF

inst25

AND2

inst1

NAND2

inst2

CLRN

DPRN

Q

DFF

inst28

CLRN

DPRN

Q

DFF

inst30

CLRN

DPRN

Q

DFF

inst31

OR4

inst

CLRN

DPRN

Q

DFF

inst32

OR4

inst18

OR4

inst20

up countersclr

clockq[6..0]

lpm_counter1

inst7

QA[3]

QA[2]

QA[1]

QA[0]

CLK0

CLK90

CLK180

CLK270 CLK90

QQ[3]

QQ[2]

QQ[1]

QQ[0]

CLK0

QQN[6..3]

QQN[5..2]

QQ[4..1]

QQ[3..0]

AD[3..0]

QQ[7..0] QQN[7..0]

CLK0

QQ[3..0] QQ[7..4]

CLK0 CLK0

QQ[7..4] QQ[11..8]

EE[3]

EE[2]

EE[1]

EE[0]

QQ[11]

QQ[10]

QQ[9]

CLK0

QQ[8]

CLK0

CLK0

ADQ[0]

EE[3]

ADQ[3]

EE[0]

CLK0

AD[3]

CLK0

ADQ[3..0]

ADQ[1]

ADQ[0]AD[2]

CLK0

CLK0

ADQ[3]

ADQ[2]

ADQ[1]

ADQ[0] QCNT[6..0]

QCNT[6]

VCCBB INPUT

VCCBX INPUT

VCCJMP INPUT

VCCEN INPUT

VCCCLK INPUT

C1OUTPUT

C0OUTPUT

Q[4..0]OUTPUTdata1x[4..0]

data0x[4..0]

sel

result[4..0]

lpm_mux4

inst

PRN

CLRN

D

ENA

Q

DFFE

inst5

OR2

inst9

XOR

inst10

XOR

inst11

NOT

inst12

PRN

CLRN

D

ENA

Q

DFFE

inst6

PRN

CLRN

D

ENA

Q

DFFE

inst7

Q[2..0],BB,BX

Q[3..0],BBD[4..0] Q[4..0]

EN

CLK

CLK

EN

CLK

EN

JMP

CLK: 375MHz Data Rate:

375Mbits/s



The Two-Cycle Serial IO

This scheme is slower than digital phase follower but the logic is simpler. The CLK1 and CLK2 can be generated with two free running crystal oscillators.

CLK1

Data Out

Transmitter

Receiver

start bit = 1 b15 b14

b15start bit = 1 X b14X

CLK2

Data In

One data bit is transmitted every 2 clock cycles.

A logic transition is detected between these two falling edges.

Input data are stable at these clock edges.



Schematics of the Two-Cycle Serial IO

VCCCK200 INPUT

VCCDD[15..0] INPUT

VCCDRDY INPUT

VCCSDIN INPUT

VCCDV INPUT

VCCCK100 INPUT

QQ[15..0]OUTPUT

SDOUTOUTPUT

POPCMDOUTPUT

QQOKOUTPUT

VC

CG

ND

CLRN

DPRN

Q

DFF

inst4

up countermodulus 36sclr

clockq[5..0]

cout

lpm_counterS2

inst3

CLRN

DPRN

Q

DFF

inst7

NOT

inst9

OR2

inst10NOT

inst11

NOT

inst12

CLRN

DPRN

Q

DFF

inst13

CLRN

DPRN

Q

DFF

inst14

NOT

inst16

CLRN

DPRN

Q

DFF

inst18

CLRN

DPRN

Q

DFF

inst19

AND4

inst20

NOT

inst17 up countersset 32sset

clock

cnt_en

q[5..0]

lpm_counterS4

inst2

AND6

inst22CLRN

DPRN

Q

DFF

inst23

lef t shif tload

data[16..0]

clock

enable

shiftin

shiftout

lpm_shiftregS1

inst

lef t shif tclock

enable

shiftinq[15..0]

lpm_shiftregS5

inst21

PRN

CLRN

D

ENA

Q

DFFE

inst1

CLRN

DPRN

Q

DFF

inst5

CLRN

DPRN

Q

DFF

inst24

OR2

inst15

v v v [31..0]

zzz[31..0]

CK200

CK200

DRDY

v v v [16],DD[15..0]

DV

ENA1

zzz[0]

CK200

ENA1

ENA1DV

CK200

CK200 CK200N

CK200

SEQ[0]

CK200

SEQ[0]

SEQ[5]

SEQ[4]

SEQ[3]

SEQ[2]

SEQ[1]

SDINQ

CK200N

SDIN

CK200N

CK200

CK200

SEQ[5..0]

SEQ[5]

SEQ[5]

CK200

SDIN SDINQ

CK200

CK100

CK200

434241403938373635343332

SDIN

SEQ

SDINQ

QQ

SD15 SD14 SD13 SD12 SD11 SD10

SD15 SD15,14 SD15..13 SD15..12

SSET

ENAS=SEQ[0]

SDIN1NQ

SDIN2NQ

CK200

CLK: 200MHz Data Rate: 100Mbits/s



The FM coding

A bit is transmitted in two unit time intervals, usually in two internal clock cycles at frequency f.

For bit=1, the output toggles each cycle, i.e., with frequency (f/2) and for bit=0, the output toggles every two cycles, i.e., with frequency (f/4).

When not transmitting data, the output toggles at frequency (f/4), until seeing the start bit. The data stream is naturally DC balanced suitable for AC coupled transmission. The polarity of the interconnection doesn’t matter.

0 start bit = 1 0 0 1 1



Schematics of FM Decoder

VCCCK212 INPUT

VCCINA INPUT

DVOUTPUT

DQ[17..0]OUTPUT

PQOUTPUT

CLRN

DPRN

Q

DFF

inst CLRN

DPRN

Q

DFF

inst2

CLRN

DPRN

Q

DFF

inst3

XOR

inst4


clock

cnt_en

q[3..0]

lpm_counter1

inst5

data[2..0]

eq0

eq1

eq2

eq3

eq4

eq5

eq6

eq7

lpm_decode0

inst6

AND2

inst8

NOT

inst10

data[2..0]

eq0

eq1

eq2

eq3

eq4

eq5

eq6

eq7

lpm_decode0

inst11


clock

cnt_en

q[8..0]

lpm_counter4

inst7

PRN

CLRN

D

ENA

Q

DFFE

inst1

NOT

inst9

AND6

inst12CLRN

DPRN

Q

DFF

inst13

CK212

CK212

CK212

INATOG

CK212

INATOG

TOGCNT[3..0]

TOGCNT[3]

INAQ

TOGCNT[2..0]

INAis0x

CK212

CNTSHFT

SSETFCNTSSETFCNTINAis0x

CNTSHFT

CNTSHFT,BitCNT[4..0],BTK[2..0] BTK[2..0]

OKSample

CK212

DQ[17..0],PQ

DD

OKSample

BitCNT[4]

OKSample

BitCNT[3]

BitCNT[2]

BitCNT[1]

BitCNT[0]CK212

DQ[16..0],PQ,DD

TOGCNT[2]

0 0

INAQ

INATOG

TOGCNT[2..0] 1 2 3 1 2 3 0 1 2 3 0 01 2 3 1 2 34 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3

SSETFCNT

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7BTK

CNTSHFT

OKSample

BitCNT 13 14

0 1 2 3 4 5 6 7

... 31

DV

DQ[17] DQ[16] DQ[0] PQ

Logic 0: INA:13.25MHz or 8xCK212

BitCNT: 13..31, Init to 13x8+256=260

CLK: 212MHz Data Rate: 26.5Mbits/s The ratio 8 CLK cycles/bit in this design is not an intrinsic limit.



0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

-1 0 1 2 3 4 5 6

The Clock-Command Combined Carrier Coding (C5)

A data train contains 5 pulses and each pulse is transmitted in four unit time intervals, usually in four internal clock cycles at frequency f.

Information is carried with wide, normal and narrow pulses and the first pulse is always wide or narrow.

When not transmitting data, all pulses have normal width. The data stream is DC balanced over 5 pulses suitable for AC coupled transmission. All leading edges are evenly spread so that the pulse train can be used directly drive the

receiver side logic or PLL.



Schematics of C5 Decoder

VCCCC INPUT

VCCT38 INPUT

VCCT58 INPUT

CmdValidOUTPUT

CmdBit[3..0]OUTPUT

Y[0..4]OUTPUT

NOT

inst

CLRN

DPRN

Q

DFF

inst3

CLRN

DPRN

Q

DFF

inst4

NA

ND

2

inst

6

CLRN

DPRN

Q

DFF

inst7

CLRN

DPRN

Q

DFF

inst8

NA

ND

2

inst

9

CLRN

DPRN

Q

DFF

inst10

CLRN

DPRN

Q

DFF

inst11

NA

ND

2

inst

12

CLRN

DPRN

Q

DFF

inst13

CLRN

DPRN

Q

DFF

inst14

NA

ND

2

inst

15

CLRN

DPRN

Q

DFF

inst16

CLRN

DPRN

Q

DFF

inst17

CLRN

DPRN

Q

DFF

inst18

NOT

inst19

AND2

inst20

DFFdata[3..0]

clock

enableq[3..0]

lpm_dff0

inst22

up countermodulus 5sclr

clockq[3..0]

cout

lpm_counter0

inst27

BAND4

inst1

CLRN

DPRN

Q

DFF

inst21CLRN

DPRN

Q

DFF

inst23

Y[0]

CmdBit[3..0]

Y[0..3]

Y[1]

Y[2]

Y[3]

Y[4]

VCCCC INPUT

VCCC40 INPUT

T38OUTPUT

T58OUTPUT

CLRN

DPRN

Q

DFF

instCLRN

DPRN

Q

DFF

inst1CLRN

DPRN

Q

DFF

inst2

NOT

inst3

VCCCC INPUT

Cy clone

inclk0 period: 36.000 ns

Operation Mode: Normal

Clk Ratio Ph (dg) Td (ns) DC (%)

c0 4/1 0.00 0.00 50.00

e0 1/1 0.00 0.00 50.00

inclk0 c0

e0

locked

altpll1

inst2

CC

C40

T38

T58

Delay

inst3

T38

T58

CC

Y[0..4]

CmdValid

CmdBit[3..0]

Composer

inst8

Data Rate: 36ns/bit or 27.7Mbits/s

Internal clock: 111MHz



Outline Counting:






Fixed Latency Everywhere?

In classical trigger system, all cables must have fixed propagation delay.

Serial links intrinsically do not have fixed latency. Do we need fixed latency at all? No.

FrontEnd

Trigger

FrontEnd

FrontEnd

FrontEnd

Trigger

FrontEnd

FrontEnd

SER

DESERDESERDESER

SER SER

?

TimingReference



Hit Time Coding and Transmitting

Hits in each channel are coded as bits representing small time intervals.

Bit patterns are merged in a front-end module.

DetectorProcessing

BoardHit

5ns

40ns

0 1 0 0 0 0 01

0 0 0 0 0 01

0 1 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 01

0

0

00

0CLK&CMD



Cable Delay Self Timing

At system initialization, all the Detector Processing Boards send out a special word

in the same clock cycle as start mark. At the receiving end, the absolute arrival

time from each board can be unknown and different. However, the start mark is recognized and stored in the addresses 0 of the corresponding receiving buffer. The words after the start mark are stored in sequence.

Processing Support Board

Detector Processing Board





An Example

InitialMarker

Data

InitialMarker

Data



Hit Merging and Coincidence

Hits from different inputs in the Processing Support Board are merged together with an OR function and sent out as a serial data stream.

The Coincidence Module re-align the different stream in the receiver buffers. Inside the Coincidence Module, the coincidence is searched as AND functions of the hit streams from

opposite detector sectors. Very likely, a boundary coverage logic is applied, e.g.: Trigger T[N] = HA[N]&&(HB[N] || HC[N]).

The boundary coverage for time domain is also necessary. This is satisfied by checking adjacent bits in the buffered words, e.g.: Trigger T[N] = (HA[N+1] || HA[N] || HA[N-1])&&(HB[N] || HC[N]).

Processing Support Board

Processing Support Board Coincidence Module



77

Post-Scripts

Some Extra Words for the

Young & Old



About FPGA: Myths & Thinking We commonly heard about FPGA:

FPGA is cheap. FPGA is fast. FPGA is large. FPGA can do anything.

Not really. At least it is not always the case. Good design tricks are needed in order to take full

advantages of FPGA devices and to avoid drawbacks of FPGA devices.

FPGA: $16-$1500, Micro-Processor: $100-$500. FPGA: 500MHz, Micro-Processor: 1-3GHz. FPGA logic consumes more transistors. Only if the information is collected in FPGA.



Moore’s Law

Number of transistors in a package:

x2 /18months

Taken from www.intel.com



Status of Moore’s Law: an Inconvenient Truth

# of transistors Yes, via multi-core.

Clock Speed ?

Taken from www.intel.com



Complexity in FPGA Designs

Excessive Complexity in FPGA Designs

= Fevers of Moore’s Law + Myths + No Thinking

Complexity causes higher FPGA cost. Complexity creates indirect costs such as PCB

layout, assembly, power consumption, cooling etc.

Complexity confuses people, including designers.



Indirect Cost of Complexity

If something like this can do the job…

… why do these?



The Winning Line of FPGA Design

We commonly heard: FPGA devices contains millions gate. High parallelism can be implemented in FPGA. FPGA cost drops by half every 18 months.

We want to emphasize, especially to our young students:

1. Creativity,

2. Creativity,

3. Creativity, on Arithmetic ops, on Algorithms, on Architectures & on All Aspects.

O Freunde, nicht diese Töne!



84

The End

Thanks

May. 2009, Wu Jinyuan, Fermilab [email protected] IEEE RT09 Short Course 1 FPGA Structure,...

Documents

Transcript of May. 2009, Wu Jinyuan, Fermilab [email protected] IEEE RT09 Short Course 1 FPGA Structure,...