jsgray@ Jan Gray with FPGAs Teaching Processor and ... · Verilog and VHDL synthesis, ......
Transcript of jsgray@ Jan Gray with FPGAs Teaching Processor and ... · Verilog and VHDL synthesis, ......
Hands-on Computer Architecture –Hands-on Computer Architecture –Teaching Processor and Integrated Systems DesignTeaching Processor and Integrated Systems Design
with FPGAswith FPGAsJan GrayJan Gray
Gray Research LLCGray Research LLCjsgray@[email protected]
lw r12,4(r3)lw r12,4(r3)
Copyright © 1999,2000 Gray Research LLC. All Rights Reserved.
lw r12,4(r3)lw r12,4(r3)
IntroductionIntroduction◆◆ Confession - not an educator; unproven ideasConfession - not an educator; unproven ideas◆◆ Recent large, cheap FPGAs and tools enableRecent large, cheap FPGAs and tools enable
“desktop processor development”“desktop processor development”◆◆ RISC4005 (Freidin 1990)RISC4005 (Freidin 1990)◆◆ j32 (Gray 1995)j32 (Gray 1995)◆◆ XSOC/xr16 Project Kit (Gray 1998)XSOC/xr16 Project Kit (Gray 1998)
◆◆ xr16: simple pipelined 16-bit RISC, 33 MHz, 1.4 CPIxr16: simple pipelined 16-bit RISC, 33 MHz, 1.4 CPI◆◆ XSOC: system-on-a-chip: bus and peripheralsXSOC: system-on-a-chip: bus and peripherals◆◆ Compiler, simulator, specs, documentation, demosCompiler, simulator, specs, documentation, demos
◆◆ Can we (should we) use FPGA CPU and SoCCan we (should we) use FPGA CPU and SoCkits to teach computer architecture?kits to teach computer architecture?
lw r12,4(r3)lw r12,4(r3)
OutlineOutline◆◆ The XSOC/xr16 Project KitThe XSOC/xr16 Project Kit◆◆ Why use a hardware project to teachWhy use a hardware project to teach
architecture?architecture?◆◆ Suggested FPGA-based projectsSuggested FPGA-based projects◆◆ FPGAs for advanced architecture studyFPGAs for advanced architecture study◆◆ Related work and ongoing workRelated work and ongoing work
lw r12,4(r3)lw r12,4(r3)
Enabling DevelopmentsEnabling Developments◆◆ Xilinx Student Edition v1.5 ($100)Xilinx Student Edition v1.5 ($100)
◆◆ Xilinx Foundation Express 1.5Xilinx Foundation Express 1.5◆◆ Verilog and VHDL synthesis, schematic capture, simulationVerilog and VHDL synthesis, schematic capture, simulation◆◆ Xilinx FPGA implementation toolsXilinx FPGA implementation tools
◆◆ Vanden Bout’s Vanden Bout’s Practical Xilinx Designer Lab BookPractical Xilinx Designer Lab Book
◆◆ XESS XS40 ($109-$199)XESS XS40 ($109-$199)◆◆ XC4005 or XC4010, 32-128 KB RAM, 8031, programmableXC4005 or XC4010, 32-128 KB RAM, 8031, programmable
oscillator, parallel, VGA, mouse/keyboard, LED, breadboardoscillator, parallel, VGA, mouse/keyboard, LED, breadboardinterface, interface, tools, documentation, support mailing listtools, documentation, support mailing list
◆◆ LCC, a retargetable C compiler (LCC, a retargetable C compiler (free)free)◆◆ Documented in textbook, small, fairly easy to port or enhanceDocumented in textbook, small, fairly easy to port or enhance
◆◆ Veriwell free Verilog simulatorVeriwell free Verilog simulator
lw r12,4(r3)lw r12,4(r3)
XSOC/xr16 Kit GoalsXSOC/xr16 Kit Goals◆◆ Make integrated system dev more accessibleMake integrated system dev more accessible◆◆ Subject of Subject of Circuit CellarCircuit Cellar article series article series
◆◆ Explain FPGA processor and system designExplain FPGA processor and system design◆◆ Build a fast, simple, thrifty, real pipelined CPUBuild a fast, simple, thrifty, real pipelined CPU◆◆ Build CPU+SoC in a tiny XC4005XLBuild CPU+SoC in a tiny XC4005XL
◆◆ Help launch a free cores development communityHelp launch a free cores development community◆◆ Provide on-chip bus architecture for easy cores reuseProvide on-chip bus architecture for easy cores reuse
◆◆ Show FPGA CPUs can be practical (=cheap)Show FPGA CPUs can be practical (=cheap)◆◆ Show C compiler supportShow C compiler support◆◆ Show efficient FPGA design “best practices”Show efficient FPGA design “best practices”
lw r12,4(r3)lw r12,4(r3)
Software Development ToolsSoftware Development Tools◆◆ Need C compiler for nascent 16-bit RISC:Need C compiler for nascent 16-bit RISC:
port LCC [Fraser and Hanson]port LCC [Fraser and Hanson]◆◆ Choose calling conventionsChoose calling conventions◆◆ mipsmips.md .md →→ xr16.md (500 LOCs changed) xr16.md (500 LOCs changed)◆◆ 32 registers 32 registers →→ 16 registers 16 registers◆◆ sizeof(int)==sizeof(void*)==4 sizeof(int)==sizeof(void*)==4 →→ 2 2◆◆ Misc: branches; long ints; softwareMisc: branches; long ints; software mul mul/div/div
◆◆ Assembler (1300 LOCs)Assembler (1300 LOCs)◆◆ LexLex, parse, fix far branches, apply fixups, emit, parse, fix far branches, apply fixups, emit
◆◆ Instruction set simulator (400 LOCs, 3 MIPS)Instruction set simulator (400 LOCs, 3 MIPS)
lw r12,4(r3)lw r12,4(r3)
LCC Generated CodeLCC Generated Code◆◆ typedef struct TN {typedef struct TN {
int k;int k;struct TN *left,*right;struct TN *left,*right;
} *T;} *T;
TT searchsearch(int k, T p) {(int k, T p) {while (p && p->k != k)while (p && p->k != k)if (p->k < k)if (p->k < k)p = p->right;p = p->right;
elseelsep = p->left;p = p->left;
return p;return p;}}
◆◆ _search_search:: ; r3=k r4=p; r3=k r4=pbr L3br L3
L2:L2: lw r9,(r4)lw r9,(r4)cmp r9,r3cmp r9,r3 ; p->k < k?; p->k < k?bge L5bge L5lw r4,4(r4)lw r4,4(r4) ; p=p->right; p=p->rightbr L6br L6
L5:L5: lw r4,2(r4)lw r4,2(r4) ; p=p->left; p=p->leftL6:L6:L3:L3: mov r9,r4mov r9,r4
cmp r9,r0cmp r9,r0 ; p==0?; p==0?beq L7beq L7lw r9,(r4)lw r9,(r4)cmp r9,r3cmp r9,r3 ; p->k != k?; p->k != k?bne L2bne L2
L7:L7: mov r2,r4mov r2,r4 ; retval=p; retval=pL1:L1: retret
lw r12,4(r3)lw r12,4(r3)
XR16 Instruction Set DesignXR16 Instruction Set Design◆◆ Goals and constraintsGoals and constraints
◆◆ Compact 16-bit instructionsCompact 16-bit instructions◆◆ 16 16-bit registers16 16-bit registers◆◆ Cover integer C operationsCover integer C operations◆◆ KISSKISS◆◆ Fast Fast →→ pipelined pipelined →→ regular regular◆◆ Byte addressable Byte addressable →→
load/store bytes/wordsload/store bytes/words◆◆ dispdisp(reg) addressing(reg) addressing◆◆ Long ints Long ints →→ add with carry, add with carry,
shift extendedshift extended◆◆ Scale to 32-bit registersScale to 32-bit registers
◆◆ Common instructionsCommon instructions◆◆ lw sw lw sw mov mov lea call br cmplea call br cmp◆◆ movmov lea cmp lea cmp →→ add/subadd/sub◆◆ dispdisp(reg) addressing (69%)(reg) addressing (69%)
◆◆ Resolved:Resolved:◆◆ 3 operand: add sub addi3 operand: add sub addi◆◆ 4-bit immediates,4-bit immediates,
12-bit immediate prefix12-bit immediate prefix◆◆ no condition codes;no condition codes;
interlocked add/sub+binterlocked add/sub+bcondcond◆◆ fast call/return fast call/return →→ call/jal call/jal◆◆ mulmul/div/shifts in software/div/shifts in software
lw r12,4(r3)lw r12,4(r3)
XR16 Instruction SetXR16 Instruction SetHex Fmt Assembler Semantics N0dab rrr add rd,ra,rb rd = ra + rb; 11dab rrr sub rd,ra,rb rd = ra – rb; 12dai rri addi rd,ra,imm rd = ra + imm; 13d*b rr {and or xor andn adc
sbc} rd,rbrd = rd op rb; 1
4d*i ri {andi ori xori andniadci sbci slli slxisrai srli srxi} rd,imm
rd = rd op imm; 1
5dai rri lw rd,imm(ra) rd = *(int*)(ra+imm); 26dai rri lb rd,imm(ra) rd = *(byte*)(ra+imm); 28dai rri sw rd,imm(ra) *(int*)(ra+imm) = rd; 29dai rri sb rd,imm(ra) *(byte*)(ra+imm) = rd; 2Adai rri jal rd,imm(ra) rd = pc, pc = ra + imm; 3B*dd br {br brn beq bne bc
bnc bv bnv blt bgeble bgt bltu bgeubleu bgtu} label
if (cond) pc += 2*disp8; *
Ciii i12 call func r15 = pc, pc = imm12<<4; 3Diii i12 imm imm12 imm'next15:4 = imm12; 17xxxExxxFxxx
− reserved reserved
lw r12,4(r3)lw r12,4(r3)
Synthesized InstructionsSynthesized InstructionsInstruction Maps tonop and r0,r0mov rd,ra add rd,ra,r0cmp ra,rb sub r0,ra,rbsubi rd,ra,im addi rd,ra,-imcmpi ra,im addi r0,ra,-imcom rd xori rd,-1lea rd,imm(ra) addi rd,ra,immlbs rd,imm(ra)(load-byte,sign-extending)
lb rd,imm(ra)xori rd,0x80subi rd,0x80
j addr jal r0,addrret jal r0,0(r15)
lw r12,4(r3)lw r12,4(r3)
CompromisesCompromises
◆◆ No multiply/divideNo multiply/divide◆◆ No barrel shifter: 1-bit shifts onlyNo barrel shifter: 1-bit shifts only◆◆ Jumps and taken branches take 3 cyclesJumps and taken branches take 3 cycles
◆◆ Annul the two branch shadow instructionsAnnul the two branch shadow instructions
◆◆ Loads/stores take at least 2 cyclesLoads/stores take at least 2 cycles◆◆ One shared memory portOne shared memory port
◆◆ Simple interrupt/returnSimple interrupt/return◆◆ No floating pointNo floating point
Copyright (C) 2000, Gray Research LLC.This work and its use subject to XSOCLicense Agreement. See LICENSE file.
Project: [None]Macro: XR16Date: 02/23/100
Figure S2: XR16 CPU Schematic
D
DP16RLOC=R5C0
BRDISP[7:0]
LOGICOP[1:0]
RNA[3:0]
RNB[3:0]
PCE
ADD
BCE15_4
BRANCH
CI
IMMOP[5:0]
A15
LOGICT
RETADT
RFWE
SELPC
SLT
SRT
SRI
SUMT
ZXT
ZEROPC
CLK
Z
ADDR[15:0]
RESULT[15:0]
CO
V
N
FWD
IMM[11:0]
LDT
UDT
UDLDT
DMAPC
PCCE
RETCE
C
CTRL16RLOC=R0C0
A15
Z
N
V
RNA[3:0]
RNB[3:0]
RFWE
IMMOP[5:0]
PCE
BCE15_4
ADD
CI
LOGICOP[1:0]
SUMT
LOGICT
ZXT
SRT
SLT
RETADT
CLK
SELPC
ZEROPC
ACE
SRI
FWD
CO
RDY
IMM[11:0]
READN
WORDN
INSN[15:0]
BRDISP[7:0]
BRANCH
DMA
DBUSN
DMAPC
PCCE
RETCE
IREQ
DMAREQ
ZERODMARNA[3:0]
RNB[3:0]
IMM[11:0]
LOGICOP[1:0]
Z,N,CO,V,A_15
AN[15:0]
BRDISP[7:0]
IMMOP[5:0]
D[15:0]INSN[15:0]
CLK
BRANCH
RFWE
FWD
PCE
BCE15_4
SELPC
ZEROPC
ADD
CI
SUMT
LOGICT
DMA
N
CO
ZXT
SRI
SRT
SLT
RETADT
RDY WORDN
UDLDT
V
A_15A_15
Z
N
CO
V
READN
Z
UDT
LDT
ACE
ZERODMA
DMAPC
PCCE
RETCE
IREQ
DMAREQ
DBUSN
lw r12,4(r3)lw r12,4(r3)
Control Unit: Pipeline StagesControl Unit: Pipeline Stages◆◆ IF: Instruction FetchIF: Instruction Fetch
◆◆ Read instruction at addr; prepare addr Read instruction at addr; prepare addr ←← PC += 2; PC += 2;capture instruction in Instruction Register (IR)capture instruction in Instruction Register (IR)
◆◆ DC: Decode and register file accessDC: Decode and register file access◆◆ Decode instruction; read register operands;Decode instruction; read register operands;
prepare immediate field; select operandsprepare immediate field; select operands
◆◆ EX: ExecuteEX: Execute◆◆ Add, logic, shift; select result; write to register fileAdd, logic, shift; select result; write to register file◆◆ call/jal: PC := sumcall/jal: PC := sum◆◆ load/store: addr load/store: addr ←← sum; stop pipe for access sum; stop pipe for access
lw r12,4(r3)lw r12,4(r3)
Control Unit: Pipeline HazardsControl Unit: Pipeline Hazards◆◆ Memory wait statesMemory wait states
◆◆ Stop pipelineStop pipeline
◆◆ Result use data dependenceResult use data dependence◆◆ Result forwarding on one operand onlyResult forwarding on one operand only
◆◆ Load/store memory accessLoad/store memory access◆◆ Stop pipelineStop pipeline
◆◆ Jumps and taken branchesJumps and taken branches◆◆ Annul two following instructionsAnnul two following instructions
◆◆ InterruptsInterrupts◆◆ Force Force jump-to-int-handler instructionjump-to-int-handler instruction when safe when safe
Copyright (C) 2000, Gray Research LLC.This work and its use subject to XSOCLicense Agreement. See LICENSE file.
Project: [None]Macro: CTRL16Date: 02/23/100
"N.C."
"N.C.""N.C."
"N.C."
INSTRUCTION REGISTERS
Figure S4: XR16 CPU Control Unit Schematic
EXECUTE STAGEDC: OPERAND SELECTION
CONTROL STATE MACHINE
DC: CONDITIONAL BRANCHES
INSTRUCTION DECODER
T2
FD4PEINIT=S
D0
D1
D2
D3
Q0
Q1
Q2
Q3
CE
CLK
AND2B1TRUE
TRUTHRLOC=R2C6
COND[3:0]
Z
N
C
V
TRUE
OR2
OR2
EXIRB
BUF16
I[15:0] O[15:0]
IRB
BUF16
I[15:0]O[15:0]
IRMUX
IRMUXUSE_RLOC=FALSE
A[15:0]
B[15:0]
SEL
INT
O[15:0]
NEXTIRFD16CE
C
CE
CLR
D[15:0] Q[15:0]
RNB
RNMUX4RLOC=R2C1
RA[3:0]
RD[3:0]
EXRD[3:0]
SELR15
SELRD
SELSRC
RN[3:0]
FWD
SELR0
RZERO
AND2
CIFDCE
D
C
CE
CLR
Q
AND2
BUF
DECODE
DECODE
OP[3:0]
PCE
CLK
ADDSUB
SUB
NLW
NLD
NLB
IMM12
IMM4
WORDIMM4
NJAL
EXRESULTS
EXLDST
EXLBSB
EXJALI
SEXTIMM4
BR
EXST
FN[3:0]
NSUM
NLOGIC
NSR
NSL
EXFNSRA
ADCSBC
EXNSUB
EXIMM
ST
JALI
RRRI
EXOP[3:0]
EXJAL
NSUB
DCINTINH
AND2B1
BUF
AND2B1
RNA
RNMUX4RLOC=R2C0
RA[3:0]
RD[3:0]
EXRD[3:0]
SELR15
SELRD
SELSRC
RN[3:0]
FWD
SELR0
RZERO
BUF
AND2B1
BRANCHFDCE
D
C
CE
CLR
Q
IRFD16CE
C
CE
CLR
D[15:0] Q[15:0]
AND3B1
AND3B1
BUF
AND4B2
SRI
AND2T1
FD4PEINIT=S
D0
D1
D2
D3
Q0
Q1
Q2
Q3
CE
CLK
FSM
CTRLFSM
RDY
IREQ
DMAREQ
BRANCH
JUMP
EXLBSB
EXST
ACE
PCE
WORDN
READN
DBUSN
PCCE
RETCE
IF
DMAPC
CLK
ZERODMA
EXLDST
ZEROPC
IFINT
SELPC
DCINTINH
EXANNUL
EXAN
DMA
AND2XNOR2
EXIRFD16CE
C
CE
CLR
D[15:0] Q[15:0]
BUFBUF
AND2B1
IMMB
BUF16
I[15:0]O[15:0]
IR[15:0]IRMUX[15:0]
EXOP[3:0]
IMMOP[5:0]
LOGICOP[1:0]
NIR[15:0] EXOP[3:0],EXRD[3:0],BRDISP[7:0]
IMM[11:0]
IR[11:8]
RNB[3:0]
OP[3:0],RD[3:0],RA[3:0],RB[3:0]
EXIR[15:0]
INSN[15:0]
BRDISP[7:0]
IR[7:4]
RA[3:0]
RD[3:0]
EXRD[3:0]
RNA[3:0]
RB[3:0]
RD[3:0]
EXRD[3:0]
OP[3:0]
IR[11:0]
CI
CALL
NSUB
ADCSBC
FWD
ST
RFWEEXANNUL
BRANCH
RETCE
PCE
ADD
NSUB
CLK
NJAL
NLB
PCE
ZXT
SRT
SLT
RETADT
EXJAL
EXLBSB
BCE15_4
ACE
PCE
EXANNUL
CLK
IR3
EXFNSRA
WORDIMM4
ADDSUB
SUB
A15
IMM_12
EXCALL
SEXTIMM4
EXIMM
DCINTINH
EXJAL
NLW
NLD
IMM_12
IMM_4
PCE
EXRESULTS
SRI
NLB
EXIMM
EXANNUL
IMMOP0
WORDIMM4
NJAL
SEXTIMM4
EXRESULTS
CLK
EXCALL
RRRI
CLK
EXCALL
IR0IMMOP5
RZERO
IF
EXNSUB
RZERO
IREQ
NSL
NLW
NLD
NSR
JUMP
EXIR4
EXIR5
CO
LOGICOP1
PCE
LOGICOP0
JUMP
RRRI
LOGICT
EXST
BR
EXLDST TRUE
CLK
EXRESULTS
ADCSBC
NSR
NSL
NLOGIC
NSUM
CLK
NSUM
NLOGIC
CALL
PCE
IMMOP2
IMM_4
N
Z
BRANCHBR
CO
V
CLK
CLK
CLK
CLK
PCE
CLK
WORDN GND
DMAPC
EXANNUL
ST
EXNSUB
PCE
WORDIMM4
SUMT
EXANNUL
IMM_12
EXAN
IF
BRN
PCE
IMMOP1
IMMOP3
EXFNSRA
IR0
IMM_4
IMMOP4
READN
EXLDST
EXLBSB
EXST
EXAN
DCINTINH
IF
IFINT
DBUSN
SELPC
ZEROPC
DMAPC
PCCERDY
DMA
DMAREQ
ZERODMA
PCE
IFINT
GND
Copyright (C) 2000, Gray Research LLC.This work and its use subject to XSOCLicense Agreement. See LICENSE file.
Project: [None]Macro: DP16Date: 02/23/100
EXECUTION UNIT
ADDRESS/PC UNIT
RESULT MUX
Figure S3: XR16 CPU Datapath Schematic
RETBUF
RLOC=R1C9
BUFT16XT
BREGS
REGFILERLOC=R1C1
D[15:0]
A[3:0]
WE
CLK
Q[15:0]
LOGICBUF
RLOC=R1C4
BUFT16XT
SRBUF
RLOC=R1C1
BUFT16XT
IMMED
IMM16RLOC=R1C3
B[15:0]
IR[11:0]
O[15:0]
OP[5:0]
B
FD12E4ERLOC=R1C3
D[15:0] Q[15:0]
CE15_4
CE3_0
CLK
AREGS
REGFILERLOC=R1C0
D[15:0]
A[3:0]
WE
CLK
Q[15:0]
A
FD16ERLOC=R1C2
D[15:0]
CE
CLK
Q[15:0]
ADDSUB
RLOC=R-1C5
ADSU16
A[15:0]
ADD
B[15:0]
CI
COOFL
S[15:0]
LOGIC
LOGIC16RLOC=R1C4
A[15:0]
B[15:0]
OP[1:0]
O[15:0]
SUMBUF
RLOC=R1C5
BUFT16XTBUF
SLBUF
RLOC=R1C0
BUFT16XT
PCINCR
RLOC=R-1C8
ADD16
A[15:0]
B[15:0]
CI
COOFL
S[15:0]
ADDRMUX
M2_16ZRLOC=R1C7
A[15:0]
B[15:0]
SEL
O[15:0]
ZERO
PC
RAM16X16SRLOC=R1C9
A[3:0]
D[15:0]
WE
WCLK
O[15:0]
RET
FD16ERLOC=R1C9
D[15:0]
CE
CLK
Q[15:0]PCDISP
PCDISP16RLOC=R1C6
BRDISP[7:0]
BRANCH
PCDISP[15:0]
ZHBUF
RLOC=R1C2
BUFT8XT
DOUT
FD16ERLOC=R1C4
D[15:0]
CE
CLK
Q[15:0]
FWD
M2_16RLOC=R1C2
A[15:0]
B[15:0]
SEL
O[15:0]
Z
ZERODETRLOC=R1C6
I[15:0]Z
UDLDBUF
RLOC=R5C2
BUFT8XT
UDBUF
RLOC=R1C3
BUFT8XT
LDBUF
RLOC=R5C3
BUFT8XT
RNA[3:0]
RNB[3:0]
AREG[15:0]
BREG[15:0] B[15:0]
AMUX[15:0]
BMUX[15:0]
RETAD[15:0]
A[15:0]
SUM[15:0]
LOGIC[15:0]
PCNEXT[15:0]
IMMOP[5:0]
PCDISP[15:0]
IMM[11:0]
DOUT[7:0]
BRDISP[7:0]
G,G,G,G,G,G,G,G
RESULT[15:8]
RESULT[15:8]
SRI,A[15:1]
RESULT[7:0]
RESULT[15:0]
PC[15:0]
DOUT[15:8] RESULT[7:0]
DOUT[15:8]
LOGICOP[1:0]
DOUT[15:0]
A[14:0],G
ADDR[15:0]
G,G,G,DMAPC
CLK
RFWE
ZXT
CLK
CLK
RFWE
CLK
PCE
SLT
Z
RETADT
PCCE
LDT
GND
SRT
SUMT
LOGICT
CI
ADD
FWD
N
A15
CLK
CO
SELPC
ZEROPC
CLK
CLK
BCE15_4
V
G
SUM15
PCE
BRANCH
PCESRI
UDLDT
UDT
RETCE
DMAPC
lw r12,4(r3)lw r12,4(r3)
Datapath FloorplanDatapath Floorplan
AREG
S, A
REG
, SLB
UF
BREG
S, B
REG
, SR
BUF
FWD
, A, U
DLD
BUF,
ZH
BUF
IMM
ED, B
, LD
BUF,
UD
BUF
LOG
IC, D
OU
T, L
OG
ICBU
F
ADD
SUB,
SU
MBU
F
PCD
ISP,
Z
ADD
RM
UX
PCIN
CR
PC, R
ET, R
ETBU
F
lw r12,4(r3)lw r12,4(r3)
XSOC System-on-a-ChipXSOC System-on-a-Chip◆◆ Make integrated system dev more accessibleMake integrated system dev more accessible◆◆ Integrated system:Integrated system:
◆◆ Processor, memory, memory controller, on-chipProcessor, memory, memory controller, on-chipbus, parallel port, ram, video controllerbus, parallel port, ram, video controller
◆◆ 32 KB SRAM 32 KB SRAM →→ 576x455 monochrome display576x455 monochrome display
lw r12,4(r3)lw r12,4(r3)
“System-on-a-Chip” in Context“System-on-a-Chip” in Context
4701K4701K4701K
R
G
B
VGA PORT
/HSYNC/VSYNC
XC4005XL FPGA
XA[14:1]XA_0
XD[7:0]RAMNCERAMNOERAMNWE
RESET8031
R1R0G1G0B1B0
NHSYNCNVSYNC
CLK
PAR_D[5:0]
PAR_S[6:3]
32 KB SRAM
A[14:1]A0D[7:0]/CS/OE/WE
D[5:0]
S[6:3]
PARALLELPORT
12 MHzOSC.
Copyright (C) 2000, Gray Research LLC.This work and its use subject to XSOCLicense Agreement. See LICENSE file.
Project: XSOCSheet: XSOCDate: 06/07/99
EXTERNAL BUS AND SRAM INTERFACE
P
XR16RLOC_ORIGIN=R1C3
CLK
ACE
AN[15:0]
RDY
READN
WORDN
D[15:0]
UDT
LDT
INSN[15:0]
DMA
UDLDT
DBUSN
IREQ
DMAREQ
ZERODMA
OPAD
BUFGP
STARTUPSTARTUP
GSR
GTS
CLK
Q3
Q2
Q1Q4
DONEIN
IBUF
OPAD
INV
OBUF
MEMCTRL
MEMCTRL
AN[15:0]
ACE
WORDN
READN
RDY
UDT
LDT
RAMNOE
RAMNWE
XA_0
XDOUTT
UXDT
LXDT
SEL[7:0]
CLK
CTRL[15:0]UDLDT
REQ
ACK
DMA
DBUSN
IREQ
DMAREQ
ZERODMA RESET
OPAD
IRAM
XRAM16X16RLOC_ORIGIN=R7C14
CTRL[15:0]
D[15:0]
SEL
CLK
OPAD
OBUFOPAD
OBUF OPAD
OBUF
OPAD
XAOFDX16
C
CED[15:0] Q[15:0]
OBUF
OBUF
OPAD
IPADIBUF
XDINIFD8
C
D[7:0] Q[7:0]
IBUF8
UXDBUFT8
T
LXDBUFT8
T
OPAD16O[15:0]
XDOUT
OBUFT8T
OBUFOBUF
BUF
OPAD
PARIN
XIN8RLOC_ORIGIN=R11C13
I[7:0]
CTRL[15:0]
D[7:0]
SEL
CLK
BUF
IPAD
IPAD
IPAD
IPAD
IPAD
IPAD
OPAD
OPAD
IBUFIBUFIBUFIBUFIBUF
OPAD
VGA
VGARLOC_ORIGIN=R7C1
CLK
NHSYNC
NVSYNC
R1
R0
G1
G0
B1
B0
PIX[15:0]
REQ
ACK
RESET
OBUFOBUF
OPAD
OPAD
OBUFOBUF
OPAD
OPAD
OBUF
IOPAD8IO[7:0]
PAROUT
XOUT4
CTRL[15:0]
D[7:0]
SEL
CLK
Q0
Q1
Q2
Q3
OPADOBUF4
AN[15:0]
PAR[7:0]
D[7:0]
XDIN[15:0]
XDIN[15:8]
XDIN[7:0]
XD[7:0]
D[7:0]
D[15:8]
CTRL[15:0]
D[15:0]
D[7:0]
XA[15:0]
SEL[7:0]
XDIN[15:0]
D[7:0]
NCLK
WORDN
CLK
READN
CLK
VCC
RAMNOE
CLK
CLK
RESET8031
RAMNWE
RAMNCE
ACE
GND
LXDT
CLKIN
VRESET
VREQ
VRESET
XA_0
XDOUTT
UXDT
NCLK
LXDT
UXDT
CLK
B0
XDOUTTG0
VREQ
SEL1
B1
CLK
G1
SEL0
DMA
PAR_D0
PAR_D1
PAR_D2
PAR_D3
PAR_D4
PAR_D5
PAR_S5
PAR_S4
PAR7
PAR6
PAR5
PAR4
PAR3
PAR2
PAR1
PAR0
CLK
NHSYNC
NVSYNC
R1
R0
CLK
UDLDT
LDT
UDT
RDY
ZERODMA
DMAREQ
IREQ
VACK VACK
DBUSN
SEL2PAR_S6
PAR_S3
GND
lw r12,4(r3)lw r12,4(r3)
On-Chip BusOn-Chip Bus◆◆ 16-bit pipelined on-chip data bus16-bit pipelined on-chip data bus◆◆ Bus/memory controllerBus/memory controller
◆◆ Address decoding, bus controls (output enables,Address decoding, bus controls (output enables,clock enables), external SRAM controlclock enables), external SRAM control
◆◆ Make core reuse easy: Make core reuse easy: no glue logic requiredno glue logic required◆◆ Abstract control signal busAbstract control signal bus◆◆ Locally decoded within each coreLocally decoded within each core◆◆ Just add core, attach data, control, and select linesJust add core, attach data, control, and select lines◆◆ Can evolve bus with new features withoutCan evolve bus with new features without
invalidating existing designs and coresinvalidating existing designs and cores
lw r12,4(r3)lw r12,4(r3)
Video ControllerVideo Controller◆◆ 32 KB 32 KB →→ 576x455 monochrome bitmap576x455 monochrome bitmap◆◆ VGA compatible sync timingsVGA compatible sync timings◆◆ Video signal generationVideo signal generation
◆◆ A series of frames, each a series of lines, each aA series of frames, each a series of lines, each aseries of pixelsseries of pixels
◆◆ Fetch 16-pixel words, shifts them out seriallyFetch 16-pixel words, shifts them out serially◆◆ DMA read a new word every 8 clocksDMA read a new word every 8 clocks
◆◆ Drive horizontal and vertical sync to “frame” pixelsDrive horizontal and vertical sync to “frame” pixels◆◆ 2 10-bit counters and 4 10-bit comparators2 10-bit counters and 4 10-bit comparators
lw r12,4(r3)lw r12,4(r3)
System Bring UpSystem Bring Up◆◆ 11/98: Proposal, compiler, instruction set,11/98: Proposal, compiler, instruction set,
datapathdatapath◆◆ 12/98: Control unit, simulate CPU, target12/98: Control unit, simulate CPU, target
XS40 board, 1 Hz, 20 MHzXS40 board, 1 Hz, 20 MHz◆◆ Debugging with LEDs and stop: Debugging with LEDs and stop: goto goto stop;stop;◆◆ Later added on-chip bus, external SRAMLater added on-chip bus, external SRAM
interface, DMA, video, and interruptsinterface, DMA, video, and interrupts◆◆ Testing challengeTesting challenge
lw r12,4(r3)lw r12,4(r3)
Hands-on Computer ArchitectureHands-on Computer Architecture◆◆ Build a working computer in an FPGABuild a working computer in an FPGA◆◆ Value and appeal of building real hardwareValue and appeal of building real hardware◆◆ Teaching performance tradeoffsTeaching performance tradeoffs
◆◆ CPICPI vs vs. area. area vs vs. cycle time . cycle time vsvs. power. power◆◆ FPGA EDA tools close the loop,FPGA EDA tools close the loop,
provide “realistic” feedbackprovide “realistic” feedback◆◆ Cost of everythingCost of everything◆◆ Critical paths and retimingCritical paths and retiming◆◆ Interconnect and floorplanning issuesInterconnect and floorplanning issues
◆◆ Makes possible Makes possible quantitative analysisquantitative analysis
lw r12,4(r3)lw r12,4(r3)
Other BenefitsOther Benefits◆◆ Cores approach helps teach design reuse andCores approach helps teach design reuse and
embedded system designembedded system design◆◆ No hand-waving allowedNo hand-waving allowed◆◆ Learning the testing imperativeLearning the testing imperative◆◆ Living the “system bring up” experienceLiving the “system bring up” experience◆◆ Can simplify instruction set for teachingCan simplify instruction set for teaching
◆◆ Simple enough to understand end-to-end and allSimple enough to understand end-to-end and allthe way downthe way down
◆◆ Living smallLiving small◆◆ FPGA design vocational trainingFPGA design vocational training
lw r12,4(r3)lw r12,4(r3)
Senior Project IdeasSenior Project Ideas◆◆ Implement a processor core given its specImplement a processor core given its spec◆◆ Improve performance through pipeliningImprove performance through pipelining◆◆ Add a cache, MMU, or exceptionsAdd a cache, MMU, or exceptions◆◆ Accelerate some inner-loop C code: add aAccelerate some inner-loop C code: add a
coprocessor or custom instructionscoprocessor or custom instructions◆◆ Add a new SoC peripheral core, plusAdd a new SoC peripheral core, plus
interrupt handler/device driver, and testinginterrupt handler/device driver, and testing◆◆ Develop a test suite to verify the CPU/systemDevelop a test suite to verify the CPU/system◆◆ Competitive CPU design contest - Competitive CPU design contest - perfperf/power/power
lw r12,4(r3)lw r12,4(r3)
Get Real!Get Real!◆◆ FPGA CPUs/SoCs a realistic senior project?FPGA CPUs/SoCs a realistic senior project?
◆◆ Certainly for computer engineering students withCertainly for computer engineering students witha prerequisite HDL/FPGA digital design coursea prerequisite HDL/FPGA digital design course
◆◆ Maybe (maybe not) for CS studentsMaybe (maybe not) for CS students
◆◆ Vary degree of difficulty using kit approachVary degree of difficulty using kit approach◆◆ Amortize material and cost across set ofAmortize material and cost across set of
courses (intro digital design, compilers,courses (intro digital design, compilers,computer architecture, operating systems)computer architecture, operating systems)
◆◆ Danger: teach arch, not EDA tools quirksDanger: teach arch, not EDA tools quirks
lw r12,4(r3)lw r12,4(r3)
FPGA CPUs for AdvancedFPGA CPUs for AdvancedArchitecture StudiesArchitecture Studies
◆◆ Exploit improvements in FPGAs (2Exploit improvements in FPGAs (266 in 7 years) in 7 years)◆◆ New embedded RAM blocks (Virtex and APEX)New embedded RAM blocks (Virtex and APEX)
◆◆ xr16: 270 logic cells; xr32: 430-700 logic cellsxr16: 270 logic cells; xr32: 430-700 logic cells◆◆ V3200E: 73,000 logic cells!V3200E: 73,000 logic cells!◆◆ V600E: 15,000 cells + 72 256x16 SRAMsV600E: 15,000 cells + 72 256x16 SRAMs
◆◆ 8-way chip-multiprocessors 8-way chip-multiprocessors easyeasy◆◆ 16-way 16-way MPsMPs with 8 threads/P, 256x32 I$/P, and with 8 threads/P, 256x32 I$/P, and
shared 1Kx32 L2$ shared 1Kx32 L2$ feasiblefeasible◆◆ Fast I/O: 200 MHz ZBT SSRAM, 266 MHzFast I/O: 200 MHz ZBT SSRAM, 266 MHz
DDR SDRAM, up to 300 Mbps/pin chip-chip!DDR SDRAM, up to 300 Mbps/pin chip-chip!
lw r12,4(r3)lw r12,4(r3)
Applications of Block RAMsApplications of Block RAMs◆◆ ProcessorProcessor
◆◆ Register files: vector, windowed, multi-contextRegister files: vector, windowed, multi-context◆◆ On-chip RAM, ROM, microcode, stacksOn-chip RAM, ROM, microcode, stacks◆◆ Caches: data, tags, write buffersCaches: data, tags, write buffers◆◆ Branch history tables, branch target cachesBranch history tables, branch target caches◆◆ MMUs: segment registers, MMUs: segment registers, TLBs TLBs (not (not assocassoc))◆◆ Debug and tracing supportDebug and tracing support
◆◆ SystemSystem◆◆ Interconnects: packet/cell buffers, queuesInterconnects: packet/cell buffers, queues◆◆ Graphics: video line buffers, texture caches, display lists,Graphics: video line buffers, texture caches, display lists,
span buffers, color LUTsspan buffers, color LUTs◆◆ GC: write barrier support: page attributes, card marking bitsGC: write barrier support: page attributes, card marking bits◆◆ Multimedia: pixel blocks, Multimedia: pixel blocks, coeff coeff and compression tablesand compression tables
lw r12,4(r3)lw r12,4(r3)
Advanced Project IdeasAdvanced Project Ideas◆◆ Starting with a scalar kit, design and build aStarting with a scalar kit, design and build a
◆◆ 2-3 operation LIW2-3 operation LIW◆◆ 2-issue superscalar2-issue superscalar◆◆ Chip multiprocessor (+ memory system)Chip multiprocessor (+ memory system)◆◆ Multithreaded processorMultithreaded processor◆◆ Fault tolerant processorFault tolerant processor
◆◆ Or add architectural support forOr add architectural support for◆◆ network routing, packet inspection, message passingnetwork routing, packet inspection, message passing◆◆ non-pausing multithreaded GCnon-pausing multithreaded GC◆◆ hybrid processor + DSP datapathhybrid processor + DSP datapath
lw r12,4(r3)lw r12,4(r3)
Related WorkRelated Work◆◆ Virginia TechVirginia Tech, Rapid Prototyping, Rapid Prototyping, 16-bit, 16-bit
HOKIE RISC processor in XC4010HOKIE RISC processor in XC4010◆◆ Cornell EE475 Cornell EE475 Microprocessor ArchitecturesMicroprocessor Architectures,,
VHDL design and FPGA verification (simpleVHDL design and FPGA verification (simpleand pipelined)and pipelined)
◆◆ Hiroshima City University, Hiroshima City University, City-1City-1, HDL, HDLdesign of RISC or CISC CPU into XC4010design of RISC or CISC CPU into XC4010
◆◆ Hamblen, Georgia Tech, CPU designHamblen, Georgia Tech, CPU design(including MIPS subset), HDL, into Altera(including MIPS subset), HDL, into Altera10K20s and 10K70s10K20s and 10K70s
lw r12,4(r3)lw r12,4(r3)
Ongoing WorkOngoing Work◆◆ Near termNear term
◆◆ Package XSOC/xr16 distribution for beta test Package XSOC/xr16 distribution for beta test ✓✓◆◆ Retarget to Verilog Retarget to Verilog ✓✓◆◆ Port to Virtex and Altera architecturesPort to Virtex and Altera architectures
◆◆ Longer termLonger term◆◆ Help build a communityHelp build a community◆◆ xr32 xr32 ½ ½ ✓✓◆◆ Port GCC/binutils; build eCOS, Linux?Port GCC/binutils; build eCOS, Linux?◆◆ Chip multiprocessorsChip multiprocessors
lw r12,4(r3)lw r12,4(r3)
ConclusionsConclusions◆◆ FPGA CPUs are not only feasible, they’re aFPGA CPUs are not only feasible, they’re a
good idea for embedded systems developmentgood idea for embedded systems development◆◆ Building FPGA CPUs and SoCs makeBuilding FPGA CPUs and SoCs make
computer architecture real, tradeoffscomputer architecture real, tradeoffsquantitative, and the lessons tangiblequantitative, and the lessons tangible
◆◆ Project labs based upon FPGA CPU kits canProject labs based upon FPGA CPU kits canaccommodate a range of studies and expertiseaccommodate a range of studies and expertise
◆◆ Students in graduate level comp arch coursesStudents in graduate level comp arch coursescan now can now buildbuild multiprocessors, etc. multiprocessors, etc.
◆◆ What do you think?What do you think?
Hands-on Computer Architecture –Hands-on Computer Architecture –Teaching Processor and Integrated Systems DesignTeaching Processor and Integrated Systems Design
with FPGAswith FPGAsJan GrayJan Gray
Gray Research LLCGray Research LLCjsgray@[email protected]
lw r12,4(r3)lw r12,4(r3)
Copyright © 1999,2000 Gray Research LLC. All Rights Reserved.