jsgray@ Jan Gray with FPGAs Teaching Processor and ... · Verilog and VHDL synthesis, ......

Hands-on Computer Architecture –Hands-on Computer Architecture –Teaching Processor and Integrated Systems DesignTeaching Processor and Integrated Systems Design

with FPGAswith FPGAsJan GrayJan Gray

Gray Research LLCGray Research LLCjsgray@[email protected]

lw r12,4(r3)lw r12,4(r3)

Copyright © 1999,2000 Gray Research LLC. All Rights Reserved.

lw r12,4(r3)lw r12,4(r3)

IntroductionIntroduction◆◆ Confession - not an educator; unproven ideasConfession - not an educator; unproven ideas◆◆ Recent large, cheap FPGAs and tools enableRecent large, cheap FPGAs and tools enable

“desktop processor development”“desktop processor development”◆◆ RISC4005 (Freidin 1990)RISC4005 (Freidin 1990)◆◆ j32 (Gray 1995)j32 (Gray 1995)◆◆ XSOC/xr16 Project Kit (Gray 1998)XSOC/xr16 Project Kit (Gray 1998)

◆◆ xr16: simple pipelined 16-bit RISC, 33 MHz, 1.4 CPIxr16: simple pipelined 16-bit RISC, 33 MHz, 1.4 CPI◆◆ XSOC: system-on-a-chip: bus and peripheralsXSOC: system-on-a-chip: bus and peripherals◆◆ Compiler, simulator, specs, documentation, demosCompiler, simulator, specs, documentation, demos

◆◆ Can we (should we) use FPGA CPU and SoCCan we (should we) use FPGA CPU and SoCkits to teach computer architecture?kits to teach computer architecture?

lw r12,4(r3)lw r12,4(r3)

OutlineOutline◆◆ The XSOC/xr16 Project KitThe XSOC/xr16 Project Kit◆◆ Why use a hardware project to teachWhy use a hardware project to teach

architecture?architecture?◆◆ Suggested FPGA-based projectsSuggested FPGA-based projects◆◆ FPGAs for advanced architecture studyFPGAs for advanced architecture study◆◆ Related work and ongoing workRelated work and ongoing work

lw r12,4(r3)lw r12,4(r3)

Enabling DevelopmentsEnabling Developments◆◆ Xilinx Student Edition v1.5 ($100)Xilinx Student Edition v1.5 ($100)

◆◆ Xilinx Foundation Express 1.5Xilinx Foundation Express 1.5◆◆ Verilog and VHDL synthesis, schematic capture, simulationVerilog and VHDL synthesis, schematic capture, simulation◆◆ Xilinx FPGA implementation toolsXilinx FPGA implementation tools

◆◆ Vanden Bout’s Vanden Bout’s Practical Xilinx Designer Lab BookPractical Xilinx Designer Lab Book

◆◆ XESS XS40 ($109-$199)XESS XS40 ($109-$199)◆◆ XC4005 or XC4010, 32-128 KB RAM, 8031, programmableXC4005 or XC4010, 32-128 KB RAM, 8031, programmable

oscillator, parallel, VGA, mouse/keyboard, LED, breadboardoscillator, parallel, VGA, mouse/keyboard, LED, breadboardinterface, interface, tools, documentation, support mailing listtools, documentation, support mailing list

◆◆ LCC, a retargetable C compiler (LCC, a retargetable C compiler (free)free)◆◆ Documented in textbook, small, fairly easy to port or enhanceDocumented in textbook, small, fairly easy to port or enhance

◆◆ Veriwell free Verilog simulatorVeriwell free Verilog simulator

lw r12,4(r3)lw r12,4(r3)

XSOC/xr16 Kit GoalsXSOC/xr16 Kit Goals◆◆ Make integrated system dev more accessibleMake integrated system dev more accessible◆◆ Subject of Subject of Circuit CellarCircuit Cellar article series article series

◆◆ Explain FPGA processor and system designExplain FPGA processor and system design◆◆ Build a fast, simple, thrifty, real pipelined CPUBuild a fast, simple, thrifty, real pipelined CPU◆◆ Build CPU+SoC in a tiny XC4005XLBuild CPU+SoC in a tiny XC4005XL

◆◆ Help launch a free cores development communityHelp launch a free cores development community◆◆ Provide on-chip bus architecture for easy cores reuseProvide on-chip bus architecture for easy cores reuse

◆◆ Show FPGA CPUs can be practical (=cheap)Show FPGA CPUs can be practical (=cheap)◆◆ Show C compiler supportShow C compiler support◆◆ Show efficient FPGA design “best practices”Show efficient FPGA design “best practices”

lw r12,4(r3)lw r12,4(r3)

Software Development ToolsSoftware Development Tools◆◆ Need C compiler for nascent 16-bit RISC:Need C compiler for nascent 16-bit RISC:

port LCC [Fraser and Hanson]port LCC [Fraser and Hanson]◆◆ Choose calling conventionsChoose calling conventions◆◆ mipsmips.md .md →→ xr16.md (500 LOCs changed) xr16.md (500 LOCs changed)◆◆ 32 registers 32 registers →→ 16 registers 16 registers◆◆ sizeof(int)==sizeof(void*)==4 sizeof(int)==sizeof(void*)==4 →→ 2 2◆◆ Misc: branches; long ints; softwareMisc: branches; long ints; software mul mul/div/div

◆◆ Assembler (1300 LOCs)Assembler (1300 LOCs)◆◆ LexLex, parse, fix far branches, apply fixups, emit, parse, fix far branches, apply fixups, emit

◆◆ Instruction set simulator (400 LOCs, 3 MIPS)Instruction set simulator (400 LOCs, 3 MIPS)

lw r12,4(r3)lw r12,4(r3)

LCC Generated CodeLCC Generated Code◆◆ typedef struct TN {typedef struct TN {

int k;int k;struct TN *left,*right;struct TN *left,*right;

} *T;} *T;

TT searchsearch(int k, T p) {(int k, T p) {while (p && p->k != k)while (p && p->k != k)if (p->k < k)if (p->k < k)p = p->right;p = p->right;

elseelsep = p->left;p = p->left;

return p;return p;}}

◆◆ _search_search:: ; r3=k r4=p; r3=k r4=pbr L3br L3

L2:L2: lw r9,(r4)lw r9,(r4)cmp r9,r3cmp r9,r3 ; p->k < k?; p->k < k?bge L5bge L5lw r4,4(r4)lw r4,4(r4) ; p=p->right; p=p->rightbr L6br L6

L5:L5: lw r4,2(r4)lw r4,2(r4) ; p=p->left; p=p->leftL6:L6:L3:L3: mov r9,r4mov r9,r4

cmp r9,r0cmp r9,r0 ; p==0?; p==0?beq L7beq L7lw r9,(r4)lw r9,(r4)cmp r9,r3cmp r9,r3 ; p->k != k?; p->k != k?bne L2bne L2

L7:L7: mov r2,r4mov r2,r4 ; retval=p; retval=pL1:L1: retret

lw r12,4(r3)lw r12,4(r3)

XR16 Instruction Set DesignXR16 Instruction Set Design◆◆ Goals and constraintsGoals and constraints

◆◆ Compact 16-bit instructionsCompact 16-bit instructions◆◆ 16 16-bit registers16 16-bit registers◆◆ Cover integer C operationsCover integer C operations◆◆ KISSKISS◆◆ Fast Fast →→ pipelined pipelined →→ regular regular◆◆ Byte addressable Byte addressable →→

load/store bytes/wordsload/store bytes/words◆◆ dispdisp(reg) addressing(reg) addressing◆◆ Long ints Long ints →→ add with carry, add with carry,

shift extendedshift extended◆◆ Scale to 32-bit registersScale to 32-bit registers

◆◆ Common instructionsCommon instructions◆◆ lw sw lw sw mov mov lea call br cmplea call br cmp◆◆ movmov lea cmp lea cmp →→ add/subadd/sub◆◆ dispdisp(reg) addressing (69%)(reg) addressing (69%)

◆◆ Resolved:Resolved:◆◆ 3 operand: add sub addi3 operand: add sub addi◆◆ 4-bit immediates,4-bit immediates,

12-bit immediate prefix12-bit immediate prefix◆◆ no condition codes;no condition codes;

interlocked add/sub+binterlocked add/sub+bcondcond◆◆ fast call/return fast call/return →→ call/jal call/jal◆◆ mulmul/div/shifts in software/div/shifts in software

lw r12,4(r3)lw r12,4(r3)

XR16 Instruction SetXR16 Instruction SetHex Fmt Assembler Semantics N0dab rrr add rd,ra,rb rd = ra + rb; 11dab rrr sub rd,ra,rb rd = ra – rb; 12dai rri addi rd,ra,imm rd = ra + imm; 13d*b rr {and or xor andn adc

sbc} rd,rbrd = rd op rb; 1

4d*i ri {andi ori xori andniadci sbci slli slxisrai srli srxi} rd,imm

rd = rd op imm; 1

5dai rri lw rd,imm(ra) rd = *(int*)(ra+imm); 26dai rri lb rd,imm(ra) rd = *(byte*)(ra+imm); 28dai rri sw rd,imm(ra) *(int*)(ra+imm) = rd; 29dai rri sb rd,imm(ra) *(byte*)(ra+imm) = rd; 2Adai rri jal rd,imm(ra) rd = pc, pc = ra + imm; 3B*dd br {br brn beq bne bc

bnc bv bnv blt bgeble bgt bltu bgeubleu bgtu} label

if (cond) pc += 2*disp8; *

Ciii i12 call func r15 = pc, pc = imm12<<4; 3Diii i12 imm imm12 imm'next15:4 = imm12; 17xxxExxxFxxx

− reserved reserved

lw r12,4(r3)lw r12,4(r3)

Synthesized InstructionsSynthesized InstructionsInstruction Maps tonop and r0,r0mov rd,ra add rd,ra,r0cmp ra,rb sub r0,ra,rbsubi rd,ra,im addi rd,ra,-imcmpi ra,im addi r0,ra,-imcom rd xori rd,-1lea rd,imm(ra) addi rd,ra,immlbs rd,imm(ra)(load-byte,sign-extending)

lb rd,imm(ra)xori rd,0x80subi rd,0x80

j addr jal r0,addrret jal r0,0(r15)

lw r12,4(r3)lw r12,4(r3)

CompromisesCompromises

◆◆ No multiply/divideNo multiply/divide◆◆ No barrel shifter: 1-bit shifts onlyNo barrel shifter: 1-bit shifts only◆◆ Jumps and taken branches take 3 cyclesJumps and taken branches take 3 cycles

◆◆ Annul the two branch shadow instructionsAnnul the two branch shadow instructions

◆◆ Loads/stores take at least 2 cyclesLoads/stores take at least 2 cycles◆◆ One shared memory portOne shared memory port

◆◆ Simple interrupt/returnSimple interrupt/return◆◆ No floating pointNo floating point

Copyright (C) 2000, Gray Research LLC.This work and its use subject to XSOCLicense Agreement. See LICENSE file.

Project: [None]Macro: XR16Date: 02/23/100

Figure S2: XR16 CPU Schematic

D

DP16RLOC=R5C0

BRDISP[7:0]

LOGICOP[1:0]

RNA[3:0]

RNB[3:0]

PCE

ADD

BCE15_4

BRANCH

CI

IMMOP[5:0]

A15

LOGICT

RETADT

RFWE

SELPC

SLT

SRT

SRI

SUMT

ZXT

ZEROPC

CLK

Z

ADDR[15:0]

RESULT[15:0]

CO

V

N

FWD

IMM[11:0]

LDT

UDT

UDLDT

DMAPC

PCCE

RETCE

C

CTRL16RLOC=R0C0

A15

Z

N

V

RNA[3:0]

RNB[3:0]

RFWE

IMMOP[5:0]

PCE

BCE15_4

ADD

CI

LOGICOP[1:0]

SUMT

LOGICT

ZXT

SRT

SLT

RETADT

CLK

SELPC

ZEROPC

ACE

SRI

FWD

CO

RDY

IMM[11:0]

READN

WORDN

INSN[15:0]

BRDISP[7:0]

BRANCH

DMA

DBUSN

DMAPC

PCCE

RETCE

IREQ

DMAREQ

ZERODMARNA[3:0]

RNB[3:0]

IMM[11:0]

LOGICOP[1:0]

Z,N,CO,V,A_15

AN[15:0]

BRDISP[7:0]

IMMOP[5:0]

D[15:0]INSN[15:0]

CLK

BRANCH

RFWE

FWD

PCE

BCE15_4

SELPC

ZEROPC

ADD

CI

SUMT

LOGICT

DMA

N

CO

ZXT

SRI

SRT

SLT

RETADT

RDY WORDN

UDLDT

V

A_15A_15

Z

N

CO

V

READN

Z

UDT

LDT

ACE

ZERODMA

DMAPC

PCCE

RETCE

IREQ

DMAREQ

DBUSN

lw r12,4(r3)lw r12,4(r3)

Control Unit: Pipeline StagesControl Unit: Pipeline Stages◆◆ IF: Instruction FetchIF: Instruction Fetch

◆◆ Read instruction at addr; prepare addr Read instruction at addr; prepare addr ←← PC += 2; PC += 2;capture instruction in Instruction Register (IR)capture instruction in Instruction Register (IR)

◆◆ DC: Decode and register file accessDC: Decode and register file access◆◆ Decode instruction; read register operands;Decode instruction; read register operands;

prepare immediate field; select operandsprepare immediate field; select operands

◆◆ EX: ExecuteEX: Execute◆◆ Add, logic, shift; select result; write to register fileAdd, logic, shift; select result; write to register file◆◆ call/jal: PC := sumcall/jal: PC := sum◆◆ load/store: addr load/store: addr ←← sum; stop pipe for access sum; stop pipe for access

lw r12,4(r3)lw r12,4(r3)

Control Unit: Pipeline HazardsControl Unit: Pipeline Hazards◆◆ Memory wait statesMemory wait states

◆◆ Stop pipelineStop pipeline

◆◆ Result use data dependenceResult use data dependence◆◆ Result forwarding on one operand onlyResult forwarding on one operand only

◆◆ Load/store memory accessLoad/store memory access◆◆ Stop pipelineStop pipeline

◆◆ Jumps and taken branchesJumps and taken branches◆◆ Annul two following instructionsAnnul two following instructions

◆◆ InterruptsInterrupts◆◆ Force Force jump-to-int-handler instructionjump-to-int-handler instruction when safe when safe


Project: [None]Macro: CTRL16Date: 02/23/100

"N.C."

"N.C.""N.C."

"N.C."

INSTRUCTION REGISTERS

Figure S4: XR16 CPU Control Unit Schematic

EXECUTE STAGEDC: OPERAND SELECTION

CONTROL STATE MACHINE

DC: CONDITIONAL BRANCHES

INSTRUCTION DECODER

T2

FD4PEINIT=S

D0

D1

D2

D3

Q0

Q1

Q2

Q3

CE

CLK

AND2B1TRUE

TRUTHRLOC=R2C6

COND[3:0]

Z

N

C

V

TRUE

OR2

OR2

EXIRB

BUF16

I[15:0] O[15:0]

IRB

BUF16

I[15:0]O[15:0]

IRMUX

IRMUXUSE_RLOC=FALSE

A[15:0]

B[15:0]

SEL

INT

O[15:0]

NEXTIRFD16CE

C

CE

CLR

D[15:0] Q[15:0]

RNB

RNMUX4RLOC=R2C1

RA[3:0]

RD[3:0]

EXRD[3:0]

SELR15

SELRD

SELSRC

RN[3:0]

FWD

SELR0

RZERO

AND2

CIFDCE

D

C

CE

CLR

Q

AND2

BUF

DECODE

DECODE

OP[3:0]

PCE

CLK

ADDSUB

SUB

NLW

NLD

NLB

IMM12

IMM4

WORDIMM4

NJAL

EXRESULTS

EXLDST

EXLBSB

EXJALI

SEXTIMM4

BR

EXST

FN[3:0]

NSUM

NLOGIC

NSR

NSL

EXFNSRA

ADCSBC

EXNSUB

EXIMM

ST

JALI

RRRI

EXOP[3:0]

EXJAL

NSUB

DCINTINH

AND2B1

BUF

AND2B1

RNA

RNMUX4RLOC=R2C0

RA[3:0]

RD[3:0]

EXRD[3:0]

SELR15

SELRD

SELSRC

RN[3:0]

FWD

SELR0

RZERO

BUF

AND2B1

BRANCHFDCE

D

C

CE

CLR

Q

IRFD16CE

C

CE

CLR

D[15:0] Q[15:0]

AND3B1

AND3B1

BUF

AND4B2

SRI

AND2T1

FD4PEINIT=S

D0

D1

D2

D3

Q0

Q1

Q2

Q3

CE

CLK

FSM

CTRLFSM

RDY

IREQ

DMAREQ

BRANCH

JUMP

EXLBSB

EXST

ACE

PCE

WORDN

READN

DBUSN

PCCE

RETCE

IF

DMAPC

CLK

ZERODMA

EXLDST

ZEROPC

IFINT

SELPC

DCINTINH

EXANNUL

EXAN

DMA

AND2XNOR2

EXIRFD16CE

C

CE

CLR

D[15:0] Q[15:0]

BUFBUF

AND2B1

IMMB

BUF16

I[15:0]O[15:0]

IR[15:0]IRMUX[15:0]

EXOP[3:0]

IMMOP[5:0]

LOGICOP[1:0]

NIR[15:0] EXOP[3:0],EXRD[3:0],BRDISP[7:0]

IMM[11:0]

IR[11:8]

RNB[3:0]

OP[3:0],RD[3:0],RA[3:0],RB[3:0]

EXIR[15:0]

INSN[15:0]

BRDISP[7:0]

IR[7:4]

RA[3:0]

RD[3:0]

EXRD[3:0]

RNA[3:0]

RB[3:0]

RD[3:0]

EXRD[3:0]

OP[3:0]

IR[11:0]

CI

CALL

NSUB

ADCSBC

FWD

ST

RFWEEXANNUL

BRANCH

RETCE

PCE

ADD

NSUB

CLK

NJAL

NLB

PCE

ZXT

SRT

SLT

RETADT

EXJAL

EXLBSB

BCE15_4

ACE

PCE

EXANNUL

CLK

IR3

EXFNSRA

WORDIMM4

ADDSUB

SUB

A15

IMM_12

EXCALL

SEXTIMM4

EXIMM

DCINTINH

EXJAL

NLW

NLD

IMM_12

IMM_4

PCE

EXRESULTS

SRI

NLB

EXIMM

EXANNUL

IMMOP0

WORDIMM4

NJAL

SEXTIMM4

EXRESULTS

CLK

EXCALL

RRRI

CLK

EXCALL

IR0IMMOP5

RZERO

IF

EXNSUB

RZERO

IREQ

NSL

NLW

NLD

NSR

JUMP

EXIR4

EXIR5

CO

LOGICOP1

PCE

LOGICOP0

JUMP

RRRI

LOGICT

EXST

BR

EXLDST TRUE

CLK

EXRESULTS

ADCSBC

NSR

NSL

NLOGIC

NSUM

CLK

NSUM

NLOGIC

CALL

PCE

IMMOP2

IMM_4

N

Z

BRANCHBR

CO

V

CLK

CLK

CLK

CLK

PCE

CLK

WORDN GND

DMAPC

EXANNUL

ST

EXNSUB

PCE

WORDIMM4

SUMT

EXANNUL

IMM_12

EXAN

IF

BRN

PCE

IMMOP1

IMMOP3

EXFNSRA

IR0

IMM_4

IMMOP4

READN

EXLDST

EXLBSB

EXST

EXAN

DCINTINH

IF

IFINT

DBUSN

SELPC

ZEROPC

DMAPC

PCCERDY

DMA

DMAREQ

ZERODMA

PCE

IFINT

GND


Project: [None]Macro: DP16Date: 02/23/100

EXECUTION UNIT

ADDRESS/PC UNIT

RESULT MUX

Figure S3: XR16 CPU Datapath Schematic

RETBUF

RLOC=R1C9

BUFT16XT

BREGS

REGFILERLOC=R1C1

D[15:0]

A[3:0]

WE

CLK

Q[15:0]

LOGICBUF

RLOC=R1C4

BUFT16XT

SRBUF

RLOC=R1C1

BUFT16XT

IMMED

IMM16RLOC=R1C3

B[15:0]

IR[11:0]

O[15:0]

OP[5:0]

B

FD12E4ERLOC=R1C3

D[15:0] Q[15:0]

CE15_4

CE3_0

CLK

AREGS

REGFILERLOC=R1C0

D[15:0]

A[3:0]

WE

CLK

Q[15:0]

A

FD16ERLOC=R1C2

D[15:0]

CE

CLK

Q[15:0]

ADDSUB

RLOC=R-1C5

ADSU16

A[15:0]

ADD

B[15:0]

CI

COOFL

S[15:0]

LOGIC

LOGIC16RLOC=R1C4

A[15:0]

B[15:0]

OP[1:0]

O[15:0]

SUMBUF

RLOC=R1C5

BUFT16XTBUF

SLBUF

RLOC=R1C0

BUFT16XT

PCINCR

RLOC=R-1C8

ADD16

A[15:0]

B[15:0]

CI

COOFL

S[15:0]

ADDRMUX

M2_16ZRLOC=R1C7

A[15:0]

B[15:0]

SEL

O[15:0]

ZERO

PC

RAM16X16SRLOC=R1C9

A[3:0]

D[15:0]

WE

WCLK

O[15:0]

RET

FD16ERLOC=R1C9

D[15:0]

CE

CLK

Q[15:0]PCDISP

PCDISP16RLOC=R1C6

BRDISP[7:0]

BRANCH

PCDISP[15:0]

ZHBUF

RLOC=R1C2

BUFT8XT

DOUT

FD16ERLOC=R1C4

D[15:0]

CE

CLK

Q[15:0]

FWD

M2_16RLOC=R1C2

A[15:0]

B[15:0]

SEL

O[15:0]

Z

ZERODETRLOC=R1C6

I[15:0]Z

UDLDBUF

RLOC=R5C2

BUFT8XT

UDBUF

RLOC=R1C3

BUFT8XT

LDBUF

RLOC=R5C3

BUFT8XT

RNA[3:0]

RNB[3:0]

AREG[15:0]

BREG[15:0] B[15:0]

AMUX[15:0]

BMUX[15:0]

RETAD[15:0]

A[15:0]

SUM[15:0]

LOGIC[15:0]

PCNEXT[15:0]

IMMOP[5:0]

PCDISP[15:0]

IMM[11:0]

DOUT[7:0]

BRDISP[7:0]

G,G,G,G,G,G,G,G

RESULT[15:8]

RESULT[15:8]

SRI,A[15:1]

RESULT[7:0]

RESULT[15:0]

PC[15:0]

DOUT[15:8] RESULT[7:0]

DOUT[15:8]

LOGICOP[1:0]

DOUT[15:0]

A[14:0],G

ADDR[15:0]

G,G,G,DMAPC

CLK

RFWE

ZXT

CLK

CLK

RFWE

CLK

PCE

SLT

Z

RETADT

PCCE

LDT

GND

SRT

SUMT

LOGICT

CI

ADD

FWD

N

A15

CLK

CO

SELPC

ZEROPC

CLK

CLK

BCE15_4

V

G

SUM15

PCE

BRANCH

PCESRI

UDLDT

UDT

RETCE

DMAPC

lw r12,4(r3)lw r12,4(r3)

Datapath FloorplanDatapath Floorplan

AREG

S, A

REG

, SLB

UF

BREG

S, B

REG

, SR

BUF

FWD

, A, U

DLD

BUF,

ZH

BUF

IMM

ED, B

, LD

BUF,

UD

BUF

LOG

IC, D

OU

T, L

OG

ICBU

F

ADD

SUB,

SU

MBU

F

PCD

ISP,

Z

ADD

RM

UX

PCIN

CR

PC, R

ET, R

ETBU

F

lw r12,4(r3)lw r12,4(r3)

XSOC System-on-a-ChipXSOC System-on-a-Chip◆◆ Make integrated system dev more accessibleMake integrated system dev more accessible◆◆ Integrated system:Integrated system:

◆◆ Processor, memory, memory controller, on-chipProcessor, memory, memory controller, on-chipbus, parallel port, ram, video controllerbus, parallel port, ram, video controller

◆◆ 32 KB SRAM 32 KB SRAM →→ 576x455 monochrome display576x455 monochrome display

lw r12,4(r3)lw r12,4(r3)

“System-on-a-Chip” in Context“System-on-a-Chip” in Context

4701K4701K4701K

R

G

B

VGA PORT

/HSYNC/VSYNC

XC4005XL FPGA

XA[14:1]XA_0

XD[7:0]RAMNCERAMNOERAMNWE

RESET8031

R1R0G1G0B1B0

NHSYNCNVSYNC

CLK

PAR_D[5:0]

PAR_S[6:3]

32 KB SRAM

A[14:1]A0D[7:0]/CS/OE/WE

D[5:0]

S[6:3]

PARALLELPORT

12 MHzOSC.


Project: XSOCSheet: XSOCDate: 06/07/99

EXTERNAL BUS AND SRAM INTERFACE

P

XR16RLOC_ORIGIN=R1C3

CLK

ACE

AN[15:0]

RDY

READN

WORDN

D[15:0]

UDT

LDT

INSN[15:0]

DMA

UDLDT

DBUSN

IREQ

DMAREQ

ZERODMA

OPAD

BUFGP

STARTUPSTARTUP

GSR

GTS

CLK

Q3

Q2

Q1Q4

DONEIN

IBUF

OPAD

INV

OBUF

MEMCTRL

MEMCTRL

AN[15:0]

ACE

WORDN

READN

RDY

UDT

LDT

RAMNOE

RAMNWE

XA_0

XDOUTT

UXDT

LXDT

SEL[7:0]

CLK

CTRL[15:0]UDLDT

REQ

ACK

DMA

DBUSN

IREQ

DMAREQ

ZERODMA RESET

OPAD

IRAM

XRAM16X16RLOC_ORIGIN=R7C14

CTRL[15:0]

D[15:0]

SEL

CLK

OPAD

OBUFOPAD

OBUF OPAD

OBUF

OPAD

XAOFDX16

C

CED[15:0] Q[15:0]

OBUF

OBUF

OPAD

IPADIBUF

XDINIFD8

C

D[7:0] Q[7:0]

IBUF8

UXDBUFT8

T

LXDBUFT8

T

OPAD16O[15:0]

XDOUT

OBUFT8T

OBUFOBUF

BUF

OPAD

PARIN

XIN8RLOC_ORIGIN=R11C13

I[7:0]

CTRL[15:0]

D[7:0]

SEL

CLK

BUF

IPAD

IPAD

IPAD

IPAD

IPAD

IPAD

OPAD

OPAD

IBUFIBUFIBUFIBUFIBUF

OPAD

VGA

VGARLOC_ORIGIN=R7C1

CLK

NHSYNC

NVSYNC

R1

R0

G1

G0

B1

B0

PIX[15:0]

REQ

ACK

RESET

OBUFOBUF

OPAD

OPAD

OBUFOBUF

OPAD

OPAD

OBUF

IOPAD8IO[7:0]

PAROUT

XOUT4

CTRL[15:0]

D[7:0]

SEL

CLK

Q0

Q1

Q2

Q3

OPADOBUF4

AN[15:0]

PAR[7:0]

D[7:0]

XDIN[15:0]

XDIN[15:8]

XDIN[7:0]

XD[7:0]

D[7:0]

D[15:8]

CTRL[15:0]

D[15:0]

D[7:0]

XA[15:0]

SEL[7:0]

XDIN[15:0]

D[7:0]

NCLK

WORDN

CLK

READN

CLK

VCC

RAMNOE

CLK

CLK

RESET8031

RAMNWE

RAMNCE

ACE

GND

LXDT

CLKIN

VRESET

VREQ

VRESET

XA_0

XDOUTT

UXDT

NCLK

LXDT

UXDT

CLK

B0

XDOUTTG0

VREQ

SEL1

B1

CLK

G1

SEL0

DMA

PAR_D0

PAR_D1

PAR_D2

PAR_D3

PAR_D4

PAR_D5

PAR_S5

PAR_S4

PAR7

PAR6

PAR5

PAR4

PAR3

PAR2

PAR1

PAR0

CLK

NHSYNC

NVSYNC

R1

R0

CLK

UDLDT

LDT

UDT

RDY

ZERODMA

DMAREQ

IREQ

VACK VACK

DBUSN

SEL2PAR_S6

PAR_S3

GND

lw r12,4(r3)lw r12,4(r3)

On-Chip BusOn-Chip Bus◆◆ 16-bit pipelined on-chip data bus16-bit pipelined on-chip data bus◆◆ Bus/memory controllerBus/memory controller

◆◆ Address decoding, bus controls (output enables,Address decoding, bus controls (output enables,clock enables), external SRAM controlclock enables), external SRAM control

◆◆ Make core reuse easy: Make core reuse easy: no glue logic requiredno glue logic required◆◆ Abstract control signal busAbstract control signal bus◆◆ Locally decoded within each coreLocally decoded within each core◆◆ Just add core, attach data, control, and select linesJust add core, attach data, control, and select lines◆◆ Can evolve bus with new features withoutCan evolve bus with new features without

invalidating existing designs and coresinvalidating existing designs and cores

lw r12,4(r3)lw r12,4(r3)

Video ControllerVideo Controller◆◆ 32 KB 32 KB →→ 576x455 monochrome bitmap576x455 monochrome bitmap◆◆ VGA compatible sync timingsVGA compatible sync timings◆◆ Video signal generationVideo signal generation

◆◆ A series of frames, each a series of lines, each aA series of frames, each a series of lines, each aseries of pixelsseries of pixels

◆◆ Fetch 16-pixel words, shifts them out seriallyFetch 16-pixel words, shifts them out serially◆◆ DMA read a new word every 8 clocksDMA read a new word every 8 clocks

◆◆ Drive horizontal and vertical sync to “frame” pixelsDrive horizontal and vertical sync to “frame” pixels◆◆ 2 10-bit counters and 4 10-bit comparators2 10-bit counters and 4 10-bit comparators

lw r12,4(r3)lw r12,4(r3)

System Bring UpSystem Bring Up◆◆ 11/98: Proposal, compiler, instruction set,11/98: Proposal, compiler, instruction set,

datapathdatapath◆◆ 12/98: Control unit, simulate CPU, target12/98: Control unit, simulate CPU, target

XS40 board, 1 Hz, 20 MHzXS40 board, 1 Hz, 20 MHz◆◆ Debugging with LEDs and stop: Debugging with LEDs and stop: goto goto stop;stop;◆◆ Later added on-chip bus, external SRAMLater added on-chip bus, external SRAM

interface, DMA, video, and interruptsinterface, DMA, video, and interrupts◆◆ Testing challengeTesting challenge

lw r12,4(r3)lw r12,4(r3)

Hands-on Computer ArchitectureHands-on Computer Architecture◆◆ Build a working computer in an FPGABuild a working computer in an FPGA◆◆ Value and appeal of building real hardwareValue and appeal of building real hardware◆◆ Teaching performance tradeoffsTeaching performance tradeoffs

◆◆ CPICPI vs vs. area. area vs vs. cycle time . cycle time vsvs. power. power◆◆ FPGA EDA tools close the loop,FPGA EDA tools close the loop,

provide “realistic” feedbackprovide “realistic” feedback◆◆ Cost of everythingCost of everything◆◆ Critical paths and retimingCritical paths and retiming◆◆ Interconnect and floorplanning issuesInterconnect and floorplanning issues

◆◆ Makes possible Makes possible quantitative analysisquantitative analysis

lw r12,4(r3)lw r12,4(r3)

Other BenefitsOther Benefits◆◆ Cores approach helps teach design reuse andCores approach helps teach design reuse and

embedded system designembedded system design◆◆ No hand-waving allowedNo hand-waving allowed◆◆ Learning the testing imperativeLearning the testing imperative◆◆ Living the “system bring up” experienceLiving the “system bring up” experience◆◆ Can simplify instruction set for teachingCan simplify instruction set for teaching

◆◆ Simple enough to understand end-to-end and allSimple enough to understand end-to-end and allthe way downthe way down

◆◆ Living smallLiving small◆◆ FPGA design vocational trainingFPGA design vocational training

lw r12,4(r3)lw r12,4(r3)

Senior Project IdeasSenior Project Ideas◆◆ Implement a processor core given its specImplement a processor core given its spec◆◆ Improve performance through pipeliningImprove performance through pipelining◆◆ Add a cache, MMU, or exceptionsAdd a cache, MMU, or exceptions◆◆ Accelerate some inner-loop C code: add aAccelerate some inner-loop C code: add a

coprocessor or custom instructionscoprocessor or custom instructions◆◆ Add a new SoC peripheral core, plusAdd a new SoC peripheral core, plus

interrupt handler/device driver, and testinginterrupt handler/device driver, and testing◆◆ Develop a test suite to verify the CPU/systemDevelop a test suite to verify the CPU/system◆◆ Competitive CPU design contest - Competitive CPU design contest - perfperf/power/power

lw r12,4(r3)lw r12,4(r3)

Get Real!Get Real!◆◆ FPGA CPUs/SoCs a realistic senior project?FPGA CPUs/SoCs a realistic senior project?

◆◆ Certainly for computer engineering students withCertainly for computer engineering students witha prerequisite HDL/FPGA digital design coursea prerequisite HDL/FPGA digital design course

◆◆ Maybe (maybe not) for CS studentsMaybe (maybe not) for CS students

◆◆ Vary degree of difficulty using kit approachVary degree of difficulty using kit approach◆◆ Amortize material and cost across set ofAmortize material and cost across set of

courses (intro digital design, compilers,courses (intro digital design, compilers,computer architecture, operating systems)computer architecture, operating systems)

◆◆ Danger: teach arch, not EDA tools quirksDanger: teach arch, not EDA tools quirks

lw r12,4(r3)lw r12,4(r3)

FPGA CPUs for AdvancedFPGA CPUs for AdvancedArchitecture StudiesArchitecture Studies

◆◆ Exploit improvements in FPGAs (2Exploit improvements in FPGAs (266 in 7 years) in 7 years)◆◆ New embedded RAM blocks (Virtex and APEX)New embedded RAM blocks (Virtex and APEX)

◆◆ xr16: 270 logic cells; xr32: 430-700 logic cellsxr16: 270 logic cells; xr32: 430-700 logic cells◆◆ V3200E: 73,000 logic cells!V3200E: 73,000 logic cells!◆◆ V600E: 15,000 cells + 72 256x16 SRAMsV600E: 15,000 cells + 72 256x16 SRAMs

◆◆ 8-way chip-multiprocessors 8-way chip-multiprocessors easyeasy◆◆ 16-way 16-way MPsMPs with 8 threads/P, 256x32 I$/P, and with 8 threads/P, 256x32 I$/P, and

shared 1Kx32 L2$ shared 1Kx32 L2$ feasiblefeasible◆◆ Fast I/O: 200 MHz ZBT SSRAM, 266 MHzFast I/O: 200 MHz ZBT SSRAM, 266 MHz

DDR SDRAM, up to 300 Mbps/pin chip-chip!DDR SDRAM, up to 300 Mbps/pin chip-chip!

lw r12,4(r3)lw r12,4(r3)

Applications of Block RAMsApplications of Block RAMs◆◆ ProcessorProcessor

◆◆ Register files: vector, windowed, multi-contextRegister files: vector, windowed, multi-context◆◆ On-chip RAM, ROM, microcode, stacksOn-chip RAM, ROM, microcode, stacks◆◆ Caches: data, tags, write buffersCaches: data, tags, write buffers◆◆ Branch history tables, branch target cachesBranch history tables, branch target caches◆◆ MMUs: segment registers, MMUs: segment registers, TLBs TLBs (not (not assocassoc))◆◆ Debug and tracing supportDebug and tracing support

◆◆ SystemSystem◆◆ Interconnects: packet/cell buffers, queuesInterconnects: packet/cell buffers, queues◆◆ Graphics: video line buffers, texture caches, display lists,Graphics: video line buffers, texture caches, display lists,

span buffers, color LUTsspan buffers, color LUTs◆◆ GC: write barrier support: page attributes, card marking bitsGC: write barrier support: page attributes, card marking bits◆◆ Multimedia: pixel blocks, Multimedia: pixel blocks, coeff coeff and compression tablesand compression tables

lw r12,4(r3)lw r12,4(r3)

Advanced Project IdeasAdvanced Project Ideas◆◆ Starting with a scalar kit, design and build aStarting with a scalar kit, design and build a

◆◆ 2-3 operation LIW2-3 operation LIW◆◆ 2-issue superscalar2-issue superscalar◆◆ Chip multiprocessor (+ memory system)Chip multiprocessor (+ memory system)◆◆ Multithreaded processorMultithreaded processor◆◆ Fault tolerant processorFault tolerant processor

◆◆ Or add architectural support forOr add architectural support for◆◆ network routing, packet inspection, message passingnetwork routing, packet inspection, message passing◆◆ non-pausing multithreaded GCnon-pausing multithreaded GC◆◆ hybrid processor + DSP datapathhybrid processor + DSP datapath

lw r12,4(r3)lw r12,4(r3)

Related WorkRelated Work◆◆ Virginia TechVirginia Tech, Rapid Prototyping, Rapid Prototyping, 16-bit, 16-bit

HOKIE RISC processor in XC4010HOKIE RISC processor in XC4010◆◆ Cornell EE475 Cornell EE475 Microprocessor ArchitecturesMicroprocessor Architectures,,

VHDL design and FPGA verification (simpleVHDL design and FPGA verification (simpleand pipelined)and pipelined)

◆◆ Hiroshima City University, Hiroshima City University, City-1City-1, HDL, HDLdesign of RISC or CISC CPU into XC4010design of RISC or CISC CPU into XC4010

◆◆ Hamblen, Georgia Tech, CPU designHamblen, Georgia Tech, CPU design(including MIPS subset), HDL, into Altera(including MIPS subset), HDL, into Altera10K20s and 10K70s10K20s and 10K70s

lw r12,4(r3)lw r12,4(r3)

Ongoing WorkOngoing Work◆◆ Near termNear term

◆◆ Package XSOC/xr16 distribution for beta test Package XSOC/xr16 distribution for beta test ✓✓◆◆ Retarget to Verilog Retarget to Verilog ✓✓◆◆ Port to Virtex and Altera architecturesPort to Virtex and Altera architectures

◆◆ Longer termLonger term◆◆ Help build a communityHelp build a community◆◆ xr32 xr32 ½ ½ ✓✓◆◆ Port GCC/binutils; build eCOS, Linux?Port GCC/binutils; build eCOS, Linux?◆◆ Chip multiprocessorsChip multiprocessors

lw r12,4(r3)lw r12,4(r3)

ConclusionsConclusions◆◆ FPGA CPUs are not only feasible, they’re aFPGA CPUs are not only feasible, they’re a

good idea for embedded systems developmentgood idea for embedded systems development◆◆ Building FPGA CPUs and SoCs makeBuilding FPGA CPUs and SoCs make

computer architecture real, tradeoffscomputer architecture real, tradeoffsquantitative, and the lessons tangiblequantitative, and the lessons tangible

◆◆ Project labs based upon FPGA CPU kits canProject labs based upon FPGA CPU kits canaccommodate a range of studies and expertiseaccommodate a range of studies and expertise

◆◆ Students in graduate level comp arch coursesStudents in graduate level comp arch coursescan now can now buildbuild multiprocessors, etc. multiprocessors, etc.

◆◆ What do you think?What do you think?

Hands-on Computer Architecture –Hands-on Computer Architecture –Teaching Processor and Integrated Systems DesignTeaching Processor and Integrated Systems Design

with FPGAswith FPGAsJan GrayJan Gray

Gray Research LLCGray Research LLCjsgray@[email protected]

lw r12,4(r3)lw r12,4(r3)

Copyright © 1999,2000 Gray Research LLC. All Rights Reserved.

jsgray@ Jan Gray with FPGAs Teaching Processor and ... · Verilog and VHDL synthesis, ......

Documents

Transcript of jsgray@ Jan Gray with FPGAs Teaching Processor and ... · Verilog and VHDL synthesis, ......