CS250 VLSI Systems Designcs250/fa20/files/lec06... · 2004. 2. 3. · Low-overhead exploitation of...

CS250, UC Berkeley Fall ‘20Lecture 04, Reconfigurable Architectures 2

CS250 VLSISystemsDesign

Fall2020

JohnWawrzynek

with

AryaReais-Parsi

CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4

Bringsusto“reconfigurablecomputing”

■ Whatisit?■ Standarddefinition:

Computing via a post-fabrication and spatially programmed connection of processing elements. ‣ ASICimplementationsexcluded–notpost-fabrication

programmable.

‣ FPGAimplementationofaprocessorcoretorunaprogramexcluded-notdirectspatialmappingofproblem.

■ Doesthisincludearraysofprocessors?■ ThisdefinitionrestrictsRCtomappingto“fine-grained”

devices(suchasFPGAs),however,manyofthesameprincipleapplytoarraysofprocessors.

2


SpatialComputation

■ Example:

grade = 0.2 × mt1 + 0.2 × mt2

+ 0.2 × mt3 + 0.4 × project;

■ Ahardwareresourceisallocatedanddedicatedforeachoperator(inthiscase,multiplieroradder)inthecomputegraph.

3

xx xx

++

+

0.2 mt1 0.2 mt2 0.4 proj0.2 mt3

grade


TemporalComputation■ Ahardwareresourceistime-

multiplexedtoimplementtheactionsoftheoperatorsinthecomputegraph.

■ Typicalinasequentialprocessor/softwaresolution,howeverpossibleinreconfigurablelogic.

■ Inreconfigurablelogicitmightbenecessarytoserializeacomputation:

■ Limitedchipsresources■ LimitedI/Obandwidth

4

acc1 = mt1 + mt2;acc1 = acc1 + mt3;acc1 = 0.2 x acc1;acc2 = 0.4 x proj;grade = acc1 + acc2;

controller

ALU

mt1 mt1mt3 proj

acc1acc2

x

+

+

0.2

mt1 mt2 0.4 proj

mt3

grade

x

+Abstract computation-graph

Implementation

Reconfigurable Computing permits the full range of spatial, temporal, and mixed computing solutions to best match implementation to task specifics and available hardware.


RC,Processors,&ASIC

5


RCStrategy1. Exploitcaseswhereoperationcanbeboundandthen

reusedalargenumberoftimes.

2. Customizeforoperatortype,width,andinterconnect.3. Low-overheadexploitationofapplicationparallelism.

6


HybridApproach■ 90/10rule:■ 90percentoftheprogramruntimeisconsumedby10percentofthe

code(inner-loops).

■ Onlysmallportionsofanapplicationbecometheperformancebottlenecks.

■ Usually,theseportionsofcodearedataprocessingintensivewithrelativelyfixeddataflowpatterns(littlecontrol)

■ Theother90percentofthecodenotperformancecritical.

7

⇒ Hybrid processor-core reconfigurable-array


Garp–HybridProcessor

8

Function Speedupstrlen (len 16) 1.77strlen (len 1024) 14sort 2.1image median filter 26.9DES (ECB mode) 19.6image dithering 16.3

Speedups over 4-way superscalarUltraSparc on same process and comparable die size and memory system.

“Garp: A MIPS Processor with a ReconfigurableCoprocessor”, In Proceedings of the IEEE Symposiumon Field-Programmable Custom Computing Machines(FCCM ‘97, April 16-18, 1997)

• Pre-generated circuits for common program kernels cached within reconfigurable array and used to accelerate MIPS programs.

• nSec configuration swap time.• Speedup – tied to single execution

thread.


GarpCC(T.Callahan)

9

Compilation time < 2x processor only compilation time.

Kernels from wavelet image compression. Speedups relative to MIPS processor only.

“The Garp Architecture and C Compiler”, IEEE Computer, April 2000.

Kernel raw netforward_wavelet_1 2 1.9forward_wavelet_2 4.1 3.6init_image 6.4 6.4forward_wavelet_3 4.1 3.6forward_wavelet_4 5.2 4.1entropy_encode_1 4 4block_quantize 2.8 2.6RLE_encode 5.8 3.4entropy_encode_2 2.9 1.5


AdvantagesofRCoverProcessorCore■ Conventionalprocessorshave

severalsourcesofinefficiency:

■ Heavytime-multiplexingofFunctionUnits(ALUs).

■ Instructionissueoverhead.■ Memoryhierarchytodealwith

memorylatency.

■ Operatormismatch

10

λλ

Peak (raw) performance


AdvantagesofRC■ Relativetomicroprocessors:onaverageahigher

percentage of peak (or raw) computational density is achievedwithreconfigurabledevices:

■ Fine-grainflexibilityleadstoexploitationofproblemspecificparallelismatmanylevels.

■ Also,manydifferentcomputationmodels(orpatterns)canbesupported.Ingeneral,itispossibletomatchproblemcharacteristicstohardware,throughtheuseofproblemspecificarchitecturesandlow-levelcircuitspecialization.

11

■ Spatial mapping of computation versus multiplexing of function units (as in processors) relieves pressure for memory capacity, BW, and promotes local communication patterns.


Fine-grainedReconfigurableFabrics

Homogeneousfine-grainedarraysaremaximallyflexible:

a. Admitawidevarietyofcomputationalarchitecturesmodels:arraysofprocessors,hybridapproaches,hard-wireddataflow,systolicprocessing,vectorprocessing,etc.

b. Admitawidevarietyofparallelismmodes:SIMD,MIMD,bit-level,etc.Resourcescanbedeployedtolower-latencywhenrequiredfortightfeedbackloops(notpossiblewithmayparallelarchitecturesthatoptimizeforthroughput).

c. Supportsmanycompilation/resourcemanagementmodels:Staticallycompiled,dynamicallymapped.

12

Safe bet as a future standard device.


FPGAsareReconfigurable

1. Volume/costgraphsdon’taccuratelycapturethepotentialrealcostsandotheradvantages.

2. Commercialapplicationshavenottakenadvantageofreconfigurability• Xilinx/Altera(Intel)haven’tdonemuchtohelp.• Methodologies/toolsnearlynonexistent.

Reconfiguration uses:

‣ Fieldupgrades⇒productlifeextension,changingrequirements.‣ Insystemboard-leveltestingandfielddiagnostics.‣ Tolerancetofaults.‣ Risk-managementinsystemdevelopment.‣ Runtime reconfiguration ⇒ higher silicon efficiency. ‣ Time-multiplexedpre-designedcircuitstakemaximumuseofresources.‣ Runtimespecializedcircuitgeneration.

13

Seemingly obvious point but …


Multi-modalComputingTasks

■ Mini/Micro-UAVs ■ Onepieceofsiliconforallofsensorprocessing,navigation,communications,planning,logging,etc.

■ Atdifferenttimesdifferenttaskstakepriorityandconsumehigherpercentageofresources.

■ Otherexample:hand-heldmulti-functiondevicewithGPS,smartimagecapture/analysis,communications.

14

A premier application for reconfigurable devices is one with constrained size/weight, need multiple functions at near ASIC performance.

Multiple ASICs too expensive/big. Processor too slow.Fine-grained reconfigurable devices has the flexibility to efficiently matchtask parallelism over a wide variety of tasks – deployed as needed and reconfigured as needed.

Mars-rover


Soundsgreat,what’sthecatch?

■ Lackofprogrammingmodelwithconvenientandeffectivetools.

■ Mostsuccessfulcomputingapplicationsusingreconfigurabledevicesinvolvesubstantial“handmapping”.Essentiallycircuitdesign.

■ Complexissue,butperhapschangingthefabricdesigncanhelp.

15


RapidRuntimeReconfiguration

■ Mightpermitevenhigherefficiencythroughhardwaresharing(multiplexing)andontheflycircuitspecialization.

■ Largelyunexploited(unproven)todate.■ Afewresearchprojectshaveexploredthisidea.■ Needtobecareful–multiplexingaddscost.■ Rememberthe“BindingTimePrinciple” Earlier the “instruction” is bound, the less area & delay required for the

implementation.

16



1. Time-multiplexingresourcesallowsmoreefficientuseofsilicon(inwaysASICstypicallydonot):

a. Low-dutycycleor“offcriticalpath”computationstimesharefabricwhilecriticalpathstaysmappedin:

17

Why dynamic reconfiguration?

amount of reconfigurable fabric

total runtime

size of maximum efficiency



b. Coursedata-dependent controlflowmapsinonlyusefuldataflow:

c. Allowabletaskfoot-printmaychangeasothertaskscomeandgoorfaultsoccur.

Fabric virtualization allows automatic migration up and down in device sizes and eases application development.

18

If-then-else



2. RuntimeCircuitSpecialization:

• Example:fixedcoefficientmultipliersinadaptivefilterchangingvalueatlowrate.

• Aggressiveconstantpropagation(basedperhapsonruntimeprofiling),reducescircuitsizeanddelay.

• Coulduse“branch/value/rangeprediction”tomapmostcommoncaseandfaultinexceptionalcases.

• Canbetemplatebased–“fillintheblanks”,butbetterifweputPPRinruntimeloop!

• ArrayHWassistedplaceandroutemaymakeitpossible.

19


SCORE:

■ A computation model for reconfigurable systems ■ abstractsout:physicalhardwaredetails‣ especiallysizeandnumberofresources

■ Goal ■ achievedeviceindependence■ approachdensity/efficiencyofrawhardware■ allowapplicationperformancetoscalebasedonsystem

resources(w/outhumanintervention)

20

Stream Computations Organized for Reconfigurable Execution


SCORE–VirtualizedFabricModel

21

If-else

High silicon efficiency: ♦ Only active parts of data-flow consume

resources.

♦ High-duty cycle critical path of computation stays mapped and remaining resources are shared by lower duty cycle paths.

♦ Particularly effective for multi-tasking environment with time-varying task requirements.

♦ Fabric virtualization with demand paging: • Get most out of available resources by automatically

time-multiplexing. • Automatic migration up and down in device sizes. • Eases application development.


SCOREBasics

■ Abstractcomputationisadataflowgraph(Kahnprocessnetwork)

■ streamlinksbetweenoperators■ dynamicdataflowrates

■ Compilerbreaksupcomputationintocomputepages

■ unitofschedulingandvirtualization■ streamlinksbetweenpages

■ Virtualcomputepagesare“demand-paged”intoavailablehardwareresourcesasneeded.

22


VirtualHardwareModel■ Dataflow graph is arbitrarily large ■ Hardware has finite resources ■ resourcesvaryamongimplementations

■ Dataflow graph is scheduled on the hardware ■ Happens automatically (software) ■ physicalresourcesareabstractedincomputemodel

■ Graph composition and node size are data dependent

23


ArchitectureModel

24

Hybrid processor: conventional RISC core, reconfigurable array, memory.


SerialImplementation

25

CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4Vulcan, Inc. Visit

SpatialImplementation

262/3/04


ArchitectureModel(cont.)■ ArchitecturemodelandSCORE

computemodelpermitscalingoverawiderangeofICprocessesanddiesizes.

■ 0.13umprocesswith16mmX16mmpayloadsuggests:

■ 256compute/memorytiles■ totalof32Klogiccells,0.5Gbitmemory■ RISCcorewith32KbitI/Dcachearea

equivalentto8tiles.

27


ConfigurableSystemonaChip■ Thismicro-architectureandchipisanexampleCSoC■ SCOREProvidesgeneralframeworkforSoCfamilies■ interconnect/architecturefabric■ softwaremodel‣ compute model for application assembly/scaling ‣ OS/runtime ■ bothfor‣ standard cores ‣ custom, application specific components (hardcoded accelerators)

28


KeyIdea:InterconnectFabric■ Standard/common

InterconnectFabric

■ Mix-and-matchnodesonfabric

■ providedifferentresourcebalance

■ matchneedsofparticularapplications

■ Allusecommoncomputemodel

■ sharesoftwareandinfrastructure

29


SampleHybridCSoC-vision2000

30


XilinxVersal2020

31


ReconfigurableArrayDesignResearch

‣ Architecture‣ ConfigurableLogicBlocks‣ AlternativetoLUTs(lessarea,delay)

32

Many of these topics have been studied, but remain open (or secret).

As with most design decisions, it comes down to design space exploration and cost/benefit analysis. Ideally, finding the Pareto optimal frontier.

In FPGA design, it is complicated by the fact that decisions are interrelated. For instance, CLB design strong effects interconnection needs.


ReconfigurableArrayResearch‣ Architecture‣ ConfigurableLogicBlocks‣ LUTsize:

33

E. Ahmed and J. Rose, "The effect of LUT and cluster size on deep-submicron FPGA performance and density," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 12, no. 3, pp. 288-298, March 2004, doi: 10.1109/TVLSI.2004.824300.

“Finally, our results show that a LUT size of 4 to 6 and cluster size of between 3-10 provides the best area-delay product for an FPGA”


ReconfigurableArrayResearch‣ Architecture‣ ConfigurableLogicBlocks‣ InternalinterconnectionamongLUTs/FFs

34

Wenyi Feng, Jonathan Greene, and Alan Mishchenko. 2018. Improving FPGA Performance with a S44 LUT Structure. In FPGA’18: 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Feb. 25–27, 2018, Monterey, CA, USA. ACM, New York, NY, USA, 6 pages. DOI: https://doi.org/10.1145/3174243.3174272

“we show that mapping to a 7-input LUT structure can approach the performance of 6-input LUTs while retaining the area and static power advantage of 4-input LUTs ”


ReconfigurableArrayResearch‣ Architecture‣ ConfigurableLogicBlocks‣ ALUversusLUTbased,processorcores

‣ Hybridfine-grained/coarsegrained?

35

Marshall, Alan & Stansfield, Tony & Kostarnov, Igor & Vuillemin, Jean & Hutchings, Brad. (1999). A Reconfigurable Arithmetic Array for Multimedia Application.. 135-143. 10.1145/296399.296444.

A. Podobas, K. Sano and S. Matsuoka, "A Survey on Coarse-Grained Reconfigurable Architectures From a Performance Perspective," in IEEE Access, vol. 8, pp. 146719-146743, 2020, doi: 10.1109/ACCESS.2020.3012084.


ReconfigurableArrayResearch‣ Architecture‣ InterconnectionNetwork(relativelyunexplored)‣ CLOSnetwork,Fat-trees,otheradhoctopologies‣ Toolplaceandroutetimecritical

‣ ConfigurationStructure‣ Partialreconfigurationgranularity‣ dynamicreconfiguration(reconfigurewhilerunning)‣ multiplecontext‣ debugginginterface

36


ReconfigurableArrayResearch‣ Implementation‣ StandardCellsversusFullCustom

‣ Hybrid

37

Kim, Jin & Anderson, Jason. (2017). Synthesizable Standard Cell FPGA Fabrics Targetable by the Verilog-to-Routing CAD Flow. ACM Transactions on Reconfigurable Technology and Systems. 10. 1-23. 10.1145/3024063.

X. Tang, E. Giacomin, A. Alacchi, B. Chauviere and P. Gaillardon, "OpenFPGA: An Opensource Framework Enabling Rapid Prototyping of Customizable FPGAs," 2019 29th International Conference on Field Programmable Logic and Applications (FPL), Barcelona, Spain, 2019, pp. 367-374, doi: 10.1109/FPL.2019.00065.

Tile Area


ReconfigurableArrayResearch‣ Implementation‣ Fullcustomfabriclayoutgeneration(process

portability,rapidDSE):

‣ Sticks

‣ BAG-BerkeleyAnalogGenerator

38

E. Chang et al., "BAG2: A process-portable framework for generator-based AMS circuit design," 2018 IEEE Custom Integrated Circuits Conference (CICC), San Diego, CA, 2018, pp. 1-8, doi: 10.1109/CICC.2018.8357061.


ReconfigurableArrayResearch‣ The“RISC”ofFPGAs‣ ReducedComplexityReconfigurableArray(RCRA)‣ Comparedtostate-of-artcommercialarrays,canaverysimplearray

competewith(orbeat)on:

‣ Performance(clockfrequency),areaefficiency,powerefficiency,‣ area/delayproduct,power/delayproduct?

‣ DothePPR(partition,place,&route)toolsspeedup?‣ Ideas:‣ Iflocalinterconnectisefficient,thenperhapsclusteringisnot

necessary.

‣ AreLUTsoverkill?‣ Simplerinterconnect?

39


ClassProjects

‣ Forpurposesoflearning:‣ Redesignclassic(simplearray):‣ ex:Xilinxxc2064,xc6200

‣ Advanced:‣ firstattemptatRCRA‣ multi-contextoptimizedfordynamicreconfiguration‣ hybridcoarse-grainfine-grainarray

‣ Howmanyprojects?‣ Howmanyteams?‣ Howtodivideupfunctionalityand/orberedundant?

40

Do we want to do architecture research or simply focus on implementation issues? Do you want to do research in implementation issues?


ProjectProposals

41

CS250, UC Berkeley Fall ‘20Lecture 04, Reconfigurable Architectures 2

EndofLecture6

42

CS250 VLSI Systems Designcs250/fa20/files/lec06... · 2004. 2. 3. · Low-overhead exploitation of...

Documents

Transcript of CS250 VLSI Systems Designcs250/fa20/files/lec06... · 2004. 2. 3. · Low-overhead exploitation of...