CS250 VLSI Systems Designcs250/fa20/files/lec06... · 2004. 2. 3. · Low-overhead exploitation of...
Transcript of CS250 VLSI Systems Designcs250/fa20/files/lec06... · 2004. 2. 3. · Low-overhead exploitation of...
-
CS250, UC Berkeley Fall ‘20Lecture 04, Reconfigurable Architectures 2
CS250 VLSISystemsDesign
Fall2020
JohnWawrzynek
with
AryaReais-Parsi
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
Bringsusto“reconfigurablecomputing”
■ Whatisit?■ Standarddefinition:
Computing via a post-fabrication and spatially programmed connection of processing elements. ‣ ASICimplementationsexcluded–notpost-fabrication
programmable.
‣ FPGAimplementationofaprocessorcoretorunaprogramexcluded-notdirectspatialmappingofproblem.
■ Doesthisincludearraysofprocessors?■ ThisdefinitionrestrictsRCtomappingto“fine-grained”
devices(suchasFPGAs),however,manyofthesameprincipleapplytoarraysofprocessors.
2
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
SpatialComputation
■ Example:
grade = 0.2 × mt1 + 0.2 × mt2
+ 0.2 × mt3 + 0.4 × project;
■ Ahardwareresourceisallocatedanddedicatedforeachoperator(inthiscase,multiplieroradder)inthecomputegraph.
3
xx xx
++
+
0.2 mt1 0.2 mt2 0.4 proj0.2 mt3
grade
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
TemporalComputation■ Ahardwareresourceistime-
multiplexedtoimplementtheactionsoftheoperatorsinthecomputegraph.
■ Typicalinasequentialprocessor/softwaresolution,howeverpossibleinreconfigurablelogic.
■ Inreconfigurablelogicitmightbenecessarytoserializeacomputation:
■ Limitedchipsresources■ LimitedI/Obandwidth
4
acc1 = mt1 + mt2;acc1 = acc1 + mt3;acc1 = 0.2 x acc1;acc2 = 0.4 x proj;grade = acc1 + acc2;
controller
ALU
mt1 mt1mt3 proj
acc1acc2
x
+
+
0.2
mt1 mt2 0.4 proj
mt3
grade
x
+Abstract computation-graph
Implementation
Reconfigurable Computing permits the full range of spatial, temporal, and mixed computing solutions to best match implementation to task specifics and available hardware.
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
RC,Processors,&ASIC
5
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
RCStrategy1. Exploitcaseswhereoperationcanbeboundandthen
reusedalargenumberoftimes.
2. Customizeforoperatortype,width,andinterconnect.3. Low-overheadexploitationofapplicationparallelism.
6
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
HybridApproach■ 90/10rule:■ 90percentoftheprogramruntimeisconsumedby10percentofthe
code(inner-loops).
■ Onlysmallportionsofanapplicationbecometheperformancebottlenecks.
■ Usually,theseportionsofcodearedataprocessingintensivewithrelativelyfixeddataflowpatterns(littlecontrol)
■ Theother90percentofthecodenotperformancecritical.
7
⇒ Hybrid processor-core reconfigurable-array
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
Garp–HybridProcessor
8
Function Speedupstrlen (len 16) 1.77strlen (len 1024) 14sort 2.1image median filter 26.9DES (ECB mode) 19.6image dithering 16.3
Speedups over 4-way superscalarUltraSparc on same process and comparable die size and memory system.
“Garp: A MIPS Processor with a ReconfigurableCoprocessor”, In Proceedings of the IEEE Symposiumon Field-Programmable Custom Computing Machines(FCCM ‘97, April 16-18, 1997)
• Pre-generated circuits for common program kernels cached within reconfigurable array and used to accelerate MIPS programs.
• nSec configuration swap time.• Speedup – tied to single execution
thread.
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
GarpCC(T.Callahan)
9
Compilation time < 2x processor only compilation time.
Kernels from wavelet image compression. Speedups relative to MIPS processor only.
“The Garp Architecture and C Compiler”, IEEE Computer, April 2000.
Kernel raw netforward_wavelet_1 2 1.9forward_wavelet_2 4.1 3.6init_image 6.4 6.4forward_wavelet_3 4.1 3.6forward_wavelet_4 5.2 4.1entropy_encode_1 4 4block_quantize 2.8 2.6RLE_encode 5.8 3.4entropy_encode_2 2.9 1.5
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
AdvantagesofRCoverProcessorCore■ Conventionalprocessorshave
severalsourcesofinefficiency:
■ Heavytime-multiplexingofFunctionUnits(ALUs).
■ Instructionissueoverhead.■ Memoryhierarchytodealwith
memorylatency.
■ Operatormismatch
10
λλ
Peak (raw) performance
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
AdvantagesofRC■ Relativetomicroprocessors:onaverageahigher
percentage of peak (or raw) computational density is achievedwithreconfigurabledevices:
■ Fine-grainflexibilityleadstoexploitationofproblemspecificparallelismatmanylevels.
■ Also,manydifferentcomputationmodels(orpatterns)canbesupported.Ingeneral,itispossibletomatchproblemcharacteristicstohardware,throughtheuseofproblemspecificarchitecturesandlow-levelcircuitspecialization.
11
■ Spatial mapping of computation versus multiplexing of function units (as in processors) relieves pressure for memory capacity, BW, and promotes local communication patterns.
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
Fine-grainedReconfigurableFabrics
Homogeneousfine-grainedarraysaremaximallyflexible:
a. Admitawidevarietyofcomputationalarchitecturesmodels:arraysofprocessors,hybridapproaches,hard-wireddataflow,systolicprocessing,vectorprocessing,etc.
b. Admitawidevarietyofparallelismmodes:SIMD,MIMD,bit-level,etc.Resourcescanbedeployedtolower-latencywhenrequiredfortightfeedbackloops(notpossiblewithmayparallelarchitecturesthatoptimizeforthroughput).
c. Supportsmanycompilation/resourcemanagementmodels:Staticallycompiled,dynamicallymapped.
12
Safe bet as a future standard device.
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
FPGAsareReconfigurable
1. Volume/costgraphsdon’taccuratelycapturethepotentialrealcostsandotheradvantages.
2. Commercialapplicationshavenottakenadvantageofreconfigurability• Xilinx/Altera(Intel)haven’tdonemuchtohelp.• Methodologies/toolsnearlynonexistent.
Reconfiguration uses:
‣ Fieldupgrades⇒productlifeextension,changingrequirements.‣ Insystemboard-leveltestingandfielddiagnostics.‣ Tolerancetofaults.‣ Risk-managementinsystemdevelopment.‣ Runtime reconfiguration ⇒ higher silicon efficiency. ‣ Time-multiplexedpre-designedcircuitstakemaximumuseofresources.‣ Runtimespecializedcircuitgeneration.
13
Seemingly obvious point but …
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
Multi-modalComputingTasks
■ Mini/Micro-UAVs ■ Onepieceofsiliconforallofsensorprocessing,navigation,communications,planning,logging,etc.
■ Atdifferenttimesdifferenttaskstakepriorityandconsumehigherpercentageofresources.
■ Otherexample:hand-heldmulti-functiondevicewithGPS,smartimagecapture/analysis,communications.
14
A premier application for reconfigurable devices is one with constrained size/weight, need multiple functions at near ASIC performance.
Multiple ASICs too expensive/big. Processor too slow.Fine-grained reconfigurable devices has the flexibility to efficiently matchtask parallelism over a wide variety of tasks – deployed as needed and reconfigured as needed.
Mars-rover
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
Soundsgreat,what’sthecatch?
■ Lackofprogrammingmodelwithconvenientandeffectivetools.
■ Mostsuccessfulcomputingapplicationsusingreconfigurabledevicesinvolvesubstantial“handmapping”.Essentiallycircuitdesign.
■ Complexissue,butperhapschangingthefabricdesigncanhelp.
15
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
RapidRuntimeReconfiguration
■ Mightpermitevenhigherefficiencythroughhardwaresharing(multiplexing)andontheflycircuitspecialization.
■ Largelyunexploited(unproven)todate.■ Afewresearchprojectshaveexploredthisidea.■ Needtobecareful–multiplexingaddscost.■ Rememberthe“BindingTimePrinciple” Earlier the “instruction” is bound, the less area & delay required for the
implementation.
16
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
RapidRuntimeReconfiguration
1. Time-multiplexingresourcesallowsmoreefficientuseofsilicon(inwaysASICstypicallydonot):
a. Low-dutycycleor“offcriticalpath”computationstimesharefabricwhilecriticalpathstaysmappedin:
17
Why dynamic reconfiguration?
amount of reconfigurable fabric
total runtime
size of maximum efficiency
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
RapidRuntimeReconfiguration
b. Coursedata-dependent controlflowmapsinonlyusefuldataflow:
c. Allowabletaskfoot-printmaychangeasothertaskscomeandgoorfaultsoccur.
Fabric virtualization allows automatic migration up and down in device sizes and eases application development.
18
If-then-else
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
RapidRuntimeReconfiguration
2. RuntimeCircuitSpecialization:
• Example:fixedcoefficientmultipliersinadaptivefilterchangingvalueatlowrate.
• Aggressiveconstantpropagation(basedperhapsonruntimeprofiling),reducescircuitsizeanddelay.
• Coulduse“branch/value/rangeprediction”tomapmostcommoncaseandfaultinexceptionalcases.
• Canbetemplatebased–“fillintheblanks”,butbetterifweputPPRinruntimeloop!
• ArrayHWassistedplaceandroutemaymakeitpossible.
19
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
SCORE:
■ A computation model for reconfigurable systems ■ abstractsout:physicalhardwaredetails‣ especiallysizeandnumberofresources
■ Goal ■ achievedeviceindependence■ approachdensity/efficiencyofrawhardware■ allowapplicationperformancetoscalebasedonsystem
resources(w/outhumanintervention)
20
Stream Computations Organized for Reconfigurable Execution
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
SCORE–VirtualizedFabricModel
21
If-else
High silicon efficiency: ♦ Only active parts of data-flow consume
resources.
♦ High-duty cycle critical path of computation stays mapped and remaining resources are shared by lower duty cycle paths.
♦ Particularly effective for multi-tasking environment with time-varying task requirements.
♦ Fabric virtualization with demand paging: • Get most out of available resources by automatically
time-multiplexing. • Automatic migration up and down in device sizes. • Eases application development.
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
SCOREBasics
■ Abstractcomputationisadataflowgraph(Kahnprocessnetwork)
■ streamlinksbetweenoperators■ dynamicdataflowrates
■ Compilerbreaksupcomputationintocomputepages
■ unitofschedulingandvirtualization■ streamlinksbetweenpages
■ Virtualcomputepagesare“demand-paged”intoavailablehardwareresourcesasneeded.
22
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
VirtualHardwareModel■ Dataflow graph is arbitrarily large ■ Hardware has finite resources ■ resourcesvaryamongimplementations
■ Dataflow graph is scheduled on the hardware ■ Happens automatically (software) ■ physicalresourcesareabstractedincomputemodel
■ Graph composition and node size are data dependent
23
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ArchitectureModel
24
Hybrid processor: conventional RISC core, reconfigurable array, memory.
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
SerialImplementation
25
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4Vulcan, Inc. Visit
SpatialImplementation
262/3/04
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ArchitectureModel(cont.)■ ArchitecturemodelandSCORE
computemodelpermitscalingoverawiderangeofICprocessesanddiesizes.
■ 0.13umprocesswith16mmX16mmpayloadsuggests:
■ 256compute/memorytiles■ totalof32Klogiccells,0.5Gbitmemory■ RISCcorewith32KbitI/Dcachearea
equivalentto8tiles.
27
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ConfigurableSystemonaChip■ Thismicro-architectureandchipisanexampleCSoC■ SCOREProvidesgeneralframeworkforSoCfamilies■ interconnect/architecturefabric■ softwaremodel‣ compute model for application assembly/scaling ‣ OS/runtime ■ bothfor‣ standard cores ‣ custom, application specific components (hardcoded accelerators)
28
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
KeyIdea:InterconnectFabric■ Standard/common
InterconnectFabric
■ Mix-and-matchnodesonfabric
■ providedifferentresourcebalance
■ matchneedsofparticularapplications
■ Allusecommoncomputemodel
■ sharesoftwareandinfrastructure
29
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
SampleHybridCSoC-vision2000
30
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
XilinxVersal2020
31
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ReconfigurableArrayDesignResearch
‣ Architecture‣ ConfigurableLogicBlocks‣ AlternativetoLUTs(lessarea,delay)
32
Many of these topics have been studied, but remain open (or secret).
As with most design decisions, it comes down to design space exploration and cost/benefit analysis. Ideally, finding the Pareto optimal frontier.
In FPGA design, it is complicated by the fact that decisions are interrelated. For instance, CLB design strong effects interconnection needs.
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ReconfigurableArrayResearch‣ Architecture‣ ConfigurableLogicBlocks‣ LUTsize:
33
E. Ahmed and J. Rose, "The effect of LUT and cluster size on deep-submicron FPGA performance and density," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 12, no. 3, pp. 288-298, March 2004, doi: 10.1109/TVLSI.2004.824300.
“Finally, our results show that a LUT size of 4 to 6 and cluster size of between 3-10 provides the best area-delay product for an FPGA”
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ReconfigurableArrayResearch‣ Architecture‣ ConfigurableLogicBlocks‣ InternalinterconnectionamongLUTs/FFs
34
Wenyi Feng, Jonathan Greene, and Alan Mishchenko. 2018. Improving FPGA Performance with a S44 LUT Structure. In FPGA’18: 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Feb. 25–27, 2018, Monterey, CA, USA. ACM, New York, NY, USA, 6 pages. DOI: https://doi.org/10.1145/3174243.3174272
“we show that mapping to a 7-input LUT structure can approach the performance of 6-input LUTs while retaining the area and static power advantage of 4-input LUTs ”
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ReconfigurableArrayResearch‣ Architecture‣ ConfigurableLogicBlocks‣ ALUversusLUTbased,processorcores
‣ Hybridfine-grained/coarsegrained?
35
Marshall, Alan & Stansfield, Tony & Kostarnov, Igor & Vuillemin, Jean & Hutchings, Brad. (1999). A Reconfigurable Arithmetic Array for Multimedia Application.. 135-143. 10.1145/296399.296444.
A. Podobas, K. Sano and S. Matsuoka, "A Survey on Coarse-Grained Reconfigurable Architectures From a Performance Perspective," in IEEE Access, vol. 8, pp. 146719-146743, 2020, doi: 10.1109/ACCESS.2020.3012084.
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ReconfigurableArrayResearch‣ Architecture‣ InterconnectionNetwork(relativelyunexplored)‣ CLOSnetwork,Fat-trees,otheradhoctopologies‣ Toolplaceandroutetimecritical
‣ ConfigurationStructure‣ Partialreconfigurationgranularity‣ dynamicreconfiguration(reconfigurewhilerunning)‣ multiplecontext‣ debugginginterface
36
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ReconfigurableArrayResearch‣ Implementation‣ StandardCellsversusFullCustom
‣ Hybrid
37
Kim, Jin & Anderson, Jason. (2017). Synthesizable Standard Cell FPGA Fabrics Targetable by the Verilog-to-Routing CAD Flow. ACM Transactions on Reconfigurable Technology and Systems. 10. 1-23. 10.1145/3024063.
X. Tang, E. Giacomin, A. Alacchi, B. Chauviere and P. Gaillardon, "OpenFPGA: An Opensource Framework Enabling Rapid Prototyping of Customizable FPGAs," 2019 29th International Conference on Field Programmable Logic and Applications (FPL), Barcelona, Spain, 2019, pp. 367-374, doi: 10.1109/FPL.2019.00065.
Tile Area
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ReconfigurableArrayResearch‣ Implementation‣ Fullcustomfabriclayoutgeneration(process
portability,rapidDSE):
‣ Sticks
‣ BAG-BerkeleyAnalogGenerator
38
E. Chang et al., "BAG2: A process-portable framework for generator-based AMS circuit design," 2018 IEEE Custom Integrated Circuits Conference (CICC), San Diego, CA, 2018, pp. 1-8, doi: 10.1109/CICC.2018.8357061.
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ReconfigurableArrayResearch‣ The“RISC”ofFPGAs‣ ReducedComplexityReconfigurableArray(RCRA)‣ Comparedtostate-of-artcommercialarrays,canaverysimplearray
competewith(orbeat)on:
‣ Performance(clockfrequency),areaefficiency,powerefficiency,‣ area/delayproduct,power/delayproduct?
‣ DothePPR(partition,place,&route)toolsspeedup?‣ Ideas:‣ Iflocalinterconnectisefficient,thenperhapsclusteringisnot
necessary.
‣ AreLUTsoverkill?‣ Simplerinterconnect?
39
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ClassProjects
‣ Forpurposesoflearning:‣ Redesignclassic(simplearray):‣ ex:Xilinxxc2064,xc6200
‣ Advanced:‣ firstattemptatRCRA‣ multi-contextoptimizedfordynamicreconfiguration‣ hybridcoarse-grainfine-grainarray
‣ Howmanyprojects?‣ Howmanyteams?‣ Howtodivideupfunctionalityand/orberedundant?
40
Do we want to do architecture research or simply focus on implementation issues? Do you want to do research in implementation issues?
-
CS250, UC Berkeley Fall ‘20Lecture 06, Reconfigurable Architecture 4
ProjectProposals
41
-
CS250, UC Berkeley Fall ‘20Lecture 04, Reconfigurable Architectures 2
EndofLecture6
42