CS250 VLSI Systems Designcs250/fa20/files/lec05... · 2004. 2. 3. · VLSI Systems Design Fall 2020...
Transcript of CS250 VLSI Systems Designcs250/fa20/files/lec05... · 2004. 2. 3. · VLSI Systems Design Fall 2020...
-
CS250, UC Berkeley Fall ‘20Lecture 04, Reconfigurable Architectures 2
CS250 VLSISystemsDesign
Fall2020
JohnWawrzynek
with
AryaReais-Parsi
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
ReconfigurableFabricArchitecture:DegreesofFreedom
1. LogicBlocksCapacityandinternalstructureofcombinationlogiccircuitsandstateelement(s),
Clusteringandinternalinterconnect
2. InterconnectionNetworkArchitectureCircuit-switchednotpacket-switched,
Topologyofnetwork
3. ConfigurationArchitecturehowisprogramminginformationloadedanddistributed,
configuration“depth”
4. Hardblocks:RAM,ALUs,ProcessorCores,…Function(s),count,andhowintegratedintothefabric
2
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3 3
Spring 2013 EECS150 - Lec02-SDS-FPGAs Page
Colorsrepresentdifferenttypesofresources:
LogicBlockRAMDSP(ALUs)ClockingI/OSerialI/O+PCI
Aroutingfabricrunsthroughoutthechiptowireeverythingtogether. 64
XilinxVirtex-5
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
InterconnectionTopologies‣ Traditional
IslandStyle:
‣ Fromflexlogic,
4
Clos Network
“uses about half the area of the traditional interconnect and uses only 5-7 metal routing layers”
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
Fat-TreeBasedInterconnect‣ Use“Rent’srule”forproperthickness
5
start-up
Lessons: 1) for efficiency, need to “flatten” lower levels of tree, 2) critical path might be long
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
EmbeddedHardBlocks‣ Manyimportant
functionsarenotefficientwhenimplementedinthereconfigurablefabric:
‣ multiplication,largememory,processorcores,…
‣ Dedicatedblockstakerelativelylittleareaandthereforecouldgounused.
6
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3 7
Spring 2013 EECS150 - Lec02-SDS-FPGAs Page
Colorsrepresentdifferenttypesofresources:
LogicBlockRAMDSP(ALUs)ClockingI/OSerialI/O+PCI
Aroutingfabricrunsthroughoutthechiptowireeverythingtogether. 64
XilinxVirtex-5
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
VirtexDSP48ESlice
8
Efficientimplementationofmultiply,add,bit-wiselogical.
-
EE141
BlockRAMOverview❑ 36K bits of data total, can be configured as:
▪ 2 independent 18Kb RAMs, or one 36Kb RAM. ❑ Each 36Kb block RAM can be configured as:
▪ 64Kx1 (when cascaded with an adjacent 36Kb block RAM), 32Kx1, 16Kx2, 8Kx4, 4Kx9, 2Kx18, or 1Kx36 memory.
❑ Each 18Kb block RAM can be configured as: ▪ 16Kx1, 8Kx2, 4Kx4, 2Kx9, or 1Kx18 memory.
❑ Write and Read are synchronous operations. ❑ The two ports are symmetrical and totally
independent (can have different clocks), sharing only the stored data.
❑ Each port can be configured in one of the available widths, independent of the other port. The read port width can be different from the write port width for each port.
❑ The memory content can be initialized or cleared by the configuration bitstream.
9
-
EE141
Ultra-RAMBlocks
10
UltraRAM block is a dual-port synchronous 288Kb RAM with fixed configuration of 4,096 deep and 72 bits wide.
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
FirstcommercialHybridFPGA‣ XilinxVirtexIIPro‣ January2010‣ 150nmprocess
11
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
State-of-the-Art-XilinxFPGAs
12
Virtex Ultra-scale
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
ConfigurationArchitecture‣ Howarethe
programmingbitsloadedanddistributed?
‣ Configurationdepth(numberofstoredon-chipconfigurations)
‣ Sameinterfaceoftencanprovideread-backtosavestate/debug
‣ DesignChallenge:‣ Configurationsare
verylarge(100’sofMbits)
‣ Movingmanybitsoverchipinterfacerequirestimeandenergy
13
Many commercial FPGAs also have an internal reconfiguration controller that allows dynamic self reconfiguration.
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
InternalReconfiguration‣ Traditionally,longshiftchains:
‣ slow,relativelyenergyefficient‣ “Randomaccess”structureshavebeentried.‣ permitsfine-grainpartialreconfiguration
14
Connections to logic blocks, programmable interconnection points, …
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
XilinxConfigurationLayout‣ “frame”isunitof
reconfiguration
‣ seriallyloadedintochip
15
‣ Permits“partialreconfiguration”
‣ xc6000tookthistothelimitwithwordlevelconfigurationgranularity.
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
Multi-contextFPGAs
‣ Rapiddynamicreconfigurationpossible.
‣ What’stheexecutionandprogrammingmodel?
16
Garp: a MIPS processor with a reconfigurable coprocessorPublished 1997Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines
“3-D FPGA”
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
Bringsusto“reconfigurablecomputing”
■ Whatisit?■ Standarddefinition:
Computing via a post-fabrication and spatially programmed connection of processing elements. ‣ ASICimplementationsexcluded–notpost-fabrication
programmable.
‣ FPGAimplementationofaprocessorcoretorunaprogramexcluded-notdirectspatialmappingofproblem.
■ Doesthisincludearraysofprocessors?■ ThisdefinitionrestrictsRCtomappingto“fine-grained”
devices(suchasFPGAs),however,manyofthesameprincipleapplytoarraysofprocessors.
17
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
SpatialComputation
■ Example:
grade = 0.2 × mt1 + 0.2 × mt2
+ 0.2 × mt3 + 0.4 × project;
■ Ahardwareresourceisallocatedanddedicatedforeachoperator(inthiscase,multiplieroradder)inthecomputegraph.
18
xx xx
++
+
0.2 mt1 0.2 mt2 0.4 proj0.2 mt3
grade
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
TemporalComputation■ Ahardwareresourceistime-
multiplexedtoimplementtheactionsoftheoperatorsinthecomputegraph.
■ Typicalinasequentialprocessor/softwaresolution,howeverpossibleinreconfigurablelogic.
■ Inreconfigurablelogicitmightbenecessarytoserializeacomputation:
■ Limitedchipsresources■ LimitedI/Obandwidth
19
acc1 = mt1 + mt2;acc1 = acc1 + mt3;acc1 = 0.2 x acc1;acc2 = 0.4 x proj;grade = acc1 + acc2;
controller
ALU
mt1 mt1mt3 proj
acc1acc2
x
+
+
0.2
mt1 mt2 0.4 proj
mt3
grade
x
+Abstract computation-graph
Implementation
Reconfigurable Computing permits the full range of spatial, temporal, and mixed computing solutions to best match implementation to task specifics and available hardware.
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
RC,Processors,&ASIC
20
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
RCStrategy1. Exploitcaseswhereoperationcanbeboundandthen
reusedalargenumberoftimes.
2. Customizeforoperatortype,width,andinterconnect.3. Low-overheadexploitationofapplicationparallelism.
21
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
HybridApproach■ 90/10rule:■ 90percentoftheprogramruntimeisconsumedby10percentofthe
code(inner-loops).
■ Onlysmallportionsofanapplicationbecometheperformancebottlenecks.
■ Usually,theseportionsofcodearedataprocessingintensivewithrelativelyfixeddataflowpatterns(littlecontrol)
■ Theother90percentofthecodenotperformancecritical.
22
⇒ Hybrid processor-core reconfigurable-array
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
Garp–HybridProcessor
23
Function Speedupstrlen (len 16) 1.77strlen (len 1024) 14sort 2.1image median filter 26.9DES (ECB mode) 19.6image dithering 16.3
Speedups over 4-way superscalarUltraSparc on same process and comparable die size and memory system.
“Garp: A MIPS Processor with a ReconfigurableCoprocessor”, In Proceedings of the IEEE Symposiumon Field-Programmable Custom Computing Machines(FCCM ‘97, April 16-18, 1997)
• Pre-generated circuits for common program kernels cached within reconfigurable array and used to accelerate MIPS programs.
• nSec configuration swap time.• Speedup – tied to single execution
thread.
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
GarpCC(T.Callahan)
24
Compilation time < 2x processor only compilation time.
Kernels from wavelet image compression. Speedups relative to MIPS processor only.
“The Garp Architecture and C Compiler”, IEEE Computer, April 2000.
Kernel raw netforward_wavelet_1 2 1.9forward_wavelet_2 4.1 3.6init_image 6.4 6.4forward_wavelet_3 4.1 3.6forward_wavelet_4 5.2 4.1entropy_encode_1 4 4block_quantize 2.8 2.6RLE_encode 5.8 3.4entropy_encode_2 2.9 1.5
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
AdvantagesofRCoverProcessorCore■ Conventionalprocessorshave
severalsourcesofinefficiency:
■ Heavytime-multiplexingofFunctionUnits(ALUs).
■ Instructionissueoverhead.■ Memoryhierarchytodealwith
memorylatency.
■ Operatormismatch
25
λλ
Peak (raw) performance
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
AdvantagesofRC■ Relativetomicroprocessors:onaverageahigher
percentage of peak (or raw) computational density is achievedwithreconfigurabledevices:
■ Fine-grainflexibilityleadstoexploitationofproblemspecificparallelismatmanylevels.
■ Also,manydifferentcomputationmodels(orpatterns)canbesupported.Ingeneral,itispossibletomatchproblemcharacteristicstohardware,throughtheuseofproblemspecificarchitecturesandlow-levelcircuitspecialization.
26
■ Spatial mapping of computation versus multiplexing of function units (as in processors) relieves pressure for memory capacity, BW, and promotes local communication patterns.
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
AdvantagesofRC■ ModernFPGAsmakegoodsystem-levelcomponents:■ RelativelylargenumberofIOs(manyparallelmemoryports).High-BW
communications.
■ Built-inmicroprocessorsandotherblocks.■ Machinesbasedonthesecomponentscaneasilyscalepeak
performancebyridingMoore’scurve(FPGAsareprocessdrivers).
■ Low-levelredundancycouldpermitsdefect-toleranceandgreatcostsavings.
■ IstherestillroomforresearchinnoveldevicesforRC?
27
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
FPGAsareReconfigurable
1. Volume/costgraphsdon’taccuratelycapturethepotentialrealcostsandotheradvantages.
2. Commercialapplicationshavenottakenadvantageofreconfigurability• Xilinx/Altera(Intel)haven’tdonemuchtohelp.• Methodologies/toolsnearlynonexistent.
Reconfiguration uses:
‣ Fieldupgrades⇒productlifeextension,changingrequirements.‣ Insystemboard-leveltestingandfielddiagnostics.‣ Tolerancetofaults.‣ Risk-managementinsystemdevelopment.‣ Runtime reconfiguration ⇒ higher silicon efficiency. ‣ Time-multiplexedpre-designedcircuitstakemaximumuseofresources.‣ Runtimespecializedcircuitgeneration.
28
Seemingly obvious point but …
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
Multi-modalComputingTasks
■ Mini/Micro-UAVs ■ Onepieceofsiliconforallofsensorprocessing,navigation,communications,planning,logging,etc.
■ Atdifferenttimesdifferenttaskstakepriorityandconsumehigherpercentageofresources.
■ Otherexample:hand-heldmulti-functiondevicewithGPS,smartimagecapture/analysis,communications.
29
A premier application for reconfigurable devices is one with constrained size/weight, need multiple functions at near ASIC performance.
Multiple ASICs too expensive/big. Processor too slow.Fine-grained reconfigurable devices has the flexibility to efficiently matchtask parallelism over a wide variety of tasks – deployed as needed and reconfigured as needed.
Mars-rover
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
Soundsgreat,what’sthecatch?
■ Lackofprogrammingmodelwithconvenientandeffectivetools.
■ Mostsuccessfulcomputingapplicationsusingreconfigurabledevicesinvolvesubstantial“handmapping”.Essentiallycircuitdesign.
■ Complexissue,butperhapschangingthefabricdesigncanhelp.
30
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
Fine-grainedReconfigurableFabrics
Homogeneousfine-grainedarraysaremaximallyflexible:
a. Admitawidevarietyofcomputationalarchitecturesmodels:arraysofprocessors,hybridapproaches,hard-wireddataflow,systolicprocessing,vectorprocessing,etc.
b. Admitawidevarietyofparallelismmodes:SIMD,MIMD,bit-level,etc.Resourcescanbedeployedtolower-latencywhenrequiredfortightfeedbackloops(notpossiblewithmayparallelarchitecturesthatoptimizeforthroughput).
c. Supportsmanycompilation/resourcemanagementmodels:Staticallycompiled,dynamicallymapped.
31
Safe bet as a future standard device.
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
RapidRuntimeReconfiguration
■ Mightpermitevenhigherefficiencythroughhardwaresharing(multiplexing)andontheflycircuitspecialization.
■ Largelyunexploited(unproven)todate.■ Afewresearchprojectshaveexploredthisidea.■ Needtobecareful–multiplexingaddscost.■ Rememberthe“BindingTimePrinciple” Earlier the “instruction” is bound, the less area & delay required for the
implementation.
32
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
RapidRuntimeReconfiguration
1. Time-multiplexingresourcesallowsmoreefficientuseofsilicon(inwaysASICstypicallydonot):
a. Low-dutycycleor“offcriticalpath”computationstimesharefabricwhilecriticalpathstaysmappedin:
33
Why dynamic reconfiguration?
amount of reconfigurable fabric
total runtime
size of maximum efficiency
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
RapidRuntimeReconfiguration
b. Coursedata-dependent controlflowmapsinonlyusefuldataflow:
c. Allowabletaskfoot-printmaychangeasothertaskscomeandgoorfaultsoccur.
Fabric virtualization allows automatic migration up and down in device sizes and eases application development.
34
If-then-else
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
RapidRuntimeReconfiguration
2. RuntimeCircuitSpecialization:
• Example:fixedcoefficientmultipliersinadaptivefilterchangingvalueatlowrate.
• Aggressiveconstantpropagation(basedperhapsonruntimeprofiling),reducescircuitsizeanddelay.
• Coulduse“branch/value/rangeprediction”tomapmostcommoncaseandfaultinexceptionalcases.
• Canbetemplatebased–“fillintheblanks”,butbetterifweputPPRinruntimeloop!
• ArrayHWassistedplaceandroutemaymakeitpossible.
35
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
SCORE–VirtualizedFabricModel
36
If-else
High silicon efficiency: ♦ Only active parts of data-flow consume
resources.
♦ High-duty cycle critical path of computation stays mapped and remaining resources are shared by lower duty cycle paths.
♦ Particularly effective for multi-tasking environment with time-varying task requirements.
♦ Fabric virtualization with demand paging: • Get most out of available resources by automatically
time-multiplexing. • Automatic migration up and down in device sizes. • Eases application development.
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
SCORE:
■ A computation model for reconfigurable systems ■ abstractsout:physicalhardwaredetails‣ especiallysizeandnumberofresources
■ Goal ■ achievedeviceindependence■ approachdensity/efficiencyofrawhardware■ allowapplicationperformancetoscalebasedonsystem
resources(w/outhumanintervention)
37
Stream Computations Organized for Reconfigurable Execution
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
SCOREBasics
■ Abstractcomputationisadataflowgraph
■ streamlinksbetweenoperators■ dynamicdataflowrates
■ Compilerbreaksupcomputationintocomputepages
■ unitofschedulingandvirtualization■ streamlinksbetweenpages
■ Virtualcomputepagesare“demand-paged”intoavailablehardwareresourcesasneeded.
38
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
VirtualHardwareModel■ Dataflow graph is arbitrarily large ■ Hardware has finite resources ■ resourcesvaryamongimplementations
■ Dataflow graph is scheduled on the hardware ■ Happens automatically (software) ■ physicalresourcesareabstractedincomputemodel
■ Graph composition and node size are data dependent
39
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
ArchitectureModel
40
Hybrid processor: conventional RISC core, reconfigurable array, memory.
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
SerialImplementation
41
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3Vulcan, Inc. Visit
SpatialImplementation
422/3/04
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
ArchitectureModel(cont.)■ ArchitecturemodelandSCORE
computemodelpermitscalingoverawiderangeofICprocessesanddiesizes.
■ 0.13umprocesswith16mmX16mmpayloadsuggests:
■ 256compute/memorytiles■ totalof32Klogiccells,0.5Gbitmemory■ RISCcorewith32KbitI/Dcachearea
equivalentto8tiles.
43
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
ConfigurableSystemonaChip■ Thismicro-architectureandchipisanexampleCSoC■ SCOREProvidesgeneralframeworkforSoCfamilies■ interconnect/architecturefabric■ softwaremodel‣ compute model for application assembly/scaling ‣ OS/runtime ■ bothfor‣ standard cores ‣ custom, application specific components (hardcoded accelerators)
44
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
KeyIdea:InterconnectFabric■ Standard/common
InterconnectFabric
■ Mix-and-matchnodesonfabric
■ providedifferentresourcebalance
■ matchneedsofparticularapplications
■ Allusecommoncomputemodel
■ sharesoftwareandinfrastructure
45
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
SampleHybridCSoC-vision2000
46
-
CS250, UC Berkeley Fall ‘20Lecture 05, Reconfigurable Architecture 3
XilinxVersal2020
47
-
CS250, UC Berkeley Fall ‘20Lecture 04, Reconfigurable Architectures 2
EndofLecture5
48