A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT
description
Transcript of A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT
A RISC ARCHITECTURE EXTENDED BY A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED AN EFFICIENT TIGHTLY COUPLED
RECONFIGURABLE UNITRECONFIGURABLE UNIT
Nikolaos VassiliadisNikolaos VassiliadisN. Kavvadias, G. Theodoridis, S. NikolaidisN. Kavvadias, G. Theodoridis, S. Nikolaidis
Section of Electronics and Computers, Department of Physics,Section of Electronics and Computers, Department of Physics,
Aristotle University of Thessaloniki,Aristotle University of Thessaloniki,
54124 Thessaloniki, Greece54124 Thessaloniki, Greece
[email protected]@skiathos.physics.auth.gr
Algarve, PortugalAlgarve, PortugalFebruary 22-23, 2005February 22-23, 2005
22
OutlineOutline
MotivationsMotivations
Proposed ArchitectureProposed Architecture
Software Development Environment Software Development Environment
DemonstrationDemonstration
ResultsResults
ConclusionsConclusions
33
MotivationsMotivationsQuest for Performance and FlexibilityQuest for Performance and Flexibility
Large portion of computational complexity is concentrated in Large portion of computational complexity is concentrated in small kernels covering small parts of overall codesmall kernels covering small parts of overall code
Performance Improved by Accelerating these kernelsPerformance Improved by Accelerating these kernels
Many Algorithms Show a relevant Instruction Level Parallelism Many Algorithms Show a relevant Instruction Level Parallelism (ILP)(ILP)
Performance Improved by parallel executionPerformance Improved by parallel execution
Traditional Processors have computation clock slackTraditional Processors have computation clock slack Performance Improved by chaining of operations (Spatial Computation) Performance Improved by chaining of operations (Spatial Computation)
Extending Embedded Processors With Application Specific Function Units
Reconfigurable Instruction Set Processors for Performance with Maximum Flexibility
44
Proposed ArchitectureProposed Architecture
Reconfigurable Instruction Set Processor (RISP)Reconfigurable Instruction Set Processor (RISP)Core ProcessorCore Processor
32-bit load/store RISC architecture32-bit load/store RISC architecture 5 Pipeline Stages5 Pipeline Stages Single Issue ElaborationSingle Issue Elaboration
Reconfigurable Logic CouplingReconfigurable Logic Coupling Reconfigurable Function Unit (RFU) approachReconfigurable Function Unit (RFU) approach=> Low Communication Overhead=> Low Communication Overhead Tightly Coupled => RFU Fits in two RISC pipeline stagesTightly Coupled => RFU Fits in two RISC pipeline stages=> Better Utilization of the Pipeline Stages=> Better Utilization of the Pipeline Stages
RFURFU 1-D Array of Coarse Grain Processing Elements (PEs)1-D Array of Coarse Grain Processing Elements (PEs) PE Functionality Configurable at Design Time to meet PE Functionality Configurable at Design Time to meet
Application requirementsApplication requirements Exploits Instruction Level Parallelism – Spatial & Temporal Exploits Instruction Level Parallelism – Spatial & Temporal
ComputationComputation
55
CONTROL LOGIC
RE
GIS
TE
R F
IL
E
ALUM
UX
PIP
EL
IN
E R
EG
IS
TE
R
PIP
EL
IN
E R
EG
IS
TE
R
PIP
EL
IN
E R
EG
IS
TE
R
PIP
EL
IN
E R
EG
IS
TE
R
MULTIPLIER
SHIFTERDATA
MEMORY
CORE / RFU INTERFACE
PROCESSING & INTERCONNECT LAYERSCONFIGURATION LAYER
WRITE BACK DATA
CONTROL SIGNALS
I_DATA_INBUS
OPERANDS
1ST STAGE RESULT 2ND STAGE RESULT
Re OPCODE
STATUS SIGNALS
CONFIGURATION BITS
Proposed ArchitectureProposed Architecture
Core ProcessorCore Processor Commonly Used Function Commonly Used Function
UnitsUnits Control Logic Properly Control Logic Properly
Extended to Handle Extended to Handle Reconfigurable InstructionsReconfigurable Instructions
4-Read-1-Write Register File4-Read-1-Write Register File
Core / RFU InterfaceCore / RFU Interface Receives & Delivers Control Receives & Delivers Control
and Data Signalsand Data Signals
Tightly Coupled RFUTightly Coupled RFU Configuration-Processing-Configuration-Processing-
Interconnection LayersInterconnection Layers Operates & Delivers Results Operates & Delivers Results
in two Concurrent Pipeline in two Concurrent Pipeline StagesStages
66
Standard And Reconfigurable InstructionsStandard And Reconfigurable Instructions
Re=‘0’ => Standard InstructionRe=‘0’ => Standard Instruction Control Logic : Configure Core DatapathControl Logic : Configure Core Datapath Operands : Source1-2 & DestinationOperands : Source1-2 & Destination ReOpCode = “nop”ReOpCode = “nop”
Re=‘1’ => Reconfigurable InstructionRe=‘1’ => Reconfigurable Instruction Control Logic : Configure InterfaceControl Logic : Configure Interface Operands : Source1-4 & DestinationOperands : Source1-4 & Destination ReOpCode = “OpCode”ReOpCode = “OpCode”
Three Types of Reconfigurable InstructionsThree Types of Reconfigurable Instructions Complex Computational OperationsComplex Computational Operations Complex Addressing ModesComplex Addressing Modes Complex Control Flow OperationsComplex Control Flow Operations
Each Instruction can be multicycleEach Instruction can be multicycle
Re OpCode Source 1 Source 2 Destination Source 3 Source 4
32-Bit Instruction Word Format32-Bit Instruction Word Format
77
Reconfigurable Function Unit (RFU)Reconfigurable Function Unit (RFU)
Embedded RFU for Dynamic Extension of the Instruction Embedded RFU for Dynamic Extension of the Instruction SetSet
Executes Multiple-Input-Single-Output (MISO) Executes Multiple-Input-Single-Output (MISO) Reconfigurable InstructionsReconfigurable Instructions
1-D Array of Coarse Grain Reconfigurable Blocks1-D Array of Coarse Grain Reconfigurable Blocks
Comprised of Three LayersComprised of Three Layers Processing LayerProcessing Layer Interconnection LayerInterconnection Layer Configuration LayerConfiguration Layer
88
RFU-Processing LayerRFU-Processing Layer
PE Basic StructurePE Basic StructureConfigurable PE functionality for Configurable PE functionality for the targeted applicationthe targeted applicationUnregistered Output => Spatial Unregistered Output => Spatial ComputationComputationRegister Output => Temporal Register Output => Temporal ComputationComputationFloating PEs => Can operate in Floating PEs => Can operate in both core pipeline stages on both core pipeline stages on demanddemandLocal Memory for Read Only Local Memory for Read Only ValuesValuesExecute Long Chains of Execute Long Chains of Operation in one processor Operation in one processor cyclecycle
PE REGISTER
MU
X
Operand1
Operand2
Function Sel Spatial-Temporal Sel
Result
99
RFU-Interconnection LayerRFU-Interconnection Layer
1-D Array of PEs1-D Array of PEs
Operands from Operands from Register FileRegister File
Constant Values from Constant Values from Local MemoryLocal Memory
Input NetworkInput Network
Operand SelectOperand Select
Output Network => Output Network => Delivers Results to Delivers Results to corresponding pipeline corresponding pipeline stages stages
INPUT NETWORK
OUTPUT NETWORK
PE BASIC STRUCTURE
OPERAND SELECT
OPERAND1
OPERAND2
PE RESULT
PE BASIC STRUCTURE
OPERAND SELECT
OPERAND1
OPERAND2
PE RESULT
1ST STAGE RESULT
2ND STAGE RESULT
FEEDBACK NETWORK
1ST STAGE OPERANDS
2ND STAGE OPERANDS
OPERANDS
CONSTANTS
1010
RFU-Configuration LayerRFU-Configuration Layer
Configuration Bits Local Configuration Bits Local Storage StructureStorage Structure
Multi-Context Multi-Context Configuration LayerConfiguration Layer
Coarse Grain => Small Coarse Grain => Small Number of Configuration Number of Configuration Bits => Negligible Bits => Negligible Overhead to Download Overhead to Download new Contextsnew Contexts
EXTERNAL CONFIGURATION
MEMORY
CO
NF
IGU
RA
TIO
N
CO
NT
RO
LL
ER
CONFIGURATION BITS LOCAL STORAGE
CONFIGURATION 0
CONFIGURATION 1
CONFIGURATION 2
CONFIGURATION 3
CONFIGURATION BITS
1111
Architecture Synthesis & EvaluationArchitecture Synthesis & Evaluation
A Hardware Model (VHDL) A Hardware Model (VHDL) was Designed for Evaluation was Designed for Evaluation PurposesPurposes
Configuration Value
Granularity 32-bits
Number of Processing Elements 8
Processing Elements FunctionalityALU, Shifter,
Multiplier
Configuration Contexts 16 words of 134 bits
Local Memory Size 8 constants of 32-bits
Number of Provided Local Operands 4
Component Area (mm2)
Processor Core 0.134
RFU Processing Layer 0.186
RFU Interconnection Layer
0.125
RFU Configuration Layer 0.137
RFU Total 0.448
The Model was Synthesized with The Model was Synthesized with STM 0.13um ProcessSTM 0.13um Process
The RFU Area Overhead is 3.3x The RFU Area Overhead is 3.3x the Area of the Core Processorthe Area of the Core Processor
No Caches were taken into No Caches were taken into accountaccount
No Overhead to Core Critical PathNo Overhead to Core Critical Path
1212
Software Development EnvironmentSoftware Development Environment
Front-End Compilation
Application Code(C)
Application CDFG(SUIFvm)
Application Analysis
Weighted Application CDFG(SUIFvm)
Instruction Generation
Application CDFG(SUIFvm+Instruction Extensions)
Mapping + Code Generation
Executable CodeSimulation-
Profiling
Static : Count InstructionsDynamic : Estimate Frequencies
Instruction Generation : MaxMISOInstruction Selection : Max Gain
RFU : MaxMISO MappingCore : Code Generation
Revises for Finer Results
MachSUIF +Machine Independent Optimizations
1313
ADD
SUB
NEG SHIFT
NEG
SHIFT
register register constant
register constant
register
Demonstration-RFU ElaborationDemonstration-RFU Elaboration
Largest MaxMISO for a Largest MaxMISO for a Quantization KernelQuantization Kernel
Execution on the Core => Execution on the Core => six cyclessix cycles
Execution on the Core+RFU Execution on the Core+RFU => one cycle=> one cycle
Performance ImprovementsPerformance Improvements
Reduced Instruction Reduced Instruction Memory AccessesMemory Accesses
Temporal Computation
Deliver Result in 2nd Pipeline Stage
Map to PEsILP+Spatial Computation1st Execution Stage
Map to PEsILP+Spatial Computation2nd Execution Stage
1414
ResultsResults
0
0,2
0,4
0,6
0,8
1
CRC FIR FFT QUANT VLC
Normalized Energy Consumption
Core
Core+RFU
CRC FIR FFT QUANT VLC
1.6x 1.8x 2.8x 1.9x 1.7x
Energy Consumption Dominated by Memory Accesses
Speed-Ups for Several Kernels – Core Vs. Core+RFU
1515
ConclusionsConclusions
A RISC Processor Enhanced by a Run-Time A RISC Processor Enhanced by a Run-Time Reconfigurable Function UnitReconfigurable Function Unit
1-D Reconfigurable Array of Coarse Grain Processing 1-D Reconfigurable Array of Coarse Grain Processing ElementsElements
Multiple-Input-Single-Output Reconfigurable InstructionsMultiple-Input-Single-Output Reconfigurable Instructions
Specific Software Development EnvironmentSpecific Software Development Environment
Low Cost Performance and Energy Consumption Low Cost Performance and Energy Consumption ImprovementsImprovements
Next Step => Expand to VLIW Elaboration to Boost Next Step => Expand to VLIW Elaboration to Boost Achieved Speed-UpsAchieved Speed-Ups