Compiler needs to drive the ISA - sable.mcgill.caclump/cdp2010/CASCON_Shriraman.pdf · Compiler...
Transcript of Compiler needs to drive the ISA - sable.mcgill.caclump/cdp2010/CASCON_Shriraman.pdf · Compiler...
Compiler needs to drive the ISAArrvindh Shriraman
Simon Fraser University, Jan 1
with input fromSandhya Dwarkadas, Michael Scott, Engin Ipek, and Chen Ding
1
Special thanks to feedback from NSF ACAR II Panel,“What next in Instruction-Level-Parallelism research?”
Brief Introduction
I am a computer architectyou guys probably know more about compilers!
While many collaborators, opinions and mistakes are my own
Non Goal : What should the architecture stack should look like in the future ?
Goal : get compiler designers to actively participate in defining HW-SW interface
2
Outline
Inefficiency of current microprocessorsEnergy wallRISC ISAWhere does energy go ?
Big and Morphable ISA
Heterogeneous execution substrate
3
Energy Wall2014 2020
Total Si
Power/Si
1 4 16
1 1 0.6
Usable Si
200845nm 22nm 11nm
100% 25% 4%
Simple by designfocus on pipelineability and latency
Instructions encode little workno parallelism within instructionsInstruction-parallelism reaching its limits
RISC Instructions
5
Classic study: Tejas and Smith
(HW Overheads)2
Instr.Parallelism
RISC Instructions (contd..)
OO HWneeds to scale resources quadraticallyneeds to reduce misses and branch penalty quadratically
Pipelining small instructions is inefficientfetch consumes 4x more energy than execute
6
10%
15%
6%
31%
38%Fetch
Pipeline
Reg.
D$
Exe
Energy
Current ISAs - Do we need them ?
7
Tension between hardware and software
Software allowed to provide only limited info on data types and operation semantics
Not necessarily useful for hardware eitherenergy is the #1 constraint, not area or clockHW needs to work hard to mine parallelismdoes not support energy-efficient functional units
Hardware
ApplicationsISA
Agenda
8
Need to perform more work per instruction
ISA needs app-specific information
Hardware needs to cut-down overheads
Outline
Inefficiency of current microprocessors
Dynamic Morphable ISADecoupled Multi-level ISABig InstructionsApplication-based ISA extensionsChallenges
Heterogeneous execution substrate
9
De-coupled ISAs
Two-levels of ISA abstractionApp ISA for customization, more info, and portabilityHW ISA for platform-specific opt. and accelerators
Always-present virtual machinecan use rich SW program informationcan use feedback from HW for dynamic optimization
10
Hardware
Applications
Hardware ISA
Application ISACode translator and
optimizer
Application ISA
Big Instructionscompound instructions that encode more work
Can capture relationship between operationsprovides information about operation parallelismenables hardware to exploit parallelism
11
D1 D2 D3 D4
SSE, AVX VLIW, EPICD1 D2 D3 D4
Austin TRIPSD1 D2
D3
Application ISA
Custom instructionslibrary runtimes, C++ Boost, MATLAB, R, Sphinxgenerate application-specific instructions
Static instructionsmany applications have a small hot code-regions70% code shared between applications
Dynamic instructionsenable software to add new instructions to ISA90% coverage with 40,000 static instruction*
12* GreenDroid - HotChips 2010
Research Agenda (1/2)
What is the basic application ISAX86, LLVM, RISC, Java bytecode, Nvidia PTX ?
How to specify and modify application ISA ?pragmas, macro language ?generate from hot path ?
13
How to enable software to make use of new ISA?compiler generators ? Is that so crazy or feasible ?how to alter the backend of a compiler at runtime ?
Research Agenda (2/2)
Reconsider Instruction semantics do we need precise exceptions at the instruction level ?atomicity of which operation guaranteed ?major impact on hardware complexity
14
Example Designs
IBM System 38Application ISA : TMIImplementation ISA : IBM AS-400 decendants
Transmeta CMS (ahead of its time ?) Application ISA : x86 for compatibilityImplementation ISA: VLIW for energy efficiency
Nvidia CUDAApplication ISA : PTXImplementation ISA : Nvidia GPUs
15
Outline
Inefficiency of current microprocessors
Dynamic Morphable ISA
Heterogeneous execution substratesArchitectureProgram state managementExamples
16
Accelerator-based architecture
Dedicated specialized hardware to improve performance under same energy
Granularity 80s, floating point unit, co-processors90s-00s : Vector units.Future: GPU, DSP, and more...
17
CPU
Signal CryptoAccelerator
GPU
Processor
New Challenge : Program StateMulticore CPU (e.g., Power6)
Hardware cachingFine-grained sharing
18
Accelerators (e.g, GPU)
Scratchpad memoriesCoarse-grain sharing
Minimal programmer effort
SW cannot directly optimizeSW can optimize
Weak memory models
Latency-optimized Throughput-optimizedsmall reg files (<128bytes)large caches
large reg files. (2MB) small cache all L1s = L2 = 768KB
70% of on-chip mem.software controlled
90% of on-chip mem.hardware controlled
per L1 = 64KB L2 = 4MB
Program state challenges
Where to stage data required by big instructions ?registers, on-chip cache, memorywhat about bandwidth ?
SW runtime to manage on-chip caches ?similar to GC, except for physical storage,on-chip memory shared by many tasksneed to organize private and shared-data
Managing program state in SW too hard ?can we afford not to, HW too reactive and unoptimal
19
New Challenges: Task Mapping
20
0
2.5
5.0
7.5
10.0
0 10 20 30 40 50 60 70 80 90 100
Spe
edup
% on CPU
Adaptive Task Mapping [Micro 2010]
Matrix-Multiplication
Task : Bag of instructionsHow to choose which tasks to run on accelerator?
How to choose accelerator?
Example systemsFocus : design time chip generator for embedded
applications are reasonably stableprofile application and create HW for hot-code
Industry : Tensilica Systemsnew language to specify extension units and instructionsautomate integration of special-purpose HW w/ CPUgenerates compiler backend for new instructions
Academia : Conservation Cores [UCSD]generated HW from profiled hot-code in programcold-code on CPU; hot-code on specialized HW
21
Summary
Decouple ISA
Application ISAHLLs can communicate more informationenable more dynamic optimizationsallow for dynamic extensions for app-customization
Heterogeneous accelerator coresnew compute units driven by software librariesSW needs to manage on-chip program-state
22
Hardware
Applications
Hardware ISA
Application ISASoftware code translator
23