An Embedded Low Power/Cost 16-Bit Data/Instruction ... · Data/Instruction Microprocessor...
-
Upload
nguyendien -
Category
Documents
-
view
220 -
download
0
Transcript of An Embedded Low Power/Cost 16-Bit Data/Instruction ... · Data/Instruction Microprocessor...
12nd ASP-DAC, Jan. 26, 2007
An Embedded Low Power/Cost 16-Bit Data/Instruction Microprocessor Compatible with ARM7 Software Tools
An Embedded Low Power/Cost 16-Bit Data/Instruction Microprocessor Compatible with ARM7 Software Tools
Fu-Ching Yang and Ing-Jer Huang
Dept. of Computer Science and Engineering.
National Sun Yat-Sen University, Kaohsiung Taiwan
[email protected]@cse.nsysu.edu.tw
2/22
OutlineOutline
MotivationMotivation
Related WorkRelated Work
Proposed SolutionProposed Solution–– Register file mergingRegister file merging–– Architecture modificationArchitecture modification–– Software tool reuseSoftware tool reuse
Experimental ResultExperimental Result
ConclusionConclusion
3/22
OutlineOutline
MotivationMotivation
Related WorkRelated Work
Proposed SolutionProposed Solution–– Register file mergingRegister file merging–– Architecture modificationArchitecture modification–– Software tool reuseSoftware tool reuse
Experimental ResultExperimental Result
ConclusionConclusion
4/22
MotivationMotivation
ShortShort--precision oriented applicationsprecision oriented applications–– Over Over 50%50% of the operations are of the operations are less than 16 bitsless than 16 bits
Brooks, D. and Martonosi, M.,”Dynamically exploiting narrow width operands to Dynamically exploiting narrow width operands to improve processor power and performanceimprove processor power and performance,,”” Proc. of Intl. Symposium on Fifth Proc. of Intl. Symposium on Fifth HighHigh--Performance Computer ArchitecturePerformance Computer Architecture, pp. 13, pp. 13--22, 1999.22, 1999.
About 50% of the operations are less than 16-bit
Bitwidths for SPECint95 on 64-bit Alpha.
5/22
OutlineOutline
MotivationMotivation
Related WorkRelated Work
Proposed SolutionProposed Solution–– Register file mergingRegister file merging–– Architecture modificationArchitecture modification–– Software tool reuseSoftware tool reuse
Experimental ResultExperimental Result
ConclusionConclusion
6/22
Related workRelated work
DynamicDynamic explore data bit widthexplore data bit width
Static compiler analysisStatic compiler analysis–– ARM7/ARM9 cores:ARM7/ARM9 cores:–– 3232--bit RISC+CISC instruction setbit RISC+CISC instruction set–– 1616--bit RISCbit RISC–– 3232--bit databit data
What if the major data types in your What if the major data types in your SoCSoC are are 8 8 or or 16 16 bitsbits??–– 3232--bit data path may be an bit data path may be an overkilleroverkiller
–– bad forbad for performanceperformance, , powerpower, , chip sizechip size, , etcetc. .
7/22
Memory bandwidth bottleneckMemory bandwidth bottleneck
ARM7/ARM9 cores perform wellARM7/ARM9 cores perform well–– Memory bandwidth : Memory bandwidth : 32/6432/64
What if your What if your SoCSoC can not afford such bandwidthcan not afford such bandwidth–– In In costcost--sensitivesensitive applicationsapplications–– 1616--bitbit bus: bus: 22 cycles/per instruction (or data) fetchcycles/per instruction (or data) fetch–– 88--bitbit bus: bus: 44 cycles/per instruction (or data) fetch cycles/per instruction (or data) fetch
8/22
OutlineOutline
MotivationMotivation
Related WorkRelated Work
Proposed SolutionProposed Solution–– Register file mergingRegister file merging–– Architecture modificationArchitecture modification–– Software tool reuseSoftware tool reuse
Experimental ResultExperimental Result
ConclusionConclusion
9/22
Proposed solutionProposed solution
A 16A 16--bit Thumb Microprocessor bit Thumb Microprocessor –– SYS16TMSYS16TM–– 1616--bit instruction set: same as THUMBbit instruction set: same as THUMB–– 1616--bit data pathbit data path–– An An extra low costextra low cost ARMARM--based solutionbased solution–– CompatibleCompatible with most with most ARMARM’’ss existing development existing development
toolstools
Architecture challengesArchitecture challenges1.1. Register file bank mergingRegister file bank merging2.2. Pipeline TrimmingPipeline Trimming3.3. Reuse Reuse ARMARM’’ss develop environmentdevelop environment
10/22
Register file bank mergingRegister file bank merging
R0~R12 : 16 bitsR0~R12 : 16 bits
SP, LR, PC : 32 bitsSP, LR, PC : 32 bits
CPSR, SPSR : 32 bitsCPSR, SPSR : 32 bits
Define new instruction to access high halfDefine new instruction to access high half--word word of SP, LR, PCof SP, LR, PC
11/22
Architecture ChangesArchitecture Changes
Fetch stageFetch stage–– Replace the Replace the adderadder with added by 2with added by 2
Decode stageDecode stage–– Remove the Remove the THUMB THUMB decompressordecompressor–– Modify the Decoder to decode THUMB directlyModify the Decoder to decode THUMB directly–– Modify the register file to Modify the register file to single modesingle mode
Execution stageExecution stage–– Modify Modify multipliermultiplier–– Remove the Remove the adderadder after multiplierafter multiplier–– Cut the Cut the shifter before ALU pathshifter before ALU path
14/22
Coding rules for reusing ARM7’s Software EnvironmentCoding rules for reusing ARM7’s Software Environment
Linker
C code
Assemblycode
Object EXEAssembler
Compiler
Declare Declare ““short short ”” instead of instead of ““intint”” in C languagein C languagethe registers in SYS16TM are the registers in SYS16TM are 16 bits16 bits
••Remove shift instruction patternsRemove shift instruction patterns••Generated by conventional 32Generated by conventional 32--bit compilerbit compiler••Clear upper 16 bits valueClear upper 16 bits value
•• Emulating 16Emulating 16--bit computation with a 32bit computation with a 32--bit data pathbit data path
Interrupt table maintain the Interrupt table maintain the same addressingsame addressing
C codeObject
Assemblycode
15/22
OutlineOutline
MotivationMotivation
Related WorkRelated Work
Proposed SolutionProposed Solution–– Register file mergingRegister file merging–– Architecture modificationArchitecture modification–– Software tool reuseSoftware tool reuse
Experimental ResultExperimental Result
ConclusionConclusion
16/22
Chip size, clock frequencyChip size, clock frequency
TSMC 0.18TSMC 0.18uum Processm Process ARM7ARM7 SYS16TMSYS16TM SYS16TM/ASYS16TM/ARM7RM7
50,61250,612 20,23220,232
137MHz137MHz
PowerPower 5.33 5.33 mWmW 2.7 2.7 mWmW 51%51%
91 MHz91 MHz
40%40%
151%151%
Gate countGate count
Max. Clock rate (MHz)Max. Clock rate (MHz)
The gate count reductionThe gate count reduction
F e tc h D e c o d e E x e c u t io n0 .0
0 .2
0 .4
0 .6
0 .8
1 .0
4 2 %3 4 %
7 7 %
Gat
e co
unt r
educ
tion
P ip e l in e S ta g e
17/22
Static code Size : SYS16TM VS. ARM7Static code Size : SYS16TM VS. ARM7
SYS16TMSYS16TM’’s code size is smaller : s code size is smaller : 61%61%
qsortdijkstra
hanoiGCD
fibonacci
knightjpeg encoder
050
100150200250300350400450
35004000450050005500600065007000
Cod
e S
ize
Benchmark
ARM7 code size SYS16TM code size
18/22
Executed Internal Instruction countExecuted Internal Instruction count
SYS16TM requires SYS16TM requires 1.12 times1.12 times instruction count instruction count comparing to ARM7 on averagecomparing to ARM7 on average
qsortdijkstra
hanoiGCD
fibonacci
knightjpeg encoder
0.00.20.40.60.81.01.21.41.61.82.0
Inst
ruct
ion
Rat
io
Benchmark
SYS16TM/ARM7
19/22
When considering limited memory bandwidthWhen considering limited memory bandwidth
With 32With 32--bit memory bandwidthbit memory bandwidth–– ARM7 is better than SYS16TM by ARM7 is better than SYS16TM by 6%6%
With 16With 16--bit memory bandwidthbit memory bandwidth–– SYS16TM is better than ARM7 by SYS16TM is better than ARM7 by 64%64%
qsortdijkstra
hanoiGCD
fibonacci
knightjpeg encoder
0.00.20.40.60.81.01.21.41.61.8
Cyc
le c
ount
ratio
Benchmark
ARM32 / SYS16 ARM16 / SYS16
20/22
Microprocessor energy consumptionMicroprocessor energy consumption
Consider the realConsider the real--time applicationtime application–– Assume both processors have to complete the applications at the Assume both processors have to complete the applications at the
same timesame time
qsortd ijks tra
hano iG C D fibonacc i
kn igh tjpeg encoder
0 .00 .10 .20 .30 .40 .50 .60 .70 .80 .91 .0
Ene
rgy
Rat
io
B e n ch m a rk
S Y S 1 6 T M /A R M 3 2 S Y S 1 6 T M /A R M 1 6
Memory Memory bandwidthbandwidth
Energy consumption ratio Energy consumption ratio (SYS16TM/ARM7)(SYS16TM/ARM7)
3232--bitbit 55%55%1616--bitbit 30%30%
21/22
OutlineOutline
MotivationMotivation
Related WorkRelated Work
Proposed SolutionProposed Solution–– Register file mergingRegister file merging–– Architecture modificationArchitecture modification–– Software tool reuseSoftware tool reuse
Experimental ResultExperimental Result
ConclusionConclusion
22/22
Conclusions:SYS16TM VS. ARM7Conclusions:SYS16TM VS. ARM7
Proposed a 16Proposed a 16--bit data 16bit data 16--bit instruction THUMB bit instruction THUMB microprocessormicroprocessor
FeaturesFeatures–– 60%60% smaller in gate count smaller in gate count –– 39%39% smaller in the program size smaller in the program size –– faster in the maximal clock rate under the same faster in the maximal clock rate under the same
technologytechnology–– 51%51% for 0.18for 0.18μμmm
–– 33%33% faster in cycles (@ same clock rate) under faster in cycles (@ same clock rate) under 1616--bit bit memory bandwidthmemory bandwidth
–– More than More than 49%49% power reduction power reduction –– 70%70% energy reduction on energy reduction on 1616--bitbit memory bandwidthmemory bandwidth–– Utilizing the Utilizing the existing ARM7's softwareexisting ARM7's software development development
environment with minor patchingenvironment with minor patching