A Multigggrain Parallelizing Compiler with Power Control for … · 2008. 3. 7. · 0.0 1.0 2.0...
Transcript of A Multigggrain Parallelizing Compiler with Power Control for … · 2008. 3. 7. · 0.0 1.0 2.0...
A Multigrain Parallelizing g gCompiler with Power Control
for Multicore Processors Hironori Kasahara
Professor Department of Computer ScienceProfessor, Department of Computer ScienceDirector, Advanced Chip-Multiprocessor
Research InstitutesResearch Institutes Waseda University
Tokyo JapanTokyo, Japanhttp://www.kasahara.cs.waseda.ac.jp
Feb. 5, 2008, 13:30-15:00 at Intel Headquarter
Hironori Kasahara<Personal History>
B.S. (1980,Waseda), M.S.(1982,Waseda), Ph.D.(1985,EE, Waseda). Res.Assoc. (1983,Waseda),S i l R h F ll JSPS (1985) Vi i i S h l (1985 U i C lif i B k l )Special Research Fellow JSPS (1985) ,Visiting Scholar (1985.Univ.California at Berkeley), .Assist. Prof. (1986.Waseda), Assoc. Prof.(1988,Waseda), Visiting Research Scholar(1989-1990. Center for Supercomputing R&D, Univ.of Illinois at Urbana-Champaign), Prof.(1997-,Dept. CS, Waseda). , IFAC World Congress Young Author Prize (1987), IPSJ Sakai Memorial Special Award (1997), STARC Industry-Academia Cooperative Research Award (2004)
<Activities for Societies>IPSJ:Sig. Computer Architecture(Chair), Trans of IPSJ Editorial Board (HG Chair),
Journal of IPSJ Editorial Board (HWG Chair), 2001 Journal of IPSJ Special Issue on ParallelJournal of IPSJ Editorial Board (HWG Chair), 2001 Journal of IPSJ Special Issue on Parallel Processing(Chair of Editorial Board: Guest Editor, JSPP2000 (Program Chair) etc. ACM :International Conference on Supercomputing(ICS)(Program Committee)
Int’l conf. on Supercomputing (PC, esp. ‘96 ENIAC 50th Anniversary Co-Prog. Chair). IEEE C t S i t J Ch t Ch i T k S ti B d M b SC07 PCIEEE: Computer Society Japan Chapter Chair, Tokyo Section Board Member, SC07 PCOTHER: PCs of many conferences on Supercomputing and Parallel Processing.
<Activities for Governments>METI:IT Policy Proposal Forum(Architecture/HPC WG Chair), y p ( ),
Super Advanced Electronic Basis Technology Investigation CommitteeNEDO:Millennium Project IT21 “Advanced Parallelizing Compiler”( Project Leader),Computer Strategy WG (Chair).Multicore for Realtime Consumer Electronics Project Leader etc.MEXT:Earth Simulator project evaluation committee, 10PFLOPS Supercomputer evaluat. comm.p j , p pJAERI: Research accomplishment evaluation committee, CCSE 1st class invited researcher. JST: Scientific Research Fund Sub Committee, COINS Steering Committee ,
Precursory Research for Embryonic Science and Technology (Research Area Adviser)Cabinet Office: CSTP Expert Panel on Basic Policy, Information & Communication Field
2
Cabinet Office: CSTP Expert Panel on Basic Policy, Information & Communication Field Promotion Strategy , R&D Infrastructure WG, Software & Security WG
<Papers>Papers 158, Invited Talks 64, Tech. Reports 114, Symposium 25, News Papers/TV/Web News/Magazine 139, IEEE Trans. Computer, IPSJ Trans., ISSCC, Cool Chips, Supercomputing, ACM ICS, etc.etc.
Multi-core EverywhereMulti-core from embedded to supercomputers
Consumer Electronics (Embedded)Mobile Phone, Game, Digital TV, Car Navigation,
DVD CDVD, Camera, IBM/ Sony/ Toshiba Cell, Fuijtsu FR1000, NEC/ARMMPCore&MP211, Panasonic Uniphier, Renesas SH multi-core(4 core RP1 8 core RP2)Renesas SH multi core(4 core RP1, 8 core RP2)Tilera Tile64, SPI Storm-1(16 VLIW cores)
PCs, ServersIntel Quad Xeon, Core 2 Quad, Montvale, Tukwila, 80 coreAMD Quad Core Opteron, Phenom
WSs, Deskside & Highend ServersIBM Power4,5,5+,6 Sun Niagara(SparcT1,T2), Rock
S tIBM BlueGene/L
Lawrence Livermore National Laboratory2005/04稼動予定
OSCAR Type Multi-core Chip by Renesas in METI/NEDO Multicore for Real-time Consumer Electronics Project (Leader: Prof.Kasahara)
SupercomputersEarth Simulator:40TFLOPS, 2002, 5120 vector proc.IBM Blue Gene/L: 360TFLOPS, 2005, Low power CMP
based 128K processor chips, BG/P 2008
Lawrence Livermore National Laboratory2005/04稼動予定
低消費電力チップマルチプロセッサをベースとしたスパコン
High quality application software, Productivity, Costperformance, Low power consumption are important
Ex, Mobile phones, GamesC il t d lti
65,536-プロセッサチップ(128Kプロセッサ
360TFLOPS
3
Compiler cooperated multi-core processors are promising to realize the above futures
=毎秒360兆回の 浮動小数点計算)
キャビネットあたり1024プロセッサチップ
1プロセッサチップ上に2プロセッサ集積
Market of Consumer Electronics1 T illi D ll i 20101 Trillion Dollars in 2010 (World Wide)
1 6 兆 円
W /W 市 場 規 模
1 6 兆 円
W /W 市 場 規 模
市場規
1 2 兆 円
1 4 兆 円
兆 円
携 帯 電 話
市場規
1 2 兆 円
1 4 兆 円
兆 円
携 帯 電 話
規模(出荷額)
1 . 5 兆 円
2 兆 円テ ゙ シ ゙ タ ル カ メ ラ
゙ シ ゙ タ ゙ ゙ オ
薄 型 テ レ ヒ ゙
規模(出荷額)
1 . 5 兆 円
2 兆 円テ ゙ シ ゙ タ ル カ メ ラ
゙ シ ゙ タ ゙ ゙ オ
薄 型 テ レ ヒ ゙
0
5 千 億 円
1 兆 円テ ゙ シ ゙ タ ル ヒ ゙テ ゙ オ
カ ー ナ ヒ ゙
D V D レ コ ー タ ゙
0
5 千 億 円
1 兆 円テ ゙ シ ゙ タ ル ヒ ゙テ ゙ オ
カ ー ナ ヒ ゙
D V D レ コ ー タ ゙
0` 0 3 ` 0 4 ` 0 5 ` 0 6` 0 2
0` 0 3 ` 0 4 ` 0 5 ` 0 6` 0 2
121276764949デ ジ タル ス チ ル カメラ(M 台 )
年 平 均 成長 率 %‘07‘03
121276764949デ ジ タル ス チ ル カメラ(M 台 )
年 平 均 成長 率 %‘07‘03
Dig Camera
Annual Growth Rates
2005.5.11NEDOロードマップ報告会電子・情報技術開発部「技術開発戦略」より
43431141142727PC用 DV D(記 録 型 )(M 台 )
747433333.63.6D V Dレコー ダ (M 台 )
4545272766デ ジ タル TV(M 台 )
121276764949デ ジ タル ス チ ル カメラ(M 台 )
43431141142727PC用 DV D(記 録 型 )(M 台 )
747433333.63.6D V Dレコー ダ (M 台 )
4545272766デ ジ タル TV(M 台 )
121276764949デ ジ タル ス チ ル カメラ(M 台 )Dig. CameraDig. TVDVD RecorderDVD for PC
4111120.920.914.014.0自 動 車 用 半 導 体 需 要 (B$)
88670670490490携 帯 電 話 (M 台 )
43431141142727PC用 DV D(記 録 型 )(M 台 )
111120.920.914.014.0自 動 車 用 半 導 体 需 要 (B$)
88670670490490携 帯 電 話 (M 台 )
43431141142727PC用 DV D(記 録 型 )(M 台 )Mobile PhoneLSI for Cars
Roadmap of compiler cooperative multicore project02 03 04 05 06 07 0800 01 09 10
Apply for Supercomputer CompilersCompiler development of
Millennium Project IT21NEDO Advanced Parallelizing Compiler Apply for Supercomputer Compilers
Multiprocessor ServersParallelizing Compiler
(Waseda Univ. Fujitsu,Hitachi,JIPDEC,AIST)
STARC Compiler Cooperative Practical ResarchBasic resrach STARC:
Semiconductor Technology Academic
Research Center Fujitsu,Toshiba,NEC,
STARC:Semiconductor
Technology Academic Research Center
Fujitsu,Toshiba,NEC,
(Waseda・Toshiba・Fujitsu・Panasonic・Sony)
(Waseda・Fujitsu・NEC・Toshiba)
p pChip Multiprocessor( Waseda Univ., Fujitsu, NEC,Toshiba, Panasonic,Sony) Waseda Univ Hitachi Renesas
Arch. % CompilR&D
Plan Practical Use
Renesas,Panasonic, Sony etc.
Renesas,Panasonic, Sony etc.
y
NEDO (2004.07-2007.06)Heterogeneous Multiprocessor(Waseda Univ., Hitachi)
Waseda Univ., Hitachi, Renesas,
Soft realtime
PlanMulticore Arch. Compiler
Practical Use
( , )
NEDO (2005.06-2008.03)Multicore Technology for Realtime Consumer Electronics Plan
API R&DPractical Use
Waseda Univ., Hitachi, Renesas,Fujitsu, NEC, Toshiba, PanasonicPower Saving Multicore
Architecture,P ll li i C il API H t M lti A h
5
Mar. Oct.
Parallelizing Compiler,APINEDO (2007.02- 2010.03)
Heterogeneous Multicore for Consumer Electronics Waseda U i Hit hi R T k I t f T h
Mar.Mar.
PlanHetero Multicore Arch.
& Compiler R&D
METI/NEDO Advanced Parallelizing Compiler Technology ProjectBackground and Problems①Adoption of parallel processing as a core
Millenium Project IT21 2000.9.8 –2003.3.31W d U i F jit Hit hi AIST ①Adoption of parallel processing as a core
technology on PC to HPC ② Increase of importance of software on IT③ Need for improvement of cost-performance
and usability
Performance
1T
Waseda Univ., Fujitsu, Hitachi, AIST
Contents of Research and Development① R & D of advanced parallelizing compiler
and usabilityHardware Peak performance
Multigrain, Data localization, Overhead hiding② R & D of Performance evaluation technology
for parallelizing compilers<Purpose>
Proj
ect
Improvement of① Effective performance
Goal: Double the effective performance
Ripple EffectA
PC P① Effective performance
②Cost-performance③ Ease of use
Slow down
pp① Development of competitive next
generation PC and HPC② Putting the innovative automatic
parallelizing compiler technology Effective Performance
1980 20001990
p g p gyto practical use
③ Development and market acquisitionof future single-chip multiprocessors
④ Boosting R&D in the following many fields:year1G
6
IT, Bio-tech., Device, Earth environment, Next-generation VLSI design, Financial engineering, Weather forecast, New clean energy, Space development, Automobile, Electric Commerce, etc
④ g g y
Theoretical maximum performance vs.Effective performance of HPC
y
Performance of APC Compiler on IBM pSeries690 16 Processors High-end Server
• IBM XL Fortran for AIX Version 8.1– Sequential execution : -O5 -qarch=pwr4– Automatic loop parallelization : -O5 -qsmp=auto -qarch=pwr4– OSCAR compiler : -O5 -qsmp=noauto -qarch=pwr4
(su2cor: -O4 -qstrict)12.0
XL Fortran(max)3.4s 3.1s 3.5s3 5 ti d
10.0
XL Fortran(max)
APC(max)3.0s
38.5s
28.8s3.5 times speedup in average
6.0
8.0
dup
ratio
3.8s
38.5s
4.0Spee
16.7s7.1s
16 4s 13 0s115.2s 107.4s 105.0s
28.9s
0.0
2.0
23 1s
19.2s
16.7s
38 3s
21.5s21.0s
16.4s
23.2s
13.0s
37.4s
39 5s27 8s35 1s30 3s21 5s 126 5s 307 6s 291 2s 279 1s
126.5s 184.8s
22 5s 85 8s 18 8s
22.5s 85.8s 18.8s
282 4s 321 4s
282.4s 321.4s
7tomcatvswim
su2corhydro 2d
mgridappluturb3d aps i
fppppwave5
wupwiseswim2kmgrid
2kapplu2ks ixt r
ackaps i2k
23.1s 38.3s 39.5s27.8s35.1s30.3s21.5s 126.5s 307.6s 291.2s 279.1s22.5s 85.8s 18.8s 282.4s 321.4s
Performance of Multigrain Parallel Processing for 102.swim on IBM pSeries690
16 00
14.00
16.00XLF(AUTO)
OSCAR
10.00
12.00
p ra
tio
6.00
8.00
Spee
d up
2.00
4.00S
0.001 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Processors
8
Performance of APC Compileron Sun Ultra80 4 Processor Workstation
• Sun Forte Developer 6 Update 2Sequential execution : fast
on Sun Ultra80 4 Processor Workstation
– Sequential execution : -fast– Automatic loop parallelization : -fast -autopar -reduction -stackvar– OSCAR compiler : -fast -explicitpar -mp=openmp -stackvar
8 0
9.0
10.0Forte6(max)
APC(max)
21s
6.0
7.0
8.0
p ra
tio
19s
31s 21s
3.0
4.0
5.0
Spee
dup
115 31s 30s77
31s22s
21s
92
56s 216s
878s
0.0
1.0
2.0
121s 198s 70s 141s 97s 172s 201s
97s
115s 31s 77s133s
92s
194s
881s 1139s 846s 1453s
870s 1035s
962s579s 569s
1094s878s
58s 387s 98s
58s 387s 98s
763s 1313s
763s 1313s
9
0.0
tomca
tv
swim
su2c
orhy
dro2d
mgrid
applu
turb3
d
apsi
fpppp
wave5
wupwise
swim
2kmgri
d2kap
plu2k
sixtra
ckap
si2k
121s 198s 70s 141s 97s 172s 201s 881s 1139s 846s 1453s58s 387s 98s 763s 1313s
OSCAR Parallelizing Compiler• Improve effective performance, cost-performance
and productivity and reduce consumed power– Multigrain Parallelization
• Exploitation of parallelism from the whole program by use ofcoarse-grain parallelism among loops and subroutines,coarse grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism
Data Localization– Data Localization• Automatic data distribution for distributed shared memory, cache
and local memory on multiprocessor systems.– Data Transfer Overlapping
• Data transfer overhead hiding by overlapping task execution anddata transfer using DMA or data pre-fetchingdata transfer using DMA or data pre fetching
– Power Reduction• Reduction of consumed power by compiler control of
10frequency, voltage and power shut down with hardware supports.
Generation of Coarse Grain Tasks
Macro-tasks (MTs)Bl k f P d A i t (BPA) B i Bl k (BB)Block of Pseudo Assignments (BPA): Basic Block (BB)Repetition Block (RB) : outermost natural loop Subroutine Block (SB): subroutineSubroutine Block (SB): subroutine
BPA Near fine grain parallelization BPABPA Near fine grain parallelization
Loop level parallelization BPA
BPARBSBBPARBProgram RB Near fine grain of loop body
Coarse grainparallelization
RBSB
BPA
RBSBBPARBSBSB Coarse grain
parallelization
BPARBSB
SBBPARBSB
111st. Layer 2nd. Layer 3rd. LayerTotalSystem
Earliest Executable Condition Analysis for coarse grain tasks (Macro-tasks)
1 BPABPA
Data Dependency
Control flow
Conditional branch
Block of Psuedo
1
3 BPA2 BPA
4 BPA
BPA
RB Repetition Block
RB
Block of PsuedoAssignment Statements
7
2 3
4 8BPA
5 BPA
6 RB
7 RB15 BPA
RB
RB
RB
BPA
7
11
4
5 6
8
9 10
6
BPA8 BPA
9 BPA 10 RB
11 BPA
RB 1115 7
12
13
Data dependency
Extended control dependency11 BPA
12 BPA
13 RB14
13Extended control dependencyConditional branch
ORAND
Original control flowA Macro Flow Graph
1214 RB
ENDA Macro Task Graph
Automatic processor assignment in 103 su2cor103.su2cor
• Using 14 processorsg p– Coarse grain parallelization within DO400 of subroutine LOOPS
13
MTG of Su2cor-LOOPS-DO400MTG of Su2cor-LOOPS-DO400Coarse grain parallelism PARA_ALD = 4.3
14DOALL Sequential LOOP BBSB
Data-Localization L Ali d D itiLoop Aligned Decomposition
• Decompose multiple loop (Doall and Seq) into CARs and LRs id i i t l d t d dconsidering inter-loop data dependence.
– Most data in LR can be passed through LM.– LR: Localizable Region, CAR: Commonly Accessed Region
LR CAR CARLR LRC RB1(Doall)DO I=1,101
DO I=69,101DO I=67,68DO I=36,66DO I=34,35DO I=1,33
DO I=1,33C RB2(Doseq)
A(I)=2*IENDDO
DO I=67 67
DO I=35,66
DO I=34,34( q)
DO I=1,100B(I)=B(I-1)
+A(I)+A(I+1)ENDDO
DO I=2 34
DO I=68,100
DO I=67,67
DO I=68 100DO I=35 67
RB3(Doall)DO I=2,100
C(I)=B(I)+B(I-1)
15
DO I=2,34 DO I=68,100DO I=35,67C(I)=B(I)+B(I-1)ENDDO
C
Data Localization11
2
1
32
PE0 PE1
12 1
3 4 56
23 45
6 7 8910 1112
3
4
2
6 7
8
14
18
7
6 7 8910 1112
1314 15 16
5
8
9
11
15
18
19
25
8 9 101718 19 2021 22
10
11
13 16
25
29
112324 25 26
dlg0dlg3dlg1 dlg2
17 20
21
22 26
30
12 1314
15
2728 29 3031 32
33Data Localization Group
dlg023 24
27 28
32
16MTG MTG after Division A schedule for two processors
15 33Data Localization Group31
An Example of Data Localization for Spec95 SwimDO 200 J=1 N cache sizeDO 200 J=1 N cache sizeDO 200 J=1,NDO 200 I=1,M
UNEW(I+1,J) = UOLD(I+1,J)+1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J)2 +CV(I+1 J)) TDTSDX*(H(I+1 J) H(I J))
UN
0 1 2 3 4MBcache sizeDO 200 J=1,N
DO 200 I=1,MUNEW(I+1,J) = UOLD(I+1,J)+
1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J)2 +CV(I+1 J)) TDTSDX*(H(I+1 J) H(I J))
UN
0 1 2 3 4MBcache size
2 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J))VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1))
1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J))2 -TDTSDY*(H(I,J+1)-H(I,J))
VOZ
PNVN UOPO CVCUH
2 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J))VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1))
1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J))2 -TDTSDY*(H(I,J+1)-H(I,J))
VOZ
PNVN UOPO CVCUHPNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J))
1 -TDTSDY*(CV(I,J+1)-CV(I,J))200 CONTINUE
H
UNPNVN
PNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J))1 -TDTSDY*(CV(I,J+1)-CV(I,J))
200 CONTINUE
H
UNPNVN
DO 210 J=1,NUNEW(1,J) = UNEW(M+1,J)VNEW(M+1,J+1) = VNEW(1,J+1)PNEW(M+1 J) = PNEW(1 J)
UNVOPNVN UO
U PV
DO 210 J=1,NUNEW(1,J) = UNEW(M+1,J)VNEW(M+1,J+1) = VNEW(1,J+1)PNEW(M+1 J) = PNEW(1 J)
UNVOPNVN UO
U PV
DO 300 J=1,N
PNEW(M+1,J) = PNEW(1,J)210 CONTINUE
VOPNVN UOPO
C h li fli tDO 300 J=1,N
PNEW(M+1,J) = PNEW(1,J)210 CONTINUE
VOPNVN UOPO
C h li fli tDO 300 I=1,MUOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J))POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J)) (b) Image of alignment of arrays on
Cache line conflicts occurs among arrays which share the same location on cache
DO 300 I=1,MUOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J))POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J)) (b) Image of alignment of arrays on
Cache line conflicts occurs among arrays which share the same location on cache
17
( , ) ( , ) ( ( , ) ( , ) ( , ))300 CONTINUE
(a) An example of target loop group for data localization
(b) Image of alignment of arrays on cache accessed by target loops
( , ) ( , ) ( ( , ) ( , ) ( , ))300 CONTINUE
(a) An example of target loop group for data localization
(b) Image of alignment of arrays on cache accessed by target loops
Data Layout for Removing Line Conflict Misses by Array Dimension Paddingby Array Dimension Padding
Declaration part of arrays in spec95 swim
PARAMETER (N1=513, N2=513) PARAMETER (N1=513, N2=544)
before padding after padding
COMMON U(N1,N2), V(N1,N2), P(N1,N2),* UNEW(N1,N2), VNEW(N1,N2),1 PNEW(N1,N2), UOLD(N1,N2),
COMMON U(N1,N2), V(N1,N2), P(N1,N2),* UNEW(N1,N2), VNEW(N1,N2),1 PNEW(N1,N2), UOLD(N1,N2),
* VOLD(N1,N2), POLD(N1,N2),2 CU(N1,N2), CV(N1,N2),* Z(N1,N2), H(N1,N2)
* VOLD(N1,N2), POLD(N1,N2),2 CU(N1,N2), CV(N1,N2),* Z(N1,N2), H(N1,N2)
4MB 4MB
padding
18Box: Access range of DLG0
APC Compiler OrganizationAPC M l i iAPC Multigrain Parallelizing Compiler
- Cache optimization
Automatic DataDistribution
MultigrainParallelizationSource
ProgramOpenMP
C d
Cache optimization- DSM optimization
SchedulingData DependenceAnalysis
- Hierarchical Parallel- Affine Partitioning
Program
FORTRANCode
Generation- Static Scheduling- Dynamic Scheduling
- Inter-procedural- Conditional
SpeculativeE ti
- Coarse Grain- Medium Grain- Architecture Support
Execution
Parallelizing Tuning System
Feedback
Runtime
IBM XL FortranVer.8.1
Sun Forte 6Update 2
IBM XL FortranVer.7.1
Program Visualization Technique
Techniques for Profiling & Utilizing Runtime Info
Feedback-directed Selection Technique of
Compiler Directives
RuntimeInfo
…
19
Utilizing Runtime Info
Variety of Shared Memory Parallel machines
SUNUltra 80
IBMpSeries690
IBMRS6000
Image of Generated OpenMP Code for Hierarchical Multigrain Parallel Processing
Centralized scheduling codeHierarchical Multigrain Parallel Processing code
SECTIONSSECTION SECTION1st layer
Distributed scheduling
d
T0 T1 T2 T3
MT1_1
T4 T5 T6 T7MT1_1
1st layer
code SYNC SENDMT1_2
SYNC RECVMT1_3SB
MT1_2DOALL
MT1 4MT1-3
1_4_11_4_11_3_2
1_3_3
1_3_1
1_3_2
1_3_3
1_3_1
1_3_2
1_3_3
1_3_1MT1_4RB MT1-4
1_4_2
1_4_4
1_4_3
1_4_2
1_4_4
1_4_31_3_4
1 3 6
1_3_5
1_3_4
1 3 6
1_3_5
1_3_4
1 3 6
1_3_5
1_3_1
1_3_2 1_3_3 1_3_41_4_11_3_6 1_3_6 1_3_6
END SECTIONS2nd layer
1_3_5 1_3_6
1_4_21_4_3 1_4_4
20Thread group0
Thread group1
2nd layer
2nd layer
Performance OSCAR Multigrain Parallelizing Compiler on a IBM p550q 8core Deskside Server
2.7 times speedup against loop parallelizing compiler on 8 cores
Loop parallelizationMultigrain parallelizition
7
8
9
5
6
7
dup
ratio
2
3
4
spee
d
0
1
2
t t i 2 h d 2d id l t b3d i f 5 i id l i
Page. 21
tomcatvswim su2corhydro2dmgrid applu turb3d apsi fpppp wave5 swim mgrid applu apsispec95 spec2000
OSCAR Compiler Performance on 24 Processor IBM p690Highend SMP Serverp g
GXGXP P P P
GXP P P P
GX
MemSlot
GX Sl ot
L3 L3 L3 L3L3 L3 L3 L3
L3 L3 L3 L3
GX SlotMem
SlotMemSlot
MemSlot
GX Sl ot
20
25 XL Fortran Ver.8.1OSCAR
P
L2
P P
L2
P
P
L2
PP
L2
P
P
L2
P P
L2
P
GX
P
L2
PP
L2
P
P
L2
P P
L2
P
GX
GX
GX
P
L2
P P
L2
P
GXGX
L3 L3
L3 L3
L3 L3
L3 L3
L3 L3
L3 L3
MCM 3MCM 2
15
20
p ra
tio
GXGX
P
L2
PP
L2
P
GX
P
L2
PP
L2
P
GX
GXGX
L3 L3 L3 L3L3 L3L3 L3
L3 L3 L3 L3MCM 1 MCM 0
MemSlot
MemSlot
MemSlot
MemSlot
GX Slot
Four 8-way MCM Features Assembled into a 32-way pSeries 690
10
spee
du
0
5
v m r d d u d i p 5 m d u i
tom
catv
swim
su2c
o
hydr
o2
mgr
i d
appl
turb
3d
aps
fppp
p
wav
e
swi m
mgr
id
appl ap
s
spec95 spec2000
Page. 22 4.82 times speedup against loop parallelization spec95 spec2000
Performance on SGI Altix 450Montecito 16 processors cc-NUMA serverp
Compiler options for the Intel Compiler:for Automation parallelization, : -fast -parallel.for OpenMP codes generated by OSCAR : -fast -openmp
789 Intel Ver.8.1
OSCAR
3456
peed
up ra
tio
0123sp
0
tom
catv
swim
su2c
or
hydr
o2d
mgr
id
appl
u
turb
3d apsi
fppp
p
wav
e5
swim
mgr
id
appl
u
apsi
95 2000
• OSCAR compiler gave us 1.86 times speedupagainst Intel Fortran Itanium Compiler revision 8 1
spec95 spec2000
against Intel Fortran Itanium Compiler revision 8.1
Performance of OSCAR compiler on16 cores SGI Altix 450 Montvale server16 cores SGI Altix 450 Montvale server
6 Intel Ver.10.1
Compiler options for the Intel Compiler:for Automation parallelization: -fast -parallel.for OpenMP codes generated by OSCAR: -fast -openmp
4
5
o
OSCAR for OpenMP codes generated by OSCAR: -fast -openmp
2
3
4
peed
up ra
ti
1
2sp
0
omca
tv
swim
su2c
or
hydr
o2d
mgr
id
appl
u
turb
3d apsi
fppp
p
wav
e5
swim
mgr
id
appl
u
apsi
• OSCAR compiler gave us 2.32 times speedup
to h
spec95 spec2000
OSC co p e gave us .3 t es speedupagainst Intel Fortran Itanium Compiler revision 10.1
NEC/ARM MPCore Embedded 4 core SMP
3.54
4.5 g77oscar
22.5
3
peed
up ra
tio
00.5
1
1.5sp
01 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
toamcatv swim su2cor hydro2d mgrid applu turb3d
25
SPEC95
3.48 times speedup by OSCAR compiler against sequential processing
Power Reduction by Power Supply, Clock Frequencyand Voltage Control by OSCAR Compiler
• Shortest execution time mode
26
OSCAR Multi-Core ArchitectureCMP (chip multiprocessor 0)
CMP m
PE0 PE
CMP (chip multiprocessor 0)0
CPU
I/ODevicesI/O
Devices0 PE1 PE n
LDM/D-cacheLPM/
CSM j I/OCMP k
CPUDTC
DSMI-Cache
CSN t k I t fFVR
Intra-chip connection network
CSMNetwork InterfaceFVR
FVR
CSM / L2 Cache
(Multiple Buses, Crossbar, etc) FVR
FVRFVR FVR FVR FVR
Inter-chip connection network (Crossbar, Buses, Multistage network, etc)CSM: central shared mem. LDM : local data mem.
FVR
27DSM: distributed shared mem.DTC: Data Transfer Controller
LPM : local program mem.FVR: frequency / voltage control register
An Example of Machine Parameters for the Power Saving Scheme
• Functions of the multiprocessor• Functions of the multiprocessor– Frequency of each proc. is changed to several levels
Voltage is changed together with frequency– Voltage is changed together with frequency– Each proc. can be powered on/off
state FULL MID LOW OFFstatefrequencyvoltagedynamic energy
FULL111
MID1 / 20.873 / 4
LOW1 / 40.711 / 2
OFF000y gy
static power 1 1 1 0
• State transition overheadstateFULLMIDLOW
FULL0
40k40k
MID40k0
40k
LOW40k40k0
OFF80k80k80k
stateFULLMIDLOW
FULL02020
MID20020
LOW20200
OFF404040
28
LOWOFF
40k80k
40k80k
080k
80k0
LOWOFF
2040
2040
040
400
delay time [u.t.] energy overhead [μJ]
Speed-up in Fastest Execution Modep p
4 5
3.54
4.5w/o Savingw Saving
2.5
3
dup
ratio
1
1.5
2
spee
d
0
0.51
1 2 4 1 2 4 1 2 4 1 2 4
tomcatv swim applu mpeg2enc
b h k29
benchmark
Consumed Energy in Fastest Execution Mode
200 1600
140160
180
200
w/o Savingw Saving 1200
14001600
J)
80
100
120
140
ener
gy(J
)
600800
1000
nerg
y(m
J
20
4060
200400600en
01 2 4 1 2 4 1 2 4
tomcatv swim applu
01 2 4
b fbenchmark
mpeg2 encode
number of proc.
30
mpeg2_encode
Energy Reduction by OSCAR Compiler in Real-time Processing mode (10% Leak)in Real-time Processing mode (10% Leak)
• deadline = sequential execution time Leakage Power: 10%31
• deadline = sequential execution time, Leakage Power: 10%
METI/NEDO National Project Multi core for Real time Consumer ElectronicsMulti-core for Real-time Consumer Electronics
<Goal> R&D of compiler cooperative multi-core processor technology for consumer (2005.7~2008.3)**core processor technology for consumer electronics like Mobile phones, Games, DVD, Digital TV, Car CMP m
(マルチコア・チ プ )
PC0(プロセッサコア0) PC
CMP 0 (マルチコアチップ0)
CSM j
CPU(プロセッサ)
I/O DevicesI/O
(入出力装置)CMP m(マルチコア・チ プ )
PC0(プロセッサコア0) PC
CMP 0 (マルチコアチップ0)
CSM j
CPU(プロセッサ)
I/O DevicesI/O
(入出力装置)
新マルチコアプロセッサ
•高性能
低消費電力
新マルチコアプロセッサ
•高性能
低消費電力
CMP m(マルチコア・チ プ )
PC0(プロセッサコア0) PC
CMP 0 (マルチコアチップ0)
CSM j
CPU(プロセッサ)
I/O DevicesI/O
(入出力装置)CMP m(マルチコア・チ プ )
PC0(プロセッサコア0) PC
CMP 0 (マルチコアチップ0)
CSM j
CPU(プロセッサ)
I/O DevicesI/O
(入出力装置)
新マルチコアプロセッサ
•高性能
低消費電力
新マルチコアプロセッサ
•高性能
低消費電力
( )
navigation systems. <Period> From July 2005 to March 2008
チップm)
(プロセッサ
コアn)
) PC1(プロ
セッサコア1)
PC n
DSM(分散共有メモリ)
LDM/D-cache
(ローカルデータメモリ/L1データ
キャッシュ)
LPM/I-Cache
(ローカルプログラムメモリ/
命令キャッシュ)
j
CSM(集中共有
メモリ)
I/O
CSP k
(入出力用マルチコア・チップ)
NI(ネットワークインターフェイス)
DTC(データ
転送コントローラ)
チップm)
(プロセッサ
コアn)
) PC1(プロ
セッサコア1)
PC n
DSM(分散共有メモリ)
LDM/D-cache
(ローカルデータメモリ/L1データ
キャッシュ)
LPM/I-Cache
(ローカルプログラムメモリ/
命令キャッシュ)
j
CSM(集中共有
メモリ)
I/O
CSP k
(入出力用マルチコア・チップ)
NI(ネットワークインターフェイス)
DTC(データ
転送コントローラ)
•低消費電力
•短HW/SW開発期間
•各チップ間でアプリケーション
•低消費電力
•短HW/SW開発期間
•各チップ間でアプリケーション
チップm)
(プロセッサ
コアn)
) PC1(プロ
セッサコア1)
PC n
DSM(分散共有メモリ)
LDM/D-cache
(ローカルデータメモリ/L1データ
キャッシュ)
LPM/I-Cache
(ローカルプログラムメモリ/
命令キャッシュ)
j
CSM(集中共有
メモリ)
I/O
CSP k
(入出力用マルチコア・チップ)
NI(ネットワークインターフェイス)
DTC(データ
転送コントローラ)
チップm)
(プロセッサ
コアn)
) PC1(プロ
セッサコア1)
PC n
DSM(分散共有メモリ)
LDM/D-cache
(ローカルデータメモリ/L1データ
キャッシュ)
LPM/I-Cache
(ローカルプログラムメモリ/
命令キャッシュ)
j
CSM(集中共有
メモリ)
I/O
CSP k
(入出力用マルチコア・チップ)
NI(ネットワークインターフェイス)
DTC(データ
転送コントローラ)
•低消費電力
•短HW/SW開発期間
•各チップ間でアプリケーション
•低消費電力
•短HW/SW開発期間
•各チップ間でアプリケーション
<Features> ・Good cost performance・Short hardware and software development periods
CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )
IntraCCN (チップ内結合網: 複数バス、クロスバー等 )
InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )
CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )
IntraCCN (チップ内結合網: 複数バス、クロスバー等 )
InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )
プリケ ション共用可
•高信頼性
•半導体集積度と共に性能向上
プリケ ション共用可
•高信頼性
•半導体集積度と共に性能向上
マルチコア統合ECU
,,
CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )
IntraCCN (チップ内結合網: 複数バス、クロスバー等 )
InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )
CSM / L2 Cache(集中共有メモリあるいはL2キャッシュ )
IntraCCN (チップ内結合網: 複数バス、クロスバー等 )
InterCCN (チップ間結合網: 複数バス、クロスバー、多段ネットワーク等 )
プリケ ション共用可
•高信頼性
•半導体集積度と共に性能向上
プリケ ション共用可
•高信頼性
•半導体集積度と共に性能向上
マルチコア統合ECU
,,development periods・Low power consumption・Scalable performance improvement
開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ
p pwith the advancement of semiconductor ・Use of the same parallelizing compiler
32for multi-cores from different vendors using newly developed API
**Hitachi, Renesas, Fujitsu,
Toshiba, Panasonic, NEC
NEDOMulticore Technology for Realtime Consumer Electronics R&D Organization(2005.7-2008.3 )
Integrated R&D Steering CommitteeChair : Hironori Kasahara (Waseda Univ.), Project LeaderSub-chair : Kunio Uchiyama (Hitachi), Project Sub-leader
Grant: Multicore Architecture and CompilerResearch and Development
Subsidy: Evaluation Environment for MulticoreTechnologyResearch and Development
Project Leader :Prof. Hironori Kasahara, Waseda Univ.
TechnologyProject Sub-leader :
Dr. Kunio Uchiyama, Chief Researcher, Hitachi
R&D Steering ComitteeT t Chi E l ti S t
Standard MulticoreArchitecture & APIC i
Architecture & CompilerR&D Group.Group Leader :
gTest Chip Development Group.Group Leader : Renesas Technology
Evaluation Sytem Development GroupGroup Leader : Hitachi
CommitteeChair : Prof. HironoriKasahara, Waseda Univ.
Group Leader :Associate Prof. KeijiKimura, Waseda Univ.(Outsourcing : Fujitsu)
gy
Hitachi, Fujitsu, Toshiba, NEC, Panasonic Renesas Technology
33
anasonic, Renesas Technology
Fujitsu FR-1000Multicore ProcessorFR550 VLIW Processor
I-cacheInteger Operation UnitFR-V Multi-core Processor
Inst 0
FR550 core
FR550 core
GR32KBInst. 0Inst. 1Inst. 2Inst. 3
DMA-E
Local BUS IF
Mem. Cont. Mem. Cont.
Memory
Memory Crossbar Switch
DMA-I FR
Media Operation Unit
D-cache32KB
Inst. 4Inst. 5Inst. 6Inst. 7
FR550 core
FR550 core
IO
Media Operation Unit
C 1 C 2IO Chip
Core1 Core2
×
Core1 Core2
Switch SwitchFast I/O Bus
Memory Memory Memory Memory
Crossbar
Fast I/O Bus
•Memory Bus: 64bit x 2ch / 266MHz•System Bus: 64bit / 178MHz
34Bus Crossbar
(FR1000)
Panasonic UniPhier
35
CELL Processor OverviewC ocesso Ove v ew• Power Processor Element
(PPE)Synergistic Processor Elements for High (Fl)ops / WattSPE
( )– PowerCore processes OS and
Control tasks2 M lti th d d
SPU
LS
SPU
LS
SPU
LS
SPU
LS
SPU
LS
SPU
LS
SPU
LS
SPU
LS
16B/cycle
– 2-way Multi-threaded• Synergistic Processor Element
(SPE) 16B/cycle16B/cycle
EIB (up to 96B/cycle)
16B/cycle
(SPE)– 8 SPE offers high performance– Dual issue RISC Architecture
16B/cycle (2x)16B/cycle
BICMIC
16B/cycle
L2
512KB
PPE
– 128bit SIMD(16‐way)– 128 x 128bit General Registers
256KB L l StRRAC I/ODual
XDRTM
PPUL1
32B/cycle
16B/cycle
512KB
32KB+32KB– 256KB Local Store– DedicatedDMA engines
XDR
従来の演算処理のためにVMXを持つ64-bitのPower Architecture
32KB+32KB
36
1987 OSCAR(Optimally SSccheduled Advanced Multiprocessor)
37
OSCAR(Optimally SSccheduled Advanced Multiprocessor)
HOST COMPUTER
CONTROL & I/O PROCESSOR(Simultaneous Readable)
CENTRALIZED SHARED MEMORY1
RISC Processor I/O Processor
DataMemory
Prog.Memory
Bank1
Addr.n
Bank2
Addr.n
Bank3
Addr.nDistributedShared
(Simultaneous Readable)
CSM2 CSM3
Read & Write RequestsArbitrator
Memory Memory
Bus Interface
SharedMemory
Distributed
(CP)(CP)
(CP) (CP) (CP).. ..
DistributedShared Memory (Dual Port)
-5MFLOPS,32bit RISC Processor (64 Registers)
..
PE1 PE5 PE6 PE10 PE11 PE15PE8 PE9 PE16
(64 Registers)-2 Banks of Program Memory-Data Memory-Stack Memory-DMA Controller
38 5PE CLUSTER (SPC1) SPC2 SPC3
LPC2 8PE PROCESSOR CLUSTER (LPC1)
SYSTEM BUS
OSCAR PE (Processor Element)SYSTEM BUS
BUS INTERFACE
D S MLOCAL BUS 2LOCAL BUS 1
D M A L P M L S M L D M IPU FPU
R E GI N S C
R E G
D M A L D M
INSTRUCTION BUS DP
: DMA CONTROLLER : LOCAL DATA MEMORYD M AL P M
I N S C
D S M
L D M
D PI P U
F P U
: DMA CONTROLLER: LOCAL PROGRAM MEMORY (128KW * 2BANK): INSTRUCTION CONTROL UNIT
: LOCAL DATA MEMORY (256KW): DATA PATH: INTEGER PROCESSING UNIT: FLOATING: DISTRIBUTED
39
D S M
L S M
F P U
R E G: LOCAL STACK MEMORY (4KW)
: FLOATING PROCESSING UNIT: REGISTER FILE (64 REGISTERS)
: DISTRIBUTED SHARED MEMORY (2KW)
1987 OSCAR PE Board
40
LOCAL MEMORY SPACESYSTEM MEMORY SPACE
OSCAR Memory Space
00000000
00100000
UNDEFINED
00000000
00000800
D S M
NOT USE
(Distributed Shared Memory)
PE 1
PE15
(Local Memory Area)
BROAD CAST
00200000
01000000
CP
NOT USE
00010000
NOT USE
L P M(Bank0)(Local Program Memory)
PE 0
PE 1
00020000
BROAD CAST
01100000
02000000
NOT USE
CP
00030000
00040000
NOT USE
(Control Processor)L P M(Bank1)
02000000
02100000
PE 0
00080000
000F0000
L D M
NOT USEPE 1
(Local Data Memory)
02200000
02F00000
:::
000F0000
00100000
CONTROL
SYSTEMPE15
03000000
03400000
CSM1, 2, 3ACCESSING
AREA
PE15
(Centralized Shared Memory)
41FFFFFFFF
NOT USE
FFFFFFFF
OSCAR Multi-Core ArchitectureCMP (chip multiprocessor 0)
CMP m
PE0 PE
CMP (chip multiprocessor 0)0
CPU
I/ODevicesI/O
Devices0 PE1 PE n
LDM/D-cacheLPM/
CSM j I/OCMP k
CPUDTC
DSMI-Cache
CSN t k I t fFVR
Intra-chip connection network
CSMNetwork InterfaceFVR
FVR
CSM / L2 Cache
(Multiple Buses, Crossbar, etc) FVR
FVRFVR FVR FVR FVR
Inter-chip connection network (Crossbar, Buses, Multistage network, etc)CSM: central shared mem. LDM : local data mem.
FVR
42DSM: distributed shared mem.DTC: Data Transfer Controller
LPM : local program mem.FVR: frequency / voltage control register
API and Parallelizing Compiler in METI/NEDOAdvanced Multicore for Realtime Consumer Electronics Projectj
Executable
Translate into
parallel
API to specify data
assignment, data
Sequential Application
codes for each vendor
chip
parallel codes for
each vender
assignment, data transfer, power
reduction control
Program(Subset of C Language)
RealtiA
Image
Was
Backend compilerMach. Codes
Proc0ScheduledTasks
APIdecoder
Sequential Compilerm
e Cons
Applicati
e, Secure
seda OS
Backend Compiler
T1 Stop
Proc1ScheduledT k
SH multicore
Mach.CodesAPI
decoderSequential Compilersum
er Eion Proge A
udio etc.
SCA
R C
o
FR-V
TasksT2 T4
Proc2
Codesdecoder Compiler
Electronigram
sStream
i
ompiler
Backend CompilerScheduledTasks
T3 T6 Slow
Mach.CodesSequential
CompilerAPIdecoder
43
csng (CELL)
Data Transfer by DTC(DMAC)
OSCAR Multigrain Parallelizing Compilerg g p• Automatic
ParallelizationMultigrain Parallel– Multigrain Parallel Processing
– Data LocalizationData transfer– Data transfer Overlapping
– Complier Controlled Power Saving SchemePower Saving Scheme
• Compiler cooperative Multi-core architecturearchitecture– OSCAR Multi-core
ArchitectureOSCAR Heterogeneous– OSCAR Heterogeneous Multiprocessor Architecture
• Commercial SMP44
• Commercial SMP machines
P r o c e s s o r B l o c k D i a g r a mProcessor Block DiagramCCN:Cache controllerIL DL I t ti /D t
Core #3CPU FPUCore #2
600MHzIL, DL: Instruction/Data
local memory
I$32K
D$32K
IL DL
CPU FPU
CCNI$32K
D$32K
CPU FPUCore #1
$ $CPU FPUCore #0
CPU FPU rolle
r
URAM: User RAMGCPG: Global clock pulse generator
IL8K
DL16K
CCN
URAM 128K
32K 32KIL8K
DL16K
CCNI$32K
D$32K
IL8K
DL16K
CCNI$32K
D$32K
CPU FPU
CCN
p co
ntr
LCPGgLCPG: Local CPG for each coreLBSC: SRAM controller
URAM 128K8K 16KURAM 128K
IL8K
DL16K
URAM 128K Snoo
p(S
NC
)LCPG0-3LCPG0-3LCPG
0-3LCPG0-3
On-chip system bus (SHwy)
LBSC: SRAM controllerDBSC: DDR2 controller
300MHzS (0 3
DBSC PCIExp
HWGCPGIPLBSC
454 laneDDR2
32bitSRAM32bit
Chip OverviewProcess Technology
90nm, 8-layer, triple-Vth, CMOS
Chip Size 97 6mm2 (9 88mm x 9 88mm)Core Core Chip Size 97.6mm (9.88mm x 9.88mm)
Supply Voltage 1.0V (internal), 1.8/3.3V (I/O)
Power Consumption
0.6 mW/MHz/CPU @ 600MHz (90nm G)
#0 PeripheralsCore
#1
Consumption 600MHz (90nm G)
Clock Frequency 600MHz
CPU Performance 4320 MIPS (Dhrystone 2.1)
SNC
Core Core
#3FPU Performance 16.8 GFLOPS
I/D Cache 32KB 4way set-associative (each)
#2#3
( )
ILRAM/OLRAM 8KB/16KB (each CPU)
URAM 128KB (each CPU)
P k FCBGA 554 i 29SH4A Multicore SoC Chip Package FCBGA 554pin, 29mm x 29mm
ISSCC07 Paper No.5.3, Y. Yoshida, et al., “A 4320MIPS Four-Processor Core SMP/AMP with
46
pIndividually Managed Clock Frequency for Low Power Consumption”
Performance on a DevelopedSH Multi core (RP1: SH X3)SH Multi-core (RP1: SH-X3)
Using Compiler and APIAudio AAC* Encoder Image Susan Smoothing
up
5.0
4.02 95
3.825.0
4.03.43
d u
p
Speed
3.01.91
2.95
3.01.86
2.70
Speed
2.0
1.0
1.002.0
1.0
1.00
Number of Processors1 2 3 4
01 2 3 4
0Number of Processors
Page. 47
Number of Processors**) Mibench Embedded application benchmark by Michigan Univ.*) ISO Advanced Audio Coding:
Number of Processors
RP2 Chip Photo and Specifications
Process 90nm 8-layer triple-ILRAM I$ Process Technology
90nm, 8-layer, triple-Vth, CMOS
Chip Size 104 8mm2
Core#1
C0
LB
SC
URAMDLRAM
Core#0ILRAM
D$
I$
Chip Size 104.8mm(10.61mm x 9.88mm)
CPU Core 6.6mm2
Core#2 Core#3SN
SHWY CPU Core Size
6.6mm(3.36mm x 1.96mm)
Supply 1.0V–1.4V (internal), Core#6 Core#7
SNC
1
SHWY VSWC
pp yVoltage
( ),1.8/3.3V (I/O)
Power 17 (8 CPUs, Core#4 Core#5
SC
SM
Domains(
8 URAMs, common)DBSC
DDRPADGCPG
48
8 Core RP2 Chip Block DiagramCl #1
Core #3CPU FPUCore #2
Cluster #0 Cluster #1Core #7CPUFPU Core #6
Barrier Sync. Lines
I$16K
D$16K
CPU FPU
Local memoryI$
16KD$
16K
CPU FPUCore #1CPU FPUCore #0
CPU FPULCPG0 LCPG1
I$16K
D$16K
CPUFPU
I$16K
D$16K
CPUFPU Core #5CPUFPU Core #4
CPUFPU
User RAM 64K
Local memoryI:8K, D:32K
16K 16KLocal memoryI:8K, D:32K
I$16K
D$16K
Local memoryI 8K D 32K
I$16K
D$16K
CPU FPU
CCNBAR
ntro
ller
1
ntro
ller
0LCPG0PCR3PCR2
LCPG1PCR7PCR6User RAM 64K
I:8K, D:32K16K16K
I:8K, D:32K
I$16K
D$16K
I 8K D 32K
I$16K
D$16K
CPUFPU
CCNBAR
User RAM 64KUser RAM 64K
I:8K, D:32K
URAM 64K
Local memoryI:8K, D:32K
noop
con
noop
con
PCR2PCR1PCR0
PCR6PCR5PCR4
User RAM 64KUser RAM 64K
I:8K, D:32K
URAM 64K
Local memoryI:8K, D:32K
On-chip system bus (SuperHyway)SSC 0 C
DDR2LCPG: Local clock pulse generator
PCR: Power Control RegistercontrolSRAM
controlDMA
control49
CCN/BAR:Cache controller/Barrier RegisterURAM: User RAM
control control control
Processing Performance on the Developed Multicore Using Automatic Parallelizing CompilerMulticore Using Automatic Parallelizing Compiler
Speedup against single core execution for audio AAC encoding
6.0
7.0
4 0
5.0
ps
5.8 3.0
4.0
Spee
du
1 01.9
3.6 1.0
2.0
1.0 0.0
1 2 4 8N b f
50*) Advanced Audio Coding
Numbers of processor cores
Power Reduction Using Power Shutdown and Voltage Control by OSCAR Parallelizing CompilerControl by OSCAR Parallelizing Compiler
7Without Power Control with Power Control
(Resume Standby:
6
7
W] 6
7 (Resume Standby: Power shutdown & voltage lowering)
4
5
ower
[W
4
5
3
4
med
Po
3
4
1
2
Con
su
1
2
86 0% Power Reduction0 0
Average Power5.82 [W]
Average Power0.81 [W]
86.0% Power Reduction
OSCAR Heterogeneous MulticoreOSCAR Heterogeneous Multicore
• OSCAR Type Memory• OSCAR Type Memory Architecture
• LPM– Local Program Memoryg y
• LDM– Local Data Memory
• DSMDi ib d Sh d M– Distributed Shared Memory
• CSM– Centralized Shared Memory
• On Chip and/or Off Chipp p• DTU
– Data Transfer Unit• Interconnection Network
M lti l B– Multiple Buses– Split Transaction Buses– CrossBar …
52
Static Scheduling of Coarse Grain Tasks for a Heterogeneous Multi-core
MT1for CPU
CPU0 CPU1 DRP
MT1MT2for CPU
MT3for CPU
MT1
MT2 MT3MT6for CPU
MT5for DRP
MT4for DRP
MT7 MT10MT9MT8
MT4MT5
MT6 MT7MT8MT7
for DRPMT10for DRP
MT9for DRP
MT8for CPU
MT13f DRP
MT12f DRP
MT11f CPU
MT8MT9MT10
MT11for DRPfor DRPfor CPU
EMT
MT12MT13
53
An Image of Static Schedule for Heterogeneous Multi-core with Data Transfer Overlapping and Power Controlpp g
TIM
EE
54
Compiler Performance on a OSCAR HeteroCompiler Performance on a OSCAR Hetero--multimulti--corecore25.2 times speedup using 4 SH general purpose cores and 4 DRP p p g g p paccelerators against a single SH
16.3016.25
17.2217.1016.13 16.47
16.0
18.0
20.0
25.2
41.3 times Speedup
9 0810.05
9 0810.09
9 1010.11
10.3410.3110.17
10 0
12.0
14.0 times Speedup against
5.176.09
9.08
5.186.10
9.08
5.196.11
9.10
6.0
8.0
10.01SH core
0.40
1.59
2.90
0.40
1.59
2.90
0.000.00
2.90
1.00
0.0
2.0
4.0
U U DR
P
DR
P
DR
P
DR
P
DR
P
DR
P
DR
P
DR
P
U U DR
P
DR
P
DR
P
DR
P
DR
P
DR
P
DR
P
DR
P
U U DR
P
DR
P
DR
P
DR
P
DR
P
DR
P
DR
P
DR
P
lock
Cyc
les
1C
PU 1
CP
4C
P
2C
PU
+1D
2C
PU
+2D
4C
PU
+2D
2C
PU
+4D
4C
PU
+4D
8C
PU
+4D
4C
PU
+8D
8C
PU
+8D
1C
P
4C
P
2C
PU
+1D
2C
PU
+2D
4C
PU
+2D
2C
PU
+4D
4C
PU
+4D
8C
PU
+4D
4C
PU
+8D
8C
PU
+8D
1C
P
4C
P
2C
PU
+1D
2C
PU
+2D
4C
PU
+2D
2C
PU
+4D
4C
PU
+4D
8C
PU
+4D
4C
PU
+8D
8C
PU
+8D
Exe
cutio
n C
l
SH4ACore
SH4ACore
SH4ACore
汎用高性能プロセッサ
SH4A CoreOn-Chip CSM
STB×1
SH4A CoreOn-Chip CSM
BUS×3
SH4A CoreOn-Chip CSM
STB×3
Power Reduction by OSCAR Compiler(4SHs+4DRPs)0 78 W 22% P d ti b C il C t l0.78 W: 22% Power reduction by Compiler Control
1.2-22%
0 8
1
[W]
DRP
1.01
0 78
1.01
0 78
1.01
0 78
1.01
0 78
1.01
0 78
1.01
1 78
22%
0.6
0.8
sum
ptio
n [ DRP
Clock
FPU
IEU
0.78 0.78 0.78 0.78 0.78 1.78
0.4
Pow
er
Cons
BPU
Register
0
0.2
Avera
ge P
FVOFF
FVON
FVOFF
FVON
FVOFF
FVON
FVOFF
FVON
FVOFF
FVON
FVOFF
FVON
STB×1 Bus×3 STB×3 STB×1 Bus×3 STB×3
2007.7.30 56Off Chip On Chip
Compiler cooperative low power high effectiveConclusions
Compiler cooperative low power high effective performance multi-core processors will be more important in wide range of information systems fromimportant in wide range of information systems from games, mobile phones, automobiles to peta-scale supercomputers. p pParallelizing compilers are essential for realization of
Good cost performancepShort hardware and software development periodsLow power consumptionp pHigh software productivityScalable performance improvement with advancement p pin semiconductor integration technology
Key technologies in multi-core compiler
57Multigrain parallelization, Data localization, Data transfer overlapping using DMA, Low power control technologies