A Multigggrain Parallelizing Compiler with Power Control for … · 2008. 3. 7. · 0.0 1.0 2.0...

A Multigrain Parallelizing g gCompiler with Power Control

for Multicore Processors Hironori Kasahara

Professor Department of Computer ScienceProfessor, Department of Computer ScienceDirector, Advanced Chip-Multiprocessor

Research InstitutesResearch Institutes Waseda University

Tokyo JapanTokyo, Japanhttp://www.kasahara.cs.waseda.ac.jp

Feb. 5, 2008, 13:30-15:00 at Intel Headquarter

Hironori Kasahara＜Personal History＞

B.S. (1980,Waseda), M.S.(1982,Waseda), Ph.D.(1985,EE, Waseda). Res.Assoc. (1983,Waseda),S i l R h F ll JSPS (1985) Vi i i S h l (1985 U i C lif i B k l )Special Research Fellow JSPS (1985) ,Visiting Scholar (1985.Univ.California at Berkeley), .Assist. Prof. (1986.Waseda), Assoc. Prof.(1988,Waseda), Visiting Research Scholar(1989-1990. Center for Supercomputing R&D, Univ.of Illinois at Urbana-Champaign), Prof.(1997-,Dept. CS, Waseda). , IFAC World Congress Young Author Prize (1987), IPSJ Sakai Memorial Special Award (1997), STARC Industry-Academia Cooperative Research Award (2004)

＜Activities for Societies＞IPSJ：Sig. Computer Architecture(Chair), Trans of IPSJ Editorial Board (HG Chair),

Journal of IPSJ Editorial Board (HWG Chair), 2001 Journal of IPSJ Special Issue on ParallelJournal of IPSJ Editorial Board (HWG Chair), 2001 Journal of IPSJ Special Issue on Parallel Processing(Chair of Editorial Board: Guest Editor, JSPP2000 (Program Chair) etc. ACM ：International Conference on Supercomputing(ICS)(Program Committee)

Int’l conf. on Supercomputing (PC, esp. ‘96 ENIAC 50th Anniversary Co-Prog. Chair). IEEE C t S i t J Ch t Ch i T k S ti B d M b SC07 PCIEEE: Computer Society Japan Chapter Chair, Tokyo Section Board Member, SC07 PCOTHER: PCs of many conferences on Supercomputing and Parallel Processing.

<Activities for Governments>METI：IT Policy Proposal Forum(Architecture/HPC WG Chair), y p ( ),

Super Advanced Electronic Basis Technology Investigation CommitteeNEDO:Millennium Project IT21 “Advanced Parallelizing Compiler”( Project Leader),Computer Strategy WG (Chair).Multicore for Realtime Consumer Electronics Project Leader etc.MEXT:Earth Simulator project evaluation committee, 10PFLOPS Supercomputer evaluat. comm.p j , p pJAERI: Research accomplishment evaluation committee, CCSE 1st class invited researcher. JST: Scientific Research Fund Sub Committee, COINS Steering Committee ,

Precursory Research for Embryonic Science and Technology (Research Area Adviser)Cabinet Office: CSTP Expert Panel on Basic Policy, Information & Communication Field

2

Cabinet Office: CSTP Expert Panel on Basic Policy, Information & Communication Field Promotion Strategy , R&D Infrastructure WG, Software & Security WG

＜Papers＞Papers 158, Invited Talks 64, Tech. Reports 114, Symposium 25, News Papers/TV/Web News/Magazine 139, IEEE Trans. Computer, IPSJ Trans., ISSCC, Cool Chips, Supercomputing, ACM ICS, etc.etc.

Multi-core EverywhereMulti-core from embedded to supercomputers

Consumer Electronics (Embedded)Mobile Phone, Game, Digital TV, Car Navigation,

DVD CDVD, Camera, IBM/ Sony/ Toshiba Cell, Fuijtsu FR1000, NEC/ARMMPCore&MP211, Panasonic Uniphier, Renesas SH multi-core(4 core RP1 8 core RP2)Renesas SH multi core(4 core RP1, 8 core RP2)Tilera Tile64, SPI Storm-1(16 VLIW cores)

PCs, ServersIntel Quad Xeon, Core 2 Quad, Montvale, Tukwila, 80 coreAMD Quad Core Opteron, Phenom

WSs, Deskside & Highend ServersIBM Power4,5,5+,6 Sun Niagara(SparcT1,T2), Rock

S tIBM BlueGene/L　

Lawrence Livermore National Laboratory2005/04稼動予定

OSCAR Type Multi-core Chip by Renesas in METI/NEDO Multicore for Real-time Consumer Electronics Project (Leader: Prof.Kasahara)

SupercomputersEarth Simulator:40TFLOPS, 2002, 5120 vector proc.IBM Blue Gene/L: 360TFLOPS, 2005, Low power CMP

based 128K processor chips, BG/P 2008

Lawrence Livermore National Laboratory2005/04稼動予定

低消費電力チップマルチプロセッサをベースとしたスパコン

High quality application software, Productivity, Costperformance, Low power consumption are important

Ex, Mobile phones, GamesC il t d lti

65,536-プロセッサチップ（128Kプロセッサ

360TFLOPS

3

Compiler cooperated multi-core processors are promising to realize the above futures

　＝毎秒360兆回の　　　　浮動小数点計算）

キャビネットあたり1024プロセッサチップ

１プロセッサチップ上に２プロセッサ集積

Market of Consumer Electronics1 T illi D ll i 20101 Trillion Dollars in 2010 (World Wide)

1 6 兆円

W /W 市場規模

1 6 兆円

W /W 市場規模

市場規

1 2 兆円

1 4 兆円

兆円

携帯電話

市場規

1 2 兆円

1 4 兆円

兆円

携帯電話

規模（出荷額）

1 . 5 兆円

2 兆円ﾃﾞｼﾞﾀﾙｶﾒﾗ

ﾞｼﾞﾀﾞﾞｵ

薄型ﾃﾚﾋﾞ

規模（出荷額）

1 . 5 兆円

2 兆円ﾃﾞｼﾞﾀﾙｶﾒﾗ

ﾞｼﾞﾀﾞﾞｵ

薄型ﾃﾚﾋﾞ

0

5 千億円

1 兆円ﾃﾞｼﾞﾀﾙﾋﾞﾃﾞｵ

ｶｰﾅﾋﾞ

D V D ﾚｺｰﾀﾞ

0

5 千億円

1 兆円ﾃﾞｼﾞﾀﾙﾋﾞﾃﾞｵ

ｶｰﾅﾋﾞ

D V D ﾚｺｰﾀﾞ

0` 0 3 ` 0 4 ` 0 5 ` 0 6` 0 2

0` 0 3 ` 0 4 ` 0 5 ` 0 6` 0 2

１２１２７６７６４９４９デジタルスチルカメラ（M 台）

年平均成長率％‘０７‘０３


年平均成長率％‘０７‘０３

Dig Camera

Annual Growth Rates

2005.5.11NEDOロードマップ報告会電子・情報技術開発部「技術開発戦略」より

４３４３１１４１１４２７２７PC用 DV D（記録型）（M 台）

７４７４３３３３３．６３．６D V Dレコーダ（M 台）

４５４５２７２７６６デジタル TV（M 台）


４３４３１１４１１４２７２７PC用 DV D（記録型）（M 台）

７４７４３３３３３．６３．６D V Dレコーダ（M 台）

４５４５２７２７６６デジタル TV（M 台）

１２１２７６７６４９４９デジタルスチルカメラ（M 台）Dig. CameraDig. TVDVD RecorderDVD for PC

4１１１１２０．９２０．９１４．０１４．０自動車用半導体需要（B$）

８８６７０６７０４９０４９０携帯電話（M 台）

４３４３１１４１１４２７２７PC用 DV D（記録型）（M 台）

１１１１２０．９２０．９１４．０１４．０自動車用半導体需要（B$）

８８６７０６７０４９０４９０携帯電話（M 台）

４３４３１１４１１４２７２７PC用 DV D（記録型）（M 台）Mobile PhoneLSI for Cars

Roadmap of compiler cooperative multicore project02 03 04 05 06 07 0800 01 09 10

Apply for Supercomputer CompilersCompiler development of

Millennium Project IT21NEDO Advanced Parallelizing Compiler Apply for Supercomputer Compilers

Multiprocessor ServersParallelizing Compiler

（Waseda Univ. Fujitsu,Hitachi,JIPDEC,AIST）

STARC Compiler Cooperative Practical ResarchBasic resrach STARC：

Semiconductor Technology Academic

Research Center Fujitsu,Toshiba,NEC,

STARC：Semiconductor

Technology Academic Research Center

Fujitsu,Toshiba,NEC,

（Waseda・Toshiba・Fujitsu・Panasonic・Sony）

（Waseda・Fujitsu・NEC・Toshiba）

p pChip Multiprocessor（ Waseda Univ., Fujitsu, NEC,Toshiba, Panasonic,Sony） Waseda Univ Hitachi Renesas

Arch. % CompilR&D

Plan Practical Use

Renesas,Panasonic, Sony etc.

Renesas,Panasonic, Sony etc.

y

NEDO (2004.07-2007.06)Heterogeneous Multiprocessor（Waseda Univ., Hitachi）

Waseda Univ., Hitachi, Renesas,

Soft realtime

PlanMulticore Arch. Compiler

Practical Use

（ , ）

NEDO (2005.06-2008.03)Multicore Technology for Realtime Consumer Electronics Plan

API R&DPractical Use

Waseda Univ., Hitachi, Renesas,Fujitsu, NEC, Toshiba, PanasonicPower Saving Multicore

Architecture,P ll li i C il API H t M lti A h

5

Mar. Oct.

Parallelizing Compiler,APINEDO (2007.02- 2010.03)

Heterogeneous Multicore for Consumer Electronics Waseda U i Hit hi R T k I t f T h

Mar.Mar.

PlanHetero Multicore Arch.

& Compiler R&D

METI/NEDO Advanced Parallelizing Compiler Technology ProjectBackground and Problems①Adoption of parallel processing as a core

Millenium Project IT21 2000.9.8 –2003.3.31W d U i F jit Hit hi AIST ①Adoption of parallel processing as a core

technology on PC to HPC ② Increase of importance of software on IT③ Need for improvement of cost-performance

and usability

Performance

1T

Waseda Univ., Fujitsu, Hitachi, AIST

Contents of Research and Development① R & D of advanced parallelizing compiler

and usabilityHardware Peak performance

Multigrain, Data localization, Overhead hiding② R & D of Performance evaluation technology

for parallelizing compilers<Purpose>

Proj

ect

Improvement of① Effective performance

Goal: Double the effective performance

Ripple EffectA

PC P① Effective performance

②Cost-performance③ Ease of use

Slow down

pp① Development of competitive next

generation PC and HPC② Putting the innovative automatic

parallelizing compiler technology Effective Performance

1980 20001990

p g p gyto practical use

③ Development and market acquisitionof future single-chip multiprocessors

④ Boosting R&D in the following many fields:year1G

6

IT, Bio-tech., Device, Earth environment, Next-generation VLSI design, Financial engineering, Weather forecast, New clean energy, Space development, Automobile, Electric Commerce, etc

④ g g y

Theoretical maximum performance vs.Effective performance of HPC

y

Performance of APC Compiler on IBM pSeries690 16 Processors High-end Server

• IBM XL Fortran for AIX Version 8.1– Sequential execution : -O5 -qarch=pwr4– Automatic loop parallelization : -O5 -qsmp=auto -qarch=pwr4– OSCAR compiler : -O5 -qsmp=noauto -qarch=pwr4

(su2cor: -O4 -qstrict)12.0

XL Fortran(max)3.4s 3.1s 3.5s3 5 ti d

10.0

XL Fortran(max)

APC(max)3.0s

38.5s

28.8s3.5 times speedup in average

6.0

8.0

dup

ratio

3.8s

38.5s

4.0Spee

16.7s7.1s

16 4s 13 0s115.2s 107.4s 105.0s

28.9s

0.0

2.0

23 1s

19.2s

16.7s

38 3s

21.5s21.0s

16.4s

23.2s

13.0s

37.4s

39 5s27 8s35 1s30 3s21 5s 126 5s 307 6s 291 2s 279 1s

126.5s 184.8s

22 5s 85 8s 18 8s

22.5s 85.8s 18.8s

282 4s 321 4s

282.4s 321.4s

7tomcatvswim

su2corhydro 2d

mgridappluturb3d aps i

fppppwave5

wupwiseswim2kmgrid

2kapplu2ks ixt r

ackaps i2k

23.1s 38.3s 39.5s27.8s35.1s30.3s21.5s 126.5s 307.6s 291.2s 279.1s22.5s 85.8s 18.8s 282.4s 321.4s

Performance of Multigrain Parallel Processing for 102.swim on IBM pSeries690

16 00

14.00

16.00XLF(AUTO)

OSCAR

10.00

12.00

p ra

tio

6.00

8.00

Spee

d up

2.00

4.00S

0.001 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Processors

8

Performance of APC Compileron Sun Ultra80 4 Processor Workstation

• Sun Forte Developer 6 Update 2Sequential execution : fast

on Sun Ultra80 4 Processor Workstation

– Sequential execution : -fast– Automatic loop parallelization : -fast -autopar -reduction -stackvar– OSCAR compiler : -fast -explicitpar -mp=openmp -stackvar

8 0

9.0

10.0Forte6(max)

APC(max)

21s

6.0

7.0

8.0

p ra

tio

19s

31s 21s

3.0

4.0

5.0

Spee

dup

115 31s 30s77

31s22s

21s

92

56s 216s

878s

0.0

1.0

2.0

121s 198s 70s 141s 97s 172s 201s

97s

115s 31s 77s133s

92s

194s

881s 1139s 846s 1453s

870s 1035s

962s579s 569s

1094s878s

58s 387s 98s

58s 387s 98s

763s 1313s

763s 1313s

9

0.0

tomca

tv

swim

su2c

orhy

dro2d

mgrid

applu

turb3

d

apsi

fpppp

wave5

wupwise

swim

2kmgri

d2kap

plu2k

sixtra

ckap

si2k

121s 198s 70s 141s 97s 172s 201s 881s 1139s 846s 1453s58s 387s 98s 763s 1313s

OSCAR Parallelizing Compiler• Improve effective performance, cost-performance

and productivity and reduce consumed power– Multigrain Parallelization

• Exploitation of parallelism from the whole program by use ofcoarse-grain parallelism among loops and subroutines,coarse grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism

Data Localization– Data Localization• Automatic data distribution for distributed shared memory, cache

and local memory on multiprocessor systems.– Data Transfer Overlapping

• Data transfer overhead hiding by overlapping task execution anddata transfer using DMA or data pre-fetchingdata transfer using DMA or data pre fetching

– Power Reduction• Reduction of consumed power by compiler control of

10frequency, voltage and power shut down with hardware supports.

Generation of Coarse Grain Tasks

Ｍacro-tasks (MTs)Bl k f P d A i t (BPA) B i Bl k (BB)Block of Pseudo Assignments (BPA): Basic Block (BB)Repetition Block (RB) : outermost natural loop Subroutine Block (SB): subroutineSubroutine Block (SB): subroutine

BPA Near fine grain parallelization BPABPA Near fine grain parallelization

Loop level parallelization BPA

BPARBSBBPARBProgram RB Near fine grain of loop body

Coarse grainparallelization

RBSB

BPA

RBSBBPARBSBSB Coarse grain

parallelization

BPARBSB

SBBPARBSB

111st. Layer 2nd. Layer 3rd. LayerTotalSystem

Earliest Executable Condition Analysis for coarse grain tasks (Macro-tasks)

1 BPABPA

Data Dependency

Control flow

Conditional branch

Block of Psuedo

1

3 BPA2 BPA

4 BPA

BPA

RB Repetition Block

RB

Block of PsuedoAssignment Statements

7

2 3

4 8BPA

5 BPA

6 RB

7 RB15 BPA

RB

RB

RB

BPA

7

11

4

5 6

8

9 10

6

BPA8 BPA

9 BPA 10 RB

11 BPA

RB 1115 7

12

13

Data dependency

Extended control dependency11 BPA

12 BPA

13 RB14

13Extended control dependencyConditional branch

ORAND

Original control flowA Macro Flow Graph

1214 RB

ENDA Macro Task Graph

Automatic processor assignment in 103 su2cor103.su2cor

• Using 14 processorsg p– Coarse grain parallelization within DO400 of subroutine LOOPS

13

MTG of Su2cor-LOOPS-DO400MTG of Su2cor-LOOPS-DO400Coarse grain parallelism PARA_ALD = 4.3

14DOALL Sequential LOOP BBSB

Data-Localization L Ali d D itiLoop Aligned Decomposition

• Decompose multiple loop (Doall and Seq) into CARs and LRs id i i t l d t d dconsidering inter-loop data dependence.

– Most data in LR can be passed through LM.– LR: Localizable Region, CAR: Commonly Accessed Region

LR CAR CARLR LRC RB1(Doall)DO I=1,101

DO I=69,101DO I=67,68DO I=36,66DO I=34,35DO I=1,33

DO I=1,33C RB2(Doseq)

A(I)=2*IENDDO

DO I=67 67

DO I=35,66

DO I=34,34( q)

DO I=1,100B(I)=B(I-1)

+A(I)+A(I+1)ENDDO

DO I=2 34

DO I=68,100

DO I=67,67

DO I=68 100DO I=35 67

RB3(Doall)DO I=2,100

C(I)=B(I)+B(I-1)

15

DO I=2,34 DO I=68,100DO I=35,67C(I)=B(I)+B(I-1)ENDDO

C

Data Localization11

2

1

32

PE0 PE1

12 1

3 4 56

23 45

6 7 8910 1112

3

4

2

6 7

8

14

18

7

6 7 8910 1112

1314 15 16

5

8

9

11

15

18

19

25

8 9 101718 19 2021 22

10

11

13 16

25

29

112324 25 26

dlg0dlg3dlg1 dlg2

17 20

21

22 26

30

12 1314

15

2728 29 3031 32

33Data Localization Group

dlg023 24

27 28

32

16MTG MTG after Division A schedule for two processors

15 33Data Localization Group31

An Example of Data Localization for Spec95 SwimDO 200 J=1 N cache sizeDO 200 J=1 N cache sizeDO 200 J=1,NDO 200 I=1,M

UNEW(I+1,J) = UOLD(I+1,J)+1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J)2 +CV(I+1 J)) TDTSDX*(H(I+1 J) H(I J))

UN

0 1 2 3 4MBcache sizeDO 200 J=1,N

DO 200 I=1,MUNEW(I+1,J) = UOLD(I+1,J)+

1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J)2 +CV(I+1 J)) TDTSDX*(H(I+1 J) H(I J))

UN

0 1 2 3 4MBcache size

2 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J))VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1))

1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J))2 -TDTSDY*(H(I,J+1)-H(I,J))

VOZ

PNVN UOPO CVCUH

2 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J))VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1))

1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J))2 -TDTSDY*(H(I,J+1)-H(I,J))

VOZ

PNVN UOPO CVCUHPNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J))

1 -TDTSDY*(CV(I,J+1)-CV(I,J))200 CONTINUE

H

UNPNVN

PNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J))1 -TDTSDY*(CV(I,J+1)-CV(I,J))

200 CONTINUE

H

UNPNVN

DO 210 J=1,NUNEW(1,J) = UNEW(M+1,J)VNEW(M+1,J+1) = VNEW(1,J+1)PNEW(M+1 J) = PNEW(1 J)

UNVOPNVN UO

U PV

DO 210 J=1,NUNEW(1,J) = UNEW(M+1,J)VNEW(M+1,J+1) = VNEW(1,J+1)PNEW(M+1 J) = PNEW(1 J)

UNVOPNVN UO

U PV

DO 300 J=1,N

PNEW(M+1,J) = PNEW(1,J)210 CONTINUE

VOPNVN UOPO

C h li fli tDO 300 J=1,N

PNEW(M+1,J) = PNEW(1,J)210 CONTINUE

VOPNVN UOPO

C h li fli tDO 300 I=1,MUOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J))POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J)) (b) Image of alignment of arrays on

Cache line conflicts occurs among arrays which share the same location on cache

DO 300 I=1,MUOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J))POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J)) (b) Image of alignment of arrays on

Cache line conflicts occurs among arrays which share the same location on cache

17

( , ) ( , ) ( ( , ) ( , ) ( , ))300 CONTINUE

(a) An example of target loop group for data localization

(b) Image of alignment of arrays on cache accessed by target loops

( , ) ( , ) ( ( , ) ( , ) ( , ))300 CONTINUE

(a) An example of target loop group for data localization

(b) Image of alignment of arrays on cache accessed by target loops

Data Layout for Removing Line Conflict Misses by Array Dimension Paddingby Array Dimension Padding

Declaration part of arrays in spec95 swim

PARAMETER (N1=513, N2=513) PARAMETER (N1=513, N2=544)

before padding after padding

COMMON U(N1,N2), V(N1,N2), P(N1,N2),* UNEW(N1,N2), VNEW(N1,N2),1 PNEW(N1,N2), UOLD(N1,N2),

COMMON U(N1,N2), V(N1,N2), P(N1,N2),* UNEW(N1,N2), VNEW(N1,N2),1 PNEW(N1,N2), UOLD(N1,N2),

* VOLD(N1,N2), POLD(N1,N2),2 CU(N1,N2), CV(N1,N2),* Z(N1,N2), H(N1,N2)

* VOLD(N1,N2), POLD(N1,N2),2 CU(N1,N2), CV(N1,N2),* Z(N1,N2), H(N1,N2)

4MB 4MB

padding

18Box: Access range of DLG0

APC Compiler OrganizationAPC M l i iAPC Multigrain Parallelizing Compiler

- Cache optimization

Automatic DataDistribution

MultigrainParallelizationSource

ProgramOpenMP

C d

Cache optimization- DSM optimization

SchedulingData DependenceAnalysis

- Hierarchical Parallel- Affine Partitioning

Program

FORTRANCode

Generation- Static Scheduling- Dynamic Scheduling

- Inter-procedural- Conditional

SpeculativeE ti

- Coarse Grain- Medium Grain- Architecture Support

Execution

Parallelizing Tuning System

Feedback

Runtime

IBM XL FortranVer.8.1

Sun Forte 6Update 2

IBM XL FortranVer.7.1

Program Visualization Technique

Techniques for Profiling & Utilizing Runtime Info

Feedback-directed Selection Technique of

Compiler Directives

RuntimeInfo

…

19

Utilizing Runtime Info

Variety of Shared Memory Parallel machines

SUNUltra 80

IBMpSeries690

IBMRS6000

Image of Generated OpenMP Code for Hierarchical Multigrain Parallel Processing

Centralized scheduling codeHierarchical Multigrain Parallel Processing code

SECTIONSSECTION SECTION1st layer

Distributed scheduling

d

T0 T1 T2 T3

MT1_1

T4 T5 T6 T7MT1_1

1st layer

code SYNC SENDMT1_2

SYNC RECVMT1_3SB

MT1_2DOALL

MT1 4MT1-3

1_4_11_4_11_3_2

1_3_3

1_3_1

1_3_2

1_3_3

1_3_1

1_3_2

1_3_3

1_3_1MT1_4RB MT1-4

1_4_2

1_4_4

1_4_3

1_4_2

1_4_4

1_4_31_3_4

1 3 6

1_3_5

1_3_4

1 3 6

1_3_5

1_3_4

1 3 6

1_3_5

1_3_1

1_3_2 1_3_3 1_3_41_4_11_3_6 1_3_6 1_3_6

END SECTIONS2nd layer

1_3_5 1_3_6

1_4_21_4_3 1_4_4

20Thread group0

Thread group1

2nd layer

2nd layer

Performance OSCAR Multigrain Parallelizing Compiler on a IBM p550q 8core Deskside Server

2.7 times speedup against loop parallelizing compiler on 8 cores

Loop parallelizationMultigrain parallelizition

7

8

9

5

6

7

dup

ratio

2

3

4

spee

d

0

1

2

t t i 2 h d 2d id l t b3d i f 5 i id l i

Page. 21

tomcatvswim su2corhydro2dmgrid applu turb3d apsi fpppp wave5 swim mgrid applu apsispec95 spec2000

OSCAR Compiler Performance on 24 Processor IBM p690Highend SMP Serverp g

GXGXP P P P

GXP P P P

GX

MemSlot

GX Sl ot

L3 L3 L3 L3L3 L3 L3 L3

L3 L3 L3 L3

GX SlotMem

SlotMemSlot

MemSlot

GX Sl ot

20

25 XL Fortran Ver.8.1OSCAR

P

L2

P P

L2

P

P

L2

PP

L2

P

P

L2

P P

L2

P

GX

P

L2

PP

L2

P

P

L2

P P

L2

P

GX

GX

GX

P

L2

P P

L2

P

GXGX

L3 L3

L3 L3

L3 L3

L3 L3

L3 L3

L3 L3

MCM 3MCM 2

15

20

p ra

tio

GXGX

P

L2

PP

L2

P

GX

P

L2

PP

L2

P

GX

GXGX

L3 L3 L3 L3L3 L3L3 L3

L3 L3 L3 L3MCM 1 MCM 0

MemSlot

MemSlot

MemSlot

MemSlot

GX Slot

Four 8-way MCM Features Assembled into a 32-way pSeries 690

10

spee

du

0

5

v m r d d u d i p 5 m d u i

tom

catv

swim

su2c

o

hydr

o2

mgr

i d

appl

turb

3d

aps

fppp

p

wav

e

swi m

mgr

id

appl ap

s

spec95 spec2000

Page. 22 4.82 times speedup against loop parallelization spec95 spec2000

Performance on SGI Altix 450Montecito 16 processors cc-NUMA serverp

Compiler options for the Intel Compiler:for Automation parallelization, : -fast -parallel.for OpenMP codes generated by OSCAR : -fast -openmp

789 Intel Ver.8.1

OSCAR

3456

peed

up ra

tio

0123sp

0

tom

catv

swim

su2c

or

hydr

o2d

mgr

id

appl

u

turb

3d apsi

fppp

p

wav

e5

swim

mgr

id

appl

u

apsi

95 2000

• OSCAR compiler gave us 1.86 times speedupagainst Intel Fortran Itanium Compiler revision 8 1

spec95 spec2000

against Intel Fortran Itanium Compiler revision 8.1

Performance of OSCAR compiler on16 cores SGI Altix 450 Montvale server16 cores SGI Altix 450 Montvale server

6 Intel Ver.10.1

Compiler options for the Intel Compiler:for Automation parallelization: -fast -parallel.for OpenMP codes generated by OSCAR: -fast -openmp

4

5

o

OSCAR for OpenMP codes generated by OSCAR: -fast -openmp

2

3

4

peed

up ra

ti

1

2sp

0

omca

tv

swim

su2c

or

hydr

o2d

mgr

id

appl

u

turb

3d apsi

fppp

p

wav

e5

swim

mgr

id

appl

u

apsi

• OSCAR compiler gave us 2.32 times speedup

to h

spec95 spec2000

OSC co p e gave us .3 t es speedupagainst Intel Fortran Itanium Compiler revision 10.1

NEC/ARM MPCore Embedded 4 core SMP

3.54

4.5 g77oscar

22.5

3

peed

up ra

tio

00.5

1

1.5sp

01 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

toamcatv swim su2cor hydro2d mgrid applu turb3d

25

SPEC95

3.48 times speedup by OSCAR compiler against sequential processing

Power Reduction by Power Supply, Clock Frequencyand Voltage Control by OSCAR Compiler

• Shortest execution time mode

26

OSCAR Multi-Core ArchitectureCMP (chip multiprocessor 0)

CMP m

PE0 PE

CMP (chip multiprocessor 0)0

CPU

I/ODevicesI/O

Devices0 PE1 PE n

LDM/D-cacheLPM/

CSM j I/OCMP k

CPUDTC

DSMI-Cache

CSN t k I t fFVR

Intra-chip connection network

CSMNetwork InterfaceFVR

FVR

CSM / L2 Cache

(Multiple Buses, Crossbar, etc) FVR

FVRFVR FVR FVR FVR

Inter-chip connection network (Crossbar, Buses, Multistage network, etc)CSM: central shared mem. LDM : local data mem.

FVR

27DSM: distributed shared mem.DTC: Data Transfer Controller

LPM : local program mem.FVR: frequency / voltage control register

An Example of Machine Parameters for the Power Saving Scheme

• Functions of the multiprocessor• Functions of the multiprocessor– Frequency of each proc. is changed to several levels

Voltage is changed together with frequency– Voltage is changed together with frequency– Each proc. can be powered on/off

state FULL MID LOW OFFstatefrequencyvoltagedynamic energy

FULL111

MID1 / 20.873 / 4

LOW1 / 40.711 / 2

OFF000y gy

static power 1 1 1 0

• State transition overheadstateFULLMIDLOW

FULL0

40k40k

MID40k0

40k

LOW40k40k0

OFF80k80k80k

stateFULLMIDLOW

FULL02020

MID20020

LOW20200

OFF404040

28

LOWOFF

40k80k

40k80k

080k

80k0

LOWOFF

2040

2040

040

400

delay time [u.t.] energy overhead [μJ]

Speed-up in Fastest Execution Modep p

4 5

3.54

4.5w/o Savingw Saving

2.5

3

dup

ratio

1

1.5

2

spee

d

0

0.51

1 2 4 1 2 4 1 2 4 1 2 4

tomcatv swim applu mpeg2enc

b h k29

benchmark

Consumed Energy in Fastest Execution Mode

200 1600

140160

180

200

w/o Savingw Saving 1200

14001600

J)

80

100

120

140

ener

gy(J

)

600800

1000

nerg

y(m

J

20

4060

200400600en

01 2 4 1 2 4 1 2 4

tomcatv swim applu

01 2 4

b fbenchmark

mpeg2 encode

number of proc.

30

mpeg2_encode

Energy Reduction by OSCAR Compiler in Real-time Processing mode (10% Leak)in Real-time Processing mode (10% Leak)

• deadline = sequential execution time Leakage Power: 10%31

• deadline = sequential execution time, Leakage Power: 10%

METI/NEDO National Project Multi core for Real time Consumer ElectronicsMulti-core for Real-time Consumer Electronics

<Goal> R&D of compiler cooperative multi-core processor technology for consumer (2005.7～2008.3)**core processor technology for consumer electronics like Mobile phones, Games, DVD, Digital TV, Car CMP m

(マルチコア・チプ )

PC0(プロセッサコア0) PC

CMP 0 （マルチコアチップ0）

CSM j

CPU(プロセッサ）

I/O DevicesI/O

（入出力装置）CMP m(マルチコア・チプ )



CSM j


I/O DevicesI/O

（入出力装置）

新マルチコアプロセッサ

•高性能

低消費電力


•高性能

低消費電力

CMP m(マルチコア・チプ )



CSM j


I/O DevicesI/O

（入出力装置）CMP m(マルチコア・チプ )



CSM j


I/O DevicesI/O

（入出力装置）


•高性能

低消費電力


•高性能

低消費電力

( )

navigation systems. <Period> From July 2005 to March 2008

チップm)

(プロセッサ

コアn)

) PC1(プロ

セッサコア1)

PC n

DSM(分散共有メモリ)

LDM/D-cache

(ローカルデータメモリ/L1データ

キャッシュ)

LPM/I-Cache

(ローカルプログラムメモリ/

命令キャッシュ)

j

CSM(集中共有

メモリ)

I/O

ＣＳＰ k

(入出力用マルチコア・チップ)

NI(ネットワークインターフェイス)

DTC(データ

転送コントローラ)

チップm)

(プロセッサ

コアn)

) PC1(プロ

セッサコア1)

PC n


LDM/D-cache


キャッシュ)

LPM/I-Cache



j

CSM(集中共有

メモリ)

I/O

ＣＳＰ k



DTC(データ


•低消費電力

•短HW/SW開発期間

•各チップ間でアプリケーション

•低消費電力



チップm)

(プロセッサ

コアn)

) PC1(プロ

セッサコア1)

PC n


LDM/D-cache


キャッシュ)

LPM/I-Cache



j

CSM(集中共有

メモリ)

I/O

ＣＳＰ k



DTC(データ


チップm)

(プロセッサ

コアn)

) PC1(プロ

セッサコア1)

PC n


LDM/D-cache


キャッシュ)

LPM/I-Cache



j

CSM(集中共有

メモリ)

I/O

ＣＳＰ k



DTC(データ


•低消費電力



•低消費電力



<Features> ・Good cost performance・Short hardware and software development periods

CSM / L2 Cache(集中共有メモリあるいはL２キャッシュ )

IntraCCN (チップ内結合網：　複数バス、クロスバー等 )

InterCCN (チップ間結合網：　複数バス、クロスバー、多段ネットワーク等 )




プリケション共用可

•高信頼性

•半導体集積度と共に性能向上


•高信頼性


マルチコア統合ECU

,,








•高信頼性



•高信頼性


マルチコア統合ECU

,,development periods・Low power consumption・Scalable performance improvement

開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ

p pwith the advancement of semiconductor ・Use of the same parallelizing compiler

32for multi-cores from different vendors using newly developed API

**Hitachi, Renesas, Fujitsu,

Toshiba, Panasonic, NEC

NEDOMulticore Technology for Realtime Consumer Electronics R&D Organization(2005.7-2008.3 )

Integrated R&D Steering CommitteeChair : Hironori Kasahara (Waseda Univ.), Project LeaderSub-chair : Kunio Uchiyama (Hitachi), Project Sub-leader

Grant: Multicore Architecture and CompilerResearch and Development

Subsidy: Evaluation Environment for MulticoreTechnologyResearch and Development

Project Leader :Prof. Hironori Kasahara, Waseda Univ.

TechnologyProject Sub-leader :

Dr. Kunio Uchiyama, Chief Researcher, Hitachi

R&D Steering ComitteeT t Chi E l ti S t

Standard MulticoreArchitecture & APIC i

Architecture & CompilerR&D Group.Group Leader :

gTest Chip Development Group.Group Leader : Renesas Technology

Evaluation Sytem Development GroupGroup Leader : Hitachi

CommitteeChair : Prof. HironoriKasahara, Waseda Univ.

Group Leader :Associate Prof. KeijiKimura, Waseda Univ.(Outsourcing ： Fujitsu)

gy

Hitachi, Fujitsu, Toshiba, NEC, Panasonic Renesas Technology

33

anasonic, Renesas Technology

Fujitsu FR-1000Multicore ProcessorFR550 VLIW Processor

I-cacheInteger Operation UnitFR-V Multi-core Processor

Inst 0

FR550 ｃore

FR550 core

GR32KBInst. 0Inst. 1Inst. 2Inst. 3

DMA-E

Local BUS IF

Mem. Cont. Mem. Cont.

Memory

Memory Crossbar Switch

DMA-I FR

Media Operation Unit

D-cache32KB

Inst. 4Inst. 5Inst. 6Inst. 7

FR550 core

FR550 core

IO

Media Operation Unit

C 1 C 2IO Chip

Core1 Core2

×

Core1 Core2

Switch SwitchFast I/O Bus

Memory Memory Memory Memory

Crossbar

Fast I/O Bus

•Memory Bus: 64bit x 2ch / 266MHz•System Bus: 64bit / 178MHz

34Bus Crossbar

（FR1000）

Panasonic UniPhier

35

CELL Processor OverviewC ocesso Ove v ew• Power Processor Element

(PPE)Synergistic Processor Elements for High (Fl)ops / WattSPE

( )– PowerCore processes OS and

Control tasks2 M lti th d d

SPU

LS

SPU

LS

SPU

LS

SPU

LS

SPU

LS

SPU

LS

SPU

LS

SPU

LS

16B/cycle

– 2-way Multi-threaded• Synergistic Processor Element

(SPE) 16B/cycle16B/cycle

EIB (up to 96B/cycle)

16B/cycle

(SPE)– 8 SPE offers high performance– Dual issue RISC Architecture

16B/cycle (2x)16B/cycle

BICMIC

16B/cycle

L2

512KB

PPE

– 128bit SIMD(16‐way)– 128 x 128bit General Registers

256KB L l StRRAC I/ODual

XDRTM

PPUL1

32B/cycle

16B/cycle

512KB

32KB+32KB– 256KB Local Store– DedicatedDMA engines

XDR

従来の演算処理のためにVMXを持つ64-bitのPower Architecture

32KB+32KB

36

1987 OSCAR(Optimally SSccheduled Advanced Multiprocessor)

37

OSCAR(Optimally SSccheduled Advanced Multiprocessor)

HOST COMPUTER

CONTROL & I/O PROCESSOR(Simultaneous Readable)

CENTRALIZED SHARED MEMORY1

RISC Processor I/O Processor

DataMemory

Prog.Memory

Bank1

Addr.n

Bank2

Addr.n

Bank3

Addr.nDistributedShared

(Simultaneous Readable)

CSM2 CSM3

Read & Write RequestsArbitrator

Memory Memory

Bus Interface

SharedMemory

Distributed

(CP)(CP)

(CP) (CP) (CP).. ..

DistributedShared Memory (Dual Port)

-5MFLOPS,32bit RISC Processor (64 Registers)

..

PE1 PE5 PE6 PE10 PE11 PE15PE8 PE9 PE16

(64 Registers)-2 Banks of Program Memory-Data Memory-Stack Memory-DMA Controller

38 5PE CLUSTER (SPC1) SPC2 SPC3

LPC2 8PE PROCESSOR CLUSTER (LPC1)

SYSTEM BUS

OSCAR PE (Processor Element)SYSTEM BUS

BUS INTERFACE

D S MLOCAL BUS 2LOCAL BUS 1

D M A L P M L S M L D M IPU FPU

R E GI N S C

R E G

D M A L D M

INSTRUCTION BUS DP

: DMA CONTROLLER : LOCAL DATA MEMORYD M AL P M

I N S C

D S M

L D M

D PI P U

F P U

: DMA CONTROLLER: LOCAL PROGRAM MEMORY (128KW * 2BANK): INSTRUCTION CONTROL UNIT

: LOCAL DATA MEMORY (256KW): DATA PATH: INTEGER PROCESSING UNIT: FLOATING: DISTRIBUTED

39

D S M

L S M

F P U

R E G: LOCAL STACK MEMORY (4KW)

: FLOATING PROCESSING UNIT: REGISTER FILE (64 REGISTERS)

: DISTRIBUTED SHARED MEMORY (2KW)

1987 OSCAR PE Board

40

LOCAL MEMORY SPACESYSTEM MEMORY SPACE

OSCAR Memory Space

00000000

00100000

UNDEFINED

00000000

00000800

D S M

NOT USE

(Distributed Shared Memory)

PE 1

PE15

(Local Memory Area)

BROAD CAST

00200000

01000000

CP

NOT USE

00010000

NOT USE

L P M(Bank0)(Local Program Memory)

PE 0

PE 1

00020000

BROAD CAST

01100000

02000000

NOT USE

CP

00030000

00040000

NOT USE

(Control Processor)L P M(Bank1)

02000000

02100000

PE 0

00080000

000F0000

L D M

NOT USEPE 1

(Local Data Memory)

02200000

02F00000

:::

000F0000

00100000

CONTROL

SYSTEMPE15

03000000

03400000

CSM1, 2, 3ACCESSING

AREA

PE15

(Centralized Shared Memory)

41FFFFFFFF

NOT USE

FFFFFFFF

OSCAR Multi-Core ArchitectureCMP (chip multiprocessor 0)

CMP m

PE0 PE

CMP (chip multiprocessor 0)0

CPU

I/ODevicesI/O

Devices0 PE1 PE n

LDM/D-cacheLPM/

CSM j I/OCMP k

CPUDTC

DSMI-Cache

CSN t k I t fFVR

Intra-chip connection network

CSMNetwork InterfaceFVR

FVR

CSM / L2 Cache

(Multiple Buses, Crossbar, etc) FVR

FVRFVR FVR FVR FVR

Inter-chip connection network (Crossbar, Buses, Multistage network, etc)CSM: central shared mem. LDM : local data mem.

FVR

42DSM: distributed shared mem.DTC: Data Transfer Controller

LPM : local program mem.FVR: frequency / voltage control register

API and Parallelizing Compiler in METI/NEDOAdvanced Multicore for Realtime Consumer Electronics Projectj

Executable

Translate into

parallel

API to specify data

assignment, data

Sequential Application

codes for each vendor

chip

parallel codes for

each vender

assignment, data transfer, power

reduction control

Program（Subset of C Language)

RealtiA

Image

Was

Backend compilerMach. Codes

Proc0ScheduledTasks

APIdecoder

Sequential Compilerm

e Cons

Applicati

e, Secure

seda OS

Backend Compiler

T1 Stop

Proc1ScheduledT k

SH multicore

Mach.CodesAPI

decoderSequential Compilersum

er Eion Proge A

udio etc.

SCA

R C

o

FR-V

TasksT2 T4

Proc2

Codesdecoder Compiler

Electronigram

sStream

i

ompiler

Backend CompilerScheduledTasks

T3 T6 Slow

Mach.CodesSequential

CompilerAPIdecoder

43

csng （CELL）

Data Transfer by DTC(DMAC)

OSCAR Multigrain Parallelizing Compilerg g p• Automatic

ParallelizationMultigrain Parallel– Multigrain Parallel Processing

– Data LocalizationData transfer– Data transfer Overlapping

– Complier Controlled Power Saving SchemePower Saving Scheme

• Compiler cooperative Multi-core architecturearchitecture– OSCAR Multi-core

ArchitectureOSCAR Heterogeneous– OSCAR Heterogeneous Multiprocessor Architecture

• Commercial SMP44

• Commercial SMP machines

P r o c e s s o r B l o c k D i a g r a mProcessor Block DiagramCCN:Cache controllerIL DL I t ti /D t

Core #3CPU FPUCore #2

600MHzIL, DL: Instruction/Data

local memory

I$32K

D$32K

IL DL

CPU FPU

CCNI$32K

D$32K

CPU FPUCore #1

$ $CPU FPUCore #0

CPU FPU rolle

r

URAM: User RAMGCPG: Global clock pulse generator

IL8K

DL16K

CCN

URAM 128K

32K 32KIL8K

DL16K

CCNI$32K

D$32K

IL8K

DL16K

CCNI$32K

D$32K

CPU FPU

CCN

p co

ntr

LCPGgLCPG: Local CPG for each coreLBSC: SRAM controller

URAM 128K8K 16KURAM 128K

IL8K

DL16K

URAM 128K Snoo

p(S

NC

)LCPG0-3LCPG0-3LCPG

0-3LCPG0-3

On-chip system bus (SHwy)

LBSC: SRAM controllerDBSC: DDR2 controller

300MHzS (0 3

DBSC PCIExp

HWGCPGIPLBSC

454 laneDDR2

32bitSRAM32bit

Chip OverviewProcess Technology

90nm, 8-layer, triple-Vth, CMOS

Chip Size 97 6mm2 (9 88mm x 9 88mm)Core Core Chip Size 97.6mm (9.88mm x 9.88mm)

Supply Voltage 1.0V (internal), 1.8/3.3V (I/O)

Power Consumption

0.6 mW/MHz/CPU @ 600MHz (90nm G)

#0 PeripheralsCore

#1

Consumption 600MHz (90nm G)

Clock Frequency 600MHz

CPU Performance 4320 MIPS (Dhrystone 2.1)

SNC

Core Core

#3FPU Performance 16.8 GFLOPS

I/D Cache 32KB 4way set-associative (each)

#2#3

( )

ILRAM/OLRAM 8KB/16KB (each CPU)

URAM 128KB (each CPU)

P k FCBGA 554 i 29SH4A Multicore SoC Chip Package FCBGA 554pin, 29mm x 29mm

ISSCC07 Paper No.5.3, Y. Yoshida, et al., “A 4320MIPS Four-Processor Core SMP/AMP with

46

pIndividually Managed Clock Frequency for Low Power Consumption”

Performance on a DevelopedSH Multi core (RP1: SH X3)SH Multi-core (RP1: SH-X3)

Using Compiler and APIAudio AAC* Encoder Image Susan Smoothing

up

5.0

4.02 95

3.825.0

4.03.43

d u

p

Speed

3.01.91

2.95

3.01.86

2.70

Speed

2.0

1.0

1.002.0

1.0

1.00

Number of Processors1 2 3 4

01 2 3 4

0Number of Processors

Page. 47

Number of Processors**) Mibench Embedded application benchmark by Michigan Univ.*) ISO Advanced Audio Coding：

Number of Processors

RP2 Chip Photo and Specifications

Process 90nm 8-layer triple-ILRAM I$ Process Technology

90nm, 8-layer, triple-Vth, CMOS

Chip Size 104 8mm2

Core#1

C0

LB

SC

URAMDLRAM

Core#0ILRAM

D$

I$

Chip Size 104.8mm(10.61mm x 9.88mm)

CPU Core 6.6mm2

Core#2 Core#3SN

SHWY CPU Core Size

6.6mm(3.36mm x 1.96mm)

Supply 1.0V–1.4V (internal), Core#6 Core#7

SNC

1

SHWY VSWC

pp yVoltage

( ),1.8/3.3V (I/O)

Power 17 (8 CPUs, Core#4 Core#5

SC

SM

Domains(

8 URAMs, common)DBSC

DDRPADGCPG

48

8 Core RP2 Chip Block DiagramCl #1

Core #3CPU FPUCore #2

Cluster #0 Cluster #1Core #7CPUFPU Core #6

Barrier Sync. Lines

I$16K

D$16K

CPU FPU

Local memoryI$

16KD$

16K

CPU FPUCore #1CPU FPUCore #0

CPU FPULCPG0 LCPG1

I$16K

D$16K

CPUFPU

I$16K

D$16K

CPUFPU Core #5CPUFPU Core #4

CPUFPU

User RAM 64K

Local memoryI:8K, D:32K

16K 16KLocal memoryI:8K, D:32K

I$16K

D$16K

Local memoryI 8K D 32K

I$16K

D$16K

CPU FPU

CCNBAR

ntro

ller

1

ntro

ller

0LCPG0PCR3PCR2

LCPG1PCR7PCR6User RAM 64K

I:8K, D:32K16K16K

I:8K, D:32K

I$16K

D$16K

I 8K D 32K

I$16K

D$16K

CPUFPU

CCNBAR

User RAM 64KUser RAM 64K

I:8K, D:32K

URAM 64K


noop

con

noop

con

PCR2PCR1PCR0

PCR6PCR5PCR4

User RAM 64KUser RAM 64K

I:8K, D:32K

URAM 64K


On-chip system bus (SuperHyway)SSC 0 C

DDR2LCPG: Local clock pulse generator

PCR: Power Control RegistercontrolSRAM

controlDMA

control49

CCN/BAR:Cache controller/Barrier RegisterURAM: User RAM

control control control

Processing Performance on the Developed Multicore Using Automatic Parallelizing CompilerMulticore Using Automatic Parallelizing Compiler

Speedup against single core execution for audio AAC encoding

6.0

7.0

4 0

5.0

ps

5.8 3.0

4.0

Spee

du

1 01.9

3.6 1.0

2.0

1.0 0.0

1 2 4 8N b f

50*) Advanced Audio Coding

Numbers of processor cores

Power Reduction Using Power Shutdown and Voltage Control by OSCAR Parallelizing CompilerControl by OSCAR Parallelizing Compiler

7Without Power Control with Power Control

（Resume Standby:

6

7

W] 6

7 （Resume Standby: Power shutdown & voltage lowering)

4

5

ower

[W

4

5

3

4

med

Po

3

4

1

2

Con

su

1

2

86 0% Power Reduction0 0

Average Power5.82 [W]

Average Power0.81 [W]

86.0% Power Reduction

OSCAR Heterogeneous MulticoreOSCAR Heterogeneous Multicore

• OSCAR Type Memory• OSCAR Type Memory Architecture

• LPM– Local Program Memoryg y

• LDM– Local Data Memory

• DSMDi ib d Sh d M– Distributed Shared Memory

• CSM– Centralized Shared Memory

• On Chip and/or Off Chipp p• DTU

– Data Transfer Unit• Interconnection Network

M lti l B– Multiple Buses– Split Transaction Buses– CrossBar …

52

Static Scheduling of Coarse Grain Tasks for a Heterogeneous Multi-core

MT1for CPU

CPU0 CPU1 DRP

MT1MT2for CPU

MT3for CPU

MT1

MT2 MT3MT6for CPU

MT5for DRP

MT4for DRP

MT7 MT10MT9MT8

MT4MT5

MT6 MT7MT8MT7

for DRPMT10for DRP

MT9for DRP

MT8for CPU

MT13f DRP

MT12f DRP

MT11f CPU

MT8MT9MT10

MT11for DRPfor DRPfor CPU

EMT

MT12MT13

53

An Image of Static Schedule for Heterogeneous Multi-core with Data Transfer Overlapping and Power Controlpp g

TIM

EE

54

Compiler Performance on a OSCAR HeteroCompiler Performance on a OSCAR Hetero--multimulti--corecore25.2 times speedup using 4 SH general purpose cores and 4 DRP p p g g p paccelerators against a single SH

16.3016.25

17.2217.1016.13 16.47

16.0

18.0

20.0

25.2

41.3 times Speedup

9 0810.05

9 0810.09

9 1010.11

10.3410.3110.17

10 0

12.0

14.0 times Speedup against

5.176.09

9.08

5.186.10

9.08

5.196.11

9.10

6.0

8.0

10.01SH core

0.40

1.59

2.90

0.40

1.59

2.90

0.000.00

2.90

1.00

0.0

2.0

4.0

U U DR

P

DR

P

DR

P

DR

P

DR

P

DR

P

DR

P

DR

P

U U DR

P

DR

P

DR

P

DR

P

DR

P

DR

P

DR

P

DR

P

U U DR

P

DR

P

DR

P

DR

P

DR

P

DR

P

DR

P

DR

P

lock

Cyc

les

1C

PU 1

CP

4C

P

2C

PU

+1D

2C

PU

+2D

4C

PU

+2D

2C

PU

+4D

4C

PU

+4D

8C

PU

+4D

4C

PU

+8D

8C

PU

+8D

1C

P

4C

P

2C

PU

+1D

2C

PU

+2D

4C

PU

+2D

2C

PU

+4D

4C

PU

+4D

8C

PU

+4D

4C

PU

+8D

8C

PU

+8D

1C

P

4C

P

2C

PU

+1D

2C

PU

+2D

4C

PU

+2D

2C

PU

+4D

4C

PU

+4D

8C

PU

+4D

4C

PU

+8D

8C

PU

+8D

Exe

cutio

n C

l

SH4ACore

SH4ACore

SH4ACore

汎用高性能プロセッサ

SH4A CoreOn-Chip CSM

STB×1


BUS×3


STB×3

Power Reduction by OSCAR Compiler(4SHs＋4DRPs)0 78 W 22% P d ti b C il C t l0.78 W: 22% Power reduction by Compiler Control

1.2-22%

0 8

1

[W]

DRP

1.01

0 78

1.01

0 78

1.01

0 78

1.01

0 78

1.01

0 78

1.01

1 78

22%

0.6

0.8

sum

ptio

n [ DRP

Clock

FPU

IEU

0.78 0.78 0.78 0.78 0.78 1.78

0.4

Pow

er

Cons

BPU

Register

0

0.2

Avera

ge P

FVOFF

FVON

FVOFF

FVON

FVOFF

FVON

FVOFF

FVON

FVOFF

FVON

FVOFF

FVON

STB×1 Bus×3 STB×3 STB×1 Bus×3 STB×3

2007.7.30 56Off Chip On Chip

Compiler cooperative low power high effectiveConclusions

Compiler cooperative low power high effective performance multi-core processors will be more important in wide range of information systems fromimportant in wide range of information systems from games, mobile phones, automobiles to peta-scale supercomputers. p pParallelizing compilers are essential for realization of

Good cost performancepShort hardware and software development periodsLow power consumptionp pHigh software productivityScalable performance improvement with advancement p pin semiconductor integration technology

Key technologies in multi-core compiler

57Multigrain parallelization, Data localization, Data transfer overlapping using DMA, Low power control technologies

A Multigggrain Parallelizing Compiler with Power Control for … · 2008. 3. 7. · 0.0 1.0 2.0...

Documents

Transcript of A Multigggrain Parallelizing Compiler with Power Control for … · 2008. 3. 7. · 0.0 1.0 2.0...