Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and...

29
Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo

Transcript of Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and...

Page 1: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

Challenges for High Performance Processors

Hiroshi NAKAMURAResearch Center for Advanced Science and Technology, The University of Tokyo

Page 2: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 2

What’s the challenge? Our Primary Goal: Performance How ?

increase the number and/or operating frequency of functional units

AND supply functional units with sufficient data (bandwidth)

Problems: Memory Wall

system performance is limited by poor memory performance

Power Wall power consumption is approaching cooling limitation

Page 3: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 3

Memory Wall Problem Performance improvement

CPU: 55% / year DRAM: 7% / year

1

10

100

1000

10000

100000

1000000

Year

Rel

ativ

e P

erfo

rman

ce

CPU Memory

Page 4: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 4

Example of Memory Wall: Performance of 2GHz Pentium4 for a[i]=b[i]+c[i]

050

100150200250300350400450500

10 100 1000 10000 100000 1000000

Vector Length

Per

form

ance

[M

FLO

PS

]L1 hit L2 hit

cache miss

1/6

lack of effective memory throughput

non-blocking cache &out-of-order issue

Page 5: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 5

Recap: Memory Wall Problem growing gap between processor and memory speed

performance is limited by memory ability in High Performance Computing (HPC) long access latency of main memory lack of throughput of main memory

making full use of local memory (on-chip memory) of wide bandwidth is indispensable on-chip memory space is valuable resource not enough for HPC should exploit data locality

Itanium2/Montecito : Huge L3 cache (12MB x 2)

Page 6: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 6

works well in many cases, but not the best for HPC data location and replacement by hardware

× unfortunate line conflicts occur although most of data accesses are regular

ex. data used only once flush out other useful data

transfer size of cache off-chip is fixed for consecutive data: larger transfer size is preferable for non-consecutive data: large line transfer incurs

unnecessary data transfer waste of bandwidth Most of HPC applications exhibit regularity in data

access, which is sometimes not well enjoyed.

Does cache work well in HPC?

Page 7: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 7

SCIMA (Software Controlled Integrated Memory Architecture) [kondo-ICCD2000]

overview of SCIMA

SCM

register

ALU FPU

Cache

address space

NIA

Network

Memory(DRAM)

・・・

addressable SCMSCM in addition to ordinary cache

a part of logical address space no inclusive relations with Cache SCM SCM and cache are reconfigurable at

the granularity of way

CacheSCM

reconfigurable

(SCM: Software Controllable Memory)

(joint work with Prof. Boku@ Univ. of Tsukuba and others)

Page 8: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 8

Data Transfer Instruction

Register

Off-Chip Memory

CacheSCM

load/store

line transfer

page-load/page-store

load/store register SCM/Cache

page-load/page-store SCM Off-Chip Memory large granularity transfer

wider effective bandwidthby reducing latency stall

block stride transfer avoid unnecessary

data transfer more effective utilization

of On-Chip Memory

New

Page 9: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 9

Strategy of Software Control SCM must be controlled by software

arrays are classified into 6 groups

Reusabilitynot-reusable

conse- cutive

stride

(1)

(2)

(3)

use SCM as a

stream buffer

reusable

use SCM as a stream buffer

not use SCM

(5) reserve SCMfor reused data

reserve SCMfor reused data (4)

(6) reserve SCMfor reused data irregular

Consecutiveness first, apply (1) (2)allocate small stream buffer in SCM

second, apply (4) (5) and (6)

allocate rest area of SCM for reused data

・ prototype of semi-automatic compiler : users specify hints on reusability of data arrays

Page 10: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 10

Results of Memory Traffic

1% - 61% of memory traffic decreases in SCIMA

0.00.20.40.60.81.01.21.41.6

Cache SCIMA Cache SCIMA Cache SCIMA

CG FT QCD

Nor

mal

ized

Mem

ory

Traffi

c cache miss page- load/ store

line size

due to fully exploitation of data reusability

unnecessary memory traffic is suppressed

assumption cache model:

cache size = 64KB(4way) SCM size = 0KB

SCIMA mode: cache size = 16KB (1way) SCM size = 48KB

total # of way: 4 line size: 32B, 128B

benchmark programs CG, FT, QCD

Page 11: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 11

Results of Performance

CPU busy time latency stall :

elapsed time due to memory latency

throughput stall : elapsed time due to lack of throughput

1.3-2.5 times faster than cache latency stall reduction by large granularity of data transfer throughput stall reduction by suppressing unnecessary data transfer

load/store latency: 2cyclebus throughput: 4B/cyclememory latency: 40cycle

assumption0

0.2

0.4

0.6

0.8

1

1.2

32 128 32 12

8 32 128 32 12

8 32 128 32 12

8

Cache SCIMA Cache SCIMA Cache SCIMA

CG FT QCD

Nor

mal

ized

Exe

cutio

n Ti

me CPU busy time latency stall throughput stall

line size

normalized execution time

Page 12: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 12

Power Wall Next Focus: Power Consumption of Processors

Is there any room for power reduction ? If yes, then how to reduce ?

Trends of Heat Density

♦ Itanium (130W)

Page 13: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 13

Observation(1) Moore’s Law

Num. of transistors : doubles every 18 months

Page 14: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 14

Observation (2) – frequency –

Frequency doubles every 3 years.

Number of transistors : doubles every 18 months

Number of switching on a chip: 8 times every 3 years

Page 15: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 15

Observation (3) – performance – # of switching on a chip: 8 times every 3 years effective performance: 4 times every 3 years

“microprocessor performance improved 55% per year” from “Computer Architecture A Quantitative Approach” by J.Henessy and D.Patterson, Morgan Kaufmann

unnecessary switching = chance of power reduction: doubles every 3 years

Page 16: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 16

An Evidence of the Observation - unnecessary switching = x2 / 3 years - [Zyuban00] @ ISLPED’00

4 6 8 10 12

IssueWidth

acce

ss e

nerg

y pe

r in

stru

ctio

n (n

J)

flushedinstruction

committedinstruction

energy/instr. increases to exploit ILP for higher performance at functional units : no increase at issue window, register file : increase flushed instruction by incorrect prediction: increase

rename map tablebypass mechanismload/store windowissue windowregister filefunctional units

waste of power

Page 17: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 17

Registers Register consumes a lot of power

roughly speaking, power ∝(num. of registers) X (num. of ports) high performance wide issue superscalar processors

more registers, more read/write ports Open Question

in HPC, what is the best way to use many function units (or accelerators) from the perspective of register file design

scalar registers with SIMD operations vector registers with vector operations ………

Personal Impression vector registers are accessed in well-organized fashion, it is easy

to reduce “num. of ports” by sub-banking technique can vector operations make good use of local on-chip memory?

(at least, traditional vector processors can never!)

Page 18: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 18

Dual Core helps …

Voltage Frequency Power Performance

1% 1% 3% 0.66%

Rule of thumb

CoreCore

CacheCache

CoreCore

CacheCache

CoreCore

Voltage = 1Freq = 1Area = 1Power = 1Perf = 1

Voltage = -15%Freq = -15%Area = 2Power = 1Perf = ~1.8

In the same process technology…

Page 19: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 19

Multi-Core helps more …

C1C1 C2C2

C3C3 C4C4

Cache

Large CoreLarge Core

Cache

1

2

3

4

1

2 SmallCoreSmallCore

1 1

1

2

3

4

1

2

3

4

Power

PerformancePower = 1/4

Performance = 1/2

Multi-Core:Multi-Core:Power efficientPower efficient

Better power and Better power and thermal managementthermal management

Multi-Core:Multi-Core:Power efficientPower efficient

Better power and Better power and thermal managementthermal management

no need for wider instruction issue

Page 20: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 20

Leakage problem How to attack

leakage problem?

IEEE Computer Magazine

[Borkar-MICRO05] VDD

Input 0 OFF

ON

leakagecurrent

0

200

400

600

800

1000

1200

1400

90nm 65nm 45nm 32nm 22nm 16nm

Po

wer

(W

), P

ow

er D

ensi

ty (

W/c

m2)

SiO2 Lkg

SD Lkg

Active

10 mm Die

[Borkar-MICRO05]

Page 21: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 21

Introduction of our research Innovative Power Control for Ultra Low-Power and H

igh-Performance System LSIs 5 years project started October, 2006 supported by JST (Japan Science and Technology Agency) a

s a CREST (Core Research for Evolutional Science and Technology) program

Objective: drastic power reduction of high-performance system LSIs by innovative power control through tight cooperation of various design levels including circuit, architecture, and system software.

Members: Prof. H. Nakamura (U. Tokyo): architecture & compiler [leader] Prof. M. Namiki (Tokyo Univ of Agri. Tech): OS Prof. H. Amano (Keio Univ): architecture & F/E design Prof. K. Usami (Shibaura I.T.): circuit & B/E design

Page 22: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 22

How to reduce leakage: Power Gating

Focusing on Power Gating for reducing leakage Inserting a Power Switch (PS) between VDD and GND Turning off PS when sleep

Virtual GNDSleep Power Switch

logic gatesVDD logic gates VDD

GND

Page 23: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 23

Run-time Power Gating (RTPG)

CircuitA

Circuit B

CircuitC

SleepControl ckt

PowerSwitch

control power switch at run time

Coarse grain: Mobile processor by Renesas(independent power domains for BB module, MPEG module, ..)

Fine grain (our target): power gating within a module

Page 24: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 24

Fine-grain Run-time Power Gating

Longer sleep time is preferable Leakage savings Overheads: power penalties for wakeup

Evaluation through a real chip not reported Test vehicle: 32b x 32b Multiplier

Either or both operands (input data) are likely less than 16-bit Circuit portions to compute upper bits of product need not to

operate waste leakage power

By detecting 0s at upper 16-bits of operands, power gate internal Multiplier array

Page 25: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 25

Test chip "Pinnacle"

Not applied FG-RTPG applied

Technology STARC 90nm CMOS

Multiplier core Area # cells

0.544 × 0.378 mm2

15,000

Design time 4.5 months

Design members

3 Master students, 1 Bachelor student,1 Faculty

2.0

2.5

3.0

3.5

4.0

25C

85C

125C

Sequence 1

(No sleep)

Sequence 2

(Domain H

sleeps)

Sequence 3

(Domain H and M

sleep)

Pow

er d

issi

patio

n(

mW

)- Exhibits good power reduction

- Current Status Designing a pipelined

microprocessor with FG-RTPG Compiler (instruction scheduler)

to increase sleep time

real measurement

Page 26: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 26

Low Power Linux Scheduler based onstatistical modeling Co-optimization of System Software and Architecture Objective:

process scheduler which reduce power consumption by DVFS (dynamic voltage and frequency scaling) of each process with satisfying its performance constraint

How to find the lowest frequency with satisfying performance constraints ? it depends on hardware and program characteristics performance ratio is different from frequency ratio hard to find the answer straightforward

modeling by statistical analysis of hardware events

Page 27: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 27

Evaluation result

Specified threshold Black dotted line

Perf. is within the threshold in all the cases except for mgrid 3-7% below the threshold

Accurate model is obtained Linux scheduler using this model is developed

May 8, 2007 27

1.0

0.9

0.8

0.7

0.6

0.5

0.4

Rela

tive

Perf

orm

ance

1.00.90.80.70.60.5

Performance Threshold

mcf bzip2 swim mgrid matrix (50) matrix (600) matrix (1000) Threshold

Pentium M 760 (Max 2.00 GHz, FSB 533 MHz)

Page 28: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 28

Summary Challenge for high performance processors:

Memory Wall and Power Wall One solution to memory wall

make good use of on-chip memory with software controllability

Solutions to power wall many cores will relax the problem, but leakage current is getting a big problem new research/approach is required our project “Innovative Power Control for Ultra Low-Power

and High-Performance System LSIs” is introduced

Page 29: Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and Technology, The University of Tokyo.

2007. 11. 2.France-Japan PAAP Workshop 29