Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and...

Challenges for High Performance Processors

Hiroshi NAKAMURAResearch Center for Advanced Science and Technology, The University of Tokyo

2007. 11. 2.France-Japan PAAP Workshop 2

What’s the challenge? Our Primary Goal: Performance How ?

increase the number and/or operating frequency of functional units

AND supply functional units with sufficient data (bandwidth)

Problems: Memory Wall

system performance is limited by poor memory performance

Power Wall power consumption is approaching cooling limitation


Memory Wall Problem Performance improvement

CPU: 55% / year DRAM: 7% / year

1

10

100

1000

10000

100000

1000000

Year

Rel

ativ

e P

erfo

rman

ce

CPU Memory


Example of Memory Wall: Performance of 2GHz Pentium4 for a[i]=b[i]+c[i]

050

100150200250300350400450500

10 100 1000 10000 100000 1000000

Vector Length

Per

form

ance

[M

FLO

PS

]L1 hit L2 hit

cache miss

1/6

lack of effective memory throughput

non-blocking cache &out-of-order issue


Recap: Memory Wall Problem growing gap between processor and memory speed

performance is limited by memory ability in High Performance Computing (HPC) long access latency of main memory lack of throughput of main memory

making full use of local memory (on-chip memory) of wide bandwidth is indispensable on-chip memory space is valuable resource not enough for HPC should exploit data locality

Itanium2/Montecito : Huge L3 cache (12MB x 2)


works well in many cases, but not the best for HPC data location and replacement by hardware

× unfortunate line conflicts occur although most of data accesses are regular

ex. data used only once flush out other useful data

transfer size of cache off-chip is fixed for consecutive data: larger transfer size is preferable for non-consecutive data: large line transfer incurs

unnecessary data transfer waste of bandwidth Most of HPC applications exhibit regularity in data

access, which is sometimes not well enjoyed.

Does cache work well in HPC?


SCIMA (Software Controlled Integrated Memory Architecture) [kondo-ICCD2000]

overview of SCIMA

SCM

register

ALU FPU

Cache

address space

NIA

Network

Memory(DRAM)

・・・

addressable SCMSCM in addition to ordinary cache

a part of logical address space no inclusive relations with Cache SCM SCM and cache are reconfigurable at

the granularity of way

CacheSCM

reconfigurable

(SCM: Software Controllable Memory)

(joint work with Prof. Boku@ Univ. of Tsukuba and others)


Data Transfer Instruction

Register

Off-Chip Memory

CacheSCM

load/store

line transfer

page-load/page-store

load/store register SCM/Cache

page-load/page-store SCM Off-Chip Memory large granularity transfer

wider effective bandwidthby reducing latency stall

block stride transfer avoid unnecessary

data transfer more effective utilization

of On-Chip Memory

New


Strategy of Software Control SCM must be controlled by software

arrays are classified into 6 groups

Reusabilitynot-reusable

consecutive

stride

(1)

(2)

(3)

use SCM as a

stream buffer

reusable

use SCM as a stream buffer

not use SCM

(5) reserve SCMfor reused data

reserve SCMfor reused data (4)

(6) reserve SCMfor reused data irregular

Consecutiveness first, apply (1) (2)allocate small stream buffer in SCM

second, apply (4) (5) and (6)

allocate rest area of SCM for reused data

・ prototype of semi-automatic compiler : users specify hints on reusability of data arrays


Results of Memory Traffic

1% - 61% of memory traffic decreases in SCIMA

0.00.20.40.60.81.01.21.41.6

Cache SCIMA Cache SCIMA Cache SCIMA

CG FT QCD

Nor

mal

ized

Mem

ory

Traffi

c cache miss page- load/ store

line size

due to fully exploitation of data reusability

unnecessary memory traffic is suppressed

assumption cache model:

cache size = 64KB(4way) SCM size = 0KB

SCIMA mode: cache size = 16KB (1way) SCM size = 48KB

total # of way: 4 line size: 32B, 128B

benchmark programs CG, FT, QCD


Results of Performance

CPU busy time latency stall :

elapsed time due to memory latency

throughput stall : elapsed time due to lack of throughput

1.3-2.5 times faster than cache latency stall reduction by large granularity of data transfer throughput stall reduction by suppressing unnecessary data transfer

load/store latency: 2cyclebus throughput: 4B/cyclememory latency: 40cycle

assumption0

0.2

0.4

0.6

0.8

1

1.2

32 128 32 12

8 32 128 32 12

8 32 128 32 12

8

Cache SCIMA Cache SCIMA Cache SCIMA

CG FT QCD

Nor

mal

ized

Exe

cutio

n Ti

me CPU busy time latency stall throughput stall

line size

normalized execution time


Power Wall Next Focus: Power Consumption of Processors

Is there any room for power reduction ? If yes, then how to reduce ?

Trends of Heat Density

♦ Itanium (130W)


Observation(1) Moore’s Law

Num. of transistors : doubles every 18 months


Observation (2) – frequency –

Frequency doubles every 3 years.

Number of transistors : doubles every 18 months

Number of switching on a chip: 8 times every 3 years


Observation (3) – performance – # of switching on a chip: 8 times every 3 years effective performance: 4 times every 3 years

“microprocessor performance improved 55% per year” from “Computer Architecture A Quantitative Approach” by J.Henessy and D.Patterson, Morgan Kaufmann

unnecessary switching = chance of power reduction: doubles every 3 years


An Evidence of the Observation - unnecessary switching = x2 / 3 years - [Zyuban00] @ ISLPED’00

4 6 8 10 12

IssueWidth

acce

ss e

nerg

y pe

r in

stru

ctio

n (n

J)

flushedinstruction

committedinstruction

energy/instr. increases to exploit ILP for higher performance at functional units : no increase at issue window, register file : increase flushed instruction by incorrect prediction: increase

rename map tablebypass mechanismload/store windowissue windowregister filefunctional units

waste of power


Registers Register consumes a lot of power

roughly speaking, power ∝(num. of registers) X (num. of ports) high performance wide issue superscalar processors

more registers, more read/write ports Open Question

in HPC, what is the best way to use many function units (or accelerators) from the perspective of register file design

scalar registers with SIMD operations vector registers with vector operations ………

Personal Impression vector registers are accessed in well-organized fashion, it is easy

to reduce “num. of ports” by sub-banking technique can vector operations make good use of local on-chip memory?

(at least, traditional vector processors can never!)


Dual Core helps …

Voltage Frequency Power Performance

1% 1% 3% 0.66%

Rule of thumb

CoreCore

CacheCache

CoreCore

CacheCache

CoreCore

Voltage = 1Freq = 1Area = 1Power = 1Perf = 1

Voltage = -15%Freq = -15%Area = 2Power = 1Perf = ~1.8

In the same process technology…


Multi-Core helps more …

C1C1 C2C2

C3C3 C4C4

Cache

Large CoreLarge Core

Cache

1

2

3

4

1

2 SmallCoreSmallCore

1 1

1

2

3

4

1

2

3

4

Power

PerformancePower = 1/4

Performance = 1/2

Multi-Core:Multi-Core:Power efficientPower efficient

Better power and Better power and thermal managementthermal management

Multi-Core:Multi-Core:Power efficientPower efficient

Better power and Better power and thermal managementthermal management

no need for wider instruction issue


Leakage problem How to attack

leakage problem?

IEEE Computer Magazine

[Borkar-MICRO05] VDD

Input 0 OFF

ON

leakagecurrent

0

200

400

600

800

1000

1200

1400

90nm 65nm 45nm 32nm 22nm 16nm

Po

wer

(W

), P

ow

er D

ensi

ty (

W/c

m2)

SiO2 Lkg

SD Lkg

Active

10 mm Die

[Borkar-MICRO05]


Introduction of our research Innovative Power Control for Ultra Low-Power and H

igh-Performance System LSIs 5 years project started October, 2006 supported by JST (Japan Science and Technology Agency) a

s a CREST (Core Research for Evolutional Science and Technology) program

Objective: drastic power reduction of high-performance system LSIs by innovative power control through tight cooperation of various design levels including circuit, architecture, and system software.

Members: Prof. H. Nakamura (U. Tokyo): architecture & compiler [leader] Prof. M. Namiki (Tokyo Univ of Agri. Tech): OS Prof. H. Amano (Keio Univ): architecture & F/E design Prof. K. Usami (Shibaura I.T.): circuit & B/E design


How to reduce leakage: Power Gating

Focusing on Power Gating for reducing leakage Inserting a Power Switch (PS) between VDD and GND Turning off PS when sleep

Virtual GNDSleep Power Switch

logic gatesVDD logic gates VDD

GND


Run-time Power Gating (RTPG)

CircuitA

Circuit B

CircuitC

SleepControl ckt

PowerSwitch

control power switch at run time

Coarse grain: Mobile processor by Renesas(independent power domains for BB module, MPEG module, ..)

Fine grain (our target): power gating within a module


Fine-grain Run-time Power Gating

Longer sleep time is preferable Leakage savings Overheads: power penalties for wakeup

Evaluation through a real chip not reported Test vehicle: 32b x 32b Multiplier

Either or both operands (input data) are likely less than 16-bit Circuit portions to compute upper bits of product need not to

operate waste leakage power

By detecting 0s at upper 16-bits of operands, power gate internal Multiplier array


Test chip "Pinnacle"

Not applied FG-RTPG applied

Technology STARC 90nm CMOS

Multiplier core Area # cells

0.544 × 0.378 mm2

15,000

Design time 4.5 months

Design members

3 Master students, 1 Bachelor student,1 Faculty

2.0

2.5

3.0

3.5

4.0

25C

85C

125C

Sequence 1

(No sleep)

Sequence 2

(Domain H

sleeps)

Sequence 3

(Domain H and M

sleep)

Pow

er d

issi

patio

n（

mW

）- Exhibits good power reduction

- Current Status Designing a pipelined

microprocessor with FG-RTPG Compiler (instruction scheduler)

to increase sleep time

real measurement


Low Power Linux Scheduler based onstatistical modeling Co-optimization of System Software and Architecture Objective:

process scheduler which reduce power consumption by DVFS (dynamic voltage and frequency scaling) of each process with satisfying its performance constraint

How to find the lowest frequency with satisfying performance constraints ? it depends on hardware and program characteristics performance ratio is different from frequency ratio hard to find the answer straightforward

modeling by statistical analysis of hardware events


Evaluation result

Specified threshold Black dotted line

Perf. is within the threshold in all the cases except for mgrid 3-7% below the threshold

Accurate model is obtained Linux scheduler using this model is developed

May 8, 2007 27

1.0

0.9

0.8

0.7

0.6

0.5

0.4

Rela

tive

Perf

orm

ance

1.00.90.80.70.60.5

Performance Threshold

mcf bzip2 swim mgrid matrix (50) matrix (600) matrix (1000) Threshold

Pentium M 760 (Max 2.00 GHz, FSB 533 MHz)


Summary Challenge for high performance processors:

Memory Wall and Power Wall One solution to memory wall

make good use of on-chip memory with software controllability

Solutions to power wall many cores will relax the problem, but leakage current is getting a big problem new research/approach is required our project “Innovative Power Control for Ultra Low-Power

and High-Performance System LSIs” is introduced

Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and...

Documents

Transcript of Challenges for High Performance Processors Hiroshi NAKAMURA Research Center for Advanced Science and...