A Distributed Control Path Architecture for VLIW Processors

20
1 University of Michigan Electrical Engineering and Computer Science A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan, Scott Mahlke, and Michael Schlansker* Advanced Computer Architecture Laboratory University of Michigan *HP Laboratories

description

A Distributed Control Path Architecture for VLIW Processors. Hongtao Zhong, Kevin Fan, Scott Mahlke, and Michael Schlansker* Advanced Computer Architecture Laboratory University of Michigan *HP Laboratories. Motivation. VLIW Scaling Problem Centralized resource Highly ported structures - PowerPoint PPT Presentation

Transcript of A Distributed Control Path Architecture for VLIW Processors

Page 1: A Distributed Control Path Architecture for VLIW Processors

1 University of MichiganElectrical Engineering and Computer Science

A Distributed Control Path Architecture for VLIW Processors

Hongtao Zhong, Kevin Fan, Scott Mahlke,and Michael Schlansker*

Advanced Computer Architecture LaboratoryUniversity of Michigan

*HP Laboratories

Page 2: A Distributed Control Path Architecture for VLIW Processors

2 University of MichiganElectrical Engineering and Computer Science

Motivation

• VLIW Scaling Problem► Centralized resource► Highly ported structures► Wire delays

FU FU

Register File

Instruction Fetch/Decode

FU FU…FU FU

Register File

Instruction Fetch/Decode

FUFU FUFU

Page 3: A Distributed Control Path Architecture for VLIW Processors

3 University of MichiganElectrical Engineering and Computer Science

Multicluster VLIW

• Distribute register files• Cluster function units• Distribute data caches• Clusters communicate

through interconnection network

• Used in TI C6x, Lx/ST200, Analog Tigersharc

FU FU FU FU

Register FileRegister File

Interconnection network

Instruction Fetch/Decode

Cluster 0 Cluster 1

Page 4: A Distributed Control Path Architecture for VLIW Processors

4 University of MichiganElectrical Engineering and Computer Science

Control Path Scaling Problem• Larger I-cache

• Latency► Long wires for control

signals distribution

• Code compression► Hardware cost, power► Grow quadratically with the

number of FUs

GFED

CBAX

PC

BA

I-cache

IR

align/shiftnetwork

NOP NOP

Page 5: A Distributed Control Path Architecture for VLIW Processors

5 University of MichiganElectrical Engineering and Computer Science

Straight Forward Approach• Distribute I-fetch in spirit similar to distribution of

data path► Local communication of controls► Reduce latency, hardware cost, power

• Used in Multiflow Trace 14/300 processors

I-cache

PC

IR

Interconnection network

PC

FU FU FU FU

Register FileRegister File

Interconnection network

I-cache

IR

FU FU FU FU

Register FileRegister File

Page 6: A Distributed Control Path Architecture for VLIW Processors

6 University of MichiganElectrical Engineering and Computer Science

DVLIW Approach

• Simple distribution has problems► Doesn’t support code compression► PC still a centralized resource

I-cache

FU FU FU FU

Register File Register File

PC0

IR

Interconnection network

I-cache

FU FU FU FU

Register File Register File

PC

IR

Interconnection network

align/shift

PC1

align/shift

Page 7: A Distributed Control Path Architecture for VLIW Processors

7 University of MichiganElectrical Engineering and Computer Science

DVLIW Execution Model

• Clusters execute in lock-step► When one cluster stalls, all clusters stall

• Clusters collectively execute one thread► Each cluster runs an instruction stream► Compiler orchestrates the execution of streams► Compiler manages communication► Light weight synchronization

Page 8: A Distributed Control Path Architecture for VLIW Processors

8 University of MichiganElectrical Engineering and Computer Science

DVLIW Benefits

• Completely decentralized architecture► Distributed data path► Distributed control path

• Supports arbitrary code compression

• Exploiting ILP on multi-core style system► Good for embedded applications► Low cost► Compiler support

Page 9: A Distributed Control Path Architecture for VLIW Processors

9 University of MichiganElectrical Engineering and Computer Science

DVLIW Architecture

VLIWCluster 0

VLIWCluster 1

VLIWCluster 3

VLIWCluster 2

Banked L2

br_target

PC

Next PC

BNOPA

BA

L1 D-C

acheL1 I-Cache

IR

Register Files

align/shift

IC MFUFU…

To Banked L2

Banked L2

To cluster 2 To cluster 1

Page 10: A Distributed Control Path Architecture for VLIW Processors

10 University of MichiganElectrical Engineering and Computer Science

Code Organization• Code for each cluster

is consecutive in memory

• Operations in the same MultiOp stored in different memory locations

• Each cluster computes its own next PC

A1

A2

A3

A4

A5

B1

B2

B3

B4

A1

A2

A3

B1

B2

A4

A5

B3

B4

Conventional VLIW DVLIW

PC PC0

PC1

Page 11: A Distributed Control Path Architecture for VLIW Processors

11 University of MichiganElectrical Engineering and Computer Science

Branch Mechanism

• Maintain correct execution order► All clusters transfer control at the same cycle► All clusters branch to the same logical multiop

• Unbundled branch in HPL-PD

Branch

PBR btr1, TARGET

CMPP pr0, (x>100)?

BR btr1, pr0

Each cluster specifies its own target

Broadcast to all clusters

Replicated in each cluster

Page 12: A Distributed Control Path Architecture for VLIW Processors

12 University of MichiganElectrical Engineering and Computer Science

Branch Handling Example

…pbr btr1, BB2cmpp pr0, (x>100)?…br btr1, pr0

…pbr btr1, BB2cmpp pr0, (x>100)?bcast pr0br btr1, pr0

…pbr btr1, BB2’….….br btr1, pr0

Conventional VLIW DVLIW

Cluster 0 Cluster 1

Page 13: A Distributed Control Path Architecture for VLIW Processors

13 University of MichiganElectrical Engineering and Computer Science

Sleep Mode

• Idle blocks after distribution

• Put cluster into sleep mode

► Compiler managed► Save energy► Reduce code size

• Mode change happens at block boundary

BR

Cluster 0 Cluster 1

BRSLEEP

WAKE BR BR

Page 14: A Distributed Control Path Architecture for VLIW Processors

14 University of MichiganElectrical Engineering and Computer Science

Experimental Setup

• Trimaran toolset• Processor configuration

► 4 clusters, 2 INT, 1 FP, 1 MEM, 1 BR per cluster► 16K L1 I-cache total► Perfect data cache assumed

• Power Model► Verilog for instruction align/shift logic► Wire model► Cacti cache model

• 21 benchmarks from MediaBench and SPECINT2000

Page 15: A Distributed Control Path Architecture for VLIW Processors

15 University of MichiganElectrical Engineering and Computer Science

1

10

100

1000

10000

100000

1000000ra

wca

udio

raw

daud

io

g721

enco

de

g721

deco

de

gsm

enco

de

gsm

deco

de

epic

unep

ic

pegw

itdec

pegw

itenc

rast

a

cjpe

g

djpe

g

mpe

g2en

c

mpe

g2de

c

164.

gzip

256.

bzip

2

181.

mcf

197.

pars

er

255.

vort

ex

300.

twol

f

aver

age

Tra

ffic

Rat

io

Icmove increase Fetch reduction Total reduction

Change in Global Communication Bits

MediaBench SPECINT

Page 16: A Distributed Control Path Architecture for VLIW Processors

16 University of MichiganElectrical Engineering and Computer Science

Normalized Energy Consumption on Control Path

Control path energy = (align/shift logic energy) + (wire energy) + (I-cache energy)

40% saving 67% saving 80% saving 21% saving

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

raw

caud

io

raw

daud

io

g721

enco

de

g721

deco

de

gsm

enco

de

gsm

deco

de

epic

unep

ic

pegw

itdec

pegw

itenc

rast

a

cjpe

g

djpe

g

mpe

g2en

c

mpe

g2de

c

164.

gzip

256.

bzip

2

181.

mcf

197.

pars

er

255.

vort

ex

300.

twol

f

aver

age

Nor

mal

ized

Ene

rgy

Con

sum

ptio

n

Page 17: A Distributed Control Path Architecture for VLIW Processors

17 University of MichiganElectrical Engineering and Computer Science

Normalized Code Size

Baseline: Conventional VLIW with compressed encodingTraditional method (single PC): 7x increase DVLIW: 40% increase

0123456789

10

raw

caud

io

raw

daud

io

g721

enco

de

g721

deco

de

gsm

enco

de

gsm

deco

de

epic

unep

ic

pegw

itdec

pegw

itenc

rast

a

cjpe

g

djpe

g

mpe

g2en

c

mpe

g2de

c

164.

gzip

256.

bzip

2

181.

mcf

197.

pars

er

255.

vort

ex

300.

twol

f

aver

age

Traditional Method DVLIW

Page 18: A Distributed Control Path Architecture for VLIW Processors

18 University of MichiganElectrical Engineering and Computer Science

Result Summary

• DVLIW benefits► Order of magnitude reduction in global

communication► 40% savings in control path energy► 5x code size reduction vs. simple distribution

• Small overhead for ILP execution on CMP► 3% increase in execution cycles► 4% increase in I-cache stalls

Page 19: A Distributed Control Path Architecture for VLIW Processors

19 University of MichiganElectrical Engineering and Computer Science

Conclusions

• DVLIW removes last centralized resource in a multicluster VLIW

► Fully distributed control path► Scalable architecture

• More energy efficient• Stylized CMP architecture

► Exploit ILP► Multiple instruction streams► Compiler orchestrated

Page 20: A Distributed Control Path Architecture for VLIW Processors

20 University of MichiganElectrical Engineering and Computer Science

Thank You

• For more information► http://cccp.eecs.umich.edu