Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative...

Carnegie Mellon

Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads

Antonia Zhai, Christopher B. Colohan,

J. Gregory Steffan† and Todd C. Mowry

School of Computer ScienceCarnegie Mellon University

†Dept. Elec. & Comp. EngineeringUniversity of Toronto

Compiler Optimization of Memory-Resident Value Communication… - 2 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Motivation

Chip-level multiprocessing is becoming commonplace

We need parallel programs

UntraSPARC IV 2 UltraSparc III cores

IBM Power 4 SUN MAJC Sibyte SB-1250

Can multithreaded processors improve the performance of a single application?


Mowry

Carnegie Mellon

Why Is Automatic Parallelization Difficult?

One solution: Thread-Level Speculation

Automatic parallelization today

Must statically prove threads are independent

Constructing proofs is difficult due to ambiguous data dependences Complex control flow Pointers and indirect references Runtime inputs

Optimistic compiler?

Limited only by true dependences


Mowry

Carnegie Mellon

Example

while (...){…x=hash[index1];…hash[index2]=y;...

}

Time…= hash[19]…hash[21] =...check_dep()

Thread 2…= hash[33]…hash[30] =...check_dep()

Thread 3…= hash[3]…hash[10] =...check_dep()

Thread 1

…= hash[10]…hash[25] =...check_dep()

Thread 4


Thread 5


Thread 6


Thread 7


Thread 4 Retry

Processor 1 Processor 2 Processor 3 Processor 4


Mowry

Carnegie Mellon

Frequently Dependent Scalars

…=a

a=……=a

a=…

Can identify scalars that always cause dependences

Time

ProducerConsumer


Mowry

Carnegie Mellon


…=a

a=…

…=a

a=…

Dependent scalars should be synchronized

[ASPLOS’02]

Time

Signal(a)

Wait(a)

ProducerConsumer


Mowry

Carnegie Mellon


…=a

a=…

Dataflow analysis allows us to deal with complex control flow

[ASPLOS’02]

…=a

a=…

Time

ProducerConsumer


Mowry

Carnegie Mellon

Communicating Memory-Resident Values

Synchronize?

Speculate?

Will speculation succeed?

Time Load *p

Store *qLoad *p

Store *q

ProducerConsumer


Mowry

Carnegie Mellon

Speculation vs. Synchronization

Sequential Execution Speculative Parallel Execution

Load *p

Speculation succeeds: efficient

Time

Load *p

Load *p

Load *p

Store *q

Store *q

Store *q

Store *q

Load *pLoad *p

Load *pLoad *pStore *q

Store *qStore *q

Store *q


Mowry

Carnegie Mellon



Speculation fails: inefficient

Load *p

Time

Load *p

Load *p

Load *p

Store *q

Store *q

Store *q

Store *q

Load *p

Store *qLoad *p

Store *q

Load *p

Store *q

Load *p

Store *q

Load *p

Store *q

Load *p

Store *q

Load *p

Store *q

Load *p

Store *q

Load *p

Store *q

Load *p

Store *q

violation


Mowry

Carnegie Mellon



Frequent dependences: Synchronize

Infrequent dependences: Speculate

Load *p

Time

Load *p

Load *p

Load *p

Store *q

Store *q

Store *q

Store *q

Load *p

Store *qLoad *pStore *q

Load *pStore *q Load *p

Store *q


Mowry

Carnegie Mellon

Performance Potential

Reducing failed speculation improves performance

Detailed simulation:• TLS support• 4-processor CMP

• 4-way issue, out-of-order superscalar• 10-cycle communication latency

Original

Perfect memory value

Prediction

Nor

m. R

egio

nal E

xec.

Tim

e

0

100

m88ksim

ijpeg

gzip_comp

gzip_decomp

vpr_place gc

cmcf

crafty

parser

perlbmk

gap

bzip2_comp

go


Mowry

Carnegie Mellon

Hardware vs. Compiler Inserted Synchronization

Store*qLoad *p

Memory

Store*q

Load *p

Memory

Store *q

Load *p

Memory

Speculation Hardware-insertedSynchronization[HPCA’02]

Compiler-insertedSynchronization[CGO’04]

Tim

e Signal()

(stall)

ProducerConsumer

ProducerConsumer

ProducerConsumer

Wait()


Mowry

Carnegie Mellon

Issues in Synchronizing Memory-Resident Values

Static analysis Which instructions to synchronize? Inter-procedural dependences

Runtime Detecting and recovering from improper synchronization

Store *qLoad *p

ProducerConsumer

Time


Mowry

Carnegie Mellon

Outline

Static analysis

Runtime checks

Results

Conclusions

Load *p

ProducerConsumer

Store *q

Time


Mowry

Carnegie Mellon

Compiler Passes

Front

End

Back

End

foo.c

foo.exe

Insert

Synchronization

Profile DataDependences

CreateThreads

ScheduleInstructions

Decide what to Synchronize


Mowry

Carnegie Mellon

Example

work()

push (head, entry)

do { push (&set, element); work(); } while (test);


Mowry

Carnegie Mellon

Example

work() { if (condition(&set)) push (&set, element);}

push (head, entry)



Mowry

Carnegie Mellon

Example


push(head,entry) { entry->next = *head; *head = entry; }


Load *head

Store *head

Load *head

(work, push)

Load *head

(push)

Store *head

(work, push)


Store *head

(push)


Mowry

Carnegie Mellon

Compiler Passes

Front

End

Back

End

Insert

Synchronization


ThreadCreating

InstructionScheduling


foo.exe

foo.c


Mowry

Carnegie Mellon

Example





Load *head

(push)

Store *head

(push)

Load *head

(work, push)

Store *head

(work, push)

Profile Information=======================================================

=

Source Destination FrequencyStore *head(push) Load *head(push) 990Store *head(push) Load *head(work, push) 10Store *head(work, push) Load *head(push) 10

Profile Information=======================================================

=

Source Destination FrequencyStore *head(push) Load *head(push) 990Store *head(push) Load *head(work, push) 10Store *head(work, push) Load *head(push) 10


Mowry

Carnegie Mellon

Compiler Passes

Front

End

Back

End

Insert

Synchronization


ThreadCreating



foo.exe

foo.c


Mowry

Carnegie Mellon

Dependence Graph

Load *head

(work, push)

Store *head

(work, push)

990

10

10

Load *head

(push)

Store *head

(push)

Pairs that need to be synchronized can be extracted

from the dependence graph

Infrequent dependences: occur in less than 5% of iterations


Mowry

Carnegie Mellon

Compiler Passes

Front

End

Back

End

Insert

Synchronization


ThreadCreating



foo.exe

foo.c


Mowry

Carnegie Mellon

Example





Load *head

(push)

Store *head

(push)990

Load *head

(push)

Store *head

(push)

Synchronize these

push_clone(head,entry) { wait(); entry->next = *head; *head = entry; signal(head, *head);}

push_clone(&set, element);


Mowry

Carnegie Mellon

Outline

• Static analysis

Runtime checks

Results

Conclusions

ProducerConsumer

Store *qLoad *pT

ime


Mowry

Carnegie Mellon

Runtime Checks

Store *q and Load *p access the same memory address

No store modifies the forwarded address between

Store *q and Load *p

Signal(q, *q);

Producer forwards the address to ensure a match between the load and the store

ProducerConsumer

Load *pStore *q

Time


Mowry

Carnegie Mellon

Ensuring Correctness

Store *x

• Store *q and Load *p access the same memory address


Store *q and load *p

ConsumerProducer

Hardware supportSimilar to memory conflict buffer [Gallagher et al, ASPLOS’94]

Load *pStore *q

Time


Mowry

Carnegie Mellon


Hardware support: TLS hardware already knows which locations are stored to




ConsumerProducer

Store *yLoad *p

Store *q

Time


Mowry

Carnegie Mellon

Outline

• Static analysis

• Runtime checks

Results

Conclusions

ProducerConsumer

Store *qLoad *pT

ime


Mowry

Carnegie Mellon

Crossbar

Experimental Framework

Underlying architecture 4-processor, single-chip multiprocessor speculation supported through coherence

Simulator superscalar, similar to MIPS R14K 10-cycle communication latency models all bandwidth and contention

Benchmarks SPECint95 and SPECint2000, -O3 optimization

detailed simulationC

C

P

C

P


Mowry

Carnegie Mellon

Parallel Region CoveragePa

ralle

l Reg

ion

Cov

erag

e

0

100

go

m88ksim

ijpeg

gzip_comp

gzip_decomp

vpr_place gc

cmcf

crafty

parser

perlbmk

gap

bzip2_comp

Coverage is significant

Average coverage: 54%


Mowry

Carnegie Mellon

Failed Speculation

Synchronization Stall

Other

Busy

U=No synchronization inserted

C=Compiler-Inserted Synchronization

Seven benchmarks speed up by 5% to 46%

Compiler-Inserted Synchronization

0

100

go

m88ksim

ijpeg

gzip_comp

gzip_decomp

vpr_place gc

cmcf

crafty

parser

perlbmk

gap

bzip2_comp

U C U C U C U C U C U C U C U C U C U C U C U C U C

10% 46% 13% 5% 8% 5% 21%

Nor

m. R

egio

nal E

xec.

Tim

e


Mowry

Carnegie Mellon

Compiler- vs. Hardware-Inserted Synchronization

0

100

go

m88ksim

ijpeg

gzip_comp

gzip_decomp

vpr_place gc

cmcf

crafty

parser

perlbmk

gap

bzip2_comp

C H C H C H C H C H C H C H C H C H C H C H C H C H


H=Hardware-Inserted Synchronization

Compiler and hardware [HPCA’02] each benefits different benchmarks

Nor

m. R

egio

nal E

xec.

Tim

e

Failed Speculation


Other

Busy

Hardwaredoes better

Compilerdoes better


Mowry

Carnegie Mellon

Combining Hardware and Compiler Synchronization

C=Compiler-inserted synchronization

H=Hardware-inserted synchronization

B=Combining Both

The combination is more robust than each technique individually

0

100

go

m88ksim

gzip_comp

gzip_decomp

perlbmk

gap

C H B C H B C H B C H B C H B C H B

Nor

m. R

egio

nal E

xec.

Tim

e

Failed Speculation


Other

Busy


Mowry

Carnegie Mellon

Related Work

Zhai et. al.CGO’04

CytronICPP’86

Compiler-inserted

Moshovos et. al.ISCA’97

Cintra & TorrellasHPCA’02

Steffan et. al.HPCA’02

Hardware-inserted

Centralized TableDistributed Table

Tsai & YewPACT’96


Mowry

Carnegie Mellon

Conclusions

Compiler-inserted synchronization for memory-resident value communication:

Effective in reducing speculation failure Half of the benchmarks speedup by 5% to 46%

(regional)

Combining hardware and compiler techniques is more robust Neither consistently outperforms the other Can be combined to track the best performer

Memory-resident value communication should be addressed with the combined efforts of the compiler and the hardware


Mowry

Carnegie Mellon

Questions?


Mowry

Carnegie Mellon

The Potential of Instruction Scheduling

0

100

go

m88ksim

ijpeg

gzip_comp_R

gzip_decomp

vpr_place

mcf

crafty

parser

perlbmk

gap

gzip_comp gc

c

E=Early


L=Late

Failed Speculation


Other

Busy

Scheduling instructions has addition benefit for some benchmarks

ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL

Bzip2_comp


Mowry

Carnegie Mellon

Program Performance

0

100

go

m88ksim

ijpeg

gzip_comp_R

gzip_decomp

vpr_place gc

cmcf

crafty

parser

perlbmk

gap

bzip2_comp

bzip2_decomp

twolf

gzip_comp

U=Un-optimized



B=Both compiler and hardware

Failed Speculation


Other

Busy

UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB


Mowry

Carnegie Mellon

Which Technique Synchronizes This Load?

0

100

go

m88ksim

ijpeg

gzip_comp_R

gzip_decomp

vpr_place gc

cmcf

crafty

parser

perlbmk

gap

bzip2_comp

twolf

UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHBUCHBUCHBUCHBUCHB

gzip_comp

U=Un-optimized



B=Both compiler and hardware

Synchronized by neither technique

Synchronized by compiler

Synchronized by hardware

Synchronized by both


Mowry

Carnegie Mellon


Hardware supportSimilar to memory conflict buffer [Gallagher et al, ASPLOS’94]

Store *q Load *pStore *x




ConsumerProducer


Mowry

Carnegie Mellon

Consumer





Hardware support Use the forwarded value only if the synchronized pair is dependent

UseForwarded

Value

UseMemoryValue

LocalStore to *p

q == p

NO

YES

YESNO

Store *q Load *p

Store *xSignal(q);Signal(*q)

Producer


Mowry

Carnegie Mellon

Issues in Synchronizing Memory-Resident Values

• Inserting synchronization using compilers

• Ensuring correctness

Reducing synchronization cost

Store *q

Load *p

ConsumerProducer


Mowry

Carnegie Mellon

Reducing Cost of Synchronization

Before Instruction Scheduling

Consumer

Producer

Instruction scheduling algorithms are described in [ASPLOS’02]

After Instruction Scheduling

Producer

Consumer


Mowry

Carnegie Mellon

The Potential of Instruction Scheduling

0

100

m88ksim

ijpeg

gzip_comp

gzip_decomp

vpr_place ga

p

E = Perfectly predicting synchronized

memory-resident values

C = Compiler-inserted synchronization

L = Consumer stalls until previous thread commits

Scheduling instructions could offer additional benefit

E C L E C L E C L E C L E C L E C L

Failed Speculation


Other

Busy

Nor

m. R

egio

nal E

xec.

Tim

e


Mowry

Carnegie Mellon

Using More Accuracy of Profiling Information

0

100

C RU

U=No Instruction Scheduling


R=Compiler-Inserted Synchronization

(Profiled with the ref input set)

Gzip_comp is the only benchmark sensitive to profiling input

gzip_comp

Failed Speculation


Other

Busy

Nor

m. R

egio

nal E

xec.

Tim

e

Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative...

Documents

Transcript of Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative...