Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative...

47
Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan and Todd C. Mowry School of Computer Science Carnegie Mellon University Dept. Elec. & Comp. Engineering University of Toronto

Transcript of Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative...

Carnegie Mellon

Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads

Antonia Zhai, Christopher B. Colohan,

J. Gregory Steffan† and Todd C. Mowry

School of Computer ScienceCarnegie Mellon University

†Dept. Elec. & Comp. EngineeringUniversity of Toronto

Compiler Optimization of Memory-Resident Value Communication… - 2 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Motivation

Chip-level multiprocessing is becoming commonplace

We need parallel programs

UntraSPARC IV 2 UltraSparc III cores

IBM Power 4 SUN MAJC Sibyte SB-1250

Can multithreaded processors improve the performance of a single application?

Compiler Optimization of Memory-Resident Value Communication… - 3 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Why Is Automatic Parallelization Difficult?

One solution: Thread-Level Speculation

Automatic parallelization today

Must statically prove threads are independent

Constructing proofs is difficult due to ambiguous data dependences Complex control flow Pointers and indirect references Runtime inputs

Optimistic compiler?

Limited only by true dependences

Compiler Optimization of Memory-Resident Value Communication… - 4 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Example

while (...){…x=hash[index1];…hash[index2]=y;...

}

Time…= hash[19]…hash[21] =...check_dep()

Thread 2…= hash[33]…hash[30] =...check_dep()

Thread 3…= hash[3]…hash[10] =...check_dep()

Thread 1

…= hash[10]…hash[25] =...check_dep()

Thread 4

…= hash[31]…hash[12] =...check_dep()

Thread 5

…= hash[9]…hash[44] =...check_dep()

Thread 6

…= hash[27]…hash[32] =...check_dep()

Thread 7

…= hash[10]…hash[25] =...check_dep()

Thread 4 Retry

Processor 1 Processor 2 Processor 3 Processor 4

Compiler Optimization of Memory-Resident Value Communication… - 5 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Frequently Dependent Scalars

…=a

a=……=a

a=…

Can identify scalars that always cause dependences

Time

ProducerConsumer

Compiler Optimization of Memory-Resident Value Communication… - 6 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Frequently Dependent Scalars

…=a

a=…

…=a

a=…

Dependent scalars should be synchronized

[ASPLOS’02]

Time

Signal(a)

Wait(a)

ProducerConsumer

Compiler Optimization of Memory-Resident Value Communication… - 7 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Frequently Dependent Scalars

…=a

a=…

Dataflow analysis allows us to deal with complex control flow

[ASPLOS’02]

…=a

a=…

Time

ProducerConsumer

Compiler Optimization of Memory-Resident Value Communication… - 8 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Communicating Memory-Resident Values

Synchronize?

Speculate?

Will speculation succeed?

Time Load *p

Store *qLoad *p

Store *q

ProducerConsumer

Compiler Optimization of Memory-Resident Value Communication… - 9 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Speculation vs. Synchronization

Sequential Execution Speculative Parallel Execution

Load *p

Speculation succeeds: efficient

Time

Load *p

Load *p

Load *p

Store *q

Store *q

Store *q

Store *q

Load *pLoad *p

Load *pLoad *pStore *q

Store *qStore *q

Store *q

Compiler Optimization of Memory-Resident Value Communication… - 10 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Speculation vs. Synchronization

Sequential Execution Speculative Parallel Execution

Speculation fails: inefficient

Load *p

Time

Load *p

Load *p

Load *p

Store *q

Store *q

Store *q

Store *q

Load *p

Store *qLoad *p

Store *q

Load *p

Store *q

Load *p

Store *q

Load *p

Store *q

Load *p

Store *q

Load *p

Store *q

Load *p

Store *q

Load *p

Store *q

Load *p

Store *q

violation

Compiler Optimization of Memory-Resident Value Communication… - 11 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Speculation vs. Synchronization

Sequential Execution Speculative Parallel Execution

Frequent dependences: Synchronize

Infrequent dependences: Speculate

Load *p

Time

Load *p

Load *p

Load *p

Store *q

Store *q

Store *q

Store *q

Load *p

Store *qLoad *pStore *q

Load *pStore *q Load *p

Store *q

Compiler Optimization of Memory-Resident Value Communication… - 12 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Performance Potential

Reducing failed speculation improves performance

Detailed simulation:• TLS support• 4-processor CMP

• 4-way issue, out-of-order superscalar• 10-cycle communication latency

Original

Perfect memory value

Prediction

Nor

m. R

egio

nal E

xec.

Tim

e

0

100

m88ksim

ijpeg

gzip_comp

gzip_decomp

vpr_place gc

cmcf

crafty

parser

perlbmk

gap

bzip2_comp

go

Compiler Optimization of Memory-Resident Value Communication… - 13 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Hardware vs. Compiler Inserted Synchronization

Store*qLoad *p

Memory

Store*q

Load *p

Memory

Store *q

Load *p

Memory

Speculation Hardware-insertedSynchronization[HPCA’02]

Compiler-insertedSynchronization[CGO’04]

Tim

e Signal()

(stall)

ProducerConsumer

ProducerConsumer

ProducerConsumer

Wait()

Compiler Optimization of Memory-Resident Value Communication… - 14 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Issues in Synchronizing Memory-Resident Values

Static analysis Which instructions to synchronize? Inter-procedural dependences

Runtime Detecting and recovering from improper synchronization

Store *qLoad *p

ProducerConsumer

Time

Compiler Optimization of Memory-Resident Value Communication… - 15 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Outline

Static analysis

Runtime checks

Results

Conclusions

Load *p

ProducerConsumer

Store *q

Time

Compiler Optimization of Memory-Resident Value Communication… - 16 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Compiler Passes

Front

End

Back

End

foo.c

foo.exe

Insert

Synchronization

Profile DataDependences

CreateThreads

ScheduleInstructions

Decide what to Synchronize

Compiler Optimization of Memory-Resident Value Communication… - 17 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Example

work()

push (head, entry)

do { push (&set, element); work(); } while (test);

Compiler Optimization of Memory-Resident Value Communication… - 18 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Example

work() { if (condition(&set)) push (&set, element);}

push (head, entry)

do { push (&set, element); work(); } while (test);

Compiler Optimization of Memory-Resident Value Communication… - 19 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Example

work() { if (condition(&set)) push (&set, element);}

push(head,entry) { entry->next = *head; *head = entry; }

push(head,entry) { entry->next = *head; *head = entry; }

Load *head

Store *head

Load *head

(work, push)

Load *head

(push)

Store *head

(work, push)

do { push (&set, element); work(); } while (test);

Store *head

(push)

Compiler Optimization of Memory-Resident Value Communication… - 20 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Compiler Passes

Front

End

Back

End

Insert

Synchronization

Profile DataDependences

ThreadCreating

InstructionScheduling

Decide what to Synchronize

foo.exe

foo.c

Compiler Optimization of Memory-Resident Value Communication… - 21 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Example

work() { if (condition(&set)) push (&set, element);}

do { push (&set, element); work(); } while (test);

push(head,entry) { entry->next = *head; *head = entry; }

push(head,entry) { entry->next = *head; *head = entry; }

Load *head

(push)

Store *head

(push)

Load *head

(work, push)

Store *head

(work, push)

Profile Information=======================================================

=

Source Destination FrequencyStore *head(push) Load *head(push) 990Store *head(push) Load *head(work, push) 10Store *head(work, push) Load *head(push) 10

Profile Information=======================================================

=

Source Destination FrequencyStore *head(push) Load *head(push) 990Store *head(push) Load *head(work, push) 10Store *head(work, push) Load *head(push) 10

Compiler Optimization of Memory-Resident Value Communication… - 22 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Compiler Passes

Front

End

Back

End

Insert

Synchronization

Profile DataDependences

ThreadCreating

InstructionScheduling

Decide what to Synchronize

foo.exe

foo.c

Compiler Optimization of Memory-Resident Value Communication… - 23 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Dependence Graph

Load *head

(work, push)

Store *head

(work, push)

990

10

10

Load *head

(push)

Store *head

(push)

Pairs that need to be synchronized can be extracted

from the dependence graph

Infrequent dependences: occur in less than 5% of iterations

Compiler Optimization of Memory-Resident Value Communication… - 24 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Compiler Passes

Front

End

Back

End

Insert

Synchronization

Profile DataDependences

ThreadCreating

InstructionScheduling

Decide what to Synchronize

foo.exe

foo.c

Compiler Optimization of Memory-Resident Value Communication… - 25 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Example

work() { if (condition(&set)) push (&set, element);}

do { push (&set, element); work(); } while (test);

push(head,entry) { entry->next = *head; *head = entry; }

push(head,entry) { entry->next = *head; *head = entry; }

Load *head

(push)

Store *head

(push)990

Load *head

(push)

Store *head

(push)

Synchronize these

push_clone(head,entry) { wait(); entry->next = *head; *head = entry; signal(head, *head);}

push_clone(&set, element);

Compiler Optimization of Memory-Resident Value Communication… - 26 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Outline

• Static analysis

Runtime checks

Results

Conclusions

ProducerConsumer

Store *qLoad *pT

ime

Compiler Optimization of Memory-Resident Value Communication… - 27 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Runtime Checks

Store *q and Load *p access the same memory address

No store modifies the forwarded address between

Store *q and Load *p

Signal(q, *q);

Producer forwards the address to ensure a match between the load and the store

ProducerConsumer

Load *pStore *q

Time

Compiler Optimization of Memory-Resident Value Communication… - 28 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Ensuring Correctness

Store *x

• Store *q and Load *p access the same memory address

No store modifies the forwarded address between

Store *q and load *p

ConsumerProducer

Hardware supportSimilar to memory conflict buffer [Gallagher et al, ASPLOS’94]

Load *pStore *q

Time

Compiler Optimization of Memory-Resident Value Communication… - 29 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Ensuring Correctness

Hardware support: TLS hardware already knows which locations are stored to

• Store *q and Load *p access the same memory address

No store modifies the forwarded address between

Store *q and load *p

ConsumerProducer

Store *yLoad *p

Store *q

Time

Compiler Optimization of Memory-Resident Value Communication… - 30 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Outline

• Static analysis

• Runtime checks

Results

Conclusions

ProducerConsumer

Store *qLoad *pT

ime

Compiler Optimization of Memory-Resident Value Communication… - 31 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Crossbar

Experimental Framework

Underlying architecture 4-processor, single-chip multiprocessor speculation supported through coherence

Simulator superscalar, similar to MIPS R14K 10-cycle communication latency models all bandwidth and contention

Benchmarks SPECint95 and SPECint2000, -O3 optimization

detailed simulationC

C

P

C

P

Compiler Optimization of Memory-Resident Value Communication… - 32 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Parallel Region CoveragePa

ralle

l Reg

ion

Cov

erag

e

0

100

go

m88ksim

ijpeg

gzip_comp

gzip_decomp

vpr_place gc

cmcf

crafty

parser

perlbmk

gap

bzip2_comp

Coverage is significant

Average coverage: 54%

Compiler Optimization of Memory-Resident Value Communication… - 33 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Failed Speculation

Synchronization Stall

Other

Busy

U=No synchronization inserted

C=Compiler-Inserted Synchronization

Seven benchmarks speed up by 5% to 46%

Compiler-Inserted Synchronization

0

100

go

m88ksim

ijpeg

gzip_comp

gzip_decomp

vpr_place gc

cmcf

crafty

parser

perlbmk

gap

bzip2_comp

U C U C U C U C U C U C U C U C U C U C U C U C U C

10% 46% 13% 5% 8% 5% 21%

Nor

m. R

egio

nal E

xec.

Tim

e

Compiler Optimization of Memory-Resident Value Communication… - 34 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Compiler- vs. Hardware-Inserted Synchronization

0

100

go

m88ksim

ijpeg

gzip_comp

gzip_decomp

vpr_place gc

cmcf

crafty

parser

perlbmk

gap

bzip2_comp

C H C H C H C H C H C H C H C H C H C H C H C H C H

C=Compiler-Inserted Synchronization

H=Hardware-Inserted Synchronization

Compiler and hardware [HPCA’02] each benefits different benchmarks

Nor

m. R

egio

nal E

xec.

Tim

e

Failed Speculation

Synchronization Stall

Other

Busy

Hardwaredoes better

Compilerdoes better

Compiler Optimization of Memory-Resident Value Communication… - 35 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Combining Hardware and Compiler Synchronization

C=Compiler-inserted synchronization

H=Hardware-inserted synchronization

B=Combining Both

The combination is more robust than each technique individually

0

100

go

m88ksim

gzip_comp

gzip_decomp

perlbmk

gap

C H B C H B C H B C H B C H B C H B

Nor

m. R

egio

nal E

xec.

Tim

e

Failed Speculation

Synchronization Stall

Other

Busy

Compiler Optimization of Memory-Resident Value Communication… - 36 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Related Work

Zhai et. al.CGO’04

CytronICPP’86

Compiler-inserted

Moshovos et. al.ISCA’97

Cintra & TorrellasHPCA’02

Steffan et. al.HPCA’02

Hardware-inserted

Centralized TableDistributed Table

Tsai & YewPACT’96

Compiler Optimization of Memory-Resident Value Communication… - 37 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Conclusions

Compiler-inserted synchronization for memory-resident value communication:

Effective in reducing speculation failure Half of the benchmarks speedup by 5% to 46%

(regional)

Combining hardware and compiler techniques is more robust Neither consistently outperforms the other Can be combined to track the best performer

Memory-resident value communication should be addressed with the combined efforts of the compiler and the hardware

Compiler Optimization of Memory-Resident Value Communication… - 38 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Questions?

Compiler Optimization of Memory-Resident Value Communication… - 39 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

The Potential of Instruction Scheduling

0

100

go

m88ksim

ijpeg

gzip_comp_R

gzip_decomp

vpr_place

mcf

crafty

parser

perlbmk

gap

gzip_comp gc

c

E=Early

C=Compiler-Inserted Synchronization

L=Late

Failed Speculation

Synchronization Stall

Other

Busy

Scheduling instructions has addition benefit for some benchmarks

ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL

Bzip2_comp

Compiler Optimization of Memory-Resident Value Communication… - 40 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Program Performance

0

100

go

m88ksim

ijpeg

gzip_comp_R

gzip_decomp

vpr_place gc

cmcf

crafty

parser

perlbmk

gap

bzip2_comp

bzip2_decomp

twolf

gzip_comp

U=Un-optimized

C=Compiler-Inserted Synchronization

H=Hardware-Inserted Synchronization

B=Both compiler and hardware

Failed Speculation

Synchronization Stall

Other

Busy

UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB

Compiler Optimization of Memory-Resident Value Communication… - 41 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Which Technique Synchronizes This Load?

0

100

go

m88ksim

ijpeg

gzip_comp_R

gzip_decomp

vpr_place gc

cmcf

crafty

parser

perlbmk

gap

bzip2_comp

twolf

UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHBUCHBUCHBUCHBUCHB

gzip_comp

U=Un-optimized

C=Compiler-Inserted Synchronization

H=Hardware-Inserted Synchronization

B=Both compiler and hardware

Synchronized by neither technique

Synchronized by compiler

Synchronized by hardware

Synchronized by both

Compiler Optimization of Memory-Resident Value Communication… - 42 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Ensuring Correctness

Hardware supportSimilar to memory conflict buffer [Gallagher et al, ASPLOS’94]

Store *q Load *pStore *x

• Store *q and Load *p access the same memory address

No store modifies the forwarded address between

Store *q and load *p

ConsumerProducer

Compiler Optimization of Memory-Resident Value Communication… - 43 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Consumer

• Store *q and Load *p access the same memory address

No store modifies the forwarded address between

Store *q and load *p

Ensuring Correctness

Hardware support Use the forwarded value only if the synchronized pair is dependent

UseForwarded

Value

UseMemoryValue

LocalStore to *p

q == p

NO

YES

YESNO

Store *q Load *p

Store *xSignal(q);Signal(*q)

Producer

Compiler Optimization of Memory-Resident Value Communication… - 44 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Issues in Synchronizing Memory-Resident Values

• Inserting synchronization using compilers

• Ensuring correctness

Reducing synchronization cost

Store *q

Load *p

ConsumerProducer

Compiler Optimization of Memory-Resident Value Communication… - 45 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Reducing Cost of Synchronization

Before Instruction Scheduling

Consumer

Producer

Instruction scheduling algorithms are described in [ASPLOS’02]

After Instruction Scheduling

Producer

Consumer

Compiler Optimization of Memory-Resident Value Communication… - 46 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

The Potential of Instruction Scheduling

0

100

m88ksim

ijpeg

gzip_comp

gzip_decomp

vpr_place ga

p

E = Perfectly predicting synchronized

memory-resident values

C = Compiler-inserted synchronization

L = Consumer stalls until previous thread commits

Scheduling instructions could offer additional benefit

E C L E C L E C L E C L E C L E C L

Failed Speculation

Synchronization Stall

Other

Busy

Nor

m. R

egio

nal E

xec.

Tim

e

Compiler Optimization of Memory-Resident Value Communication… - 47 - Zhai, Colohan, Steffan and

Mowry

Carnegie Mellon

Using More Accuracy of Profiling Information

0

100

C RU

U=No Instruction Scheduling

C=Compiler-Inserted Synchronization

R=Compiler-Inserted Synchronization

(Profiled with the ref input set)

Gzip_comp is the only benchmark sensitive to profiling input

gzip_comp

Failed Speculation

Synchronization Stall

Other

Busy

Nor

m. R

egio

nal E

xec.

Tim

e