Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads
description
Transcript of Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads
![Page 1: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/1.jpg)
Carnegie Mellon
Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads
Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan† and Todd C. Mowry
School of Computer ScienceCarnegie Mellon University
†Dept. Elec. & Comp. EngineeringUniversity of Toronto
![Page 2: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/2.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 2 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Motivation
Chip-level multiprocessing is becoming commonplace
We need parallel programs
UntraSPARC IV 2 UltraSparc III cores
IBM Power 4 SUN MAJC Sibyte SB-1250
Can multithreaded processors improve the performance of a single application?
![Page 3: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/3.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 3 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Why Is Automatic Parallelization Difficult?
One solution: Thread-Level Speculation
Automatic parallelization today Must statically prove threads are independent Constructing proofs is difficult due to ambiguous data
dependences Complex control flow Pointers and indirect references Runtime inputs
Optimistic compiler? Limited only by true dependences
![Page 4: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/4.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 4 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Example
while (...){…x=hash[index1];…hash[index2]=y;...
}
Time…= hash[19]…hash[21] =...check_dep()
Thread 2…= hash[33]…hash[30] =...check_dep()
Thread 3…= hash[3]…hash[10] =...check_dep()
Thread 1
…= hash[10]…hash[25] =...check_dep()
Thread 4
…= hash[31]…hash[12] =...check_dep()
Thread 5…= hash[9]…hash[44] =...check_dep()
Thread 6…= hash[27]…hash[32] =...check_dep()
Thread 7
…= hash[10]…hash[25] =...check_dep()
Thread 4 Retry
Processor 1 Processor 2 Processor 3 Processor 4
![Page 5: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/5.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 5 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Frequently Dependent Scalars
…=a
a=……=a
a=…
Can identify scalars that always cause dependences
Time
ProducerConsumer
![Page 6: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/6.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 6 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Frequently Dependent Scalars
…=a
a=…
…=aa=…
Dependent scalars should be synchronized [ASPLOS’02]
Time
Signal(a)
Wait(a)
ProducerConsumer
![Page 7: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/7.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 7 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Frequently Dependent Scalars
…=a
a=…
Dataflow analysis allows us to deal with complex control flow [ASPLOS’02]
…=a
a=…
Time
ProducerConsumer
![Page 8: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/8.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 8 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Communicating Memory-Resident Values
Synchronize?
Speculate?
Will speculation succeed?
Time Load *p
Store *qLoad *p
Store *q
ProducerConsumer
![Page 9: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/9.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 9 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Speculation vs. SynchronizationSequential Execution Speculative Parallel Execution
Load *p
Speculation succeeds: efficient
Time
Load *p
Load *p
Load *p
Store *q
Store *q
Store *q
Store *q
Load *p Load *p Load *p Load *pStore *q Store *q Store *qStore *q
![Page 10: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/10.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 10 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Speculation vs. SynchronizationSequential Execution Speculative Parallel Execution
Speculation fails: inefficient
Load *p
Time
Load *p
Load *p
Load *p
Store *q
Store *q
Store *q
Store *q
Load *pStore *q
Load *pStore *q
Load *pStore *q
Load *pStore *q
Load *pStore *q
Load *pStore *q
Load *pStore *q
Load *pStore *q
Load *pStore *q
Load *pStore *q
violation
![Page 11: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/11.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 11 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Speculation vs. SynchronizationSequential Execution Speculative Parallel Execution
Frequent dependences: Synchronize Infrequent dependences: Speculate
Load *p
Time
Load *p
Load *p
Load *p
Store *q
Store *q
Store *q
Store *q
Load *pStore *q
Load *pStore *q Load *p
Store *q Load *pStore *q
![Page 12: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/12.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 12 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Performance Potential
Reducing failed speculation improves performance
Detailed simulation:• TLS support• 4-processor CMP
• 4-way issue, out-of-order superscalar• 10-cycle communication latency
Original
Perfect memory valuePrediction
Norm
. Reg
iona
l Exe
c. T
ime
0
100
m88ksim ijp
eg
gzip_comp
gzip_decomp
vpr_place
gcc
mcfcrafty
parser
perlbmk ga
p
bzip2_compgo
![Page 13: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/13.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 13 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Hardware vs. Compiler Inserted Synchronization
Store*qLoad *p
Memory
Store*q
Load *p
Memory
Store *q
Load *p
Memory
Speculation Hardware-insertedSynchronization[HPCA’02]
Compiler-insertedSynchronization[CGO’04]
Tim
e Signal()
(stall)
ProducerConsumer
ProducerConsumer
ProducerConsumer
Wait()
![Page 14: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/14.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 14 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Issues in Synchronizing Memory-Resident Values
Static analysis Which instructions to synchronize? Inter-procedural dependences
Runtime Detecting and recovering from improper synchronization
Store *qLoad *p
ProducerConsumer
Time
![Page 15: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/15.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 15 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Outline
Static analysis Runtime checks Results Conclusions
Load *p
ProducerConsumer
Store *q
Time
![Page 16: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/16.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 16 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Compiler Passes
FrontEnd
BackEnd
foo.c
foo.exe
InsertSynchronization
Profile DataDependences
CreateThreads
ScheduleInstructions
Decide what to Synchronize
![Page 17: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/17.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 17 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Example
work()
push (head, entry)
do { push (&set, element); work(); } while (test);
![Page 18: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/18.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 18 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Example
work() { if (condition(&set)) push (&set, element);}
push (head, entry)
do { push (&set, element); work(); } while (test);
![Page 19: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/19.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 19 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Example
work() { if (condition(&set)) push (&set, element);}
push(head,entry) { entry->next = *head; *head = entry; }
push(head,entry) { entry->next = *head; *head = entry; }
Load *head
Store *head
Load *head(work, push)
Load *head(push)
Store *head(work, push)
do { push (&set, element); work(); } while (test);
Store *head(push)
![Page 20: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/20.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 20 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Compiler Passes
FrontEnd
BackEnd
InsertSynchronization
Profile DataDependences
ThreadCreating
InstructionScheduling
Decide what to Synchronize
foo.exe
foo.c
![Page 21: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/21.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 21 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Example
work() { if (condition(&set)) push (&set, element);}
do { push (&set, element); work(); } while (test);
push(head,entry) { entry->next = *head; *head = entry; }
push(head,entry) { entry->next = *head; *head = entry; }
Load *head(push)
Store *head(push)
Load *head(work, push)
Store *head(work, push)
Profile Information=======================================================
=Source Destination FrequencyStore *head(push) Load *head(push) 990Store *head(push) Load *head(work, push) 10Store *head(work, push) Load *head(push) 10
![Page 22: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/22.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 22 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Compiler Passes
FrontEnd
BackEnd
InsertSynchronization
Profile DataDependences
ThreadCreating
InstructionScheduling
Decide what to Synchronize
foo.exe
foo.c
![Page 23: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/23.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 23 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Dependence Graph
Load *head(work, push)
Store *head(work, push)
99010
10
Load *head(push)
Store *head(push)
Pairs that need to be synchronized can be extracted from the dependence graph
Infrequent dependences: occur in less than 5% of iterations
![Page 24: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/24.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 24 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Compiler Passes
FrontEnd
BackEnd
InsertSynchronization
Profile DataDependences
ThreadCreating
InstructionScheduling
Decide what to Synchronize
foo.exe
foo.c
![Page 25: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/25.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 25 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Example
work() { if (condition(&set)) push (&set, element);}
do { push (&set, element); work(); } while (test);
push(head,entry) { entry->next = *head; *head = entry; }
push(head,entry) { entry->next = *head; *head = entry; }
Load *head(push)
Store *head(push)
990
Load *head(push)
Store *head(push)
Synchronize these
push_clone(head,entry) { wait(); entry->next = *head; *head = entry; signal(head, *head);}
push_clone(&set, element);
![Page 26: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/26.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 26 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Outline
• Static analysisRuntime checks Results Conclusions
ProducerConsumer
Store *q Load *pTime
![Page 27: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/27.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 27 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Runtime Checks
Store *q and Load *p access the same memory address No store modifies the forwarded address between Store *q and Load *p
Signal(q, *q);
Producer forwards the address to ensure a match between the load and the store
ProducerConsumer
Load *pStore *q
Time
![Page 28: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/28.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 28 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Ensuring Correctness
Store *x
• Store *q and Load *p access the same memory address No store modifies the forwarded address between Store *q and load *p
ConsumerProducer
Hardware supportSimilar to memory conflict buffer [Gallagher et al, ASPLOS’94]
Load *pStore *q
Time
![Page 29: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/29.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 29 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Ensuring Correctness
Hardware support: TLS hardware already knows which locations are stored to
• Store *q and Load *p access the same memory address No store modifies the forwarded address between Store *q and load *p
ConsumerProducer
Store *yLoad *pStore *q
Time
![Page 30: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/30.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 30 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Outline
• Static analysis
• Runtime checksResults Conclusions
ProducerConsumer
Store *q Load *pTime
![Page 31: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/31.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 31 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Crossbar
Experimental Framework
Underlying architecture 4-processor, single-chip multiprocessor speculation supported through coherence
Simulator superscalar, similar to MIPS R14K 10-cycle communication latency models all bandwidth and contention
Benchmarks SPECint95 and SPECint2000, -O3 optimization
detailed simulationC
C
P
C
P
![Page 32: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/32.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 32 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Parallel Region CoveragePa
ralle
l Reg
ion
Cove
rage
0
100
go
m88ksim ijp
eg
gzip_comp
gzip_decomp
vpr_place
gcc
mcfcrafty
parser
perlbmk ga
p
bzip2_comp
Coverage is significantAverage coverage: 54%
![Page 33: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/33.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 33 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Failed SpeculationSynchronization StallOtherBusy
U=No synchronization insertedC=Compiler-Inserted Synchronization
Seven benchmarks speed up by 5% to 46%
Compiler-Inserted Synchronization
0
100
go
m88ksim ijp
eg
gzip_comp
gzip_decomp
vpr_place
gcc
mcfcrafty
parser
perlbmk ga
p
bzip2_comp
U C U C U C U C U C U C U C U C U C U C U C U C U C
10% 46% 13% 5% 8% 5% 21%
Norm
. Reg
iona
l Exe
c. T
ime
![Page 34: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/34.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 34 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Compiler- vs. Hardware-Inserted Synchronization
0
100
go
m88ksim ijp
eg
gzip_comp
gzip_decomp
vpr_place
gcc
mcf
crafty
parser
perlbmk ga
p
bzip2_comp
C H C H C H C H C H C H C H C H C H C H C H C H C H
C=Compiler-Inserted SynchronizationH=Hardware-Inserted Synchronization
Compiler and hardware [HPCA’02] each benefits different benchmarks
Norm
. Reg
iona
l Exe
c. T
ime
Failed SpeculationSynchronization StallOtherBusy
Hardwaredoes better
Compilerdoes better
![Page 35: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/35.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 35 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Combining Hardware and Compiler Synchronization
C=Compiler-inserted synchronizationH=Hardware-inserted synchronizationB=Combining Both
The combination is more robust than each technique individually
0
100
go
m88ksim
gzip_comp
gzip_decomp
perlbmk ga
pC H B C H B C H B C H B C H B C H B
Norm
. Reg
iona
l Exe
c. T
ime
Failed SpeculationSynchronization StallOtherBusy
![Page 36: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/36.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 36 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Related Work
Zhai et. al.CGO’04Cytron
ICPP’86
Compiler-inserted
Moshovos et. al.ISCA’97
Cintra & TorrellasHPCA’02
Steffan et. al.HPCA’02
Hardware-inserted
Centralized TableDistributed Table
Tsai & YewPACT’96
![Page 37: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/37.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 37 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Conclusions
Compiler-inserted synchronization for memory-resident value communication:
Effective in reducing speculation failure Half of the benchmarks speedup by 5% to 46%
(regional) Combining hardware and compiler techniques is more
robust Neither consistently outperforms the other Can be combined to track the best performer
Memory-resident value communication should be addressed with the combined efforts of the compiler and the hardware
![Page 38: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/38.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 38 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Questions?
![Page 39: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/39.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 39 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
The Potential of Instruction Scheduling
0
100
go
m88ksim ijp
eg
gzip_comp_R
gzip_decomp
vpr_place
mcf
crafty
parser
perlbmk ga
p
gzip_comp gc
c
E=EarlyC=Compiler-Inserted SynchronizationL=Late
Failed SpeculationSynchronization StallOtherBusy
Scheduling instructions has addition benefit for some benchmarks
ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL ECL
Bzip2_comp
![Page 40: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/40.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 40 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Program Performance
0
100
go
m88ksim ijp
eg
gzip_comp_R
gzip_decomp
vpr_place
gcc
mcfcrafty
parser
perlbmk ga
p
bzip2_comp
bzip2_decomp
twolf
gzip_comp
U=Un-optimizedC=Compiler-Inserted SynchronizationH=Hardware-Inserted SynchronizationB=Both compiler and hardware
Failed SpeculationSynchronization StallOtherBusy
UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB
![Page 41: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/41.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 41 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Which Technique Synchronizes This Load?
0
100
go
m88ksim ijp
eg
gzip_comp_R
gzip_decomp
vpr_place
gcc mc
f
crafty
parser
perlbmk ga
p
bzip2_comp
twolf
UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHB UCHBUCHBUCHBUCHBUCHB
gzip_comp
U=Un-optimizedC=Compiler-Inserted SynchronizationH=Hardware-Inserted SynchronizationB=Both compiler and hardware
Synchronized by neither techniqueSynchronized by compilerSynchronized by hardwareSynchronized by both
![Page 42: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/42.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 42 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Ensuring Correctness
Hardware supportSimilar to memory conflict buffer [Gallagher et al, ASPLOS’94]
Store *q Load *pStore *x
• Store *q and Load *p access the same memory address No store modifies the forwarded address between Store *q and load *p
ConsumerProducer
![Page 43: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/43.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 43 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Consumer
• Store *q and Load *p access the same memory address No store modifies the forwarded address between Store *q and load *p
Ensuring Correctness
Hardware support Use the forwarded value only if the synchronized pair is dependent
UseForwarded
Value
UseMemoryValue
LocalStore to *p
q == p
NO
YES
YES NO
Store *q Load *pStore *xSignal(q);
Signal(*q)
Producer
![Page 44: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/44.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 44 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Issues in Synchronizing Memory-Resident Values
• Inserting synchronization using compilers
• Ensuring correctnessReducing synchronization cost
Store *q
Load *p
ConsumerProducer
![Page 45: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/45.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 45 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Reducing Cost of Synchronization
Before Instruction Scheduling
Consumer
Producer
Instruction scheduling algorithms are described in [ASPLOS’02]
After Instruction Scheduling
Producer
Consumer
![Page 46: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/46.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 46 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
The Potential of Instruction Scheduling
0
100
m88ksim ijp
eg
gzip_comp
gzip_decomp
vpr_place
gap
E = Perfectly predicting synchronized memory-resident valuesC = Compiler-inserted synchronizationL = Consumer stalls until previous thread commits
Scheduling instructions could offer additional benefit
E C L E C L E C L E C L E C L E C L
Failed SpeculationSynchronization StallOtherBusy
Norm
. Reg
iona
l Exe
c. T
ime
![Page 47: Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads](https://reader036.fdocuments.in/reader036/viewer/2022081603/56815c1d550346895dc9f2d6/html5/thumbnails/47.jpg)
Compiler Optimization of Memory-Resident Value Communication… - 47 - Zhai, Colohan, Steffan and
Mowry
Carnegie Mellon
Using More Accuracy of Profiling Information
0
100
C RU
U=No Instruction SchedulingC=Compiler-Inserted SynchronizationR=Compiler-Inserted Synchronization (Profiled with the ref input set)
Gzip_comp is the only benchmark sensitive to profiling input
gzip_comp
Failed SpeculationSynchronization StallOtherBusy
Norm
. Reg
iona
l Exe
c. T
ime