Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors
description
Transcript of Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors
![Page 1: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/1.jpg)
Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors
MASS Lab. Kim Ik Hyun.
![Page 2: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/2.jpg)
Outline
Introduction Software infrastructures for
experiments Experimental Framework Performance Evaluation Conclusions
![Page 3: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/3.jpg)
Introduction
Background Speed gap between processor and
memory system – large memory latency
Helper thread? One of a pre-execution technique Prefetching cache block to tolerate the
memory latency
![Page 4: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/4.jpg)
Helper thread
Be able to detect the dynamic program behavior at run-time same static load incurs different number of
cache misses for different time phases. Need to be invoked judiciously to avoid
potential performance degradation the hyper-threaded processors are shared or
partitioned in the multi-threading mode Have low overhead thread synchronization
mechanism helper threads need to be activated and
synchronized very frequently
![Page 5: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/5.jpg)
Outline
Introduction Software infrastructures for
experiments Experimental Framework Performance Evaluation Conclusions
![Page 6: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/6.jpg)
Software infrastructures for experiments - compiler Compiler to construct helper threads
1 step : the loads that incur a large number of cache misses and also account for a large fraction of the total execution time are identified.
2 step : loop selection by compiler 3 step : the pre-computation slices and
the live-in variables are identified as well. 4 step : trigger points are placed in the
original program and the helper thread codes are generated.
![Page 7: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/7.jpg)
Delinquent load identification
identify the top cache-missing loads, known as delinquent loads, through profile feedback.( Intel VTune performance analyzer)
compiler module identifies the delinquent loads and also keeps track of the cycle costs associated with those delinquent loads
the delinquent loads that account for a large portion of the entire execution time are selected to be targeted by helper threads.
![Page 8: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/8.jpg)
Loop selection
key criterion is to minimize the overhead of thread management. One goal is to minimize the number of
helper thread invocations can be accomplished by ensuring the trip
count of the outer-loop that encompasses the candidate loop is small.
the helper thread, once invoked, runs for an adequate number of cycles It is desirable to choose a loop that iterates a
reasonably large number of times.
![Page 9: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/9.jpg)
Loop selection algorithm
The analysis starts from the innermost loop that contains the delinquent loads
And keeps searching for the next outer-loop until the loop trip-count exceeds a threshold until the next outerloop’s trip-count is less
than twice the trip-count of the currently processed loop
Searching ends when analysis reaches the outermost loop within the procedure boundary
![Page 10: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/10.jpg)
Slicing
identifies the instructions to be executed in the helper threads
1 step : Within the selected loop, the compiler module starts from a delinquent load and traverses the dependence edges backwards.
2 step : only the statements that affect the address computation of the delinquent load are selected
3 step : all the stores to heap objects or global variables are removed from the slice
![Page 11: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/11.jpg)
Live-in variable identificationand Code generation
the live-in variables to the helper thread are identified.
the constructed helper threads are attached to the application program as a separate code.
![Page 12: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/12.jpg)
Helper thread execution
When the main thread encounters a trigger point
it first passes the function pointer of the corresponding helper thread and the live-in variables, and wakes up the helper thread.
the helper thread indirectly jumps to the designated helper thread code region
reads in the live-ins, and starts execution.
![Page 13: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/13.jpg)
Software infrastructures for experiments - EmonLite light-weight mechanism to monitor
dynamic events as cache misses and at very fine sampling granularity.
profiling through the direct use of the performance monitoring events supported on the Intel processors.
compiler to instrument at any location of the program code to directly read from the Performance Monitoring Counters (PMCs).
Support dynamic optimizations such as dynamic throttling of both helper thread activation and termination.
![Page 14: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/14.jpg)
EmonLite vs. VTune
Vtune provides only a summary of sampling profile for the entire program execution, but EmonLite provides the chronology of the performance monitoring events.
VTune’s sampling based profiling relies on the buffer overflow of the PMCs to trigger an event exception handler registered at OS, EmonLite reads the counter values directly from the PMCs by executing four assembly instructions.
![Page 15: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/15.jpg)
Components of EmonLite
Emonlite_begin() initializes and programs a set of EMON-
related Machine Specific Registers (MSRs)
Emonlite_sample() reads the counter values from the PMCs
and is inserted in the user code of interest.
![Page 16: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/16.jpg)
Implementation of EmonLite
1 step : the delinquent loads are first identified and appropriate loops are selected.
2 step : The compiler inserts the instrumentation codes into the user program.
3 step : the compiler inserts codes to read the PMC values once every few iterations.
![Page 17: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/17.jpg)
Example of EmonLite code instrumentation
![Page 18: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/18.jpg)
Example usage of EmonLite
![Page 19: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/19.jpg)
Outline
Introduction Software infrastructures for experiments Experimental Framework Performance Evaluation Conclusions
![Page 20: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/20.jpg)
Experimental Framework
Sytem configuration
![Page 21: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/21.jpg)
Experimental Framework
Hardware management in intel hyper threaded processors
![Page 22: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/22.jpg)
Experimental Framework
Two thread synchronization mechanisms the Win32 API, SetEvent() and
WaitForSingleObject(), can be used for thread management.
This hardware mechanism is actually implemented in real silicon as an experimental feature.
![Page 23: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/23.jpg)
Experimental Framework
SPEC CPU2000 benchmark MCF and BZIP2 from SPEC CINT2000 ART from SPEC CFP2000
MST EM3D from Olden benchmark Best compile option VTune Analyzer
![Page 24: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/24.jpg)
Thread pinning
theWindows OS periodically reschedules a user thread on different logical processors.
a user thread and its helper thread, as two OS threads, could potentially compete with each other to be scheduled on the same logical processor.
the compiler adds a call to the Win32 API, SetThreadAffinityMask(), to manage thread affinity
![Page 25: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/25.jpg)
Thread pinning
Normarlized execution time without thread pinning
![Page 26: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/26.jpg)
Helper threading scenarios
Static trigger Loop-based trigger
the inter-thread synchronization only occurs once for every instance of the targeted loop.
Sample-based trigger helper thread is invoked once for every few
iterations of the targeted loop.
![Page 27: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/27.jpg)
Helper threading scenarios
Dynamic trigger helper threads may not always be
beneficial based on the sample-based trigger the main thread dynamically decides
whether or not to invoke a helper thread for a particular sample period
![Page 28: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/28.jpg)
Outline
Introduction Software infrastructures for experiments Experimental Framework Performance Evaluation Conclusions
![Page 29: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/29.jpg)
Performance Evaluation
Speedup of static trigger
![Page 30: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/30.jpg)
Performance Evaluation
LO vs LH – LH 1.8% SO vs SH – SH 5.5% LO vs SO – LO (except EM3D) LH vs SH As the thread synchronization cost
becomes even lower, the sample-based trigger is expected to be more effective.
![Page 31: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/31.jpg)
Dynamic behavior of performance events with and without helper threading
![Page 32: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/32.jpg)
Outline
Introduction Software infrastructures for experiments Experimental Framework Performance Evaluation Conclusions
![Page 33: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/33.jpg)
Conclusions
Impediments to speedup Potential resource contention with the main
thread must be minimized so as not to degrade the performance of the main thread
Dynamic throttling of helper thread invocation is important for achieving effective prefetching benefit without suffering potential slow down.
Having very light-weight thread synchronization and switching mechanisms is crucial.
![Page 34: Physical Experimentation with Prefetching Helper Threads on Intels Hyper-Threaded Processors](https://reader035.fdocuments.in/reader035/viewer/2022062409/56814e28550346895dbb8ed3/html5/thumbnails/34.jpg)
Conclusions
Future works run-time mechanisms to develop
practical dynamic throttling framework lighter-weight user-level thread
synchronization mechanisms.