Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen,...
-
Upload
brian-pearson -
Category
Documents
-
view
216 -
download
0
Transcript of Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen,...
![Page 1: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/1.jpg)
Temperature-Sensitive Loop Parallelization for Chip Multiprocessors
Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie
Embedded Mobile Computing Center (EMC2)The Pennsylvania State University
International Conference on Computer Design, 10/2-5, 2005, San Jose
![Page 2: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/2.jpg)
2
Outline
Motivation Related Works Our Approach Example Experimental Results & Conclusion
![Page 3: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/3.jpg)
3
Motivation
Thermal Hotspots are a cause for concern Caused due to increasing power density Can result in the permanent chip damage
How to avoid damage Cooling techniques
How to prevent HotSpots Hardware techniques This paper proposes a compiler directed technique to avoid hotspots in
CMPs
![Page 4: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/4.jpg)
4
Related work: Dynamic Thermal Management
When one unit overheats, migrate its functionality to a distant, spare unit Dual pipeline (Intel, ISQED ’02) Spare register file (Skadron et al. 2003) Separate core (CMP) (Heo et al. ISLPED 2003) Microarchitectural clusters (Intel, ICCD 2004)
Raises many interesting issues Cost-benefit tradeoff for extra area Use both resources (scheduling) Run-time Thermal sensing/estimation
Yesterday, UC Riverside paper @ Session 2.2 proposes a run-time thermal tracking method
![Page 5: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/5.jpg)
5
Related work: Design-time techniques MDL @ PSU:
Thermal-Aware IP Virtualization and Placement for Networks-on-Chip Architecture, ICCD 2004 Thermal-Aware Allocation and Scheduling for MPSOC Design,
DATE 2005 Thermal-Aware Floorplanning Using Genetic Algorithms ISQED 2005 Thermal-Aware Voltage-island architecting, the other paper in this
session
Other groups:
Thermal-Aware High Level Synthesis (Northwestern Univ. Memik, R.Dick (ISLPED 2005, ASP-DAC 2006)
Many more in this conference Industry:
Gradient Design Automation (a start-up showcases at DAC 2005)
![Page 6: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/6.jpg)
6
CMP
–Justin R. Rattner, Intel director of the Corporate Technology Group, Spring 2005 IDF
“Intel researchers and scientists are experimenting with "many tens of cores, potentially even hundreds of cores per die, per single processor die. ..”
Last night, Panel discussion on CMP
Industry examples:
![Page 7: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/7.jpg)
7
This paper- compiler approach
Temperature and performance sensitive loop scheduling Schedules different loop iterations on CMP Data locality aware and hence performance aware
Intuition behind the approach Let ‘hot” cores idle while cool cores work. Static scheduling of parallelized loop iterations at compiler
time
![Page 8: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/8.jpg)
8
How can the compiler schedule temperature aware code? This work targets loop intensive programs run on
embedded CMPs Loop nests are divided into chunks. The number of cycles in a chunk is . Let the starting temperature of a processor be Tc
The temperature after execution the chunk is Tc‘ = F(Tc , , floorplan, power )
, power are obtained by profiling the code.
Floorplan and physical parameters remain constant.
![Page 9: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/9.jpg)
9
Thermal modeling Want a good model of chip temperature
That accounts for adjacency and package That does not require detailed designs That is fast enough for practical use
A compact model based on thermal R, C (Hotspot)Parameterized to automatically derive a model based on
various Architectures Power models Floorplans Thermal Packages
![Page 10: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/10.jpg)
10
Temperature Estimation The temperature of each block depends on the power
consumption and the location of blocks. The thermal resistance Rij of PEi with respect to PEj
can be represented by units of temperature rise at PEi due to one unit of power dissipated at PEj.
Rt11 R
t12 ……………….. R
t1m
Rt21 R
t22 ……………….. R
t2m
Rtm1 R
tm2 ……………….. R
tmm
Rt =
Rt11 R
t12 ……………….. R
t1m
Rt21 R
t22 ……………….. R
t2m
Rtm1 R
tm2 ………………. R
tmm
T1
T1
Tm
=
P1
P1
Pm
![Page 11: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/11.jpg)
11
Running ExampleBasic Schedule
for (i=1; i<=600; i++) for (j=1; j<=1000; j++) B[i][j] = (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) / 4;
Time P0 P1 P2 P3 P4 P5 P6 P7
1 0 6 12 18 242 1 7 13 19 253 2 8 14 20 264 3 9 15 21 275 4 10 16 22 286 5 11 17 23 29
Jacobi’s Algorithm
for (i=k*120+1; i<=(k+1)*120; i++) for (j=1; j<=1000; j++) B[i][j] = (A[i-1][j] + A[i+1][j] + A[i][j-1] + A[i][j+1]) / 4;
ParallelizedAlgorithm for 5 cores
ParallelSchedule
Iterationchunk
numberCore numberTime Slot
![Page 12: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/12.jpg)
12
Analysis of Basic Schedule
Analysis Great locality Uses only 5 processors Will definitely overheat
Time P0 P1 P2 P3 P4 P5 P6 P7
1 0 6 12 18 242 1 7 13 19 253 2 8 14 20 264 3 9 15 21 275 4 10 16 22 286 5 11 17 23 29
Assumptions in the example
1. Initial temperature is 0
2. Threshold temperature is 2
3. An idle slot reduces the temperature by 1 degree ( but 0)
4. So at most 2 active slots can be scheduled together on one core
5. The ideal number of active processors at any time is 5.
6. Due to Jacobi’s algorithm consecutive iteration chunk exhibit locality
![Page 13: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/13.jpg)
13
Pure Temperature Aware SchedulingAlgorithm
Start with time slot as 0 and all iterations as unscheduled While unscheduled iterations exit
Select the coolest A processors whose temperature is less than the threshold.
Schedule the chunks on those processors at current timeslot.
Reduce number of chunks to be scheduled. Increase the time slot by 1.
Analysis
Poor locality 1 extra time slot is used. No temperature problems
![Page 14: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/14.jpg)
14
10
8
6
4
2
_____11
_____9
_____7
_____5
_____3
_____1P7P6P5P4P3P2P1P0Slot
Pure Temperature Aware Scheduling
_____6_____5_____4_____3_____2_____1
P7P6P5P4P3P2P1P0Time
29231711562822161045272115934262014823251913712241812601
P7P6P5P4P3P2P1P0Time
29728272625246
23222120519181716154
14131211103987652
432101
P7P6P5P4P3P2P1P0Slot
29231711562822161045272115934262014823251913712241812601
P7P6P5P4P3P2P1P0Time
Original Schedule
_____6
_____4
_____2
_7
____5
_____3
_____1P7P6P5P4P3P2P1P0Slot
![Page 15: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/15.jpg)
15
Pure Locality Aware Scheduling
Algorithm Start with a clean slate. For each iteration chunk
Schedule it on the processor with greatest locality with it keeping at most two chunks together.
If more slots are required (when all processors are exhausted), increase the scheduling length.
Otherwise move to the next processor
654321
P7P6P5P4P3P2P1P0Time
C = { I0, I1, I2, I3, I4 }
2422203
1915117362523215
181410624
1713951216128401
P7P6P5P4P3P2P1P0Time
C = { I26, I27, I28, I29 }
365
243
1201
P7P6P5P4P3P2P1P0Time
C = { I4, I5, I6, I7, I8 }
27298
2422203
26287191511736
2523215181410624
1713951216128401
P7P6P5P4P3P2P1P0Time
C = { }
Analysis
Very good locality However 2 extra time slots are used. No temperature problems
![Page 16: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/16.jpg)
16
Locality and temperature aware scheduling
Algorithm Use temperature aware scheduling to obtain the schedulable slots. Use locality aware scheduling to assign chunks to these slots.
Time P0 P1 P2 P3 P4 P5 P6 P7
1 ■ ■ ■ ■ ■2 ■ ■ ■ ■ ■3 ■ ■ ■ ■ ■4 ■ ■ ■ ■ ■5 ■ ■ ■ ■6 ■ ■ ■ ■ ■7 ■
C = { I0, I1, I2, I3, I4 }
Time P0 P1 P2 P3 P4 P5 P6 P7
1 0 4 8 12 162 1 5 20 24 273 9 13 17 21 254 2 6 10 14 285 18 22 26 296 3 7 11 15 197 23
C = { }
Analysis - Best of both worlds Great Locality No temperature problems Good performance
![Page 17: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/17.jpg)
17
Phase1 - Profiling
#define N 5000 #define ITER 1int du1[N], du2[N], du3[N];int au1[N][N][2], au2[N][N][2], au3[N][N][2];int a11=1, a12=-1, a13=-1; int a21=2, a22=3, a23=-3; int a31=5, a32=-5, a33=-2; int l;/* Initialization loop */ int sig = 1;int main(){ int kx; int ky; int kz;printf("Thread:%d\n",mp_numthreads()); for(kx = 0; kx < N; kx = kx + 1) { for(ky = 0; ky < N; ky = ky + 1) { for(kz = 0; kz <= 1; kz = kz + 1) { au1[kx][ky][kz] = 1; au2[kx][ky][kz] = 1; au3[kx][ky][kz] = 1; } }} }} /* main */
Cycle Times
Chunk Sizes
Energy Consumption
Architecture Details
_8
___6
___4
___2
_____7
_____5
_____3
_____1P7P6P5P4P3P2P1P0Slot
Temperature Sensitive Schedule
+Scheduler
HotSpot
Phase 2 -Temperature Sensitive Scheduling
Phase 3 -Locality Based Scheduling
298
2422203
2628277191511736
2523215181410624
1713951216128401
P7P6P5P4P3P2P1P0Slot
Temperature &Locality Sensitive Schedule
Scheduler
#define N 5000 #define ITER 1int du1[N], du2[N], du3[N];int au1[N][N][2], au2[N][N][2], au3[N][N][2];int a11=1, a12=-1, a13=-1; int a21=2, a22=3, a23=-3; int a31=5, a32=-5, a33=-2; int l;/* Initialization loop */ int sig = 1;int main(){ int kx; int ky; int kz;printf("Thread:%d\n",mp_numthreads()); for(kx = 0; kx < N; kx = kx + 1) { for(ky = 0; ky < N; ky = ky + 1) { for(kz = 0; kz <= 1; kz = kz + 1) { au1[kx][ky][kz] = 1; au2[kx][ky][kz] = 1; au3[kx][ky][kz] = 1; } }} }} /* main */
Optimized, temperature sensitive code
+Code
Generator
Phase 4 - Code Generation
Omega Library
![Page 18: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/18.jpg)
18
Experiments
5 codes loop intensive codes were tested
Benchmark Cycles
(millions)
Energy
(J)
3step-log 1487 1894686.2
Adi 438 1239551.1
Btrix 1351 80918.1
Eflux 56 80918.1
Tsf 1799 2548001.6
![Page 19: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/19.jpg)
19
adi - Threshold Temperature 88 ºC
60
70
80
90
100
110
120
130
140
150
0 10 20 30 40 50 60 70 80 90
Percentage of Execution
100
base
temperature-sensitive
![Page 20: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/20.jpg)
20
eflux - Threshold Temperature 88 ºC
70
75
80
85
90
0 10 20 30 40 50 60 70 80 90
Percentage of Execution
100
base
temperature-sensitive
![Page 21: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/21.jpg)
21
adi - Threshold Temperature 88 ºC
78
79
80
81
82
83
84
85
86
87
88
0 10 20 30 40 50 60 70 80 90
Percentage of Execution
![Page 22: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/22.jpg)
22
eflux - Threshold Temperature 88 ºC
71.5
72.5
73.5
74.5
0 10 20 30 40 50 60 70 80 90
Percentage of Execution
![Page 23: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/23.jpg)
23
Sensitivity Analysis adi - Threshold Temperature 87 ºC
![Page 24: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/24.jpg)
24
Sensitivity Analysis adi - Threshold Temperature 86 ºC
![Page 25: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/25.jpg)
25
Sensitivity Analysis adi - Threshold Temperature 85 ºC
![Page 26: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/26.jpg)
26
Sensitivity Analysis adi - Threshold Temperature 84 ºC
![Page 27: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/27.jpg)
27
Experiments
Benchmark Name Peak Temperature Average Temperature
Original Optimized Original Optimized3step-log 95.5 80.7 80.7 78.7
adi 146.1 86.8 100.5 85.0btrix 84.9 78.9 74.1 73.9eflux 84.9 74.2 76.4 73.7tsf 87.6 74.2 80.0 73.0
average 99.8 78.9 81.2 76.9
![Page 28: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/28.jpg)
28
Experiments
BenchmarkName
Extra Energyconsumption
Extra ExecutionCycles
3step-log 2.40% 1.80%adi 2.40% 9.10%
btrix 0.80% 0.60%eflux 7.40% 4.00%tsf 1.60% 1.20%
average 2.90% 3.30%
![Page 29: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/29.jpg)
29
Conclusion Implemented a compiler directed combined
temperature sensitive and performance aware scheduling algorithm.
Achieve impressive average and peak chip temperature reductions.
This allows software to take up the burden of preventing chip damage due to thermal effects. Chips can be aggressively scaled Cooling costs can be reduced Lowers the need for hardware based thermal
management schemes.
![Page 30: Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri HK Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Embedded Mobile Computing.](https://reader035.fdocuments.in/reader035/viewer/2022062519/5697c02b1a28abf838cd8c53/html5/thumbnails/30.jpg)
Thank you!