[ASME ASME 2013 International Technical Conference and Exhibition on Packaging and Integration of...

8
IMPROVING THE THERMAL PERFORMANCE OF A FORCED CONVECTION AIR COOLED SOLUTION – PART 2: EFFECT ON SYSTEM-LEVEL PERFORMANCE John Edward Fernandes University of Texas at Arlington Arlington, TX, USA Saeed Ghalambor University of Texas at Arlington Arlington, TX, USA Richard Eiland University of Texas at Arlington Arlington, TX, USA Dereje Agonafer University of Texas at Arlington Arlington, TX, USA Veerendra Mulay Facebook Inc. Menlo Park, CA, USA ABSTRACT The heat sink assembly of an air cooled CPU is modified to improve thermal performance of the module-level solution. This modification is employed in a dual-socket server that relies on system fans to move air for forced convection cooling of all heat generating components on the motherboard. Currently, in the data center industry, the focus is on reducing power consumption through application of energy-efficient cooling solutions. Fans installed in the server operate as a function of CPU die temperatures and represent a parasitic load that must be minimized. Improvement in system-level performance can be quantified in terms of reduced fan and server power consumption. The server is subjected to varying CPU utilizations and corresponding average fan speeds and power consumption are reported. Similarly, reduction in CPU junction temperature and server power at a given utilization can be computed by operating the fans at a constant speed. Difference in thermal performance and power consumption between the baseline and modified heat sink configurations was found to negligible when a TIM is applied. However, in the absence of a TIM, the modified assembly delivered as much as 24.4% reduction in CPU die temperature and 6.2% reduction in server power consumption. In addition, there is indiscernible variation in server power consumption between the baseline (with employment of TIM) and modified (with and without TIM application) heat sink assemblies. Thus, the modified configuration has possible applications in systems where a TIM may be undesirable or difficult to apply. Keywords: Server, back plate, interfacial pressure, TIM performance, power consumption INTRODUCTION The growth and dependence of global commerce, social interaction, news sources and other industries on information technology systems over the last decade has attributed to the rise of large data centers. These data center facilities which house and provide infrastructure support for compute, storage and networking devices are responsible for a significant portion of national and global energy consumption. Recent estimates have reported that data centers account for around 2% (between 1.7% and 2.2%) of the total national electricity consumption [1] A substantial portion of this total energy is consumed by cooling resources which maintain safe operating temperatures of the IT hardware and silicon devices contained within. In general, silicon devices reach their functional limitations in the 85 to 105°C range and experience permanent damage when operated at temperatures 15 to 25°C higher than that [2]. The traditional method of air cooling these components require forced convection over extended surface areas, heat sinks, to remove adequate amounts of heat for safe operation. Figure 1 shows the packaging architecture typical of modern day flip chip CPU devices. Heat generated by the die is transferred by conduction through the first thermal interface material (TIM 1), a case integrated heat spreader (IHS), a second TIM 2 and heat sink and then finally convection by the ambient air. The overall resistance of the package to heat flow is usually characterized by a junction-to-ambient thermal resistance: (1) Proceedings of the ASME 2013 International Technical Conference and Exhibition on Packaging and Integration of Electronic and Photonic Microsystems InterPACK2013 July 16-18, 2013, Burlingame, CA, USA IPACK2013-73107 1 Copyright © 2013 by ASME Downloaded From: http://proceedings.asmedigitalcollection.asme.org/ on 04/09/2014 Terms of Use: http://asme.org/terms

Transcript of [ASME ASME 2013 International Technical Conference and Exhibition on Packaging and Integration of...

IMPROVING THE THERMAL PERFORMANCE OF A FORCED CONVECTION AIR COOLED SOLUTION – PART 2: EFFECT ON SYSTEM-LEVEL PERFORMANCE

John Edward Fernandes University of Texas at Arlington

Arlington, TX, USA

Saeed Ghalambor University of Texas at Arlington

Arlington, TX, USA

Richard Eiland University of Texas at Arlington

Arlington, TX, USA

Dereje Agonafer University of Texas at Arlington

Arlington, TX, USA

Veerendra Mulay Facebook Inc.

Menlo Park, CA, USA

ABSTRACT The heat sink assembly of an air cooled CPU is modified

to improve thermal performance of the module-level solution. This modification is employed in a dual-socket server that relies on system fans to move air for forced convection cooling of all heat generating components on the motherboard. Currently, in the data center industry, the focus is on reducing power consumption through application of energy-efficient cooling solutions. Fans installed in the server operate as a function of CPU die temperatures and represent a parasitic load that must be minimized. Improvement in system-level performance can be quantified in terms of reduced fan and server power consumption. The server is subjected to varying CPU utilizations and corresponding average fan speeds and power consumption are reported. Similarly, reduction in CPU junction temperature and server power at a given utilization can be computed by operating the fans at a constant speed. Difference in thermal performance and power consumption between the baseline and modified heat sink configurations was found to negligible when a TIM is applied. However, in the absence of a TIM, the modified assembly delivered as much as 24.4% reduction in CPU die temperature and 6.2% reduction in server power consumption. In addition, there is indiscernible variation in server power consumption between the baseline (with employment of TIM) and modified (with and without TIM application) heat sink assemblies. Thus, the modified configuration has possible applications in systems where a TIM may be undesirable or difficult to apply.

Keywords: Server, back plate, interfacial pressure, TIM performance, power consumption

INTRODUCTION The growth and dependence of global commerce, social

interaction, news sources and other industries on information technology systems over the last decade has attributed to the rise of large data centers. These data center facilities which house and provide infrastructure support for compute, storage and networking devices are responsible for a significant portion of national and global energy consumption. Recent estimates have reported that data centers account for around 2% (between 1.7% and 2.2%) of the total national electricity consumption [1] A substantial portion of this total energy is consumed by cooling resources which maintain safe operating temperatures of the IT hardware and silicon devices contained within. In general, silicon devices reach their functional limitations in the 85 to 105°C range and experience permanent damage when operated at temperatures 15 to 25°C higher than that [2]. The traditional method of air cooling these components require forced convection over extended surface areas, heat sinks, to remove adequate amounts of heat for safe operation. Figure 1 shows the packaging architecture typical of modern day flip chip CPU devices. Heat generated by the die is transferred by conduction through the first thermal interface material (TIM 1), a case integrated heat spreader (IHS), a second TIM 2 and heat sink and then finally convection by the ambient air.

The overall resistance of the package to heat flow is usually characterized by a junction-to-ambient thermal resistance:

(1)

Proceedings of the ASME 2013 International Technical Conference and Exhibition on Packaging and Integration of Electronic and Photonic Microsystems

InterPACK2013 July 16-18, 2013, Burlingame, CA, USA

IPACK2013-73107

1 Copyright © 2013 by ASME

Downloaded From: http://proceedings.asmedigitalcollection.asme.org/ on 04/09/2014 Terms of Use: http://asme.org/terms

Rja is the junction-to-ambient thermal resistance, Tj the die junction temperature, Ta the ambient air temperature through the heat sink, and Pmodule the power or heat generated by the silicon device. This cumulative package thermal resistance can be broken down into constituents such as the junction-to-case (Rjc), case-to-base (Rcb), and base-to-ambient (Rba) thermal resistances. In typical applications, Rcb which is the thermal resistance through TIM 2 can be a significant contributor to the total thermal resistance of the package.

Fig. 1: Typical packaging architecture for modern CPUs

Despite this large contribution to the overall thermal resistance, TIM 2 plays an important role by reducing the thermal contact resistance between the CPU package and heat sink. By applying a conforming TIM material such as grease between the CPU and heat sink surfaces, the surface roughness of the two materials are neutralized and a better heat transfer path is formed. Rcb, also expressed as RTIM, can be further divided into [3]

(1)

where Rcontact are the contact resistances between the TIM and CPU and heat sinks surfaces, kTIM the TIM thermal conductivity and BLT the bond line thickness of the TIM. In order to minimize the RTIM, the thinnest possible material layer is desired. The bond line thickness (BLT) can be minimized by applying high-pressure between the heat sink and CPU package. Many TIM manufactures report this pressure-BLT relation in their data sheets as a guide for appropriate application. Part 1 of this work demonstrated the variation in surface contact achieved with various heat sink and backplate assemblies.

Table 1. Summary of Part I

Design Heat Sink Bolts

Deformation of Backplate

(x10-3 in)

Interfacial Contact (%)

CPU0 CPU1 Baseline 2 4.55 37.24 39.24 ILM configuration 4 0.0086 72.53 78.41

The intent of this study (Part 2) is to determine the thermal performance and overall system benefits of the improved contact area achieved with the modified heat sink design proposed in Part 1. In theory, the greater contact area due to uniform pressure should lead to reduction in the RTIM and overall lower total package resistance. Only two configurations

from Part 1 will be investigated as summarized in table 1. The initial baseline configuration consists of a heat sink with two points of loading, less than 50% surface contact at the interface and significant deformation of the backplate. The modified ILM configuration increased the number of heat sink bolts to four and moved them closer to the interface surface, resulting in almost twice the interfacial surface contact and minimum backplate deflection.

Fig. 2: Intel-based Open Compute server [4]

Server under Study The system under study (as described in Part I) is the Intel-

based Open Compute server [4], as shown in Fig. 2. The server contains two CPUs each with a rated thermal design power (TDP) of 95W [5]. These components represent the primary heat producers in the systems and are cooled by two extruded aluminum heat sinks. As discussed, the primary method of cooling is forced convection by four 60 x 60 x 25.4 mm fans [6]. The four wire fans are operated with a pulse width modulation (PWM) control algorithm. The primary control signal directing operation of the fans is the junction temperature of the CPU packages. As Tj approaches a designated threshold temperature, the PWM signal increases to supply additional cooling air flow to the heat sinks. As with the entire data center power load, the fan power of the internal server fans is seen as a parasitic load that does not contribute to the primary compute function of the server. For this reason, improved thermal solutions are sought to lower the overall package thermal resistance and hence, minimize the required cooling power.

EXPERIMENTAL PROCEDURES AND EQUIPMENT To evaluate the performance of the initial baseline and

modified ILM configuration of heat sink assemblies, CPU temperatures, internal server fan speeds and total server power consumption are monitored. Two test methods are used to provide the desired information. Figure 3 shows a simplified block diagram of the test setup employed in this study.

Method A: Externally controlled, internally powered fans In order to control the effect of convection on the heat sink

assemblies, the on-board server fans are externally controlled

2 Copyright © 2013 by ASME

Downloaded From: http://proceedings.asmedigitalcollection.asme.org/ on 04/09/2014 Terms of Use: http://asme.org/terms

by an Agilent 33210A arbitrary waveform generator. The control wire of each of the four fans is connected to a breadboard to receive a PWM signal from the generator. The function generator supplies a 5.25V peak-to-peak pulse at 25kHz as specified by the CPU manufacturer [7] to replicate the signal fans would receive in-situ. The additional fan wires for power and speed sensing remain connected to the motherboard. By maintaining a constant fan speed, any differences in CPU junction temperatures between the two configurations should be indicative of an improved thermal solution.

Fig. 3: Representative diagram of test setup employed to enable testing using Methods A and B

Method B: Internally controlled and powered fans This method replicates the actual performance of the server

as would be seen in production. The on-board fan control algorithm is allowed to operate freely. The primary focus of this

method is the fan speeds and total server power, which will provide an indication of any system level improvements of the modified ILM configuration.

Initially, each test method is performed with a thin layer of thermal grease applied using a stencil pattern process. Additional trials of each method are also performed with no thermal interface material applied in order to further highlight the improved thermal performance of the modified heat sink assembly.

In both methods synthetic compute loads are generated using the ‘lookbusy’ program [8]. This program provides a means of synthetically stressing individual subsystems such as the CPUs, memory and hard disks of the server at specified values. For this test, since the primary focus is on the CPU cooling solution, only loadings on the CPUs are varied. Five design points are chosen at idle (near-zero), 25, 50, 75 and 98% CPU utilization (UCPU) levels. CPU utilization in monitored in two second intervals using the ‘mpstat’ command which is intrinsic to the Linux operating system. A system health monitoring tool, provided by the motherboard manufacturer, reports the CPU die and case temperatures and speeds of all four server fans in two second intervals. The total server power is recorded with a Yokogawa CW121 power meter by connecting voltage and current clamps to the incoming power feed to the server. Power consumption data is recorded in five second intervals. Additionally, the server inlet ambient temperature is recorded in one minute intervals using an Omega USB data logger.

For each design point, the CPU die temperature is allowed to increase until steady state is reached. For Method A, steady state is characterized by a constant Tj. For Method B, steady state is usually characterized by an oscillatory chip temperature due to the response of the fan speed control signal sent from the

Fig. 3: Variation of average (a) CPU0 die temperature and (b) normalized server power consumption with CPU utilization for all heatsink assemblies with TIM applied and tested using Method A (constant fan speed – 16% PWM signal)

3 Copyright © 2013 by ASME

Downloaded From: http://proceedings.asmedigitalcollection.asme.org/ on 04/09/2014 Terms of Use: http://asme.org/terms

CPUs, however; in some tests, a more constant value is achieved by the fan speed control. For each method, data is recorded through this testing time; however, the data at steady operation is primarily presented. The steady state period is allowed to proceed for five minutes and average values of CPU temperature, fan speeds and total server power are taken. Three runs of each design point are performed to ensure repeatability and consistency of results.

RESULTS Three heat sink assemblies were tested using the two

methods described in the previous section. The first assembly (see Fig. A1 in appendix) represents the original configuration with the base case heat sink design (BCHS) and original backplate of 0.092” thickness. The second assembly (see Fig. A2) consists of the ILM configuration with modified heat sink presented in Part I and a 0.25” thick backplate (referred to as ‘ILM’ in subsequent plots). The third assembly is the same as the second; however, the base of the modified heat sink has been lapped to a smooth finish within 0.002” planarity in an attempt to further minimize the surface contact resistance at the TIM2-heat sink interface (henceforth referred to as ‘ILM + Lapped’). Over the course of testing mean temperature of air entering the server is found to be around 23.4°C (SD = 0.35°C). With such minimal variation, the effect of inlet air temperature on server thermal performance and power consumption can be assumed to be negligible. Thus, any improvements in performance reported in the following sections can be attributed to the type of heat sink assembly and TIM employed.

TIM Applied: A high performance thermal grease with a manufacturer

specified thermal conductivity of 4.5 W/mK was used in the

following tests. The results of testing using method A, in which the fan speeds were held fixed at 16% PWM signal (roughly 2600 rpm) are shown in Fig. 3. From Fig. 3(a), the ILM assembly shows a consistently higher CPU0 die temperature, as much as 2°C, in comparison to the other assemblies. While this is contradictory to what is expected, it should be noted that such minor variations between the temperature and performance of the chips can be considered within experimental error and do not present conclusive evidence of either an improved or degraded thermal solution. Additionally, the normalized total server power, presented in Fig. 3(b), shows indiscernible change.

Method B produced similar results for the fan speed and total system-level power consumption. Figure 4(a) shows the variation is average fan speed across all design points for each of the heat sink assemblies tested. Although variation of up to 300 rpm in fan speed is seen between the assemblies tested, this is within the error commonly observed by the tachometer internal to the server that reports this information. Figure 4(b) shows similar results as test Method A in that no discernible variation in total system level power consumption is observed between the assemblies tested. Based on the results reported from both methods of testing, we can conclude that a lower performance TIM (as opposed to grease) needs to be employed to noticeably accentuate the difference in thermal performance between the three assemblies being tested.

No TIM Applied (Air Gap Only): In order to amplify the improved thermal performance of

the modified heat sink design, the heat sink assemblies were tested in the absence of a TIM material. In theory, this should lead to a significant increase in RTIM due to the increased contact resistance and kTIM terms from Equation 1.

Fig. 4: Variation of average (a) fan speed and (b) normalized server power consumption with CPU utilization for all heat sinkassemblies with TIM applied and tested using Method B (fan control algorithm active)

4 Copyright © 2013 by ASME

Downloaded From: http://proceedings.asmedigitalcollection.asme.org/ on 04/09/2014 Terms of Use: http://asme.org/terms

Figure 5 shows the results of testing all assemblies when the fans are externally controlled to 50% PWM signal (around 5300 rpm). Significant reductions in the die temperature of the CPU are observed with 17.9% improvement at idle to 24.4% improvement at maximum CPU load. These improvements also become apparent in the total server power especially at the higher CPU utilizations. The 19.7°C reduction in junction temperature at maximum CPU loading results in 6.2% reduction in the total server power presumably due to reduced leakage current.

When the fan speed algorithm operates freely in the

Method B test setup, the BCHS requires significantly more air flow to maintain the internal target temperature as compared to the ILM configurations as seen in Fig. 6(a). The increase in fan speed results in more fan power as evident in Fig. 6(b) in the total server power consumption. In addition, it is evident that the lapping operation does not improve the thermal performance of the modified heat sinks. Thus, increasing capital expenditure for heat sink manufacturing with the

Fig. 6: Variation of average (a) fan speed and (b) normalized server power consumption with CPU utilization for all heat sinkassemblies with no TIM applied and tested using Method B (fan control algorithm active)Fig. 6: Variation of average (a) fan speed and (b) normalized server power consumption with CPU utilization for all heat sinkassemblies with no TIM applied and tested using Method B (fan control algorithm active)

Fig. 5: Variation of average (a) CPU0 die temperature and (b) normalized server power consumption with CPU utilization for all heatsink assemblies with no TIM applied and tested using Method A (constant fan speed – 50% PWM signal)

5 Copyright © 2013 by ASME

Downloaded From: http://proceedings.asmedigitalcollection.asme.org/ on 04/09/2014 Terms of Use: http://asme.org/terms

addition of a lapping operation would be unwarranted and we therefore limit the proceeding discussion to the BCHS and ILM assemblies only.

DISCUSSION The initial results with employment of high performance

thermal grease proved inconclusive towards improved thermal performance of the modified heat sink assembly. This bodes well for TIM manufacturers, as it indicated that the TIM performs sufficiently in filling air gaps and minimizing contact

resistance due to asperities in contacting surfaces. Figure 7 provides a comparison of heat sink thermal performance with and without TIM application with the fans controlled at 16% PWM signal (around 2600 rpm). Employment of TIM results in up to 7°C reduction in junction temperature over the range of CPU utilizations possible. Figure 7(b) shows the resulting effect on total server power consumption.

Performance comparisons of the heat sinks with and without the TIM when the fan speed control algorithm operates

Fig. 8: Performance comparison of BCHS assembly (with TIM applied) and unlapped ILM configuration (with and without TIMapplication) under normal server operation

Fig. 7: Performance comparison of BCHS assembly (with TIM applied) and ILM configuration (with and without TIM application)under constant fan speed operation (16% PWM signal)

6 Copyright © 2013 by ASME

Downloaded From: http://proceedings.asmedigitalcollection.asme.org/ on 04/09/2014 Terms of Use: http://asme.org/terms

offer interesting results. Figure 8(a) shows the differences in fan speeds when no TIM is applied to the modified heat sink installed in the ILM configuration. Beyond 25% CPU loading, the fan speeds operate higher when no TIM is applied. The fan speed increase occurs to maintain the CPU die temperatures within specified operating limits. It is interesting to see that in Fig. 8(b), the increased fan speed is not significant enough to produce appreciable changes in the total server power consumption. Comparable performance with and without TIMs may have implications to applications where a TIM may be undesirable or difficult to apply. Further testing in dynamic loading simulations with the fan speed control operating may provide additional insight as to the feasibility of operating the CPUs without a TIM material.

An important aspect of the test set up was the torques required to install and assemble the heat sink and backplate combinations. Different loads are applied for assembly of the configurations tested. First, four PEM nuts in the independent loading mechanism (ILM) are torqued to specified values for retention of the CPU package against the socket. Second, spring screws or nuts holding the heat sink in place are torqued per requirement to ensure contact between the IHS and sink base. Table 2 summarizes the torque applied at each location. For the baseline configuration these values are based on the recommendation of chip [9] and motherboard manufacturer. However, for the ILM configuration, reported values are the maximum loads under which the server is fully operational. Curiously, maximum torque applied to each nut of the ILM heat sink is 2.3 kgf-cm with employment of TIM and 2.9 kgf-cm without application of grease at the base.

Table 2. Summary of loading torques

Design ILM PEM Nut - Applied Torque

(kgf-cm)

Heat Sink Retention Torque

(kgf-cm) Baseline 10.4 12.8 ILM Configuration 1.2 2.3 – 2.9

CONCLUSIONS In Part I of this work, a modified heat sink assembly with

increased loading locations and backplate thickness was designed to improve interfacial contact between the heat sink base and CPU IHS for a server. In this work, thermal performance of the improved contact was tested and evaluated. Initially, when a thermal interface material was applied, the modified ILM design showed negligible difference when compared with the base case scenario in terms of CPU junction temperature and total server power. However, tested performance without a TIM applied showed up to 24.4% reduction in die temperature with the modified design at a given CPU loading and fixed fan speed. Results also showed that lapping the base of the modified heat sink does not deliver a noticeable improvement in thermal performance. Additionally, when synthetic loads were applied and the fan speed control algorithm was allowed to operate freely, total

server power consumption was comparable for the baseline (with TIM) and ILM (with and without TIM application) configurations. This work has possible implications for applications where improved thermal performance is desired when a TIM material is not practical to use. Continued work with dynamic CPU utilization loading and refined assembly techniques may further improve this as a viable cooling solution.

REFERENCES [1] Koomey J.G., ‘Growth in Data Center Electricity Use 2005 to 2010’. A report by Analytics Press, completed at the request of The New York Times, 2011. [2] ASHRAE TC 9.9, ‘IT Equipment Thermal Management and Controls [White Paper]’, 2012. [3] Gwinn, J.P., Webb, R.L. ‘Performance and testing of thermal interface materials’ Microelectronics Journal, v.34, pp. 215-222, 2003. [4] Li, H. and Michael, A., ‘Intel Motherboard Hardware v1.0’, Open Compute Project. http://www.opencompute.org/projects/intel-motherboard/ [5] Intel® Xeon® Processor X5650. http://ark.intel.com/products/47922/Intel-Xeon-Processor-X5650-12M-Cache-2_66-GHz-6_40-GTs-Intel-QPI [6] Delta Electronics QFR0612UH DC brushless fan. http://www.delta.com.tw/product/cp/dcfans/download/pdf/QFR/QFR60x60x25.4mm.pdf [7] Intel Specification, ‘4-Wire Pulse Width Modulation (PWM) Controlled Fans’, Revision 1.2, 2004. [8] “Lookbusy – a Synthetic Load Generator”. Accessed from: http://www.devin.com/lookbusy/ [9] Intel Specification, ‘Intel® Xeon® Processor 5500/5600 Series Thermal/Mechanical Design Guide’, March 2010.

7 Copyright © 2013 by ASME

Downloaded From: http://proceedings.asmedigitalcollection.asme.org/ on 04/09/2014 Terms of Use: http://asme.org/terms

APPENDIX

Fig. A1: Heat sink (BCHS) assembly native to server under test

Fig. A2: Modified heat sink installed in ILM configuration

8 Copyright © 2013 by ASME

Downloaded From: http://proceedings.asmedigitalcollection.asme.org/ on 04/09/2014 Terms of Use: http://asme.org/terms