Novel Graphite-based TIM for High Performance …...v ABSTRACT A new class of compressible graphite...

8
ABSTRACT A new class of compressible graphite thermal interface material (TIM) was installed, tested, and qualified for use in a new series of IBM high performance computing (HPC) systems. Each system incorporated several high power graphics processing unit (300W GPU) assemblies in which a graphite TIM provided direct contact between bare die GPU devices and either air cooled heatsinks or water cooled cold plates. GPU hardware was removed from the system and exposed to a battery of thermal and mechanical stress tests, and reinstalled for in-system power age and power cycle tests, to quantify TIM reliability within a 3-year service life. After the GPUs were subjected to thermal-mechanical tests, which spanned accelerated thermal cycling (ATC), deep thermal cycling (DTC), thermal chip shock, temperature/humidity exposures, and system shock/vibration tests, the components were periodically reinstalled into systems to monitor power stability and to assess thermal reliability. For in-system tests, continuous power and thermal monitors were incorporated for all power cycle/power age regimens. Control groups of GPUs mounted with conventional grease-based TIMs were exposed to the same battery of thermal-mechanical and in-system server tests. All GPU hardware used for testing shared a common mounting design for TIM and cooling hardware attachments that provided a constant spring force clamping mechanism over the GPU bare die device area while enabling module flexure under load throughout stress test and system operating temperature ranges. Thermal effectiveness was measured by periodically monitoring the power draw of each GPU module and an internal device (junction) temperature over the course of the simulated life cycle. At the end of each evaluation, all GPU assemblies were then disassembled and assessed for TIM condition, which was then correlated with the final thermal resistance and power measurements. Based on comparison of both initial build performance and final test results, an optimized mounting construction was developed that incorporates the compressible graphite TIM. KEY WORDS: Thermal Interface Material, Thermal Interface Reliability, Contact Resistance, ASTM D5470, Compressible Graphite TIM Mark Hoffmeyer is with IBM Corporation, Systems Technology Group, Rochester, MN 55901, [email protected], 507-253-6686. Prashanth Submarianian is with Advanced Energy Technologies LLC., Lakewood, OH 44107, [email protected], 216-618-9975. Rick Beyerle is with Advanced Energy Technologies LLC., Lakewood, OH 44107, [email protected], 216-529-3719. INTRODUCTION System Application & Hardware Description Recently announced IBM High Performance Computing (HPC) systems [1] incorporate NVIDIA GP100 Graphics Processing Units (GPUs) powered by Pascal architecture [2] to accelerate deep learning and computation intensive applications. Unlike PCIe graphics cards, which mount vertically and contain built-in fans, the NVLink graphics platform presents a flat surface bare die for the integrator (IBM) for high power liquid or forced air cooling. More precisely, the TSMC-based CoWoS® [3,4] GPU architecture employs a 2.5D multi-chip device area consisting of a large bare die GPU chip coupled with four, quadruple stacked 3- dimensional integrated circuit (3DIC) flash memory chips that are collectively mounted atop a silicon interposer. The CoWoS® construction is incorporated onto a 55mm organic ball grid array (BGA) package that is subsequently surface mount assembled to a GPU card assembly (Figure 1). Proximity of the memory to the GPU chip creates a high performance, high power (300W) package with the communication speeds necessary for optimized computational acceleration. Prior to system integration, a cooling solution must be affixed to the bare die device surface area and GPU card assembly. While IBM’s HPC systems designs facilitate either air or water cooling, a serviceable low thermal impedance TIM is critical in the attachment stack up [5-7]. For air cooled applications, finned heatsinks customized for both front and rear system installation having planarized mounting pedestals coupled to high capacity thermal transport heat pipes are used with a TIM1 solution. These heatsinks prevent any of the GPUs from thermal throttling. For water cooled systems, planarized mounting pedestals on aluminum heat spreaders are used in conjunction with interconnected cold plates using both TIM1 and TIM2 interfaces between devices, spreaders, and cold plates for an efficient cooling path. Phil Mann is with IBM Corporation, Systems Technology Group, Rochester, MN 55901, [email protected], 507-253-4636. Advanced Energy Technologies LLC is a subsidiary of GrafTech International Holdings, Inc. Novel Graphite-based TIM for High Performance Computing Mark Hoffmeyer (IBM), Prashanth Subramanian (AET), Rick Beyerle (AET), Phil Mann (IBM) IBM Systems & Technology Group, Rochester, MN 55901 [email protected], [email protected] Advanced Energy Technologies LLC, Lakewood, OH 44107 [email protected], [email protected] 978-1-5090-2994-5/$31.00 ©2017 IEEE 243 16th IEEE ITHERM Conference

Transcript of Novel Graphite-based TIM for High Performance …...v ABSTRACT A new class of compressible graphite...

Page 1: Novel Graphite-based TIM for High Performance …...v ABSTRACT A new class of compressible graphite thermal interface material (TIM) was installed, tested, and qualified for use in

v

ABSTRACT

A new class of compressible graphite thermal interface

material (TIM) was installed, tested, and qualified for use in

a new series of IBM high performance computing (HPC)

systems. Each system incorporated several high power

graphics processing unit (300W GPU) assemblies in which a

graphite TIM provided direct contact between bare die GPU

devices and either air cooled heatsinks or water cooled cold

plates. GPU hardware was removed from the system and

exposed to a battery of thermal and mechanical stress tests,

and reinstalled for in-system power age and power cycle tests,

to quantify TIM reliability within a 3-year service life. After

the GPUs were subjected to thermal-mechanical tests, which

spanned accelerated thermal cycling (ATC), deep thermal

cycling (DTC), thermal chip shock, temperature/humidity

exposures, and system shock/vibration tests, the components

were periodically reinstalled into systems to monitor power

stability and to assess thermal reliability. For in-system tests,

continuous power and thermal monitors were incorporated for

all power cycle/power age regimens. Control groups of GPUs

mounted with conventional grease-based TIMs were exposed

to the same battery of thermal-mechanical and in-system

server tests. All GPU hardware used for testing shared a

common mounting design for TIM and cooling hardware

attachments that provided a constant spring force clamping

mechanism over the GPU bare die device area while enabling

module flexure under load throughout stress test and system

operating temperature ranges. Thermal effectiveness was

measured by periodically monitoring the power draw of each

GPU module and an internal device (junction) temperature

over the course of the simulated life cycle. At the end of each

evaluation, all GPU assemblies were then disassembled and

assessed for TIM condition, which was then correlated with

the final thermal resistance and power measurements. Based

on comparison of both initial build performance and final test

results, an optimized mounting construction was developed

that incorporates the co mpressible graphite T IM.

KEY WORDS: Thermal Interface Material, Thermal

Interface Reliability, Contact Resistance, ASTM D5470,

Compressible Graphite TIM

Mark Hoffmeyer is with IBM Corporation, Systems Technology Group,

Rochester, MN 55901, [email protected], 507-253-6686.

Prashanth Submarianian is with Advanced Energy Technologies LLC.,

Lakewood, OH 44107, [email protected],

216-618-9975.

Rick Beyerle is with Advanced Energy Technologies LLC., Lakewood, OH

44107, [email protected], 216-529-3719.

INTRODUCTION

System Application & Hardware Description

Recently announced IBM High Performance Computing

(HPC) systems [1] incorporate NVIDIA GP100 Graphics

Processing Units (GPUs) powered by Pascal architecture [2]

to accelerate deep learning and computation intensive

applications. Unlike PCIe graphics cards, which mount

vertically and contain built-in fans, the NVLink graphics

platform presents a flat surface bare die for the integrator

(IBM) for high power liquid or forced air cooling. More

precisely, the TSMC-based CoWoS® [3,4] GPU architecture

employs a 2.5D multi-chip device area consisting of a large

bare die GPU chip coupled with four, quadruple stacked 3-

dimensional integrated circuit (3DIC) flash memory chips

that are collectively mounted atop a silicon interposer. The

CoWoS® construction is incorporated onto a 55mm organic

ball grid array (BGA) package that is subsequently surface

mount assembled to a GPU card assembly (Figure 1).

Proximity of the memory to the GPU chip creates a high

performance, high power (300W) package with the

communication speeds necessary for optimized

computational acceleration. Prior to system integration, a

cooling solution must be affixed to the bare die device surface

area and GPU card assembly. While IBM’s HPC systems

designs facilitate either air or water cooling, a serviceable low

thermal impedance TIM is critical in the attachment stack up

[5-7]. For air cooled applications, finned heatsinks

customized for both front and rear system installation having

planarized mounting pedestals coupled to high capacity

thermal transport heat pipes are used with a TIM1 solution.

These heatsinks prevent any of the GPUs from thermal

throttling. For water cooled systems, planarized mounting

pedestals on aluminum heat spreaders are used in conjunction

with interconnected cold plates using both TIM1 and TIM2

interfaces between devices, spreaders, and cold plates for an

efficient cooling path.

Phil Mann is with IBM Corporation, Systems Technology Group, Rochester,

MN 55901, [email protected], 507-253-4636.

Advanced Energy Technologies LLC is a subsidiary of GrafTech

International Holdings, Inc.

Novel Graphite-based TIM for High Performance Computing

Mark Hoffmeyer (IBM), Prashanth Subramanian (AET), Rick Beyerle (AET), Phil Mann (IBM)

IBM Systems & Technology Group, Rochester, MN 55901

[email protected], [email protected]

Advanced Energy Technologies LLC, Lakewood, OH 44107

[email protected], [email protected]

978-1-5090-2994-5/$31.00 ©2017 IEEE 243 16th IEEE ITHERM Conference

Page 2: Novel Graphite-based TIM for High Performance …...v ABSTRACT A new class of compressible graphite thermal interface material (TIM) was installed, tested, and qualified for use in

Fig. 1. Nvidia GP100 GPU Card Assembly with (A) 55 mm BGA package

with device area consisting of a (B) GPU chip and (C) four memory chip

stacks mounted on a silicon interposer.

GPU Application: Thermal Interface Considerations

The bare die GPU packaging assembly described above

presents several significant challenges to ensure creation of

an efficient and reliable bare die thermal interface material

(TIM1) cooling solution. Because the large device surface

areas are not flat, and have a convex surface curvature with

potential non-uniform heat dissipation characteristics, a

preferred TIM1 solution must provide adequate gap filling

capability to accommodate out of flat conditions.

Furthermore, to minimize potential for packaging interaction

reliability issues, the TIM1 solution, including its thermal

performance, compression, and gap filling characteristics

must also be compatible with mechanical load constraints

used to affix the external cooling hardware to the bare die

GPU card assembly.

A specific and limited load range must be carefully selected

and designed into components used to secure the TIM1

interface and affiliated hardware cooling solution. The

specific load range used must prevent shock and vibration

damage, and also minimize the potential for detrimental chip-

package-card assembly interaction issues, including effects

that can undermine stable and reliable TIM1 performance.

Excess loads used to affix an external cooling solution can

drive stresses sufficient to prompt device damage and/or

deformation of the packaging over time, including stresses

that result from materials creep in the overall assembled stack

of the organic GPU BGA package, the GPU board assembly,

and affiliated BGA interconnections. These mechanically

derived stresses can also superimpose with thermal-

mechanical derived stresses that arise from continuous, or

cyclical in service run time and result in complex strains on

the TIM1 that can also prompt thermal performance

deterioration at the device interface. Hardware shape changes

driven by thermal loads and coefficient of thermal expansion

(CTE) mismatch between device, package substrate, and the

GPU card assembly, can drive dynamic strains on the TIM1,

especially with repeated in service system power cycling. An

optimal TIM1 selection for this application must be resilient

to complex, superimposed, static and dynamic chip-package-

heatsink-board assembly interactions to ensure consistent and

reliable function. Given this application description and

various packaging restrictions, the advantages of using

eGRAF® HITHERM™ HT-C3200 thermal interface

material (HT-C3200) [8], as the TIM1, are described in

greater detail as compared to use of other potential TIM1

solutions.

APPLICATION REQUIREMENTS, RESTRICTIONS,

AND TIM SELECTION

A. Compression, Gap Filling Capability, and Thermal

Performance

Testing conducted within IBM using both Instron and

Thermomechanical Analysis (TMA) methods confirm that

HT-C3200 TIM sheets are highly compressible at the low

pressures and load ranges required for the GPU bare die

cooling application. Compression data collected via TMA at

25°C and 75°C (see Figure 2) indicates the material readily

fills a five mil (127 micrometer) gap* at loads ranging from

10-30 psi.

A. B.

Fig. 2. TMA Compressibility Data for AET HT-C3200. (A) Shows bond line

thickness; (B) shows TIM thickness [9,10]. A Compression range and TIM

gap-filling capability in excess of 6 mils (150 micrometers) is observed

independent of applied temperature. *The post-DMA and Instron test sample

thickness measurements indicate a peak compression of the material to

approximately 55-60 micrometers.

Unlike most gel, grease, and phase change TIM alternatives

that rely on intimate contact of conductive filler particles that

are realized under high pressures and small, uniform gaps, the

HT-C3200 material conducts heat most effectively under

relatively low range of gaps and pressures. Figure 3 illustrates

this performance behavior as collected from material samples

using two test methods. These methods include ASTM

D5470 [11], and use of a custom built IBM thermal tester to

provide in-situ measurements. The custom tool employs top

and bottom copper column sections with affixed stages that

sandwich the TIM under various fixed loads or fixed gaps.

The top section is heated to a controlled temperature, while

the bottom section is cooled with a thermoelectric cooler.

Thermocouples positioned at regular intervals in both column

sections measure the thermal profile across the entire column.

From these data at thermal equilibrium, a temperature drop

across the TIM material can be calculated and resultant

thermal performance properties of the TIM can be identified.

Page 3: Novel Graphite-based TIM for High Performance …...v ABSTRACT A new class of compressible graphite thermal interface material (TIM) was installed, tested, and qualified for use in

A. B.

Fig. 3. HT-C3200 TIM thermal performance as a function of applied pressure

measured (A) in situ and (B) by ASTM D5470. The in situ thermal resistance

stabilizes at loads of 15 psi or higher.

As shown in Figure 3, stable, low thermal resistance occurs at

pressures as low as 15 psi (103 kPa), with a corresponding

thermal resistance of approximately 0.2 K·cm2/W when the

TIM is compressed by approximately 3 mils (75

micrometers), see Figure 2.

Overall, this materials’ performance data when coupled

with systems modeling and parts characterization work

conducted within IBM show the above attributes provide

sufficient gap filling and thermal performance requirements

for the GPU application, especially so when additional

controls are placed on flatness of external cooling solution

surfaces such that the spreader and heatsink pedestals that

come in contact with the material are flat to within

approximately 25 micrometers.

B. Other Design and Process Considerations

In addition to having adequate gap filling capability and

good thermal performance characteristics, the HT-C3200

material is also well suited for GPU assembly processing.

These attributes include:

1. Ease of assembly with minimal surface preparation: The

material is readily placed onto intended surfaces without

need for additional processing equipment

2. Surface preparation simplicity: use of the HT-C3200

material does not require complex cleaning of thermal

interface surfaces (e.g. plasma). Simple surface cleaning

preparation using IPA in conjunction with a

particle/shed free cloth wipe to remove potential debris

or potential bulk contamination films ensures adequate

and consistent performance.

3. Chemical stability and chemical compatibility in the

presence of other materials: Because the HT-C3200

TIM is pure graphite, it is chemically inert and does not

degrade in the presence of other chemicals, or materials

that might come in contact with it within a GPU

assembly throughout in service operation temperatures.

This is an important feature for the GPU application

because additional TIMs must be used in close proximity

on spreader plates used in the water-cooled GPU

application. Because the material is generally inert in air

to temperatures in excess of 300°C, it will remain stable

when exposed to extreme temperature or humidity.

4. Cost: total installed cost is low cost when compared to

most TIM solutions, especially when coupled with ease

of use and GPU assembly process considerations.

C. Product Application Development and Proper Usage

of HT-C3200 as a Thermal Solution

Although the HT-C3200 has many thermal advantages for

use as a TIM solution, it must be noted that the material is

inelastic, and recovers only a fraction – about three percent –

of its original shape or thickness profile when a load is

removed. As such, the use of static, fixed displacement

controlled loads for creating TIM bond lines in assemblies is

not recommended, as reduction or inconsistent mechanical

load on the TIM over time may result due to card and

component level hardware relaxation, creep, or from short

term dynamic shape changes that can take place in service.

These collective changes can create TIM instability and

increased TIM thermal interface resistance, as they drive the

potential formation of low contact force or local air gaps

between the inelastic TIM, the device, and cooling solution

surfaces.

Inferior HT-C3200 TIM performance was clearly

observed on early GPU assemblies that had heatsinks attached

using two different types of cooling hardware fasteners, that

when used in combination, created a fixed displacement load

control condition. Initial builds of GPU hardware assemblies

used four corner spring screws to attach the heatsink, and to

provide a constant spring load on the TIM. The initial builds

also incorporated four, optional, non-influencing fasteners

(NIFs) [12,13] as additional attachment points. These NIF

attachment points, shown in Figure 4, lock the heatsink into a

fixed position with respect to the module device surface, and

provide an added measure of protection against physical

damage that may arise on hardware exposed to random but

significant shock and vibration events that can occur during

component shipping, system shipping, or from unusual

hardware usage scenarios.

Fig. 4. Exploded view of a GPU Card-heatsink assembly including; (A) GPU

card, (B) Heatsink, (C) Optional NIFs, (4X) (D) Spring loaded attachment

points (4X), and E. the HT-C3200 TIM used between the device surface and

the heatsink pedestal surface.

Page 4: Novel Graphite-based TIM for High Performance …...v ABSTRACT A new class of compressible graphite thermal interface material (TIM) was installed, tested, and qualified for use in

With NIFs engaged, the attached cooling solution became

a fixed displacement load design when coupled with the 4-

corner spring screws originally used to provide a constant

spring load on the TIM. This fixed displacement load

condition created unacceptable thermal performance of the

HT-C3200 TIM when GPU assemblies were installed and run

in HPC systems at required powers and fans speeds. When

NIFs were removed from these assemblies to provide a

constant spring load on the inelastic TIM, a markedly

improved, and acceptable in-system thermal performance was

achieved. Figure 5 illustrates the change in HT-C3200

thermal performance as a function of GPU heatsink load.

Confirmation of shape changes that occur on hardware and

the affiliated root cause for hardware load dependent thermal

performance differences of HT-C3200 TIM was provided

using Moiré interferometry on a GPU assembly. Specifically,

Moiré analysis of GPU hardware taken through a heating and

cooling excursion used to simulate device power on / power

off cycling shows that dynamic GPU shape changes occur and

are directly linked to strains that develop in the hardware

packaging from aggregate CTE mismatch between the

organic the GPU board, the GPU BGA package, and the large

2.5D silicon on silicon device area on the BGA module.

These relative CTE mismatch driven shape changes are

shown in Figure 6.

Fig. 5. HT-C3200 thermal performance vs. cooling solution hardware attach

design. Temperatures shown are average GPU device Tj measured on GPUs

installed at front and rear locations within the system.

Given generally known CTE differences between an

organic BGA package, an organic printed wiring board

(PWB) assembly, and a large 2.5D silicon on silicon device

area, dynamic Moiré analysis indicates the GPU device area

flattens by approximately 1.2 micrometers for every degree of

temperature increase, and shows this change is reversible

upon cooling. GPU device surfaces are convex at room

temperature, with the convex shape resulting from an

aggregate CTE mismatch derived elastic tensile and

compressive strain distribution present between the organic

carrier and the stacked silicon device. As temperature is

increased, elastic strains between board, BGA carrier and the

Si device are reduced, and result in the part flattening, while

decreases in temperature prompt additional strain on the

assembly and increased part convexity. Since the spring

actuation and NIF assembly of the GPU thermal solution

attachment all takes place and locks in the relative position of

the heatsink to the device area at room temperature, a binding

or buckling condition is created between the hardware as the

GPU assembly attempts to flatten in response to external

heating or power driven temperature rises and affiliated

changes to CTE mismatch induced strains distributions.

These overall effects drive a loss of sufficient contact with the

inelastic HT-C3200 TIM and result in thermal performance

degradation. However, once the NIFs are released, the

hardware binding condition is eliminated and proper thermal

performance can be restored using a pure, constant spring

load control in the design to ensure the TIM stays in contact

with the device and cooling hardware surfaces during in

service operation.

Note that dynamic CTE mismatch induced shape changes

can be largely responsible for TIM degradation issues when

phase change materials (PCMs), gels, or greases are used [14-

17] as they can drive TIM strain related phenomenon such as

grease pumping, gel interface adhesion loss, and related PCM

“healing” instabilities as well.

Based on the above considerations and data analyses,

proper use of HT-C3200 material must be coupled with

constant spring force loads that are incorporated into a given

hardware attachment design. In addition, due to the

inelasticity of the material, reuse is not recommended in the

event that hardware disassembly, rework, or replacement

steps are required. Reused material will not provide gap

filling on hardware with different shape or topographic

profiles.

Fig. 6. Summary of Moiré interferometer analysis of a GPU assembly

illustrating relative CTE driven flatness changes vs. temperature. GPU device

areas are convex at room temperature. A negative change in flatness indicates

less convexity and improved flatness.

PRODUCT QUALIFICATION SUMMARY

Because the general performance characteristics of the

HT-C3200 TIM material, along with its additional processing

advantages, are shown to offer a possible GPU TIM solution,

it was selected as a candidate material for product

qualification test work using the four corner spring screw

heatsink attachment design in the absence of NIF attachments.

An alternate, conventional, high performance TIM grease

solution was also tested in parallel in the same application

product form factor. Details of the testing and test results are

shown below.

Page 5: Novel Graphite-based TIM for High Performance …...v ABSTRACT A new class of compressible graphite thermal interface material (TIM) was installed, tested, and qualified for use in

A. GPU TIM Product Qualification

The overall TIM qualification took place in two key stages

termed T1 and T2. The T1 stage consisted of early GPU

hardware assembled with a split of hardware samples built

using the HT-C3200 TIM, and samples built using a high

performance thermal grease TIM sandwiched between bare

die GPU devices and pedestals of air cooled heatsinks. All

were assembled using the four-corner spring loaded hardware

in the absence of NIFs, as previously discussed.

These hardware sets were also built with both front and rear

GPU heatsinks as shown in Figure 7, as installed in an IBM

Power™ System 822LC HPC server (Minsky).

A sequence of in-situ accelerated stress tests was run,

simultaneously using all four heatsinks, front and rear, for

functional verification. The test sequence began with System

Level Shock & Vibration (S&V), then proceeded to

Temperature & Humidity (T&H), Thermal Ship Shock (TSS),

Accelerated Thermal Cycling (ATC), and Deep Thermal

Cycling (DTC). Slightly accelerated Power Cycle/Power Age

Tests (PA/PC) were used for additional in-situ system GPU

stressing and continuous functional monitoring of the GPU

hardware.

Fig. 7. Photo of front (F ) and rear (R ) GPU-heatsink assemblies installed

into an IBM HPC server system.

In all cases, GPU powers and operating junction

temperature (Tj) measurements were collected and monitored

as a figure of merit, with a change in average Tj of 5°C or

more used as a pass/fail criteria, adjusting for slight ambient

temperature fluctuations present in the system operating

environment. A plan view of the IBM 2S2U Minsky HPC

system computer electronics complex is shown in Figure 8

and shows all four GPUs installed (without heatsinks) in front

and rear slot locations.

It is important to note that power fluctuations were fairly

common on early GPU hardware used in development, so

preliminary functional testing and functional verification

measurements on GPUs within HPC systems and T1

hardware were not necessarily tested at peak powers. Typical

preliminary testing occurred using a range of powers

spanning approximately 180-280W. Neither the heatsinks

nor early GPUs used for T1 tests had optimized flatness, so

additional variability in thermal measurements could arise

from an increased gap filling range required for the TIM

solutions. Despite these shortcomings, the primary goal of

the T1 testing was to identify a single TIM solution to bring

forward for final (T2) product qualification. This T2

qualification would consist of all power stable, production

level GPU hardware capable of running at full speeds and

powers in the range of 280-300 W.

B. T1 GPU Test Summary

A summary of the T1 Test plan, sample sizes used, and

corresponding test results are shown in Figure 9. All parts

built with both the HT-C3200 and grease TIM solutions

passed in-situ PA/PC system testing, but 25% of the parts

built with grease showed some evidence of modest

temperature degradation that neared the pass/fail criteria. All

parts built with both the HT-C3200 and grease TIM solutions

Fig. 8. Plan view of the 2S2U HPC system showing GPUs installed at front

(F3, F7) and rear (R2, R6) slot locations. Air flows from right to left in the

diagram.

passed in-situ PA/PC system testing, but 25% of the parts

built with grease showed some evidence of modest

temperature degradation that neared the pass/fail criteria. An

example of the hot cycle thermal stability resulting from in

situ system PA/PC tests is illustrated in Figure 10 for an

assembly built with the HT-C3200 TIM and tested through

more than twelve hundred full power on/power idle cycles

using a 1-hour cycle consisting of 15 minute ramp and dwell

periods.

T1 Product Test Summary HT-C3200 Grease

Stress Test

Sam-

ples

Fails

(∆T>5°C)

Sam-

ples

Fails

(∆T>5°C)

S&V 2 0 2 0

TSS (-40—65°C, 10 cycles) 5 0 5 0

T&H (50/80, 200 hrs) 0 0

ATC (0—100°C, 500 cycles) 0 4

System PA/PC

(35—80°C, Up to 1350

cycles)

8 0 8 2

ATC (0-75°C, 500 cycles) n/a n/a 5 3

DTC (-50—100°C, 200

cycles)

6 0 n/a n/a

Fig. 9. Summary of T1 product qualification test plan and test results

Page 6: Novel Graphite-based TIM for High Performance …...v ABSTRACT A new class of compressible graphite thermal interface material (TIM) was installed, tested, and qualified for use in

All parts passed sequential TSS and T&H tests. However,

monitoring of parts subjected to follow-on ATC tests showed

that 80% of the parts built with the grease TIM suffered

significant thermal performance degradation with first

failures occurring after 170 cycles of ATC and 80% of parts

failing the criteria after 500 cycles. An example of this

deterioration is shown in Figure 11 for a GPU with grease

TIM1 that ceased to function after exposure to 500 cycles of

0-100°C ATC.

Fig. 10. Example PA/PC temperature and power for a GPU built with the

HT-C3200 TIM. Fluidic diodes on the system fans constrain airflow to drive

higher in-system operating temperatures for stress acceleration. A system fan

speed adjustment was made at approximately 350 cycles into the test to

prevent GPU Tjs from running in excess of 80°C.

Fig. 11. Grease TIM thermal performance deterioration vs. HT-C3200 TIM

thermal stability after exposure to ATC test legs as functionally tested on

example parts in system slot positions R6 (grease) and F3 (HT-C3200). The

example part built with the grease TIM failed to power on after exposure to

a 500+ cycle ATC checkpoint.

In contrast, all parts made with the HT-C3200 TIM

remained stable through 500 cycles of 0-100°C ATC. An

example of this stability is also shown in Figure 11 for

comparison. Because notable thermal degradation in ATC

tests was identified in GPU test cells built with grease,

supplemental ATC test cells on additional parts made using

the grease TIM were also run using an intermediate 0-75°C

ATC stress regimen. These test results when coupled with 0-

100°C ATC test data and results from power cycling work

were collectively used to help generally assess and identify a

rough grease ATC test acceleration factor estimate for use in

this and other product applications. As shown in Figure 9,

0-75°C ATC testing of parts built with the grease TIM

resulted in slightly prolonged adequate thermal performance

prior to degradation relative to 0-100°C ATC tests, but 60%

of the parts still failed the thermal performance pass/fail

criteria after parts reached 500 cycles of test.

Another set of parts built with the HT-C3200 TIM was

subjected to -50-100°C DTC stress to further assess its overall

robustness in the GPU application. In all cases tested, no fails

were encountered using the HT-C3200 TIM.

C. Post - T1 GPU Testing Failure Analysis

All parts built with grease TIMs were disassembled after

stressing to identify root causes of thermal degradation. Parts

built with the HT-C3200 TIMs were also disassembled to

look for potential changes to the interface material as a result

of the stress test exposures. In this latter case, nothing

remarkable was found. The HT-C3200 appeared to be fully

intact and compressed with no signs of any erosion or

deterioration issues. In fact, test data also shows that

HT-C3200 that is intentionally crumpled -- inflicted with

significant creasing -- prior to use performs equally as well as

pristine material. This is an important attribute to understand

from assembly manufacturing and quality control

perspectives, because the material is somewhat delicate to

handle. As such, the TIM may be subject to minor handling

issues in volume manufacturing that result in the formation of

one or two creases on the material, especially at or near

corners of TIM preforms.

A collection of photos in Figure 12 show the general

condition of post stress tested HT-C3200 TIM along with

TIM that was also intentionally and significantly damaged

before and after use, while Figure 13 shows a thermal

corresponding thermal performance comparison of the

intentionally damaged vs. undamaged material when used and

tested in the same GPU assembly.

Fig. 12. Photos of HT-C3200 TIM pieces (A) prior to installation, (B) after

assembly and stress tests, (C) prior to installation, intentionally crumpled, (D)

intentionally crumpled piece after assembly, compression, and thermal

performance test. Compression eliminated most of the creases and caused

the transfer of an image of laser scribed device information onto the TIM.

Upon disassembly of hardware built with the grease TIM,

grease pumping with significant depletion of grease away

Page 7: Novel Graphite-based TIM for High Performance …...v ABSTRACT A new class of compressible graphite thermal interface material (TIM) was installed, tested, and qualified for use in

from the GPU device area was found to be the obvious root

cause mechanism for thermal performance deterioration on

most parts exposed to the ATC stress tests.

An example of significant grease pumping observed on

samples after five hundred 0-100°C ATC cycles is provided

in Figure 14A. For comparison, grease TIM coverage

produced on a typical uncycled part shown is Figure 14B.

Figure 14A shows the hardware sample that produced the

GPU6 thermal data shown in Figure 11.

Fig. 13. Undamaged vs. intentionally crumpled HT-C3200 thermal

performance test comparison as installed into a common GPU.

Fig. 14. Thermal grease pump-out (A) after 500-cycle ATC test and footprint

(B) as built.

D.T2 In-System GPU Test Summary

Given the T1 test results and physical failure analysis

findings, use of a high performance grease as a TIM was

abandoned from further consideration for final product

qualification activity. Twelve additional production level

parts assembled with HT-C3200 graphite TIM were put into -

50-100°C DTC, while four of these parts were also subjected

to system-level S&V tests as well. These two tests coupled

with prior T1 results defined the final qualification for product

introduction.

In addition to stress test exposures and functional

assessments similar to those described above for T1 testing,

once parts completed the T2 stress tests and final affiliated

thermal measurements, they were also exposed to an

additional 90 days of run time monitoring in IBM Joint

Engineering-Manufacturing Test machines that were also

exposed to extensive and rigorous power cycling, corners

testing, and altitude chamber testing. No issues were

observed with any of the GPU hardware. All device Tjs

remained stable to 2-4°C throughout all tests and monitoring

exercises. This range of stability also includes ambient

temperature variability on the development lab data center

floor. Excerpt examples of specific T2 test results are shown

in Figures 15 and 16 and include GPU Tj stability

measurements before and after DTC and S&V tests.

Fig. 15. Example of thermal stability from 4 different GPU locations

(shown in Figure 8) built with HT-C3200 TIM, as tested in an HPC system

pre-vs post DTC T2 qualification test. Corresponding GPU power levels

for the parts when tested before and after DTC exposures are also shown.

Fig. 16. Example thermal stability of front and rear positioned GPUs in an

HPC system as measured before and after system level S&V test.

SUMMARY AND CONCLUSIONS

A new, compressible, graphite thermal interface material

has been tested and qualified for use as a TIM1 on large, 2.5D

silicon on silicon bare die BGA packages and affiliated GPU

card assemblies that are integrated into recently announced

IBM high performance computing systems to provide

accelerated processing function. Evaluation, test, and

qualification of this new compressible TIM within and

outside of the GPU application shows it can be successfully

Page 8: Novel Graphite-based TIM for High Performance …...v ABSTRACT A new class of compressible graphite thermal interface material (TIM) was installed, tested, and qualified for use in

used in high power applications that require thermal gap

filling capability up to 0.005 inches (125 micrometers) when

coupled with cooling hardware attached at loads as low as 15-

25 psi. Because the material is intrinsically inelastic, the

material should be used in conjunction with mechanical

packaging designs that incorporate a constant spring load for

attachment of cooling hardware elements to ensure consistent

and reliable TIM function. Although not discussed, this

interface material has also been tested and integrated into

water cooled versions of the IBM HPC server referenced in

this paper as a TIM1 GPU solution, and for use as a TIM2

cooling solution between GPU heat spreaders and cold plate

assemblies as well.

ACKNOWLEDGEMENT

The authors gratefully acknowledge the support of IBM

development engineering, manufacturing engineering, and

contract engineering support from Sarah Czaplewski,

Timothy Jennings, Eric Campbell, David Braun, David

Barron, Steve Miranda, Matthew Scheckel, Dave Nickel,

Jeffrey Johnson, and Matthew Farmer, for their assistance

with materials properties evaluations, hardware assembly,

stress test support, test support, and system thermal

performance monitoring that took place throughout the course

of this development effort. We also would like to thank

Martin Smalc and Larry Jones of the AET Innovation and

Technology Center for guidance and assistance with using the

D5470 instrument to characterize the compressible graphite

materials. We would also like to thank Andy Reynolds, Jason

Murphy, and Julian Norley for their support and guidance.

REFERENCES

[1] S822LC Server for Big Data (product white paper),

retrieved 2017 February 13,[Online] IBM Power

Systems, www.ibm.com

[2] NVidia Tesla P100 White Paper, 2017 February 13,

WP-08019-001, [Online] Nvidia Corporation,

www.nvidia.com

[3] TSMC CoWoS Foundry Services, 2017 February 13,

[Online], www.tsmc.com

[4] Chip On Wafer On Substrate (CoWoS), 2012,

November 3, [Online], Daniel Payne,

www.semiwiki.com

[5] Norley, Julian. "The Role of Natural Graphite in

Electronics Cooling." Electronics Cooling Magazine 7

(2001): 50-51

[6] Chu, R.C., Ellsworth, M.J. Jr., Simons, R.E.,

2002,“Thermal Spreader and Interface Assembly for

Heat Generating Component of an Electronic Device,”

US Patent 6396700 B1

[7] Marotta, E. E., S. LaFontant, D. McClafferty, S.

Mazzuca, and J. Norley. "The Effect of Interface

Pressure on Thermal Joint Conductance for Flexible

Graphite Materials: Reno, NV." (2002). ITHERM

[8] eGRAF® and HITHERM™ are trademarks of

Advanced Energy Technologies LLC.

[9] Smalc, Martin, Julian Norley, R. Andy Reynolds,

Richard Pachuta, and Dan W. Krassowski. "Advanced

thermal interface materials using natural graphite."

In ASME 2003 International Electronic Packaging

Technical Conference and Exhibition, pp. 253-261.

American Society of Mechanical Engineers, 2003

[10] HITHERM™ HT-C3200 Thermal Interface Material

Technical Data Sheet TDS-319, GrafTech

International/Advanced Energy Technologies LLC

[11] Standard, ASTM D5470-12 Standard Test Method for

Thermal Transmission Properties of Thermally

Conductive Electrical Insulation Materials, West

Conshohocken, PA: ASTM International (2012)

[12] Jeffrey F. Boigenzahn, Darrell E. Bratvold, James M.

Rigotti, Lyle R. Tufty, “Noninfluencing fastener for

disk drives”, United States Patent US4945435 A, issued

July 31, 1990

[13] John Lee Colbert, Eric Alan Eckberg, Roger Duane

Hamilton, Mark Kenneth Hoffmeyer, Amanda Elisa

Ennis Mikhail, Arvind Kumar Sinha, “Mounting a

heatsink in thermal contact with an electronic

component”, United States Patent US7944698 B2,

issued May 17, 2011

[14] Lim, T. Y., and Michelle Velderrain. "Calculated shear

stress produced by silicone and epoxy thermal interface

materials (TIMS) during thermal cycling." Electronics

Packaging Technology Conference, 2007. EPTC 2007.

9th. IEEE, 2007

[15] Methodologies to Mitigate Chip-Package Interaction,

2015, August 5, Chae, Seung-Hyun Chae; Nangia,

Amit, Electronic Design Magazine, [Online]

www.electronicdesign.com

[16] Thermal Strain in Semiconductor Packages, Part II,

2007 November 1, Bruce Guenin, Electronics Cooling

Magazine

[17] Advanced Materials for Thermal Management Solutions

of Electronic Packaging, 2011, Xingcun Colin Tong

Ph.D, Springer Science+Business Media, LLC