Heatsink or IC: how to determine root cause of overtemp?

Question

I have a manufacturing situation where we perform a functional test on a board and we are getting frequent overtemperature failures from a BGA package with a heatsink on it. I would like to be able to determine if the cause of overtemperature is because of a bad thermal contact with the heatsink OR if the cause is from the IC itself generating more heat than we expect.

Here's the details:

Large BGA package that dissipates A LOT of power. Very sensitive to heat sink seating
BGA package is a part that is picked by our supplier to meet our specified voltage/power requirements.
There is variation in power dissipation across devices. Unkwown if this variation is caused by heat-sink application or differences between individual IC's. Device has characteristics of thermal run-away? Higher temp and higher current consumption go hand-in-hand (voltage rails are steady).
Heat sink is a copper vapour phase chamber with fins. TIM is a high-performance thermal grease. We have a controlled environment in a chassis with fans forcing air at a constant RPM.
I have a way to measure die temperature of the device to a resolution of 1C. And I can heat-up the device "at will" by running an automated test.

What I would like to do is to perform a test that checks the efficacy of the heat-sink to rule out the heat sink (or TIM or seating) as a problem. One way to do this is to re-apply another "known-good" heat sink and retest, but that is dependent on operator skill for repeatability, and has other manufacturing workflow problems.

Here's an idea for measuring the effectiveness of the heat sink, I'd like to get some input on whether it will be a good idea and/or what would be a better way to test this.

The device has a "textbook" heat-up/cool-down curve that fits nicely to RC-time-constant. In the plot below, I have the device starting at "idle" then I make the device "do its job" in a functional test and then turn off the function after 5 minutes.
I am most interested in the cooling curve because when it starts to cool, I know that the core-part of the IC is no longer generating heat. The cooling curve is just the package cooling down through the heatsink and PCB. I assume that the heatsink dominates the heat transfer especially early on. In other words, the cooling curve is a measure of the cooling performance of the heat sink and not much else. Moreover the other variables across tests have less variation than the heatsink (eg cooling through PCB).
When I normalize the curves to range between zero and one, set the time origin to onset of cooling and look only at the first 80 seconds of cooling, I get nice straight lines in a log plot. Time constant in a cool-running device is 36s with standard deviation <5% over a dozen runs. Time constant in a device where the heat sink has been deliberately impaired to run a few degrees hot was 39s with similar standard deviation.

enter image description here

Now the question if I get a hot-running device and I measure time constant that is the same as a cool-running device, can I rule-out the heat sink and its application as a problem?

I should clarify that this is in a manufacturing context, not design (DVT). The focus is to be able to determine the cause of failures.

Unless you are driving substantial electrical power through off-chip I/O or EM fields, your FPGA should be nearly 100% efficient as a heater, so one that is actually generating more heat will draw more supply power. If you are seeing higher supply current at the same die temperature before it heats up then I'd think you have a chip that produces fewer computes-per-watt. — Chris Stratton, Commented Aug 4, 2014 at 18:20
Has the manufacturer provided you a transient thermal model of the die/package? — Spehro Pefhany, Commented Aug 4, 2014 at 18:33
@ChrisStratton, thanks, I think it is possible that we're getting fewer computes per watt on some devices. Although this is not an fpga, I guess I can still assume that any compute/switch device is effectively 100% efficient as a heater? I do know for a fact that hotter devices draw more current, but haven't tried to see if the onset of high current occurs before die temp increase (that would be difficult to measure). — Angelo, Commented Aug 4, 2014 at 18:40
@SpehroPefhany, thanks, if I had that model, what could I do with that? I guess I would then need to determine a model for the heatsink we apply, and then put them together to get an idea of what to expect from the performance of the system? — Angelo, Commented Aug 4, 2014 at 18:44
It might help rule out any dynamic effects from the chip being different amounts hotter than the package (two or more time constants). Of course you're measuring the die temperature. Since it's a heat pipe HS there might not be that much thermal mass in the HS itself. If you have a thermocouple on the heat sink there should be a consistent relationship between die temperature and heat sink temperature if the HS and attach are consistent (ah @EEDeveloper has suggested this)- it's a more direct measurement. For failure analysis you probably want to rule out heat pipe leakage. — Spehro Pefhany, Commented Aug 4, 2014 at 19:19

WhatRoughBeast · Accepted Answer · 2014-08-04 19:04:59Z

Maybe, maybe not, but I'd ask why you are not correlating hot chips with power supply currents, and why you're not putting a temperature sensor on the heatsink. If the thermal path from the die to the heatsink is impaired you'll get a different temperature differential between the die and the heatsink. Likewise, if the chip is drawing more current you should be able to predict the final temperature of the die based on normal thermal behavior. And measuring the heatsink temp doesn't require a dedicated contact sensor: a temporary one will do, or a non-contact IR unit should work, since the emissivity of the heat sinks should be pretty uniform.

As to why the maybes, consider the following model:

schematic

^{simulate this circuit – Schematic created using CircuitLab}

If the thermal resistance from the die to the heatsink is much larger than the thermal resistance of the heatsink to ambient, and the thermal capacity of the die is much less than the capacity of the heat sink (and I would guess both to be true), the latter is the dominant factor in determining the thermal time constant of the heatsink, and thus of the die. In this case, increases in the die/HS thermal resistance will have only small effects on the time constant of the die, but will cause the die to get hotter. You'll have to figure the values for your board to see if this is the case.

Thanks! I think that last paragraph is the missing piece of information. I am definitely trying to screen for variations in resistance for die/HS by looking at time constant of die temp decay. I've been trying to avoid putting a thermocouple on the HS because these devices are hard to get to (in a chassis) and this is a production environment. — Angelo, Commented Aug 4, 2014 at 19:54

EEd · Accepted Answer · 2014-08-04 18:57:15Z

If I understand correctly, you want to rule out bad thermal contact of BGA with heat sink, right?

If so, consider attach small thermal couple to heat sink fin and bottom of PCB of BGA. As BGA heats up, some heat goes upward to heat sink and some go downard to PCB, depending on thermal resistance of each path.

The path to PCB is constant. If heatsink path has good/bad thermal contact, the rise/fall time of 2 thermal couples should show difference in rise/fall shape as well as amplitude.

Hope it helps

Stack Exchange Network

Heatsink or IC: how to determine root cause of overtemp?

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
temperature
heatsink
thermal
or ask your own question.

Hot Network Questions

Heatsink or IC: how to determine root cause of overtemp?

2 Answers 2

Not the answer you're looking for? Browse other questions tagged temperatureheatsinkthermal or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
temperature
heatsink
thermal
or ask your own question.