The newest article http://www.theinquirer.net/gb/inquirer/ ... -defective
says good cooling or not these nvidia cards may not last the warranty period.
It's a godawful long 2 page article so I'll pick some random snippets that cover some high points. Follow the link if something I quoted doesn't seem to make sense, they explain it several different ways in the article.
The defective parts appear to make up the entire line-up of Nvidia parts on 65nm and 55nm processes, no exceptions. The question is not whether or not these parts are defective, it is simply the failure rates of each line, with field reports on specific parts hitting up to 40 per cent early life failures. This is obviously not acceptable.
The Nvidia defective chips use a type of bump called high lead, and are now transitioning to a type called eutectic, see here and here.
Getting back to the stress, it is what makes bumps fracture. Think of the old trick of taking a fork and bending it back and forth. It bends several times, then it breaks. The same thing happens to bumps. Heating leads to stress, aka bending, and then it cools and bends back. Eventually this thermal cycling kills chips.
Once again, if you did your engineering right, this won't happen in any timeframe that matters to mere humans, if it takes ten years of on and off switching to make it happen, once a day power cycling won't matter in our lifetimes. Chip makers tend to engineer for timelines like the ten-year horizon, and are pretty safe in assuming it will live for five years of casual use.
If you pick an underfill that is too soft, it doesn't provide you enough mechanical support for the bumps, they crack and your chip dies and early death. Pick one that is too hard and it rips the polyamide layer off. In the words of one packaging engineer talked to for this article, if you used too hard of an underfill, the chip "wouldn't survive the first heat cycle". The magic is in the middle, you have to pick a bowl of porridge, er, underfill, that is strong enough to provide the support you need, but not so strong as to rip layers off your chip. Like we said, package engineering is not for the faint of heart, but it can make baby bear happy.
That brings us to the billion dollar question, why not simply change bump types to eutectic if they are that much better, which they are, in some ways. The answer is in the current capacity, more specifically average current capacity. We mentioned this earlier, and the idea ties into the hot spots and functional units.
If you take a hypothetical simple CPU that has an integer and floating point units. If you are doing heavy int. work, the power bumps that supply that part of the chip will be loaded heavily and the FP bumps will not be doing much of anything at all. When FP load gets heavy, the opposite happen.
The layout of the bumps is designed so that neither set will be overloaded at peak times, and in fact won't get all that close to their maximum.
The problem with eutectic bumps is that they have a lower current capacity, and the closer you get to it, the worse the problem of electromigration becomes.
If Nvidia wants to swap in eutectic bumps for the high lead they are using, there is a slight problem, they are well over the current capacity of the new bumps.
If the chip actually powers up without letting the smoke out, the first time you fire up a massive game of Telengard, it will most assuredly go pop. In the rare case of that the gods of luck are staring right at you and the thing doe sn't fry immediately, electromigration will ensure it has the lifespan of a mayfly, basically worse than the current crop of defective Nvidia chips.
What do you do? You can either cut the power used by the GPU way way down, ie, clock it at a point where no one would ever buy it, or rearrange where the bumps go. The rearrangement is not a trivial thing, and may require moving large parts of the chip around, basically a partial relayout. This is expensive, time consuming, and likely can't be done and validated in the time the chip is on sale for.
The other option is basically just as bad, you need a power plane or power grid on the die. This is a metal layer that distributes power across the die, and it means adding a layer to the chip. That means expense, slightly lower yield, and can have other detrimental effects to power draw and clocking.
All of these things can be dealt with if you see this coming when you start making the GPU. It is pretty painfully obvious that Nvidia didn't, otherwise they wouldn't have used high lead bumps and gotten into the hole that they are in. They have switched to eutectic bumps, but given the way it is being done, and the supplier grumbles we are hearing, it appears to be very poorly planned. It will be interesting to see the lifespan of these new parts.