What's that all about? [heavy Nvidia quality problems]

They make noise, too.

Moderators: NeilBlanchard, Ralf Hutter, sthayashi, Lawrence Lee

Post Reply
mexell
Posts: 307
Joined: Sat Jun 24, 2006 11:52 am
Location: (near) Berlin, Germany

What's that all about? [heavy Nvidia quality problems]

Post by mexell » Wed Aug 13, 2008 1:09 am

In case you haven't noticed:

Inquirer link 1
Inquirer link 2
Inquirer link 3

mexell
Posts: 307
Joined: Sat Jun 24, 2006 11:52 am
Location: (near) Berlin, Germany

Post by mexell » Wed Aug 13, 2008 1:13 am

I've just stumbled across something related with that:

Link to heise.de

Basically, for the non-German speaking crowd here, it's about Nvidia having to write off 200M$ for estimated warranty replacements due to poor thermal and mechanical design of the die-PCB-connections which can lead to premature failure of their cards.

Edit: OEMs like HP are already extenting warranty periods for Nvidia products at risk.

mexell
Posts: 307
Joined: Sat Jun 24, 2006 11:52 am
Location: (near) Berlin, Germany

Post by mexell » Wed Aug 13, 2008 1:17 am

Channel vendors demand card makers recall faulty Nvidia products
- Monica Chen, Digitimes, 25.07.08, paid subscribers only

Looks like Nvidia has a hot problem (pun intended)

Ethyriel
Posts: 93
Joined: Sun Dec 24, 2006 12:47 am
Location: Arizona

Post by Ethyriel » Wed Aug 13, 2008 1:24 am

Even though he was right the first time around, I'll take any claims Charlie at The Inquirer makes about Nvidia with a huge grain of salt. Apparently he has it out for them, and The Inquirer is known for posting just about anything no matter how little substantiation it has.

That said... damn, I JUST bought this 9800+. But, my old 8800GTS 640 never had any problems, and it got a LOT hotter (25-30C hotter, at twice the RPM :shock:)

Eh, we'll see. Maybe I'll just use my EVGA trade up for a die shrunk GTX260 when they hit.

edit: that post is about the new desktop chip claims, obviously Nvidia has already owned up to their mobile woes

mexell
Posts: 307
Joined: Sat Jun 24, 2006 11:52 am
Location: (near) Berlin, Germany

Post by mexell » Wed Aug 13, 2008 1:49 am

Well, it seems like The Inquirer is not the only one on this train. Digitimes and heise.de are both very well respected and in no means the yellow press. The most significant fact being published is in my opinion that major brands like HP or Lenovo have voluntarily extended their warranty on certain products.

Ethyriel
Posts: 93
Joined: Sun Dec 24, 2006 12:47 am
Location: Arizona

Post by Ethyriel » Wed Aug 13, 2008 1:59 am

Oh, I agree, there's definitely a problem with the mobile stuff, Nvidia has come out with it and warranties are being extended. On the desktop side, all we have is 'Charlie's sources' saying they're seeing far higher than normal failures. I'm skeptical until I see something from someone who doesn't hate Nvidia. So far it's The Inquirer, and others linking to The Inquirer.

It wouldn't surprise me, though. Maybe a little bit if it's not fixed on the latest chips like GTX200 and the 9800GTX+ shrink.

mexell
Posts: 307
Joined: Sat Jun 24, 2006 11:52 am
Location: (near) Berlin, Germany

Post by mexell » Wed Aug 13, 2008 2:17 am

Yeah. Let's see what will come. Hopefully Nvidia can cope with that well, because if not it would leave AMD more or less alone in the field.

I have to say that I'm more than happy that AMD is back up in competition. What's happening when dominant companies become uninventive could be seen in the Netburst era.

dhanson865
Posts: 2198
Joined: Thu Feb 10, 2005 11:20 am
Location: TN, USA

Post by dhanson865 » Wed Aug 13, 2008 11:36 am

From the NVIDIA cc (http://seekingalpha.com/article/90644-n ... pt?page=-1)...
Desktop GPU units declined by 20% but more importantly, desktop ASPs declined by 25% quarter to quarter.
and Desktop GPU revenue was down 40% quarter to quarter and down 25% year to year.

It is unlikely that 40% of NVidia's customers suddenly started buying IGP's, so that means a BIG switch to ATI. Since AMD's Q2 ended 1 month earlier, and only just a few weeks after the release of the 4XXX series, the full impact on AMD's results will only be seen in the Q3 results. NVidia also hasn't felt a full quarter of 4XXXX impact, so their Q3 results will probably be worse than Q2.
I'd point out that it's unlikely that 20% switched as the bigger percentages are in dollars not customers but the quote is interesting nonetheless. It's also probably worth mentioning that ATI/AMD isn't the only other possible recipient of those sales.

tim851
Posts: 543
Joined: Wed Aug 13, 2008 11:45 am
Location: 128.0.0.1

Post by tim851 » Wed Aug 13, 2008 11:55 am

What's happening when dominant companies become uninventive could be seen in the Netburst era.
Not a good example. Intel had good competition before the P4 in the K6s and later the Athlons. Netburst was an "innovative" idea to engage the enemy: all hail to the almighty clockspeed. Throughout most of the P4's life span, AMD had the better cpus. No dominance here.

The best example for lack of competition would IMHO be Vista. I'm not even talking about the actual quality of the product. I'm just considering where XP was in 2001 and how the biggest IT company spend 6 years developing the successor and I think to myself: That's it?!?

mexell
Posts: 307
Joined: Sat Jun 24, 2006 11:52 am
Location: (near) Berlin, Germany

Post by mexell » Wed Aug 13, 2008 11:34 pm

dhanson865 wrote:I'd point out that it's unlikely that 20% switched as the bigger percentages are in dollars not customers but the quote is interesting nonetheless. It's also probably worth mentioning that ATI/AMD isn't the only other possible recipient of those sales.
Well, I think for losses in discrete graphics clearly there's only one competitor around to pick up Nvidia's lost sales. For losses in integrated graphics, there's Intel and AMD to pick up, both focused on different markets, although it will be interesting to see what happens when Larrabee turns up. AMD has a certain technical advantage over Nvidia both in the range of integrated and discrete solutions at the moment. The Radeon HD 4xxx series is quite a blow to Nvidia, as well as the 780 series of integrated GPUs.

mexell
Posts: 307
Joined: Sat Jun 24, 2006 11:52 am
Location: (near) Berlin, Germany

Post by mexell » Wed Aug 13, 2008 11:42 pm

tim851 wrote:Not a good example. Intel had good competition before the P4 in the K6s and later the Athlons. Netburst was an "innovative" idea to engage the enemy: all hail to the almighty clockspeed. Throughout most of the P4's life span, AMD had the better cpus. No dominance here.

The best example for lack of competition would IMHO be Vista. I'm not even talking about the actual quality of the product. I'm just considering where XP was in 2001 and how the biggest IT company spend 6 years developing the successor and I think to myself: That's it?!?
ACK, although:
- Even during times of AMD's greatest technical advantage over Intel, market shares spoke a quite a different language. That was in my opinion due to Intel's more "complete" lineup from notebooks up to big irons. The only part of the market in which AMD was competitive was the desktop. Intels strike back with the Core architecture was and is quite devastating for AMD, and there's also the home-made Barcelona/Phenom release delays.
- Vista has developed to quite a mature product over the last 1,5 years. The initial problems are only partly to blame on Microsoft. Everyone who developed for Windows would have been able to know the technical changes induced with Vista in time to be able to provide real "Vista-ready" products.

Tzupy
*Lifetime Patron*
Posts: 1561
Joined: Wed Jan 12, 2005 10:47 am
Location: Bucharest, Romania

Post by Tzupy » Sat Aug 30, 2008 8:59 am

According to this article at the Inq, the solder problems are widespread:
http://www.theinquirer.net/gb/inquirer/ ... -parts-bad

soloman02
Posts: 37
Joined: Wed Apr 02, 2008 9:12 pm
Location: NH, USA
Contact:

Post by soloman02 » Sat Aug 30, 2008 12:56 pm

I hope not since I have an 8800GT. Now I am afraid to shutdown my computer before bed to save money on electricity. If this is true it looks as though I may be better off leaving my computer on all the time.

thejamppa
Posts: 3142
Joined: Mon Feb 26, 2007 9:20 am
Location: Missing in Finnish wilderness, howling to moon with wolf brethren and walking with brother bears
Contact:

Post by thejamppa » Sat Aug 30, 2008 1:25 pm

I really don't believe half Charlie says... but somehow I kinda feel relieved that I went for the HD 4850 and not for the 9800gtx+ as I planned in one phase. Only reason for not getting 9800gtx+ was the local unavaibility.

kel
Posts: 100
Joined: Wed Mar 26, 2008 6:32 am
Location: Switzerland

Post by kel » Sat Aug 30, 2008 7:25 pm

I've quickly scanned over the heise article (i'm a native german speaker) - it goes into a lot more detail on what causes the problem, in short:

gpu's get hotter than cpu's (thus gpu's have higher quality requirements than cpu's and thus pose new challenges to chip design) and that leads to increased thermal stress on the joints between the gpu and the board which can lead to breakage at this point. This also means that cooler chips are less at risk than hotter ones.

If I read this correctly this would mean that a well cooled 9600gt (i.e. 60c under load) would probably be pretty safe, while high end cards or purely passively cooled cards would be at higher risk? (this is purely speculation on my side)

This makes me wonder though:
- Do nvidia laptop chips really get this hot? (i have no idea, i never owned one)
- Does this mean that the new ati cards might also run into problems down the line as they seem to run very hot too from what i've been reading...

at least that's what i'm getting from the article, it's bloody late though and it's probably a good idea if i reread it tomorrow ;-)

AZBrandon
Friend of SPCR
Posts: 867
Joined: Sun Mar 21, 2004 5:47 pm
Location: Phoenix, AZ

Post by AZBrandon » Sun Aug 31, 2008 12:05 am

kel wrote:This makes me wonder though:
- Do nvidia laptop chips really get this hot? (i have no idea, i never owned one)
The original articles from a couple weeks ago said the problem became public because several laptop makers were experiencing failure rates way out of the ordinary, and saying because of the nature of the problem, laptops fail much sooner and more frequently than full size PCIe cards due to the greater heat in the laptops.

It may be a while before we really get the whole story on exactly what's going on here, or given how shy nvidia is about anything other than attacking Intel and AMD in the press, perhaps we'll never know the true story. It certainly came at a great time for AMD/ATI however, that nvidia has an apparent quality control problem right at the same time AMD has competitive alternatives available.

shathal
Posts: 1083
Joined: Wed Apr 14, 2004 11:36 am
Location: Reading, UK

Post by shathal » Mon Sep 01, 2008 3:32 am

Nice warnings.

Think I'll definately be getting a Radeon for now then :).

Thanks folks :).

CA_Steve
Moderator
Posts: 7651
Joined: Thu Oct 06, 2005 4:36 am
Location: St. Louis, MO

Post by CA_Steve » Mon Sep 01, 2008 6:40 am

kel wrote:I've quickly scanned over the heise article (i'm a native german speaker) - it goes into a lot more detail on what causes the problem, in short:

gpu's get hotter than cpu's (thus gpu's have higher quality requirements than cpu's and thus pose new challenges to chip design) and that leads to increased thermal stress on the joints between the gpu and the board which can lead to breakage at this point. This also means that cooler chips are less at risk than hotter ones.

If I read this correctly this would mean that a well cooled 9600gt (i.e. 60c under load) would probably be pretty safe, while high end cards or purely passively cooled cards would be at higher risk? (this is purely speculation on my side)

This makes me wonder though:
- Do nvidia laptop chips really get this hot? (i have no idea, i never owned one)
- Does this mean that the new ati cards might also run into problems down the line as they seem to run very hot too from what i've been reading...

at least that's what i'm getting from the article, it's bloody late though and it's probably a good idea if i reread it tomorrow ;-)
Yes, you can greatly reduce the risk by reducing the amount of thermal cycling/stress. Note that that was the bios fix for the laptops - they ramped up the fans earlier/at a lower temp point.

Stay away from lousy cooling solutions and my guess is that this problem won't exist for you.

dhanson865
Posts: 2198
Joined: Thu Feb 10, 2005 11:20 am
Location: TN, USA

Post by dhanson865 » Mon Sep 01, 2008 5:17 pm

The newest article http://www.theinquirer.net/gb/inquirer/ ... -defective says good cooling or not these nvidia cards may not last the warranty period.

It's a godawful long 2 page article so I'll pick some random snippets that cover some high points. Follow the link if something I quoted doesn't seem to make sense, they explain it several different ways in the article.
The defective parts appear to make up the entire line-up of Nvidia parts on 65nm and 55nm processes, no exceptions. The question is not whether or not these parts are defective, it is simply the failure rates of each line, with field reports on specific parts hitting up to 40 per cent early life failures. This is obviously not acceptable.
The Nvidia defective chips use a type of bump called high lead, and are now transitioning to a type called eutectic, see here and here.
Getting back to the stress, it is what makes bumps fracture. Think of the old trick of taking a fork and bending it back and forth. It bends several times, then it breaks. The same thing happens to bumps. Heating leads to stress, aka bending, and then it cools and bends back. Eventually this thermal cycling kills chips.

Once again, if you did your engineering right, this won't happen in any timeframe that matters to mere humans, if it takes ten years of on and off switching to make it happen, once a day power cycling won't matter in our lifetimes. Chip makers tend to engineer for timelines like the ten-year horizon, and are pretty safe in assuming it will live for five years of casual use.
If you pick an underfill that is too soft, it doesn't provide you enough mechanical support for the bumps, they crack and your chip dies and early death. Pick one that is too hard and it rips the polyamide layer off. In the words of one packaging engineer talked to for this article, if you used too hard of an underfill, the chip "wouldn't survive the first heat cycle". The magic is in the middle, you have to pick a bowl of porridge, er, underfill, that is strong enough to provide the support you need, but not so strong as to rip layers off your chip. Like we said, package engineering is not for the faint of heart, but it can make baby bear happy.

That brings us to the billion dollar question, why not simply change bump types to eutectic if they are that much better, which they are, in some ways. The answer is in the current capacity, more specifically average current capacity. We mentioned this earlier, and the idea ties into the hot spots and functional units.

If you take a hypothetical simple CPU that has an integer and floating point units. If you are doing heavy int. work, the power bumps that supply that part of the chip will be loaded heavily and the FP bumps will not be doing much of anything at all. When FP load gets heavy, the opposite happen.

The layout of the bumps is designed so that neither set will be overloaded at peak times, and in fact won't get all that close to their maximum.
The problem with eutectic bumps is that they have a lower current capacity, and the closer you get to it, the worse the problem of electromigration becomes.
If Nvidia wants to swap in eutectic bumps for the high lead they are using, there is a slight problem, they are well over the current capacity of the new bumps.

If the chip actually powers up without letting the smoke out, the first time you fire up a massive game of Telengard, it will most assuredly go pop. In the rare case of that the gods of luck are staring right at you and the thing doe sn't fry immediately, electromigration will ensure it has the lifespan of a mayfly, basically worse than the current crop of defective Nvidia chips.

What do you do? You can either cut the power used by the GPU way way down, ie, clock it at a point where no one would ever buy it, or rearrange where the bumps go. The rearrangement is not a trivial thing, and may require moving large parts of the chip around, basically a partial relayout. This is expensive, time consuming, and likely can't be done and validated in the time the chip is on sale for.

The other option is basically just as bad, you need a power plane or power grid on the die. This is a metal layer that distributes power across the die, and it means adding a layer to the chip. That means expense, slightly lower yield, and can have other detrimental effects to power draw and clocking.

All of these things can be dealt with if you see this coming when you start making the GPU. It is pretty painfully obvious that Nvidia didn't, otherwise they wouldn't have used high lead bumps and gotten into the hole that they are in. They have switched to eutectic bumps, but given the way it is being done, and the supplier grumbles we are hearing, it appears to be very poorly planned. It will be interesting to see the lifespan of these new parts.

CA_Steve
Moderator
Posts: 7651
Joined: Thu Oct 06, 2005 4:36 am
Location: St. Louis, MO

Post by CA_Steve » Mon Sep 01, 2008 7:42 pm

If the the inquirer article holds true (and that's a large assumption in itself :D ), then Nvidia failed at a pretty basic level. It's not like these issues are new to any IC mfgr.

Tzupy
*Lifetime Patron*
Posts: 1561
Joined: Wed Jan 12, 2005 10:47 am
Location: Bucharest, Romania

Post by Tzupy » Tue Sep 02, 2008 7:04 am

According to this Inq article, if you exceed about 80C temperature the underfill becomes 10x softer:
http://www.theinquirer.net/gb/inquirer/ ... tive-chips
nVidia is supposed to be switching to a better underfill, that withstands higher temps.

Edit: this newer article claims that the 'BIOS fix' isn't a real fix, but just means that the laptop fans run faster,
so the laptops should live until after the warranty expires, after that it's not their problem anymore:
http://www.theinquirer.net/gb/inquirer/ ... /nv-should

dhanson865
Posts: 2198
Joined: Thu Feb 10, 2005 11:20 am
Location: TN, USA

Post by dhanson865 » Wed Sep 10, 2008 8:57 am

Class Action lawsuit filed in California.
"A lawsuit filed in a California court on Tuesday alleges Nvidia concealed the existence of a serious defect in its graphics-chip line for at least eight months 'in a series of false and misleading statements made to the investing public.' The lawsuit contends that Nvidia CEO Jen-Hsun Huang and CFO Marvin Burkett knew as early as November 2007 about a flaw that exists in the packaging used with some of the company's graphics chips that caused them to fail at unusually high rates. Nvidia publicly acknowledged the flaw on July 2, when it announced plans to take a one-time charge of up to $200 million to cover warranty costs related to the problem. That announcement caused Nvidia's stock price to fall by 31 percent to $12.98 and reduced the company's market capitalization by $3 billion, the lawsuit said. The lawsuit seeks class-action status against Nvidia and unspecified damages."
Sorry, this is the law suit for duped stock buyers, not duped product buyers. The duped product lawsuit is in room 12.
Past the joke, if it makes it past the warranty period you have little regress as a customer. While it's illegal to say "we're doing great" while knowing your main product line is failing from a security law point of view, unless the failing parts are in a safety critical application (e. g. child car seats) there is no law mandating a recall/replacement/settlement for selling a crappy product.
which specific chips are effected?

No one knows for sure, and Nvidia isn't telling. The Inquirer says practically all of them, but their author has a history with Nvidia so there's quite a potential for bias there. The running theory is that the problem is due to thermal properties of a substrate material. This substrate material supposedly expands and contracts at a different rate than surrounding material in the chip package. Over time, this stresses the silicon or solder points, eventually causing a failure of the part. Laptop parts are definitely affected, you only need to look in notebook manufacturers forums and you'll see an incredible number of posts from owner of notebooks with, for example, 8600 GT mobile parts.

Desktop parts may also be affected, since they're all based on the same core silicon with (supposedly) the same substrate materials. It's possible that the problems aren't as apparent (at least not yet) due to the different thermal conditions you'd see in a tower chassis compared to a notebook. The very popular 8800GTs out there may start failing en masse in three months, six months, a year's time, or maybe never. Because Nvidia won't specifically say which parts are affected, whether it's all the parts or only certain manufacturing runs, etc., we have only speculation and rumor to go on.

shathal
Posts: 1083
Joined: Wed Apr 14, 2004 11:36 am
Location: Reading, UK

Post by shathal » Thu Sep 11, 2008 4:12 am

Nicely collected, edhanson865.

"Good" to see that so little changes in NVIDIA's generally "screw you" attitude. Let them (deservedly) suffer where it hurts in return (not that it's likely to make them act better, but hope remains...).

dhanson865
Posts: 2198
Joined: Thu Feb 10, 2005 11:20 am
Location: TN, USA

Post by dhanson865 » Tue Sep 30, 2008 8:28 am

"According to a research report out of UCLA, released this morning, NVidia's high-lead bump packaging could last anywhere from 1/10th to 1/100th as long as AMD's advanced eutectic bump approach.
It could just mean that if failures occur along a normal distribution, which they probably do, each point is approximately 10-100x higher than the ATI cards, which would be a Big Deal.

Most companies offer at least a year long warranty; if they have significant failures in that year, like 10-100x higher than normal, that may put too much pressure on their warranty policy.

And let's not forget nVidia's partners in selling cards (you know, all the non-nVidia nVidia cards). Those people may see high failure rates of nVidia parts, and all of a sudden using another chipset just got a heckuva lot more attractive.

So, the moral of the story is, there is no set 'time' that a card will die. It's not like after 10 months all of them will just conk out. But if there are higher failure rates than normal in their warranty period, not to mention harm done to their reputation, it could end up costing them greatly.
This study does NOT specifically address or study AMD or NVidia's Chips.

It does not specifically address or test the exact chemical makeup of chips belonging to AMD or NVidia.

The conclusions being drawn as to the relative life spans of those manufacturer's chips appear to strictly belong to the bloggers who want a big headline, and not to the authors of the study. The study authors specifically note that in order to determine the life span of real chips, the real chips in question should be studied. Quote:

"For life-time prediction, the real microstructure of these two kinds of flip chip solder joint should be studied and actual failure rate should be measured. "

The study states that they are ignoring various factors that would come into play in the real world in order to simplify the study, and that they are making a number of assumptions about various testing conditions and about the makeup of the materials themselves.

From reading the study linked, it's not even clear to me that they actually tested anything, and it appears from their wording to be only a theoretical exercise.

In no way should the results of this study be used to state that brand X's chips will have a longer lifespan than brand Z's chips.

dhanson865
Posts: 2198
Joined: Thu Feb 10, 2005 11:20 am
Location: TN, USA

Post by dhanson865 » Tue Oct 14, 2008 9:33 am

"HP has revealed faults with 38 different models in its slimline PC range, sparking speculation that Nvidia's faulty GPU problems have spread beyond laptops. HP's official statement says the problems are 'attributable to the computer's motherboard" and that affected machines 'may not boot or may not display video' — the same kind of terminology used to describe the previous faults with laptop GPUs. Both HP and Nvidia have declined to comment. But in a filing to the US Securities and Exchange Commission (SEC) earlier this year, Nvidia admitted 'there can be no assurance that we will not discover defects in other MCP or GPU products.'"

Note: the linked story (updated since this submission) says that Yes, the problems are now confirmed to be rooted in the Nvidia GPUs.

tehcrazybob
Friend of SPCR
Posts: 356
Joined: Wed Jan 16, 2008 8:56 pm
Location: Council Bluffs, Iowa
Contact:

Post by tehcrazybob » Tue Oct 14, 2008 11:21 am

So, does this confirm that the problem has spread past laptop-integrated GPU chips and is now affecting desktop cards? The delay doesn't surprise me much due to the average laptop having much less effective cooling than even the hottest desktop.

For people who own one of these cards, are there any suggestions to reduce the chance of failure? For instance, I have very effective cooling (50-55 C under load) and my computer is almost never shut off or put in standby (sorry, environment). As a result, my thermal stress cycle count should be extremely low.

I'm also curious to see what happens with warranties if the problem becomes extremely widespread or NVidia confirms it (probably a bad idea, sadly).

mexell
Posts: 307
Joined: Sat Jun 24, 2006 11:52 am
Location: (near) Berlin, Germany

Post by mexell » Tue Oct 14, 2008 11:15 pm

Slimline PCs may as well be equipped with mobile graphics chips, although due to less space restrictions one could argument that they should be equipped with more effective cooling measures to mitigate these problems or at least put them on the long run (OT: Can I say it like that?)

Post Reply