My long, sad tale of instability woes...

A forum just for SPCR's folding team... by request.

Moderators: NeilBlanchard, Ralf Hutter, sthayashi, Lawrence Lee

Post Reply
bcassell
Patron of SPCR
Posts: 70
Joined: Tue Mar 23, 2004 1:01 pm
Location: San Jose, CA, USA

My long, sad tale of instability woes...

Post by bcassell » Wed May 05, 2004 1:35 pm

I'm not sure if this is the right forum to post this in, but hey -- it directly affects my folding abilities =).

Ok, so this story is kind of long -- bear with me. I put together the first iteration of my current setup about 4-5 months ago. These were the specs:

A64 3000+
Asus K8V Deluxe
2x 512 Corsair Value PC3200
Radeon 9600 pro
Audigy sound card (the original one).
Antec 3700BQE case w/ stock power supply (350W Smartpower).
2x Seagate 7200.7 SATA 80gig drives
Lite-On CDRW/DVD combo drive.

The computer ran fine for a while (a few weeks to a month). Then I started to notice stability problems. I ran all the normal tests and found that prime95 wouldn't run for more than a few minutes without an error (memtest, however, could run for 36 hours straight with the fsb @ 220mhz with no errors). Adding voltage didn't help prime95, so I started underclocking. I had to get down to 800mhz before prime95 would run overnight. This really bothered me, but I was very busy, and the computer was mostly stable at stock speeds, so I just ran it like that for a month or two. Over those months, though, it seemed to get worse. Folding@Home would occasionally abort work units, Mozilla would sometimes crash, etc. I was about ready to just break down and buy a new cpu when I came across something that led me to my first replacement...

I had an Asus K8v with the faulty capcitors. If you're not familiar with what I'm talking about, Asus made a batch of K8V's that had bad capcitors on them -- capacitors that were used for voltage regulation. I figured this HAD to be my problem. So I got ready to send the board back to Asus and ordered an AOpen AK86-L for a replacement. When I got the AOpen board I put everything together and... still had stability problems. It was the same thing. Same symptoms. Oh crap. So this led to my second replacement...

After some consideration, I decided that I had run my poor cpu on a motherboard with bad voltage regulation for just shy of 4 months. This probably was not healthy. So, I ordered a replacement cpu. Another A64 3000+. I was so happy this monday when that box from newegg arrived. But after replacing the cpu, I am STILL having the SAME instability problems.

So, at this point, I have no clue. The only thing I could possibly suspect would be my power supply. I really find it hard to imagine that my power supply could be causing small cpu errors, while everything else seems to run fine! Please, if anyone has any suggestions I would REALLY appreciate some help. I've already spent way too much money on this machine (having bought 2 motherboards and 2 cpus) and I really just want a stable computer!!! Thanks,

Bryan

MikeC
Site Admin
Posts: 12285
Joined: Sun Aug 11, 2002 3:26 pm
Location: Vancouver, BC, Canada
Contact:

Post by MikeC » Wed May 05, 2004 1:55 pm

Oh boy...

You need to do some systematic diagnostics. For this, the best thing is to have a reasonably close system that works perfectly well, and just swap components from the bad system to the good one. (or vice versa).

The PSU is a definite potential culprit. Ditto the memory, motherboard and maybe VGA. Probably not the rest, and not the CPU.

Swap each of the the above components out one at a time, and run the one specific app/operation which always tells you that you have the problem.

SpyderCat
Posts: 208
Joined: Sun Feb 23, 2003 12:22 pm
Location: The Netherlands

Re: My long, sad tale of instability woes...

Post by SpyderCat » Wed May 05, 2004 2:14 pm

bcassell wrote: ....... These were the specs:
A64 3000+
Asus K8V Deluxe
2x 512 Corsair Value PC3200
Radeon 9600 pro
Audigy sound card (the original one).
Antec 3700BQE case w/ stock power supply (350W Smartpower).
2x Seagate 7200.7 SATA 80gig drives
Lite-On CDRW/DVD combo drive.
............

The only thing I could possibly suspect would be my power supply. I really find it hard to imagine that my power supply could be causing small cpu errors, while everything else seems to run fine!
........
Bryan
When you suspect the PSU, why is it you don't give specifics on it?

In the forums dealing with the AMD64 they state over and over again you need an PSU with a strong 12 volts rail.
20 Amps on the 12 volts rail or more recommanded there.

My rig is perfectly stable with a 18 Amps(max) 12 volt rail
And I undervolted the CPU to 1.3 volts, albeit running @ full speed.

Normally I end with "Have Fun" but I'll pass this time. :?

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Post by haysdb » Wed May 05, 2004 2:32 pm

Bryan, I can sure empathize with you. I had LOTS of stability issues with my folding farm. For ME, it ended up being flakey motherboards. To this day I don't understand it, but as soon as I got rid of the problamatic motherboards, I have had no problems. I am using all of the same cpu's, memory, and power supplies I started with. The only thing that's different is the motherboards.

What I could not believe is that I could have so MANY flakey motherboards. The odds just seemed wildly against it. I must have RMA'd AT LEAST three motherboards, and it might have been more like five.

I agree with you, it shouldn't be the power supply, but the fact is, power supplies are sometimes the culprit, but modern day motherboards are not nearly so reliant on stable power as they used to be, so it's either a REALLY crappy power supply, or a poorly designed motherboard. IMHO.

David

Wrah
Patron of SPCR
Posts: 316
Joined: Thu Apr 10, 2003 1:56 am

Post by Wrah » Wed May 05, 2004 2:56 pm

You have 2 sticks of RAM. Have you tried running it with both sticks seperately (meaning one at a time)? And have you tested with Memtest?

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Re: My long, sad tale of instability woes...

Post by haysdb » Wed May 05, 2004 2:58 pm

bcassell wrote:(memtest, however, could run for 36 hours straight with the fsb @ 220mhz with no errors).

dukla2000
*Lifetime Patron*
Posts: 1465
Joined: Sun Mar 09, 2003 12:27 pm
Location: Reading.England.EU

Post by dukla2000 » Wed May 05, 2004 3:24 pm

Bryan - most of the above applies.

But also, in the absence of a stable system you can swap components with, and anyway, you should check/report all your voltages and temps as best as possible. Ideally check the voltages (at least +12V and +5V) with a multimeter, at worst with MBM or similar. The temps are harder to get an independent reading for, but MBM will probably do.

If Prime95 is still a reliable failure after a few minutes then you can use that as your primary diagnostic tool. Start MBM before you fire Prime, set the readout period to 1 or 2 seconds and get a feel for the 'normal' readings. Any wild fluctuations are cause for concern; absolute readings (from MBM) in themselves are not necessarily a problem.

To try get a stable system can you run without {soundcard; CDRW; 1 stick RAM; 1 SATA} for any reasonable period of time? If Prime fails with 'standard' BIOS settings, 1 drive & 1 stick of memory then there is a big vote for the PSU (cause you have swapped CPU & mobo once already).

mas92264
Patron of SPCR
Posts: 659
Joined: Fri Sep 26, 2003 5:26 pm
Location: Palm Springs, CA, USA

Post by mas92264 » Wed May 05, 2004 3:39 pm

Same deal here with stability problems on 4 of my folders. 2 of them are open air boards and the passive nb heatsink would get nearly hot enough to burn my fingers (one was locking up and the other was aborting units.) I perched an 80 mm fan on the mobo and sorta pointed it at the nb heatsink for one and just set a 40 mm fan on top of the hs on the other. Problems with those 2 disappeared.

Apparently had a corrupted os file on one (wouldn't re-boot) and replaced the hd and reinstalled the os. OK now.

The most puzzling one was my SN45G. Was literally aborting every wu after folding for months with no changes no problems. Reloaded the chipset and video drivers and this seemed to fix it. I hope.

Which brings up an interesting point. I wasn't checking the log file and went for days aborting wu's without knowing it. Now, I'm checking them every day or so on my problem folders.

Don't know if this will help; just some added field info.

M

unregistered
Posts: 542
Joined: Mon Aug 11, 2003 5:54 pm

Post by unregistered » Wed May 05, 2004 4:06 pm

You probrably have fixed your hardware problem :D

Are you running XP?

Some reason even though XP has been labeled "stable", it gets corrupted by hardware problems, at least that has been my observation/experience.

If it makes you feel any better, my clothes washer wend down friday night, I couldn't fix/get parts til mon AM. BUT my hot water heater started leaking onto my hallway floor Sunday. My HTPC crapped out on me saturday night, my main PC froze up on me monday at lunch time (ram with data corruption) and ALL THIS after I finally get around to tearing the roof (shingles) off of my house on Friday!!!!!! This is why folding has become my primary hobby :D

Life is great! I still got one folder going and everything else except the HTPC and roof are taken care of.

bcassell
Patron of SPCR
Posts: 70
Joined: Tue Mar 23, 2004 1:01 pm
Location: San Jose, CA, USA

Post by bcassell » Wed May 05, 2004 5:07 pm

WOW. I didn't expect this many responses. Thanks in advance to everyone!

So, first thing: I've tried hard to do as many tests as I can, but the fact is, between work, etc. I just don't have that much time to debug my computer at home. So, with that said, I will try to do as many suggested tests as possible and I'll post here as soon as I do them =).

So, to some specifics....
MikeC wrote:You need to do some systematic diagnostics. For this, the best thing is to have a reasonably close system that works perfectly well, and just swap components from the bad system to the good one. (or vice versa).
Unfortunately, I don't have any systems that are even close. The only other system I have at home is a dual p3 800 system =(. I don't know anyone locally that has a system that's close either =(.
SpyderCat wrote:When you suspect the PSU, why is it you don't give specifics on it?
In my original post I stated that I had a 3700BQE with the stock 350W Antec Smartpower PS.
haysdb wrote:Bryan, I can sure empathize with you. I had LOTS of stability issues with my folding farm. For ME, it ended up being flakey motherboards.
Of course I'm keeping in mind that it could be the new motherboard as well, but man... how unlucky would I have to be? =P. Before I purchased the replacement motherboard I did some research and I could find almost no reported problems with the AOpen board (though obviously that proves nothing...)
Wrah wrote:You have 2 sticks of RAM. Have you tried running it with both sticks seperately (meaning one at a time)?
Unfortunately, I haven't had time to test it with each stick of ram seperately. That's definitely on my list of tests to do... it just seems unlikely to me that the ram could be the problem, considering how long I ran memtest for (with overclocked memory as well).
dukla2000 wrote:But also, in the absence of a stable system you can swap components with, and anyway, you should check/report all your voltages and temps as best as possible. Ideally check the voltages (at least +12V and +5V) with a multimeter, at worst with MBM or similar. The temps are harder to get an independent reading for, but MBM will probably do.
Unforunately I don't have a working multimeter right now. I've been meaning to pick one up and this is probably as good of an excuse as any. If I do get a hold of one, I'll definitely take those measurements. As for the voltages reported in MBM, they are perfectly within tolerances, no problem, even starting/stopping Prime95 and folding@home. I've never seen MBM report a voltage that was close to out of spec.

As for temperatures, according to MBM the max I've ever seen my cpu at is 54C. When I was running these tests it was at 52C (this is on a very warm day). Just to make sure though, I turned my zalman 7000 up to ~11v and my 120mm evercool (exhaust) up to ~11v and the temp dropped to 47C and I still had problems. These problems were also occuring all winter where my max load cpu temps were ~44C.
dukla2000 wrote: To try get a stable system can you run without {soundcard; CDRW; 1 stick RAM; 1 SATA} for any reasonable period of time? If Prime fails with 'standard' BIOS settings, 1 drive & 1 stick of memory then there is a big vote for the PSU (cause you have swapped CPU & mobo once already).
This will probably be my next test. If I run a minimal system and still have the problems then I can effectively rule out a lot of the possibilities suggested here.
mas92264 wrote:The most puzzling one was my SN45G. Was literally aborting every wu after folding for months with no changes no problems. Reloaded the chipset and video drivers and this seemed to fix it. I hope.
This reminds my of something I forgot to mention intially, and this is probably important, so if you're skimming through this post READ THIS PART: -- When I first put together the system I did a fresh install of winXP (and loaded all the latest chipset, video drivers etc). When I got my replacement motherboard I ALSO did a clean install of winXP with the then newer chipset, video drivers etc. Personally, it seems like the odds of this being a software problem are very low...

Ok, thanks again for all your help. If I missed something just let me know and I'll respond.

Bryan

P.S. -- I really feel the need to mention that I am just in awe of these message boards. For a point of reference, I also posted this problem in the Anandtech.com forums (only other place I could really think of, I don't post on hardware forums that often). In the anandtech forums I got two responses, both of them basically telling me to re-install winXP. I'm simply amazed by the amount of genuinely helpful and insightful advice I'm getting here =)

Beyonder
Posts: 757
Joined: Wed Sep 11, 2002 11:56 pm
Location: EARTH.

Post by Beyonder » Fri May 07, 2004 5:41 pm

Bryan,
I would also look in the event log of your computer to see if you can isolate any information about the instability. A lot of people overlook this aspect of troubleshooting, and it's actually quite valuable to determine issues you may be having.

That being said, what everybody else has said is probably where I would start. Good luck, and hope you get the box running like a champ. :D

bcassell
Patron of SPCR
Posts: 70
Joined: Tue Mar 23, 2004 1:01 pm
Location: San Jose, CA, USA

Post by bcassell » Fri May 07, 2004 9:46 pm

Ok, I've been super busy at work (already worked 70 hours this week), so I haven't had any time to test until today. I borrowed a multimeter from work and took some readings:

Code: Select all

Line      Idle      Load
12V       12.09     12.06
5V        5.08      5.08
"Idle" was when my computer was just sitting in windows. "Load" was running prime95. I don't have time to do any more tests at the moment because, well, I need to sleep. I can probably bring my comp in to work tomorrow and swap the PS for one there (though we don't really have any quality ones), but it's looking to me like it's probably not the PS? I don't know, I didn't notice any spikes, but I was only watching for about 30 seconds =P. Any opinions on what this data means?

Bryan

dukla2000
*Lifetime Patron*
Posts: 1465
Joined: Sun Mar 09, 2003 12:27 pm
Location: Reading.England.EU

Post by dukla2000 » Sat May 08, 2004 12:02 am

bcassell wrote:Any opinions on what this data means?
Nothing to be concerned about - in fact psu looks rock solid. Together with your previous observation that things are pretty stable in MBM you have eliminated the usual PSU overload (or crap psu) symptom.

You need to keep testing each component: this is not a complete exoneration of the psu (it could be spiking 20V AC or some ridiculous ripple numbers or ...) but I would tend to focus elsewhere at the moment. The easiest places are things you can swap out: if you can borrow a work psu then do it just for 100% elimination of your psu as the problem. Despite the MemTest stability you have I would worry more about your memory at this stage - can you borrow a stick from work for an hour? (Cause so far you have swapped mobo, CPU and just now PSU.)

And keep up the good work: it is important to all our communities to have folk paying more taxes :lol:

isp
Posts: 94
Joined: Wed Apr 14, 2004 11:48 am
Location: Columbus, OH United States

Post by isp » Sat May 08, 2004 5:10 am

I had stability issues with a system last night, as it turned out I just needed to change the thermal compound.

I had a suspicion that cheap stuff sidewinder gave me was no good... :evil:

Went from 48c load to 38c! :D

bcassell
Patron of SPCR
Posts: 70
Joined: Tue Mar 23, 2004 1:01 pm
Location: San Jose, CA, USA

Post by bcassell » Sun May 09, 2004 12:17 am

Update time...

I ran memtest overnight last night just for kicks -- no errors.

So I brought my computer in to work today to do some testing (I have a P4C system at work so I figured I could test ram/PS at least). The first thing I tested was ram because, well, it's the easiest =). Turns out this problem is somehow related to ram. I'd try to explain but I'll just present data and you can draw your own conclusions...

My A64 system was running 2 512 mb sticks of Corsair ram. I'll call them C1 and C2. My P4 system was running 2 512mb sticks of Kingston ram. I'll call them K1 and K2. Keep in mind that these tests aren't exactly consistent and that run times for them varied between ~20 min and ~2 hours depending on what I was doing at the time. So obviously the results could be flawed. With that said, here's my test data (S indicates success, F failure):

Code: Select all

Machine     C1     C2     C1+C2  K1     K2     K1+K2
P4          S      S      S      S      S      S
A64         F      S      F      S      S      S
As you can see, it looks like my A64 machine was not happy with one of the corsair sticks. As hard as I tried, though, I couldn't get it to fail in the p4. So... for now I just swapped the ram between the two machines =). I think further testing is definitely necessary.

Also, there was one really odd occurence. After swapping ram in and out for a while, I noticed that my A64 machine's cpu usage (performance tab in task manager) was hovering around 40-60%. When I looked in the "Processes" tab though, nothing was using the cpu! System idle was listed as 99%, yet the performance tab clearly showed cpu usage. I turned on the kernel times option and, sure enough, all the cpu time was spent in the kernel. I swapped ram a few times and this still kept happening. When I took my computer home tonight, though, and booted it up, it's no longer happening. Has anybody ever heard of anything like this? I've never seen anything like it.

And as a final note, I tried overclocking my cpu to 2.2ghz (from 2ghz) just now and it failed prime95 after 13 minutes =(. Oh well, at least it's stable at default speed (or so I hope). And I have a spare A64 cpu sittting around that I can try some other time! =P

Thanks again for all your help,

Bryan

dukla2000
*Lifetime Patron*
Posts: 1465
Joined: Sun Mar 09, 2003 12:27 pm
Location: Reading.England.EU

Post by dukla2000 » Sun May 09, 2004 2:22 pm

Presumably those were MemTest tests? So things should be stable - good luck.

The "high CPU usage" could be a swerveball - Windows confused how to report the A64 doing its Kool n Quiet thing :?:

bcassell
Patron of SPCR
Posts: 70
Joined: Tue Mar 23, 2004 1:01 pm
Location: San Jose, CA, USA

Post by bcassell » Sun May 09, 2004 8:23 pm

dukla2000 wrote:Presumably those were MemTest tests? So things should be stable - good luck.

The "high CPU usage" could be a swerveball - Windows confused how to report the A64 doing its Kool n Quiet thing :?:
Actually, those tests were prime95. I should have mentioned that. As I said before, my ram passed memtest86 for 36 hours @ 220mhz, and just two nights ago I tested again for 12 hours @ 200mhz. So, memtest would NOT detect erros with the memory. Prime95 would, however, fail. I suppose I should probably run memtest with the Corsair ram in the P4 machine just to be sure =).

As for the cpu usage... I don't have cool'n'quiet enabled, so it wasn't that. The more I think about it, I think it only happened when I had one stick of ram in... so I might try booting my computer with only one stick of ram to see if it happens again. Anyway, yes my computer seems stable for now, but I'll probably have to swap the ram back considering it came from my machine at work. Although, if the Corsair ram proves to be 100% stable in my p4 machine at work, my boss probably won't care, he's pretty understanding about this stuff (letting me disect my work machine in the first place to run tests =)).

Bryan

mas92264
Patron of SPCR
Posts: 659
Joined: Fri Sep 26, 2003 5:26 pm
Location: Palm Springs, CA, USA

Post by mas92264 » Mon May 10, 2004 10:30 am

Had a similar deal with running 2-2-2 memory with "Auto" memory settings in the bios. Changed to 2.5-3-3 and this cured my stability problems on an Intel 865 chipset board which was ditching wu's and failing Prime95.

M

trodas
Posts: 509
Joined: Sun Dec 14, 2003 6:21 am
Location: Czech republic
Contact:

Post by trodas » Wed May 12, 2004 1:28 pm

Well, if the memtest (assuming you using the latest 3.1a) did not fail, then stop seeking the culprit in memory - there is not it :roll:
The fact that the machine can run stable on memorytest for 36 hours indicating one thing - the machine can be very stable, untill you stress it CPU, is that right? :P
(i must at this point admit that after I read you using only 350W PSU, my first idea was - replace it with at least 400+++W one - I using 431W Enermax, so...)
During the voltage measuring you did not measured the 3.3V rail, and believe or not, this is the absolutely major one - it power the chipset (on older mobos working with 3.3V chipset) and the voltage regulators, witch powering the rest of the machine :wink:
Nothing is more important that 3.3V rail, believe me. Try it again, please. I still have suspiction that the PSU did not made it, because trough time it get more dusty, and then can't handle the power need...

Apart from that, what are your mobo NB cooling? Since the A64 have build-in memory controler, it possibly could fail like it's the CPU fault at Prime95 when the chipset get overheated... However if it can pass 36 hour memtest, this hardly could be a issue, but just check it out... :wink:

And one more idea - i have the most hard problem with stability, when I created machine, where I added "additional" bras mobo holding screw bellow the mobo in order to allow me push the floppy cable w/o bending the mobo. Idea is sure great, however over little time it happens, that the mobo after heated, bend, contact with the chase and cause crash. Almost random one. Sometimes it go 14 days w/o this happening, sometimes it crash 3x per day. Took me nearly 2 years of exchanging components to find this out. (of course i find the problem after I exchanged almost everything from psu to hdd to cpu to memory to gfx card to soundcard to NIC card...) So, if you get desperate, remove all stuff from the case and run the computer on the desk :wink:

Other suggestion that that, try remove all what is possible from the machine to see if it helps.

If your machine seems stable now, it could be that your PSU cable just got pulled-out a little or something like that you fixed when re-opening the maching many times :wink:
I wish your luck!

PS. my bro having random crashes todays too. We exchanged almost everything, but it keep happening. Then we swaped the CPU and - from this point, no crash. Weird. Even more is weird that the voltage option disapered from bios :roll:

Post Reply