new FAH rig is unstable

A forum just for SPCR's folding team... by request.

Moderators: NeilBlanchard, Ralf Hutter, sthayashi, Lawrence Lee

Post Reply
kittle
*Lifetime Patron*
Posts: 336
Joined: Thu Nov 09, 2006 4:44 pm
Location: San Jose, CA

new FAH rig is unstable

Post by kittle » Wed Oct 01, 2008 12:07 pm

I got a new system which will eventually make its way to my nephew, but in the meantime I wanted to run F@H and grab a few extra points.

Problem is the system reboots itself like clockwork after processing 2% of a WU.

Im running the newest 6.22 denio beta client in console mode.

Ive shut off the windows reboot on BSOD, so thats not the issue. ive checked temps. CPU cores run 45-48. GPU is 39-40.
Memtest86 ran for 1 hour with no errors.
I can play WoW for hours on end with no problems.

specs:
-DFI Lanparty JR P45-T2RS mATX motherboard
-E8400 (stock speeds)
-G.SKILL 2x1Gb DDR2-1066 PC2-8500
-HD3870 512MB
-WD WD5000AAKS 500GB SATA
-SeaSonic S12 SS-550HT 550W
-Xigmatek HDT-SD964CPU
-Scythe S-FLEX SFF21F exhaust Fan

Ideas suggestions?
My nephew wont be running F@H, but this has me worried the system is unstable somewheres.

Vicotnik
*Lifetime Patron*
Posts: 1831
Joined: Thu Feb 13, 2003 6:53 am
Location: Sweden

Post by Vicotnik » Wed Oct 01, 2008 5:12 pm

If you have the time, try memtest86 for a longer period of time. I usually test overnight, since errors can sometimes emerge after severeal hours.

aristide1
*Lifetime Patron*
Posts: 4284
Joined: Fri Apr 04, 2003 6:21 pm
Location: Undisclosed but sober in US

Post by aristide1 » Wed Oct 01, 2008 7:03 pm

You should be able to complete Prime 95 on each core for 24 hours before attempting to fold.

What memory voltage are you running at?

kittle
*Lifetime Patron*
Posts: 336
Joined: Thu Nov 09, 2006 4:44 pm
Location: San Jose, CA

Post by kittle » Wed Oct 01, 2008 10:02 pm

I'll leave prime95 running overnite and see how it goes.
Then I'll try memtest86 overnite.

but the guy who built the system said he ran prime95 overnite.

aristide1
*Lifetime Patron*
Posts: 4284
Joined: Fri Apr 04, 2003 6:21 pm
Location: Undisclosed but sober in US

Post by aristide1 » Fri Oct 03, 2008 3:49 pm

You need to run one Prime 95 for each core, one is inadequate.

Your northbridge may be overheating or have inadequate voltage, or even both. If you grab the Northbridge heatsink with your fingers you should be able to maintain grip, but if you must let go in like 10 seconds or less it's probably too hot.

Run CPU-Z while your running Prime 95, this way you see voltages under load. DDR2-1066 rarely runs at the boards lowest voltages at 1066. You could configure it as DDR2-800 and see is that gets you passed 2 checkpoints.

I've run Prime 95 for 18 hours before my first failure. A small bump in memory voltage didn't fix it, but a small bump in NB voltage did.

There is also a non-DeinoMPI version of SMP, that's the one I use without any issues.

kittle
*Lifetime Patron*
Posts: 336
Joined: Thu Nov 09, 2006 4:44 pm
Location: San Jose, CA

Post by kittle » Sat Oct 04, 2008 10:50 am

so your saying i should run _2_ copies of prime95?

I ran 1 copy with 2 worker threads for 24hours. no issues
I ran memtest86 for about 20 hours. no errors.

I now have the single core FAH running (2 copies) it ran overnite with no problems, completed a couple WU as well.

Ive tried the old "smpd" and the new "denio" versions of the smp client, they both fail the same way - spontaneous reboot.

Edit:
memory voltage is 2.001 in bios. speedfan reports 1.8 (im inclined to beleive the bios).
Also memory appears to be configured for 800mhz in the bios (I cant tell from cpuz)

aristide1
*Lifetime Patron*
Posts: 4284
Joined: Fri Apr 04, 2003 6:21 pm
Location: Undisclosed but sober in US

Post by aristide1 » Sat Oct 04, 2008 12:01 pm

so your saying i should run _2_ copies of prime95?
Yes, one directed at each core.
memory voltage is 2.001 in bios. speedfan reports 1.8 (im inclined to beleive the bios).
Why? More important is why such a large spread? But if you can go up .05 volts I would.
Also memory appears to be configured for 800mhz in the bios (I cant tell from cpuz)
On the memory tab, double the display frequency.

You should check the folding forum for spontaneous reboot issues as well.

There's still your NB.

zoatebix
Posts: 99
Joined: Thu Jun 08, 2006 1:57 pm
Location: Virginia
Contact:

Post by zoatebix » Sat Oct 04, 2008 7:37 pm

The need to run multiple copies of Prime95 is alleviated if have version 25.6. It's beta, it's multi-threaded, and it's available here: http://files.extremeoverclocking.com/file.php?f=103

There's also a multi-threaded CPU stress-tester based on the Gromacs core used by Folding@Home : http://www.gromacs.org/component/option ... Itemid,26/

If the stress tester works, you'll know that you have a software problem and not hardware problem, right?

Good luck!
George

kittle
*Lifetime Patron*
Posts: 336
Joined: Thu Nov 09, 2006 4:44 pm
Location: San Jose, CA

Post by kittle » Sun Oct 05, 2008 3:28 pm

welp thats the version of prime95 i was running. It ran for 24 hours with no issues.

ive got the gromacs tester running now:

CPU stress tester 2.0 (-h for help)
Architecture: ia32/x86 (32bit)
Copyright (c) Erik Lindahl <[email protected]> 2004-2007
Found 2 CPUs. (-n overrides #threads)
Executing 2 threads indefinitely.
Tested 2.01901e+012 FP operations.

I'll report back later with results

zoatebix
Posts: 99
Joined: Thu Jun 08, 2006 1:57 pm
Location: Virginia
Contact:

Post by zoatebix » Sun Oct 05, 2008 9:16 pm

kittle wrote:welp thats the version of prime95 i was running. It ran for 24 hours with no issues.
I had a feeling you'd say that :D

Good luck with the Gromacs stress tester! I wish I could help you more, but I've never had any SMP folding hang-ups bleed into the operation of the rest of my computer...

EDIT: On an unrelated note, wouldn't it be much better to Fold on the GPU and one core instead of messing with SMP? If you ever get fed up with tracking down this bug, ignoring the problem and avoiding its causes may be a good option...

kittle
*Lifetime Patron*
Posts: 336
Joined: Thu Nov 09, 2006 4:44 pm
Location: San Jose, CA

Post by kittle » Mon Oct 06, 2008 8:38 am

zoatebix wrote:EDIT: On an unrelated note, wouldn't it be much better to Fold on the GPU and one core instead of messing with SMP? If you ever get fed up with tracking down this bug, ignoring the problem and avoiding its causes may be a good option...
In theory yeah... but
a) this pc isnt going to stay around all that long. its getting a new home come christmas.
b) I have other issues with the GPU client (nvidia + XP x64 ~ 10ppd).

and lastly the reboot problem has me worried theres something buried in the hardware. If it was just a crashing fah client, I wouldnt feel so bad.


Updates:
I ran the gromacs tester for about 4 hours with no issues, then I fired up the SMP client and it rebooted again right on schedule after processing 2% of the WU.
So its now back to running 2 single core instances.

kittle
*Lifetime Patron*
Posts: 336
Joined: Thu Nov 09, 2006 4:44 pm
Location: San Jose, CA

Post by kittle » Tue Oct 07, 2008 10:46 am

More updates.

now the single core fah processes are causing the PC to reboot where I was able to complete a few WU before.

1 step forward... 3 steps back.

I messed with the bios settings a little, then realized I had no clue what i was doing, so I reset the cmos back to its defaults and went from there. No change in stability.
However I notice that cpu-z shows a core frequency fo 2ghz when this is supposed to be a 3ghz cpu. The bios shows 3ghz on startup, so im confused. is cpuz outdated? .. or is there some funky power saving mode going on even though something is running... ?

zoatebix
Posts: 99
Joined: Thu Jun 08, 2006 1:57 pm
Location: Virginia
Contact:

Post by zoatebix » Wed Oct 08, 2008 8:14 am

kittle wrote:b) I have other issues with the GPU client (nvidia + XP x64 ~ 10ppd).
:?: In your first post you said you had an HD3870 512MB...


This may be apocryphal, but I recall hearing that Folding at Home will not kick a processor out of SpeedStep or Cool'n'Quiet when it's set to run at Idle priority. If cpu-z reports that the voltage is also lower than what you're expecting, then we've identified at least one part of a potential suite of problems. Try monkeying around with the power-management console on the windows control panel (or disabling "Enhanced EIST" or somesuch in the bios), or setting F@H's Core "Priority" to low. The last option might be easiest...

kittle
*Lifetime Patron*
Posts: 336
Joined: Thu Nov 09, 2006 4:44 pm
Location: San Jose, CA

Post by kittle » Thu Oct 09, 2008 11:49 am

zoatebix wrote:
kittle wrote:b) I have other issues with the GPU client (nvidia + XP x64 ~ 10ppd).
:?: In your first post you said you had an HD3870 512MB...
I have several PCs - only this new one with the HD3870 is causing my current set of problems.

updates from the regular FAH forums:
http://foldingforum.org/viewtopic.php?f=46&t=6073

in a nutshell, it sounds like everyone is grasping at straws.

the CnQ settings might have something to do with it as well.
Ive gone into the windows power control panel applet and set everything to "always on"
I do see the voltages in cpu-z jump up to normal looking levels when I start fah. But -- if what your saying is true, then the system may try to drop back to CnQ mode later on and be playing havok with things.

I'll look for something relating to 'EIST' in the bios and try to disable it when i get home tonite.

Any other names that setting can go by? The bios in this motherboard has ZERO documentation, and was not written by english speaking people, so other abbreviations may not make sense to me.

zoatebix
Posts: 99
Joined: Thu Jun 08, 2006 1:57 pm
Location: Virginia
Contact:

Post by zoatebix » Thu Oct 09, 2008 12:49 pm

http://csd.dficlub.org/forum/index.php

Someone there can probably help you out with bios settings. Some of the helpful people who used to hang out on DFI-Street (later DIY-Street; later still merged with http://forums.overclockersclub.com/) are over there now.

Post 4 of this thread will help you navigate "Genie Bios" sub-menu a little bit, at least: http://csd.dficlub.org/forum/showthread.php?t=8008 Note I only said "A little bit." RGone's obviously not done with that thread...

If my theory is right (it probably isn't - I'm grasping at Staws, too) EIST, Enhanced EIST, and SpeedStep are the keywords you want to look for.

George

kittle
*Lifetime Patron*
Posts: 336
Joined: Thu Nov 09, 2006 4:44 pm
Location: San Jose, CA

Post by kittle » Thu Oct 09, 2008 8:00 pm

well i found an EIST setting in the bios. turned it off. but there was no visible change in the reboot frequency.

Im way beyond the info posted in that thread -- what I need is the WHY of all those numbers and settings.

somebody in the FAH forums mentioned fussing with the memory divider. not sure where that is.. nothing in the bios talks about a divider of any kind.

zoatebix
Posts: 99
Joined: Thu Jun 08, 2006 1:57 pm
Location: Virginia
Contact:

Post by zoatebix » Fri Oct 10, 2008 8:11 am

well i found an EIST setting in the bios. turned it off. but there was no visible change in the reboot frequency.
Is cpu-z still reporting that your CPU is running at a lower speed than you expect while you're folding?
somebody in the FAH forums mentioned fussing with the memory divider. not sure where that is.. nothing in the bios talks about a divider of any kind.
Just from looking at RGone's pictures, I'm positive that the memory divider is controlled by the "DRAM Speed" setting on the first Genie Bios page.
Im way beyond the info posted in that thread -- what I need is the WHY of all those numbers and settings.
I saw you posted over at DFUClub, I should have warned you to make a signature! Sorry!

Anywho, DFI boards are top-notch performers but they're notoriously exacting and fussy. I think the only place where you're guaranteed to find people familiar with the quirks of your board is over there. I'm not surprised that RGone blamed your power supply, but maybe someone else there will humor you.

kittle
*Lifetime Patron*
Posts: 336
Joined: Thu Nov 09, 2006 4:44 pm
Location: San Jose, CA

Post by kittle » Fri Oct 10, 2008 8:55 am

hrm. I totally missed the signature "requirements" ... i'll go make one.

interesting on the power supply too - i may have to goto Fry's and buy a 800w power supply for "testing" or mabye i could try my server grade PSU that was replaced with a quieter one - its only 420w but it has a lot more amps on all the channels.

I did some more fiddling last nite and im getting the impression that something is indeed failing. because the reboot timer is now random instead of at the same spot all the time.
I also managed to find the NB heatsink -- buried between the graphics card and the HSF. It was barely warm to the touch, so no heat issues there.

zoatebix
Posts: 99
Joined: Thu Jun 08, 2006 1:57 pm
Location: Virginia
Contact:

Post by zoatebix » Mon Oct 13, 2008 8:25 pm

kittle wrote:interesting on the power supply too...
I mentioned it because RGone has a history of mistrusting Seasonic S12s. He could very well be right, but in my very, very limited experience (one DFI board (a Lanparty UT NF4 SLI-DR) and one old-school Seasonic S12 500W), the power supply was up to the task.

SebRad
Patron of SPCR
Posts: 1121
Joined: Sun Nov 09, 2003 7:18 am
Location: UK

Post by SebRad » Tue Oct 14, 2008 1:25 am

Hi, I have a few random thoughts. The PSU, as long as not faulty somehow, is way big enough. I estimate your system at 150-200w full load. By all means try another PSU for testing. The E8400 is 9x333 = 3GHz. EIST (Enhanced Intel Speedstep Technology?) allows it to run at 6x333 = 2GHz when idle. I think XP can only set 6x or 9x but I think when I was running Vista it could set 7x & 8x as well if it saw fit. (I have E6600, 9x266 = 2.4GHz). You can likely fix the CPU multiplier in the BIOS.
You list your memory as 1066, have you tried setting the DRAM frequency lower, 800 or even 667. Try with one DIMM at a time, try different memory slots. Have you tried under-clocking, maybe set the CPU for 9x266 = 2.4GHz to see if that makes any difference?
What OS are you running? (sorry if you've mentioned it and I missed it)
Have you reinstalled the OS, or tried a different one?
Personally I find Memtest86 not to be that stressful and can fail to show errors for hours, I'm barely that patient! I find Orthos to be a pretty quick way of picking up stability problems, 3D Mark is also quite sensitive. I also find the Windows install process to be quite a good test, if it goes through fine the system is likely stable, if the Windows install fails randomly then it usually indicates a hardware problem.
One other thing I've found is that you can have problems from lack of Southbridge cooling. /Long Ramble My Dad's Abit NF-M2 AMD system wasn't very stable for about the first year! It would be fine for a day or two or three and then crash to a blue screen. Putting it under load, e.g. folding@home tended to make it worse. I eventually decided to investigate properly and found that heavy cooling of the north and south bridges fixed the stability issue. The size and fit of the northbridge cooling seamed fine but the southbridge cooler was pretty weedy and once off only had tiny dot of thermal paste under it. Replacing with an old much bigger northbridge heatsink and good thermal paste has fixed the issue and the machine is now stable, overclocked, folding 24/7 for days on end! My current PC (Asus P5B-E Plus) has had the southbridge heatsink swapped for a Zalman NB47J. I don't think I had issues as such but did notice errors in the System Event Log relating to disk issues, I don't have these any more. I had similar experience with my previous AMD Socket A system, except that could actually crash. A better sink on the southbridge seamed to help. It may well be that my quiet systems with limited air flow don't cool the southbridge as the motherboard designer intended. /Long Ramble.
Anyway, good luck!
Seb

cloneman
Posts: 448
Joined: Sat May 21, 2005 9:48 am

Post by cloneman » Tue Oct 14, 2008 7:51 am

Could the GPU having anything to do with this? Try stressing that maybe.......

Just putting it out there

kittle
*Lifetime Patron*
Posts: 336
Joined: Thu Nov 09, 2006 4:44 pm
Location: San Jose, CA

Post by kittle » Tue Oct 14, 2008 10:06 am

Well over the weekend I did the following:

Swapped the PSU my noisy supermicro server-grade PSU - no changes
Moved the video card to a different slot - no changes
uninstalled video drivers - no changes
put in my old Asus EN6800 card -no changes
moved the old card to the 2nd PCIe slot - no changes
ran with 1 stick of ram - no changes
swapped the memory sticks - no changes

I also noticed the problems seem to be behaving like a heat issue. From a cold start, f@h will run for an hour or so, then reboot. after that, the reboots come much more frequently until I run out of paitence and just shut the thing off.

@SebRad
I fiddled with the memory timings a little. but I know very little about what to do and what not to do with respect to setting up memory timings.
I think I breifly tried underclocking the CPU at 2.8ghz -- but again no change in behavior.
the memory is listed as 1066 but i want to run it at 800 -- and thats the way the system was shipped to me.

im running win XP with SP2.

As for the cooling - where is the SB and/or NB on the motherboard? how do I tell one from the other? where does one find acceptable temps? The monitoring tools have a "chipset" temprature which sets steady at 40-41c no matter what im doing.
There are 3 extra heatsinks on the motherboard. one with a tiny black heat sink. and 2 that are connected via a heatpipe near the cpu socket. None of them seem to get real warm while the system is running. I can easly grab hold with no fear of burning.

I can try 3dmark and/or orthos to see what happens. I assume they are freely downloadable?

zoatebix
Posts: 99
Joined: Thu Jun 08, 2006 1:57 pm
Location: Virginia
Contact:

Post by zoatebix » Tue Oct 14, 2008 8:42 pm

Image

They're named "North" and "South" because of their relative positions in a standard tower case - the closer chip to the top of the case is the northbridge or Memory Controller Hub (MCH), and the other is the southbridge or I/O Controller Hub (ICH).

Disjointed thoughts:
Is Orthos necessary now that Prime95 25.6 is around?

Looping test 5 will find errors faster than running the whole Memtest86+ suite. Or at least that's what the old Athlon 64 Overclocking Guide on DFI/DIY Street (and a little experience) taught me.

OK - SMP folding launches four processes and the Gromacs stress tester only launches 2... If two instances of the tester won't crash your system, we're in the same boat we've been in all thread. If they do cause a reboot, you'll have your first result that can eliminate MPICH, Deino, or something else closely tied to Folding@Home as the culprit.
EDIT: Nevermind - I forgot that single core folding causes you to reboot, too.

Links:
Another stress tester, OCCT: http://www.ocbase.com/perestroika_en/index.php?Download

Yet another stress tester, Intel's Thermal Analysis Tool:
http://www.techpowerup.com/downloads/392/mirrors.php

Probably less useful links:
3DMark 2001 SE: http://www.futuremark.com/download/3dmark2001/
3DMark 2006: http://www.futuremark.com/download/3dmark06/

kittle
*Lifetime Patron*
Posts: 336
Joined: Thu Nov 09, 2006 4:44 pm
Location: San Jose, CA

Post by kittle » Wed Oct 15, 2008 9:39 am

ok thanks for the NB vs SB descriptions. I can figure out which one is which by looking at the motherboard manual.

And theres actually 3 controller chips on the board with cooling fins on them. none of them get hot enough that I cant grab and hold tightly while the system is under load.

I also ran raytraced video render monday nite and all day yesterday. it completed without any issues. This one would stress the CPU and a little bit on the HD.

Seeing as all the other stress test programs run w/o issue I highly suspect some kind of software issue with FaH. The guys at the fah forums are starting to admit they are stumped -- so i see that as a little progress.

Post Reply