Error while folding - advise please

A forum just for SPCR's folding team... by request.

Moderators: NeilBlanchard, Ralf Hutter, sthayashi, Lawrence Lee

Post Reply
ColdFlame
Posts: 451
Joined: Wed May 21, 2003 9:39 pm
Location: Somewhere in Time

Error while folding - advise please

Post by ColdFlame » Tue Nov 25, 2003 1:32 am

Could you please tell whether this error is caused by:
1) -advmethods
2) overclocking (no crashes in games, stable)

[09:25:37] Completed 405000 out of 500000 steps (81)
[09:29:57] Quit 101 - Fatal error:
[09:29:57] Step 408732, time 817.464 (ps) LINCS WARNING
[09:29:57] relative constraint deviation after LINCS:
[09:29:57] max 0.000000 (between atoms 1 and 2) rms 1.#QNAN0
[09:29:57]
[09:29:57] Simulation instability has been encountered. The run has entered a
[09:29:57] state from which no further progress can be made.
[09:29:57] If you often see other project units terminating early like this
[09:29:57] too, you may wish to check the stability of your computer (issues
[09:29:57] such as high temperature, overclocking, etc.).
[09:29:57] Going to send back what have done.
[09:29:57] logfile size: 97138
[09:29:57] - Writing 97816 bytes of core data to disk...
[09:29:57] ... Done.
[09:29:57]
[09:29:57] Folding@home Core Shutdown: EARLY_UNIT_END
[09:29:59] CoreStatus = 72 (114)
[09:29:59] Sending work to server
[09:29:59] + Attempting to send results
[09:30:05] + Results successfully sent
[09:30:05] Thank you for your contribution to Folding@home.

Thanks!

ColdFlame
Posts: 451
Joined: Wed May 21, 2003 9:39 pm
Location: Somewhere in Time

Post by ColdFlame » Tue Nov 25, 2003 1:38 am

Google! Google! Google!

They say I should do this:
1) wait and see if it goes away with the next WU
2) return my system to stock speed
3) disable SSE

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Post by haysdb » Tue Nov 25, 2003 2:03 am

-advmethods is almost certainly not the culprit.

I don't necessarily agree with the 3 points in your last post, especially 1 and 3, and returning to stock speed seems a little extreme. I'd maybe back off just a tiny bit on the clock speed, or bump the voltage one notch. "Wait and see if it goes away?" Like it was some sort of cosmic accident that won't ever happen again? Disable SSE? Like SSE is the problem? No, I definitely don't buy options 1 and 3 at all.

David

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Post by haysdb » Tue Nov 25, 2003 2:11 am

duplicate post deleted
Last edited by haysdb on Tue Nov 25, 2003 8:45 am, edited 1 time in total.

unregistered
Posts: 542
Joined: Mon Aug 11, 2003 5:54 pm

Post by unregistered » Tue Nov 25, 2003 2:11 am

I had a similar "Failed" WU sunday night/Mon AM.

I'm not OC'ed, running hot and have had no system instability, this warning is the first and only "instability" issue that f@h has come up with and the only one for my computer in recent months. Maybe my computer was tired of those 2500 frame WUs?







:30:03]
[05:30:03] Assembly optimizations on if available.
[05:30:03] Entering M.D.
[05:30:09] Protein: p1037_A21unf_337_99
[05:30:09]
[05:30:09] Writing local files
[05:30:09] Extra 3DNow boost OK.
[05:30:09] Writing local files
[05:30:09] Completed 0 out of 2500000 steps (0)
[05:53:10] Writing local files
[05:53:10] Completed 25000 out of 2500000 steps (1)
[06:13:43] Quit 101 - Fatal error:
[06:13:43] Step 47468, time 94.936 (ps) LINCS WARNING
[06:13:43] relative constraint deviation after LINCS:
[06:13:43] max 0.000000 (between atoms 1 and 2) rms 1.#QNAN0
[06:13:43]
[06:13:43] Simulation instability has been encountered. The run has entered a
[06:13:43] state from which no further progress can be made.
[06:13:43] If you often see other project units terminating early like this
[06:13:43] too, you may wish to check the stability of your computer (issues
[06:13:43] such as high temperature, overclocking, etc.).
[06:13:43] Going to send back what have done.
[06:13:43] logfile size: 8535
[06:13:43] - Writing 9211 bytes of core data to disk...
[06:13:43] ... Done.
[06:13:43]
[06:13:43] Folding@home Core Shutdown: EARLY_UNIT_END
[06:13:45] CoreStatus = 72 (114)
[06:13:45] Sending work to server

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Post by haysdb » Tue Nov 25, 2003 2:22 am

Interesting. I checked the logs on my machines and found this:

[09:37:33] Completed 1150000 out of 2500000 steps (46)
[09:38:54] Quit 101 - Fatal error:
[09:38:54] Step 1153971, time 2307.94 (ps) LINCS WARNING
[09:38:54] relative constraint deviation after LINCS:
[09:38:54] max 0.000000 (between atoms 1 and 2) rms 1.#QNAN0
[09:38:54]
[09:38:54] Simulation instability has been encountered. The run has entered a
[09:38:54] state from which no further progress can be made.

Looks like I need to "eat some of my own dog food" and either back off on the FSB, bump the voltage a notch (I am undervolted), or leave it alone and see if it happens again!

David
Last edited by haysdb on Tue Nov 25, 2003 8:11 am, edited 1 time in total.

FJC
Posts: 89
Joined: Tue Nov 04, 2003 9:06 am
Location: MI, USA

Post by FJC » Tue Nov 25, 2003 5:56 am

From what I've read, folding can be more of a true stability test than most or all of the benchmark tests out there, especially the Gromac cores...

It might be worthwhile to just drop to stock speed for a WU or two to see if that resolves it. If it does, you certainly know it was due to overclocking stability issues, and you can work up from there.

Backing off only slightly may of course work, but it may take many many iterations of that - and if the issue isn't the overclocking (i.e., it's bad memory, or something like that), you may waste a lot of time.

dukla2000
*Lifetime Patron*
Posts: 1465
Joined: Sun Mar 09, 2003 12:27 pm
Location: Reading.England.EU

Post by dukla2000 » Tue Nov 25, 2003 6:50 am

IIRC ColdFlame's Google results, 1 & 3 are propogated by Stanford. Both are semi admitting it may not be the most stable code in the world, but in reality the only times I have had this were on an over- overclocked system.

So back off the o/c or raise the VCore.

There was a stability thread here recently, in the midst of which I discovered 1 of my boxen wasn't as stable as I thought. Maybe on the weekend I will update that thread with specific details but I currently believe stability testing is:
1) 10 passes MemTest86 subtest 5 (for Athlons, picks up cache probs)
2) 2 passes Goldmemory (fairly quick)
3) 24 hours Prime with 'max heat' settings

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Post by haysdb » Tue Nov 25, 2003 9:00 am

FJC wrote:It might be worthwhile to just drop to stock speed for a WU or two to see if that resolves it. If it does, you certainly know it was due to overclocking stability issues, and you can work up from there.

Backing off only slightly may of course work, but it may take many many iterations of that - and if the issue isn't the overclocking (i.e., it's bad memory, or something like that), you may waste a lot of time.
I do not disagree with you FJC. In my case, this particular machine is overclocked by 4%, but more importantly, undervolted into the 1.42 range. With this processor on this motherboard with this configuration (memory timings, etc.), these settings were determined by running Prime95 until a stable setting was found and then backing off one "click" on the voltage. I did not run 24 hours of Prime95 or MemTest to confirm stability, although this was just laziness on my part and not wanting to sacrifice 24 hours of Folding production. I had an idea that this might still be on the edge of rock solid stable, so I have increased vCore by .025 and will continue to watch the logs.

Either way it's approached will take multiple iterations.

David

FJC
Posts: 89
Joined: Tue Nov 04, 2003 9:06 am
Location: MI, USA

Post by FJC » Tue Nov 25, 2003 9:33 am

What a dedicated lot we Folders are. :)

And I know exactly what you mean about not wanting to miss valuable folding time!

NeilBlanchard
Moderator
Posts: 7681
Joined: Mon Dec 09, 2002 7:11 pm
Location: Maynard, MA, Eaarth
Contact:

Lower your overclock a tick or two

Post by NeilBlanchard » Tue Nov 25, 2003 1:36 pm

Hello:

I had this happen on two identical machines and before I lowered the oc by a whole 2mHz -- I had not had an error since. In my case, it may very well be the PC2100 that I am running at 152 mHz (304 DDR) at CL2 Turbo. When I ran it at 154mHz -- I got several corruptions.

ColdFlame
Posts: 451
Joined: Wed May 21, 2003 9:39 pm
Location: Somewhere in Time

Post by ColdFlame » Tue Nov 25, 2003 1:47 pm

Hey guys. Thanks for your replies. The good news is that Stanford gives partial credits for incomplete WUs, and they are fair. I mean my WU was 80% complete and I got 80% of full credit. So if you are concerned about the score then you are fine.

Now, it seems that this is caused by overclocking because it only started happening once I o/ced my AMD box. Neil and several others are saying the same. So I will lower the FSB a couple Hz and see what happens.

Thanks!

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Post by haysdb » Tue Nov 25, 2003 5:42 pm

I agree about lowering the FSB by 1 or 2MHz. It can definitely be the difference between perfectly stable and an occasional error.

David

Post Reply