Fatal error: Box exploding.

haysdb · Post by **haysdb** » Sat Jan 17, 2004 3:44 am

That's one scary error message! I almost ducked!

[00:19:11] Quit 101 - Fatal error: Box exploding.
[00:19:11] 
[00:19:11] Simulation instability has been encountered. The run has entered a
[00:19:11]   state from which no further progress can be made.

This is from my notoriously unstable "blade #2".

David

lm · Post by lm » Sat Jan 17, 2004 5:05 am

I'm quite sure the client cannot detect all errors, so you are most likely sending bad results on your highly unstable machines?

mormakil · Post by **mormakil** » Sat Jan 17, 2004 6:20 am

lm wrote:I'm quite sure the client cannot detect all errors, so you are most likely sending bad results on your highly unstable machines?

If that's true the whole project would be useless. I mean what do you want simulations results for if they may be wrong?. Don't know what they do to check the results, but they must have a way.

lm · Post by lm » Sat Jan 17, 2004 7:57 am

100% sure way to check a result would be to rerun the simulation with a 100% stable and working computer. Now, i think we can assume most of the results will be correct, and for a big part of the errors happening on the machines they probably crash or the programs crash or the sanity checks in the client catch them, but what if some value changes only a little so that eventually the end result is still completely different than correct one? I'm quite sure that highly unstable machines means stay away from these projects. At least i think you should check from them whether i'm talking bullshit or if i'm correct.

Sam Williams · Post by **Sam Williams** » Sat Jan 17, 2004 8:25 am

Don't these big projects send out each work unit to a couple of different computers? I certainly got the impression that that was how SETI@Home worked... That would obviously show up any errors - unless of course the two computers both made exactly the same errors.

I think that's how it's done - not sure though.

herosformula · Post by **herosformula** » Sat Jan 17, 2004 9:56 am

SETI@home routinely sends the same data packet to several different users around the world. If all the results agree, then at least they have some basis to assume that the program is working as intended.

In any system, there can be flaws. The GROMACS core itself could have a fatal flaw that renders all of the results that we have been working in useless, but this is not how you work towards a goal. If you expect that everything you do will be 100% correct, you sill soon find yourself buried in misery.

haysdb · Post by **haysdb** » Sat Jan 17, 2004 2:41 pm

An interesting point has been raised. Aborted WU's are one thing, but is it possible that WU's can complete but be wrong? That would be "a bad thing."

David

Sam Williams · Post by **Sam Williams** » Sat Jan 17, 2004 7:10 pm

is it possible that WU's can complete but be wrong?

Thinking about it, it almost certainly is. There's no internal error-checking done by the client, is there? It just calculates once and sends. I guess the implication of this is that every prospective Folding machine should really be put through a Prime95 check first, to make sure its output is going to be useful.

haysdb · Post by **haysdb** » Sun Jan 18, 2004 12:16 am

Thread Are WUs "double checked" once submitted?

Bruce, Site Admin @ Folding-community.org wrote:Stanford has elected to work very hard to discard bad results before they get them rather than to waste large quantities of time reissuing work and duplicating work.

There are many checks for WU coherence which cause WUs to be discarded. The client does check your arithmetic results periodically while the WU is processing to discard bad results...

<snip>

If a corrupt WU got past all the checks, it would probably appear as "strange," at this point, compared to similar WUs.

David

mormakil · Post by **mormakil** » Sun Jan 18, 2004 5:35 am

If a corrupt WU got past all the checks, it would probably appear as "strange," at this point, compared to similar WUs.

They must know what they're doing, but it's as always happens with science, you rely on things experimented by others, which rely on other's experiments..... Until some "rogue" scientist makes a contradictory experiments and throw all stated before to the trash. And I wonder how they compared WU's and see something "stange" between them, but that would require us to study a related career I guess. In this specialized world we must trust the knowledgments of others.

Sam Williams · Post by **Sam Williams** » Sun Jan 18, 2004 2:31 pm

Ah, so at least if you *do* churn out garbage results, the system should catch them. However, this still means that you could in theory run a box which churned out garbage every WU, and never be any the wiser. It'd net you the points, but you'd be contributing nothing of real-world value, except to the electric company and the heating of your home. :-\

haysdb · Post by **haysdb** » Sun Jan 18, 2004 3:08 pm

It's a scary thought Sam.

For this precise reason, the boards which have not proven themselves 100% stable are being returned. This includes the Shuttle board, and both Abit boards, even though one hasn't given me any problem. The fact that the other one HAS is enough.

I am in search of Athlon micro-ATX motherboards which are STABLE at FSB 333, and even modestly overclocked. The one ASUS board has been stable, so another is on the way, but I am opening my eyes to NForce boards. I figure they deserve a shot. The KM400 boards have been given a fair shake IMO, and largely come up wanting.

David