Page 1 of 1

SMP segmentation faults

Posted: Sun Jul 08, 2007 11:52 am
by vanhelmont
My last two work units terminated with similar errors at 67%. Here's what I copied from the terminal:

[13:06:32] Writing local files
[13:06:33] Completed 335000 out of 500000 steps (67 percent)
[0]0:Return code = 0, signaled with Segmentation fault
[0]1:Return code = 0, signaled with Segmentation fault
[0]2:Return code = 0, signaled with Segmentation fault
[0]3:Return code = 0, signaled with Segmentation fault
[13:23:53] CoreStatus = 0 (0)
[13:23:53] Client-core communications error: ERROR 0x0
[13:23:53] Deleting current work unit & continuing...
[0]0:Return code = 0, signaled with Quit
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 18
[13:28:21] - Preparing to get new work unit...
[13:28:21] + Attempting to get work packet
[13:28:21] - Connecting to assignment server
[13:28:21] - Successful: assigned to (171.64.65.56).
[13:28:21] + News From Folding@Home: Welcome to Folding@Home
[13:28:21] Loaded queue successfully.
[13:28:33] + Closed connections
[13:28:38]
[13:28:38] + Processing work unit
[13:28:38] Core required: FahCore_a1.exe
[13:28:38] Core found.
[13:28:38] Working on Unit 05 [July 8 13:28:38]
[13:28:38] + Working ...
[13:28:38]
[13:28:38] *------------------------------*
[13:28:38] Folding@Home Gromacs SMP Core
[13:28:38] Version 1.73 (November 27, 2006)
[13:28:38]
[13:28:38] Preparing to commence simulation
[13:28:38] - Ensuring status. Please wait.
[13:28:55] - Looking at optimizations...
[13:28:55] - Working with standard loops on this execution.
[13:28:55] - Previous termination of core was improper.
[13:28:55] - Going to use standard loops.
[13:28:55] - Files status OK
[13:28:57] (decompressed 537.8 percent)
[13:28:57] - Starting from initial work pa- Starting from initial work packet
[13:28:57]
[13:28:57] Project: 2608 (Run 0, Clone 40, Gen 30)

It appears to be running the same thing over and over, and crashing at the same place each time. I'm going to stop it, delete it, and try to get a new, different one.

Has this happened to anyone else? Is it my machine or a bad work unit?

Thanks

Posted: Mon Jul 09, 2007 3:59 pm
by vg30et
I can't say that I've seen your particular error but here are some incidents where I've seen SMP bomb out on me:

1. Network connection drops and/or when establishing a VPN connection
2. Attempting to restart FAH when FAH processes are already running or MPICH2 service isn't started
3. Group policy or other automated script removes administrative rights from the account under which FAH was installed with
4. Unstable processor/memory from excessively high temps or too high of an overclock

I think that in your case, a complete removal and reinstallation should work. My usual steps are:
- Stop FAH and ensure all FAH processes are not running (or reboot)
- Stop MPICH2 Process Manager, Argonne National Lab service
- From add/remove programs, uninstall Folding at Home
- Delete Folding at Home folder from c:\program files\ or your install directory
- Reinstall using .exe and run install.bat (this should remove and reinstall the MPICH2 service)
- Reboot (optional) and run fah.exe

Posted: Mon Jul 09, 2007 6:23 pm
by Plissken
Awhile back I got a work unit that errored out, then it happened again a couple days later on the same one (can't remember the # though). I haven't gotten that WU since then, as far as I can tell, and no problems since then, so I blame it on a bad WU.
The other 2 things that have caused SMP not to run for me are:
- Password change on the PC, then you need to run install.bat again
- Expired / timed out FAH.EXE (Beta period expired), need to download it again.

Posted: Mon Jul 09, 2007 10:37 pm
by peteamer
Ivoshiee over at the folding forum is having the same prob.
No solution though...


Pete

Posted: Tue Jul 10, 2007 1:16 pm
by vanhelmont
I did control-c to stop the folding processes, renamed the directory with the folding executable and data, made a new folding directory, opened the archive in it, and ran the executable. It got a different work unit, which finished today without problems.

Must have been a bad work unit. I had done about 20 smp work units before, and this one failed twice, at the same point each time. Lost 4 days of folding, though.

The work unit I got today is the same project, but a different clone and run than the bad one. I'll see what happens.