SMP segmentation faults

A forum just for SPCR's folding team... by request.

Moderators: NeilBlanchard, Ralf Hutter, sthayashi, Lawrence Lee

Post Reply
vanhelmont
Posts: 26
Joined: Thu Aug 03, 2006 7:06 pm

SMP segmentation faults

Post by vanhelmont » Sun Jul 08, 2007 11:52 am

My last two work units terminated with similar errors at 67%. Here's what I copied from the terminal:

[13:06:32] Writing local files
[13:06:33] Completed 335000 out of 500000 steps (67 percent)
[0]0:Return code = 0, signaled with Segmentation fault
[0]1:Return code = 0, signaled with Segmentation fault
[0]2:Return code = 0, signaled with Segmentation fault
[0]3:Return code = 0, signaled with Segmentation fault
[13:23:53] CoreStatus = 0 (0)
[13:23:53] Client-core communications error: ERROR 0x0
[13:23:53] Deleting current work unit & continuing...
[0]0:Return code = 0, signaled with Quit
[0]1:Return code = 0, signaled with Quit
[0]2:Return code = 0, signaled with Quit
[0]3:Return code = 18
[13:28:21] - Preparing to get new work unit...
[13:28:21] + Attempting to get work packet
[13:28:21] - Connecting to assignment server
[13:28:21] - Successful: assigned to (171.64.65.56).
[13:28:21] + News From Folding@Home: Welcome to Folding@Home
[13:28:21] Loaded queue successfully.
[13:28:33] + Closed connections
[13:28:38]
[13:28:38] + Processing work unit
[13:28:38] Core required: FahCore_a1.exe
[13:28:38] Core found.
[13:28:38] Working on Unit 05 [July 8 13:28:38]
[13:28:38] + Working ...
[13:28:38]
[13:28:38] *------------------------------*
[13:28:38] Folding@Home Gromacs SMP Core
[13:28:38] Version 1.73 (November 27, 2006)
[13:28:38]
[13:28:38] Preparing to commence simulation
[13:28:38] - Ensuring status. Please wait.
[13:28:55] - Looking at optimizations...
[13:28:55] - Working with standard loops on this execution.
[13:28:55] - Previous termination of core was improper.
[13:28:55] - Going to use standard loops.
[13:28:55] - Files status OK
[13:28:57] (decompressed 537.8 percent)
[13:28:57] - Starting from initial work pa- Starting from initial work packet
[13:28:57]
[13:28:57] Project: 2608 (Run 0, Clone 40, Gen 30)

It appears to be running the same thing over and over, and crashing at the same place each time. I'm going to stop it, delete it, and try to get a new, different one.

Has this happened to anyone else? Is it my machine or a bad work unit?

Thanks
An idle processor is the devil's workshop.

Contribute to medical research by putting your idle processor to work on [url=http://www.silentpcreview.com/forums/viewtopic.php?t=11630]folding@home![/url]

vg30et
Posts: 105
Joined: Tue Aug 22, 2006 5:14 am

Post by vg30et » Mon Jul 09, 2007 3:59 pm

I can't say that I've seen your particular error but here are some incidents where I've seen SMP bomb out on me:

1. Network connection drops and/or when establishing a VPN connection
2. Attempting to restart FAH when FAH processes are already running or MPICH2 service isn't started
3. Group policy or other automated script removes administrative rights from the account under which FAH was installed with
4. Unstable processor/memory from excessively high temps or too high of an overclock

I think that in your case, a complete removal and reinstallation should work. My usual steps are:
- Stop FAH and ensure all FAH processes are not running (or reboot)
- Stop MPICH2 Process Manager, Argonne National Lab service
- From add/remove programs, uninstall Folding at Home
- Delete Folding at Home folder from c:\program files\ or your install directory
- Reinstall using .exe and run install.bat (this should remove and reinstall the MPICH2 service)
- Reboot (optional) and run fah.exe

Plissken
Friend of SPCR
Posts: 235
Joined: Tue Dec 19, 2006 6:22 pm
Location: Seattle

Post by Plissken » Mon Jul 09, 2007 6:23 pm

Awhile back I got a work unit that errored out, then it happened again a couple days later on the same one (can't remember the # though). I haven't gotten that WU since then, as far as I can tell, and no problems since then, so I blame it on a bad WU.
The other 2 things that have caused SMP not to run for me are:
- Password change on the PC, then you need to run install.bat again
- Expired / timed out FAH.EXE (Beta period expired), need to download it again.
[color=green][size=75]Antec P180B | Corsair HX520 | Gigabyte DS3L | E8400 @ 3.6 GHz | Noctua NH-U12F | 2GB Ballistix DDR2-800 | EVGA 8800GTS | 2x Samsung HD501LJ[/size][/color]

peteamer
*Lifetime Patron*
Posts: 1740
Joined: Sun Dec 21, 2003 11:24 am
Location: 'Sunny' Cornwall U.K.

Post by peteamer » Mon Jul 09, 2007 10:37 pm

Ivoshiee over at the folding forum is having the same prob.
No solution though...


Pete

vanhelmont
Posts: 26
Joined: Thu Aug 03, 2006 7:06 pm

Post by vanhelmont » Tue Jul 10, 2007 1:16 pm

I did control-c to stop the folding processes, renamed the directory with the folding executable and data, made a new folding directory, opened the archive in it, and ran the executable. It got a different work unit, which finished today without problems.

Must have been a bad work unit. I had done about 20 smp work units before, and this one failed twice, at the same point each time. Lost 4 days of folding, though.

The work unit I got today is the same project, but a different clone and run than the bad one. I'll see what happens.
An idle processor is the devil's workshop.

Contribute to medical research by putting your idle processor to work on [url=http://www.silentpcreview.com/forums/viewtopic.php?t=11630]folding@home![/url]

Post Reply