Help - Deleted Work Units

A forum just for SPCR's folding team... by request.

Moderators: NeilBlanchard, Ralf Hutter, sthayashi, Lawrence Lee

Post Reply
dukla2000
*Lifetime Patron*
Posts: 1465
Joined: Sun Mar 09, 2003 12:27 pm
Location: Reading.England.EU

Help - Deleted Work Units

Post by dukla2000 » Wed May 07, 2003 4:08 am

OK, using Console & Electron Microscope. Have handled 12 WU on one of my systems, 4 of which have been deleted ('nil points'!).

Crazy thing is it runs and 'looks' normal, but when it gets to 100% nothing happens. So I took to shutting down & restarting, at which stage the unit was deleted. Now I have been watching the log on that system more closely, and I noticed the early warning is that actually sometimes the FAHconsole stops writing to the log, even though EM reports Frames clocking as usual. When I catch the 'no log' early enough (before 100%) and restart, it restarts as per EM reporting.

An example log extract is below (only blank lines editted out): everything fine up to 90 frames, then nothing in log (till I manually shut down). In the meantime EM reported frames counting up to 100, at which stage nothing happened, so I shut down (& lost my points :evil: ) Any offers?

...
[10:11:23] Completed 225000 out of 250000 steps (90)
Folding@home Client Shutdown.
--- Opening Log file [May 7 11:42:50]
# Windows Console Edition
Folding@home Client Version 3.24
http://foldingathome.stanford.edu
email:[email protected]
[11:42:50] - Ask before connecting: No
[11:42:50] - Use IE connection settings: Yes
[11:42:50] - User name: Dukla2000 (Team 31574)
[11:42:50] - User ID = 55B302820D53F25D
[11:42:50] - Machine ID: 1
[11:42:50]
[11:42:50] Loaded queue successfully.
[11:42:50] + Benchmarking ...
[11:42:54]
[11:42:54] + Processing work unit
[11:42:54] Core required: FahCore_78.exe
[11:42:54] Core found.
[11:42:54] Working on Unit 01 [May 7 11:42:54]
[11:42:54] + Working ...
[11:42:54]
[11:42:54] *------------------------------*
[11:42:54] Folding@home Gromacs Core
[11:42:54] Version 1.45 (April 21, 2003)
[11:42:54]
[11:42:54] Preparing to commence simulation
[11:42:54] - Looking at optimizations...
[11:42:55] - Created dyn
[11:42:55] - Files status OK
[11:42:55]
[11:42:55] Folding@home Core Shutdown: MISSING_WORK_FILES
[11:42:58] CoreStatus = 74 (116)
[11:42:58] The core could not find the work files specified. Removing from queue
[11:42:58] Deleting current work unit & continuing...
[11:43:02] + Attempting to get work packet
[11:43:02] - Connecting to assignment server
[11:43:15] - Successful: assigned to (171.64.122.125).
[11:43:15] + News From Folding@Home: Welcome to Folding@Home
[11:43:15] Loaded queue successfully.
[11:43:16] - Deadline time not received.
[11:43:17] + Closed connections

Wrah
Patron of SPCR
Posts: 316
Joined: Thu Apr 10, 2003 1:56 am

Post by Wrah » Wed May 07, 2003 7:01 am

I've had a lot of simular problems too. Just a question, is your cpu running overclocked or have you adjusted the memory timings manually? And do you have the problems with all cores or only gromacs?

The gromacs core is VERY sensitive. I had my system overclocked to the limit while still being really 100% stable. But every gromacs core I got went wrong. After running a few hours the frame time would suddenly go from 10 mins to 2 hours, or a frame would not finish at all anymore. After restarting, it would either give the work_files_missing error or simply an early_unit_end, and it would download a new wu. After looking around on the FAH forums I reset my memory timings to default and throttled my cpu just a tad back. Since then all my problems are gone. (well, at least when gromacs are concerned :))
Last edited by Wrah on Wed May 07, 2003 7:16 am, edited 1 time in total.

dukla2000
*Lifetime Patron*
Posts: 1465
Joined: Sun Mar 09, 2003 12:27 pm
Location: Reading.England.EU

Post by dukla2000 » Wed May 07, 2003 7:15 am

No overclocking: that box is a Thunderbird 1.4 and I never could overclock it. Also memory timings (PC133 SDRAM) are nothing adventurous: 2 sticks of Crucial CL2 running at standard SPD settings.

I have to admit I undervolt that CPU: spec is 1.75V and I often set the jumpers down so MBM reports 1.70V. Have never traced any problem to that, but coincidentally set to stock VCore yesterday before the log I posted so cant be that. (Was trying to debug a hang in The SIMS, played by some of my clan: turns out it was a corrupt family :D )

Also should have mentioned I had the first Deleted units when running the Graphical Client version.

Also should have mentioned that I have disabled hdd power down (figured maybe stuff was getting lost in hdd cache) with no effect.

Also should have mentioned that my other 2 boxes (1 of which is overclocked mildly - XP2100 to XP2700) have never deleted a unit. So I can setup OK sometimes!

Am convinced it is something to do with the box but am stumped.

Mr_Smartepants
Posts: 539
Joined: Fri Apr 04, 2003 6:35 am
Location: Cambridgeshire, England

Post by Mr_Smartepants » Wed May 07, 2003 7:16 am

It might be a bug with the Fahcore_xx.exe?
I noticed when I got dealt a heavy WU (1245 steps!) that my client stopped recording at 21% done (let it work all night and no progress). I noticed in the logfile that it was using FahCore_78.exe and I thought "hey that's new!". So I gave it until monday and when it still didn't finish I just deleted the whole install and reinstalled the client from scratch. Now it's using FahCore_65.exe and there's no problem now.

Might have been a corrupted core? Maybe a bug?

dukla2000
*Lifetime Patron*
Posts: 1465
Joined: Sun Mar 09, 2003 12:27 pm
Location: Reading.England.EU

Post by dukla2000 » Wed May 07, 2003 7:24 am

Wrah wrote:And do you have the problems with all cores or only gromacs?
With all cores. Looking through the 10 unit record

I have a deleted Genome, Tinker & 2 Gromacs.

The deleted Tinker was Project 632, I also have a successful Tinker for Project 632.

The deleted Gromacs are Projects 540 & 537, while Projects 670 & 542 completed OK.

Based on your o/c problem I may disable some of the silent options on that system for a few days to see if it helps. CPU (or at least diode in socket) is only running 43C @1.75V but let me see if I can blast it into mid 30s for a while.

Wrah
Patron of SPCR
Posts: 316
Joined: Thu Apr 10, 2003 1:56 am

Post by Wrah » Wed May 07, 2003 7:39 am

The o/c problem I read about was specific to though gromacs core though.
Or you can run your system with some very safe settings, something like fsb back to 100. At least to eliminate one of the hardware components being a possible cause.
You can also try looking around on the F@H forum.

Wrah
Patron of SPCR
Posts: 316
Joined: Thu Apr 10, 2003 1:56 am

Post by Wrah » Wed May 07, 2003 7:43 am

Just did a bit of reading there (sigh, should actually be working, but can't resist the f@h addiction), and it seems they just got new versions of some cores out with bugfixes. If you delete the core exe files from the folding@home directory, you will force it to download the new versions. See if that helps.

dukla2000
*Lifetime Patron*
Posts: 1465
Joined: Sun Mar 09, 2003 12:27 pm
Location: Reading.England.EU

Post by dukla2000 » Wed May 07, 2003 8:09 am

Wrah wrote:Just did a bit of reading there (sigh, should actually be working, but can't resist the f@h addiction), and it seems they just got new versions of some cores out with bugfixes. If you delete the core exe files from the folding@home directory, you will force it to download the new versions. See if that helps.
Thanks - and to Mr Smartepants. Have decided to go for the D&C when it finishes the massive 6 point WU it is currently on.

dukla2000
*Lifetime Patron*
Posts: 1465
Joined: Sun Mar 09, 2003 12:27 pm
Location: Reading.England.EU

Calling Ghostbusters

Post by dukla2000 » Fri May 09, 2003 3:47 am

OK, so I uninstalled the graphical client version, deleted the corexx.exe as well as the log. Everything was looking great: figured that it was a bit early (only 1.5 WU) to assume & post success.

Then decided to have a look at my time per frame: occasionally EM has got excited that a frame was taking too long, but it was recovering automatically. Extracted the log into Excel and calc'd the time per frame - sure as hell the occasional frame was taking abnormally long. e,g. on current WU (Protein 385, p540_BBA5_N in water) the usual time per frame is 13 minutes 20-30 seconds. The abnormal frames are 24 to 27 minutes. Then noticed that in fact yesterday afternoon the log had stopped at frame 301 (on a previous Tinker) and restarted at frame 310: the machine had hung while the family were playing and had to be rebooted.

Then noticed that the 'slow' frames are regularly every 2 hours, and yesterday's hang/reboot was on a 2 hour boundary. (The 2 hours is offset from the time F@H starts, not clock time. So before the hang the slow frames were 12h16, 14h16, 16h16 with reboot at 18h16, since the reboot the slow frames are 01h20, 03h30 etc.)

So tried to 'watch' the system coming up to a 'slow' frame, and sure as hell the hdd suddenly starts thrashing. Move the mouse, it stops. After a minute or so thrashing restarts, check task manager whatever it stops, and the frame completes in reasonable time.

Now this hdd thrashing is a symptom I have tried to hit before: it predates my F@H enrollment. In fact I noticed it some weeks ago and have rebuilt that system (from formatted hdd) because I figured it was some trash that had been picked up and missed by the virus scanner. Still there. My current suspects are stuff loaded from the registry at boot for Office XP Professional (a full/all components load)
MDM.exe (Machine Debug Manager) or
MOSearch.exe (Microsoft Office Search Service)

Any ideas/thoughts welcome: I am going to kill these processes to test if they are the cause, then try deinstall the 'full' Office XP: previously I removed MDM from the registry but seems next time we fired Word or some such it got put back!

Post Reply