It is currently Sat May 25, 2013 10:46 am

All times are UTC - 8 hours




Post new topic Reply to topic  [ 18 posts ] 
Author Message
 Post subject: B$%£&^ FILE_IO_ERROR
PostPosted: Mon Feb 16, 2004 12:53 am 
Offline
Friend of SPCR

Joined: Mon Jun 09, 2003 4:00 am
Posts: 175
Location: Reading, UK
Just got into work and checked how the folding was going on my spare workstation. 5 of 7 wu's finished :D - the 2 tinkers I got still need a few more hours.

Having read some post about performance fractions at breakfast, I decided to look through the log file to see if it was listed anywhere. Now I find this in 2 of the log files
Code:
[14:46:39] Completed 250000 out of 250000 steps  (100)
[14:46:42] Writing final coordinates.
[14:46:45] Past main M.D. loop
[14:47:45]
[14:47:45] Finished Work Unit:
[14:47:45] - Reading up to 1037136 from "work/wudata_02.arc": Read 1037136
[14:47:45] - Reading up to 1059504 from "work/wudata_02.xtc": Read 1059504
[14:47:45] - Length of file read from disk doesn't match what expected (work/wudata_02.xtc)
[14:47:45]
[14:47:45] Folding@home Core Shutdown: FILE_IO_ERROR
[14:47:49] CoreStatus = 75 (117)
[14:47:49] Error opening or reading from a file.
[14:47:49] Deleting current work unit & continuing...
[14:47:53] Trying to send all finished work units
[14:47:53] + No unsent completed units remaining.


Is there any way to recover these or is that just 80 points lost :x :evil:


Top
 Profile  
 
 Post subject:
PostPosted: Mon Feb 16, 2004 7:55 am 
Offline
Patron of SPCR

Joined: Fri Aug 29, 2003 11:09 pm
Posts: 2425
Location: Earth
geordie, don't shoot the messenger, but those points are gone. Twice? Coincidence or a clue?

The thread Folding@Home Common Errors at folding-community.org says
Quote:
FILE_IO_ERROR:
An error that occurs when disk operations go bad. This is a fairly general error, having many sub-types. It has plummeted in frequency since the release of Gromacs Core 1.46. Now, this error usually happens when a hardware error occurs: something like "Write 0010, read back 0011". If you experience this error, make sure your hard drives are OK: run ScanDisk, CHKDSK, or fsck, make sure the IDE bus is in spec, make sure you're using good IDE cables, and make sure the drive isn't dying.

FILE_IO_ERROR has also been reported to occur if two Console clients working on the same unit are started. This can occur if you accidentally start one client twice on a dually, instead of two clients once.

David

_________________
Join SPCR's Folding Team! <-- Click Here


Top
 Profile  
 
 Post subject:
PostPosted: Mon Feb 16, 2004 8:34 am 
Offline
Friend of SPCR

Joined: Mon Jun 09, 2003 4:00 am
Posts: 175
Location: Reading, UK
I'd pretty much resigned myself to losing the points. It was just wishful thinking that I could recover them somehow. :(

I'm pretty much 100% certain the hardware is fine. I thought I was careful not to start 2 clients with the same cpu id etc also. The client.cfg files are correct and I start each one as a service in firedaemon.

One other coincidence is that the 2 units that failed were of the same type, and were the only 2 wu's of that type I got. I'll have to keep an eye on the situation and see if it happens again. Hopefully this was a one off.

I did have one thought, which was to make a backup of each client's directory before it finishes a wu. Then if it fails I could restore the backup and try again. If that fails I would be inclined to think the machine doesn't like the wu, rather than a random hardware error.


Top
 Profile  
 
 Post subject:
PostPosted: Mon Feb 16, 2004 8:40 am 
Offline
Friend of SPCR

Joined: Mon Dec 23, 2002 12:55 pm
Posts: 1063
Location: Richland, WA
Is that the registered version of FireDaemon? I thought the 'freeware' version only allowed adding one service.

Just trying to help eliminate any possibilities.

_________________
Z
________


Top
 Profile  
 
 Post subject:
PostPosted: Mon Feb 16, 2004 8:51 am 
Offline
Patron of SPCR

Joined: Fri Aug 29, 2003 11:09 pm
Posts: 2425
Location: Earth
Were these two units on the same PC?

What protein? It might be a good idea for some other people to check their logs for this protein. Maybe the problem is more widespread and people just haven't noticed it yet.

David

_________________
Join SPCR's Folding Team! <-- Click Here


Top
 Profile  
 
 Post subject:
PostPosted: Mon Feb 16, 2004 9:16 am 
Offline
Friend of SPCR

Joined: Mon Jun 09, 2003 4:00 am
Posts: 175
Location: Reading, UK
Quote:
Is that the registered version of FireDaemon?

Yes. It is the registered 1.6 GA Pro so I can have unlimited services. :D

Quote:
What protein?

Both listed as p1006_ppg10_pfold, a 41.10 point gromac. They both ran on the same machine as cpu id's 4 and 5. Both running Fah4console, fahcore_78 v1.55.
Command line is
Code:
Fah4Console -local -service -advmethods -forceSSE -forceasm -verbosity 9

The machine is a dual Xeon 2GHz with HT running Windows 2000 Server. The work units are transferred in from my XP2500+ system at home since the Xeon has no internet access.

I have the fresh copies of the work units as they were downloaded at home. I'll keep one of them and start it again from scratch. See if it fails again. I don't think I'll risk running both again and losing another 80 points. :x

Do you think downloading work on an Athlon XP and doing the work on a Xeon would have any nasty effect? I use the same command line at home to do the sending / receiving of work units. I didn't screw up with the local settings at home. That is cpu 1 and I only take directories for cpu's 2 - 8 to work.


Top
 Profile  
 
 Post subject:
PostPosted: Mon Feb 16, 2004 10:14 am 
Offline
Patron of SPCR

Joined: Fri Aug 29, 2003 11:09 pm
Posts: 2425
Location: Earth
geordie wrote:
Do you think downloading work on an Athlon XP and doing the work on a Xeon would have any nasty effect? I use the same command line at home to do the sending / receiving of work units. I didn't screw up with the local settings at home. That is cpu 1 and I only take directories for cpu's 2 - 8 to work.

I would expect that to work just fine. I have been playing around with this myself, shuffling a F@H directory between a PC with Internet access and one without, using a USB pen drive, but I am only running one instance.

David

_________________
Join SPCR's Folding Team! <-- Click Here


Top
 Profile  
 
 Post subject:
PostPosted: Thu Feb 19, 2004 12:16 am 
Offline
Friend of SPCR

Joined: Mon Jun 09, 2003 4:00 am
Posts: 175
Location: Reading, UK
Well I finished one of those p1006_ppg10_pfold work units again and guess what - FILE_IO_ERROR. :evil:

From my small trial:
Every (total 3) p1006_ppg10_pfold unit run has failed at completion with FILE_IO_ERROR
No other work units of any type (total approx 25) have failed for this or any other reason.

Not a huge survey I have to admit, but I'm certainly suspicious of these units. Anyone else got one?


Top
 Profile  
 
 Post subject:
PostPosted: Thu Feb 19, 2004 12:39 am 
Offline
Patron of SPCR

Joined: Fri Aug 29, 2003 11:09 pm
Posts: 2425
Location: Earth
EMIII reports I have done 10 of these, but the most recent was Feb 9 and I don't have that log any more so I cannot confirm what happened at the end, whether it was uploaded successfully or if it aborted.

David

_________________
Join SPCR's Folding Team! <-- Click Here


Top
 Profile  
 
 Post subject:
PostPosted: Thu Feb 19, 2004 6:41 am 
Offline

Joined: Thu Nov 13, 2003 7:34 am
Posts: 246
I've been having a spate of problems lately, but mostly on one box. I got a file io error yesterday on a different protein, but it was also a mismatch in the file size. Before that, I was getting several Special Exits and one other I can't remember. I'm tweaking my settings and backing off my overclock a tad to see if that is the problem.

_________________
Image


Top
 Profile  
 
 Post subject:
PostPosted: Thu Feb 19, 2004 11:48 am 
Offline

Joined: Sat Oct 04, 2003 1:43 am
Posts: 187
Location: Madrid, Spain
geordie wrote:
Well I finished one of those p1006_ppg10_pfold work units again and guess what - FILE_IO_ERROR. :evil:

From my small trial:
Every (total 3) p1006_ppg10_pfold unit run has failed at completion with FILE_IO_ERROR
No other work units of any type (total approx 25) have failed for this or any other reason.

Not a huge survey I have to admit, but I'm certainly suspicious of these units. Anyone else got one?


Geordie, I got one really long ago and had one of my first and only failures due to overclocking/using SSE in an Athlon. From what I investigated then and what I call record some ranges of proteins (10xx and 3xx I think) seem to tolerate less well OC, and unstable system (even a very close to be stable) or using SSE in AMDs. As I see you are using dual Xeon so the latter shouldn't be a problem. Also you're using Gromacs Core 1.55, you could try with the Beta one (1.56)? Also, try running Prime95 to check stability, maybe you have some problem with your memory or whatever and it only comes to the surface with this kind of proteins.

Good luck

_________________
Athlon XP 2500 (Barton), Gigabyte GA-7VAX, Kingston 512MB DDR333, Samsung SP1213N (120GB), Geforce 2 MX 200, Case Aopen H500, Zalman 6000Cu, Antec TruePower 330.
Image


Top
 Profile  
 
 Post subject:
PostPosted: Thu Feb 19, 2004 1:02 pm 
Offline
Patron of SPCR

Joined: Fri Aug 29, 2003 11:09 pm
Posts: 2425
Location: Earth
1.56 is no longer beta. If you delete your existing core, 1.56 will be downloaded.

I agree with mormakil. This sounds like an instability issue. If you can live with your production being cut in half for awhile, fire up Prime95 alongside F@H. Each will use 50% cpu. How long you let it run is up to you. 8 hours of cpu time according to Task Manager is a reasonable goal. If it passes that, you may just be getting some bum work units. My guess is that it will NOT pass that test, in which case you need to take steps to improve stability, by backing off on the OC, increasing voltage, improving cooling, etc.

David

_________________
Join SPCR's Folding Team! <-- Click Here


Top
 Profile  
 
 Post subject:
PostPosted: Thu Feb 19, 2004 1:41 pm 
Offline
Friend of SPCR

Joined: Mon Jun 09, 2003 4:00 am
Posts: 175
Location: Reading, UK
Thanks for the suggestions. I might as well run Prime95 over the weekend since there won't be enough work for fah anyway. Better than having the processors idle. If it does fail then I've got problems. It's not OCed at all. It's bloody loud thanks largely to the 120mm NMB B30 and 92mm Sunon POS exhausts. The nasty intel stock HSF's aren't too nice either with their 60mm screamers. That should be enough cooling I'd have thought! The 4 sticks of rambus memory are nicely in the airstream too. Unfortunately, being at work I can't really do much with it to shut it up.

I'll be making sure I get hold of 1.56 core. Does the change work mid work unit or do I need to use -pause to get fah to stop after completing?


Top
 Profile  
 
 Post subject:
PostPosted: Thu Feb 19, 2004 1:58 pm 
Offline
Patron of SPCR

Joined: Fri Aug 29, 2003 11:09 pm
Posts: 2425
Location: Earth
Sorry, geordie, bad assumption on my part that you were overclocked, but I can confirm that machines can be unstable even at stock speeds. Prime95 is the best way to verify stability, or INstability.

Core 1.56 should do nothing to improve folding stability on an Intel processor, as it's primarily an Athlon release, with optimizations and some code changes to work around an SSE issue in T'bred and newer cores.

David

_________________
Join SPCR's Folding Team! <-- Click Here


Top
 Profile  
 
 Post subject:
PostPosted: Thu Feb 19, 2004 2:25 pm 
Offline
Friend of SPCR

Joined: Mon Jun 09, 2003 4:00 am
Posts: 175
Location: Reading, UK
I run a Barton at home so I'll be getting 1.56 for that at least.

You've got me on a quest to OC that machine to it's limits now though David after reading about your Barton blades :)

Mine's an unlocked 2500+ 8) . Currently running at 12.5 * 166 (reported as 2800+ by A7N8X Dlx). CPU temps now stabilise at 51C folding now (up from 49C at stock). Still at stock voltage. Cooling is a 80mm panaflo L1A @ 12v on a Thermalright SK7. I've got DDR400 memory so might as well go to 11 * 200 next and see if that's stable and cool enough for my liking. And hopefully I can keep to 1.650v.

I was disappointed a few months ago when I found I couldn't undervolt on the A7N8X. Now I love this board.


Top
 Profile  
 
 Post subject:
PostPosted: Thu Feb 19, 2004 4:55 pm 
Offline
Patron of SPCR

Joined: Fri Aug 29, 2003 11:09 pm
Posts: 2425
Location: Earth
geordie, I look forward to your report. I was jazzed after successes with two 2500's, then a bit dissappointed with lesser success with a 2600. Your 2500 being unlocked provides you some additional flexibility that I didn't have.

At 11*2000, mine were ALMOST stable at stock voltage, ultimately requiring only a .025v boost to get them totally stable. I managed to OC the second one to 208 to 2.3GHz, but it took an extra .15v, so aside from "bragging rights," I'm not sure it was worth the extra heat and stress. Still, though, KILLER results for a $75 processor!

David

_________________
Join SPCR's Folding Team! <-- Click Here


Top
 Profile  
 
 Post subject:
PostPosted: Sun Feb 22, 2004 8:53 am 
Offline
Friend of SPCR

Joined: Mon Jun 09, 2003 4:00 am
Posts: 175
Location: Reading, UK
David, it looks like I may have similar results to you. I went straight to 11 * 200 this afternoon at stock 1.650v. Ran fah with mbm5 to monitor load temps. After an hour I have 65C diode, 50C socket and 24C case temps :shock: . The diode temp is up 5C from 166 * 12.5 (2800+) speed. I have to say this is higher that I would like, but I don't know what diode temp I was running at stock 2500+ speed. I had been relying on ASUS PC Probe until yesterday, but that just reads the socket temp.

Anyway, after that limited stability test I tried moving up to 200 * 11.5, but within a few minutes of running fah and web browsing IE crashed trying to write to memory out of it's address space. :( I'm taking this as a sign that there's no chance of stability at 2.3GHz and stock voltage. I've decided I want the best overclock I can get at 1.650v. I don't really want to increase vcore.

Back to 3200+ speed now. Downloaded Prime95 but I'm gonna need some advice as to the settings to use. I clicked "torture test" out of interest but it only seemed to give 20% load for the couple of minutes I left it :?
If prime95 gives errors then it's decision time. I've got my heart set on having 3200+ speed now, but raising vcore is unappealing. Diode temperature is already hotter than I would like, but then again an 80mm L1A isn't much airflow on the heatsink. Do you have diode temps for your bartons David?

The other point of note is my memory timings. At 166 fsb I was running 2-3-3-7 in dual channel mode, but at 200 fsb this drops to 2.5-3-3-8. I guess this pretty much eliminates any performance benefit gained by increasing fsb (at the same clock rate). I might compare 200 * 10 to 166 * 12 sometime out of interest, but that will have to wait for a while.


Top
 Profile  
 
 Post subject:
PostPosted: Sun Feb 22, 2004 2:18 pm 
Offline
Patron of SPCR

Joined: Fri Aug 29, 2003 11:09 pm
Posts: 2425
Location: Earth
geordie,

I'm not the OC guru by any means, but I ALWAYS have an opinion... ;)
  • Don't worry about raising vcore by .025 or .5v. Based on my own limited (3 cpu's) experience, 3000+ is possible at stock voltage, but stability at 3200+ is aided by a .025v tweek of vcore.
  • Don't worry too much about the memory timings. At FSB 200, 3-3-3-8 is par for the course unless you have really high quality memory.
  • My own regimen for overclocking goes like this:

    • Memtest-86, test #5 (C-2-5-5<Ret>-0), 5 minutes. For some reason, this test seems to be especially good at identifying memory problems on Athlon processors.
    • Prime95 Torture test. Gross instability will be identified very quickly. I OC until Prime95 errors out within a minute or two, then back off FSB in steps of 2 MHz until it's stable for at least 5 or 10 minutes, then I start serious stability testing. My "Level 8 Stability Test" starts with Prime95 running alone. If I get no errors after 30 to 60 minutes, depending on how impatient I am, I will fire up F@H to run concurrently with Prime95. Each will get 50% cpu so long as you leave Prime95 Priority at 1. After Prime95 accumulates 8 hours of cpu time, I declare "good enough". A "Level 10" stability test would be 24 cpu hours Prime95 at Priority 10. During this test, the PC will be effectively unusable.
David

_________________
Join SPCR's Folding Team! <-- Click Here


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 18 posts ] 

All times are UTC - 8 hours


Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group