Page 1 of 4

Folding Farm stability problems

Posted: Wed Jan 14, 2004 7:23 pm
by haysdb
There may or may not be any questions here. Mostly I'm just "thinking out loud", planning my attack of a vexing problem. The problem is my diskless farm, which hasn't been what you would call rock-solid stable.

I have mixed emotions about this lack of stability. The part of me that enjoys playing with computers doesn't entirely mind, while another part is frustrated when things break "for no reason". It's the frustrated half speaking here.
  • One of my five blades is "very unstable", rarely completing a WU. It does complete some, but it aborts more often than not.
  • Another is "somewhat unstable", aborting about as many WU's as it completes.
  • Two others are "relatively stable", only aborting an occasional WU, and sometimes going days in a row without any problems.
  • Only one is "very stable". The "very stable" board is the only 266 FSB CPU in the farm, which seems like a clue.
  • Each of the boards "freeze" once-in-a-while, with, so far, no discernable pattern.
Each of the five "blades" has a slightly different configuration. No two are exactly alike. While this has it's downsides, I think how frustrated I would be if I had FIVE unstable blades rather than just two.
  • The least expensive motherboard (Biostar M7VIZ), with the cheapest no-name memory (PMI), is one of the "relatively stable" boards, while
  • One of the boards with premium memory (Crucial) and a good quality 300W psu (Q-Technology) is the least stable.
  • One of my two Abit boards is mostly stable, while the other is not. In that case the only difference is the cpu - one has an 1800+, the other a 2600+ Barton, but that could trigger a problem in any other component, so that in itself doesn't tell me anything.
I'm going to swap some components around tonight, partly in an attempt to improve overall stability, partly in an attempt to diagnose which components may be weak.

The Linux server has also had some problems, so I am including that in the component shuffle also.

For this first round, I am going to play a little 3-way swap of CPU's, replacing the 2000+ in the Server with an 1800+ that has been rock solid as a folder, moving the 2000+ to one of the boards that has been unstable with a 2500+ Barton, and moving the Barton to the former stability champion the 1800+ came out of. Whether those three machines improve, stay the same, or get worse, should provide some useful data.
  • None of the blades are overclocked by so much as 1 MHz.
    The memory is all PC2700 (DDR333) memory, and therefore "within spec" for the FSB333 CPU's.
  • The heatsinks are three Artic Cooling Copper Silent, one Speeze, and an Alpha. Figuring the Alpha as the best quality cooler of the lot, I put it on the "somewhat unstable" board, along with a medium speed fan, but I noticed it aborted at least one WU last night, so the heatsink was not "THE problem."
I have put a spreadsheet together showing each configuration, hoping some "pattern" would emerge, but nothing has. The two most unstable boards are different brands, with different power supplies, and different heatsinks. Only the cpu's and memory are the same, or similar. The memory is Crucial. I don't know for sure whether the CPU's are 2500's or 2600's. I've forgotten what I bought.

Nope, no questions in there anywhere, but feel free to throw your 2 cents at me if you see any clues, have a suggestion, or just want to hurl insults about how stupid I am.

David

I bet it's the power supply

Posted: Wed Jan 14, 2004 8:03 pm
by NeilBlanchard
Hello David:

My guess is that the power supplies are not up to the task. The total wattage may seem okay, but maybe one of the "legs" is weak and can't keep up -- what voltage does the RAM run on? Or the CPU? Maybe just try a 250-300 watter of known quality on the worst offender and see how it goes...

Posted: Wed Jan 14, 2004 8:48 pm
by mas92264
Re: Instability

2 of my win2k boxes lock up (dead kybrd and mouse) about once a week. Another win2k box locks (black screen) up about every 10 days/2 weeks. These 3 are all 333 fsb, clean os installs. They do nothing but fold. However, I never loose the protein. Reset and it picks up from where it left off. 2 are via mobos, one is nforce 2 (it seems to be the worst offender.)

Of my 2 xp pro boxes (333 fsb) one has locked up once, maybe twice (nforce2.) The other one (at the ofc, 333fsb, via) never has locked up.

Let's see. That's 5, so far. 2 more are p4 2.4's, xp pro. These never lock up. The rest are all via boards, 266 fsb, 1 is 98se and the rest are win2k. 1 of these, win2k, locks up every couple of weeks, or so. The rest of the 266 boards are fine. All of these run 24/7, we never close.

No common theme here, that I can see. The only weird thing is that they nearly always lock up over night. I know that this makes no sense and probably means nothing.

There are a couple of threads at http://forum.folding-community.org/homepage.php re amd/lockups. The possibilities suggested are all over the map - everything from overheating cpu's to the kind of beer you drink. Apparently amd was contacted and offered no help.

The most likely suspects, imo, are some kind of mystery memory problem or some kind of sse issue (this was suggested as a possibility based on the non-response from amd.)

So, no answers seem to exist and it's a common problem, particularly with amd chips.

DH's loss of proteins, however, differs entirely from my experience.

Just my $0.02

M

Re: I bet it's the power supply

Posted: Wed Jan 14, 2004 9:22 pm
by haysdb
NeilBlanchard wrote:Hello David:

My guess is that the power supplies are not up to the task. The total wattage may seem okay, but maybe one of the "legs" is weak and can't keep up -- what voltage does the RAM run on? Or the CPU? Maybe just try a 250-300 watter of known quality on the worst offender and see how it goes...
I am. The cronically unstable Shuttle MK40V has a 300W Q-Technology power supply, which I used on my main desktop machine, a P4 2.4, until recently. That desktop had 3 hard drives, an AGP card, a MyHD card, a video capture card, and three case fans. It's not impossible that it's "bad", just like the Ethernet controller was bad on said desk...top...um, maybe I will swap the psu with another one. :wink:

David

Posted: Wed Jan 14, 2004 9:51 pm
by haysdb
Mas, I don't know if that makes me feel better, or worse. It's a sad state of affairs, is it not, where we just accept this with a shrug?

I can deal with the OCCASIONAL lockup, because so far, none has resulted in the loss of work. So long as I catch them in a reasonable amount of time, only a fraction of a wu worth of cpu time is lost. It's different when a protein that's nearly finished gets trashed. That makes me :x

Now that you mention it, I remember reading a VERY long thread about the lockup problems, although I don't remember where. Obviously nobody was going to accept responsibility for that problem, not when they can blame someone else for the problem. It's AMD's fault! No, it's VIA's fault! No, it the motherboard makers fault! No way, it bad memory! No, Microsoft is to blame!

Honestly, it's why I chose an Intel cpu and an Intel motherboard with an Intel chipset (obviously) for my HTPC. It's no guarantee, but it definitely felt like less of a risk than a CPU from AMD, a chipset from VIA or nVidia, and a motherboard from someone else.

I need to get busy. I might not be able to "fix" anything, but maybe I can minimize the issues. Perhaps I can find some combination that will make everyone (every blade) happy.

David

Posted: Wed Jan 14, 2004 11:20 pm
by ColdFlame
Well guys I don't know why you all started saying that you have crashes, etc. - I don't have any problems at all with folding. I have 3 AMD systems, all different and they have yet to produce a problem.

I think if what you are describing is happening then you need to seriously take care of that. It is not normal that you have lockups, etc.

If I ever had a problem with folding it was for 2 reasons:
1) overclocking caused LINCS warnings and aborted WUs - still got points for those though
2) some WUs would lock up my AMD machine hard - but I haven't seen such WUs in a while

Bottom line: you need to troubleshoot your issues. Start with a system that is most unstable and start changing components. Put a more powerful PSU first, because it seems all your components are good quality.

Let me know if I can help somehow, but I think you need to try to reduct the problem to a minimum, i.e.
1) start with the simplest case - the system that locks up most
2) analyze each component and think what can go wrong with it - I think we can exclude CPU and maybe motherboard (make sure you have the latest BIOS everywhere)
3) put a single stick of RAM that is guaranteed to work in another machine
4) put a more powerful PSU from a machine that is guaranteed to work

Run all that and there shouldn't be any issues. If there are then it means that your software most likely is to blame. I have zero experience folding in Linux, so I don't know how stable is the client and the software. I'd imagine that those guys at Stanford develop in Linux though.

Posted: Thu Jan 15, 2004 12:11 am
by haysdb
ColdFlame wrote:Well guys I don't know why you all started saying that you have crashes, etc. - I don't have any problems at all with folding. I have 3 AMD systems, all different and they have yet to produce a problem.
Please don't take this the wrong way ColdFlame, but I can always rely on someone concluding that because they have no problems, that there is no problem. Your sample size is 3. I too have three Athlon machines that have (essentially) given me no problems. Unfortunately, I have three others that have had, and continue to have, problems.

I am prepared to admit that I cut my corners a bit too sharp, and that could be the problem. And yet, here is the rundown of components on my most troublesome folder:
  • Shuttle motherboard
  • AMD 2500+ Barton CPU
  • Crucial DDR PC2700 memory
  • Q-Technology power supply
These all are "respected brands". I'm not overcloclocking anything. Memory timings are set by SPD. All but one (and it's stable) has a cpu cooler with MikeC's stamp of approval.

If I had one blade with stability issues, I'd say it's just bad luck, but this goes way beyond bad luck.

However, your point that I need to "troubleshoot the issues" is well taken. I agree. I'm doing that. Methodically, one component at a time. One problem is that instability can occur immediately, or it can take a day or two to show up. Take my Shuttle buddy for example. Since the last LINCS WARNING at 26%, it has completed a WU and is 70% through another. Is it now stable? I doubt it since I haven't changed anything.

The "more powerful PSU" is 90% BS. These micro-ATX boards with a 2500+ Barton draw less than 90 watts from the wall. Even at 65% efficiency converting AC to DC, that's 60 watts. I'm not saying the psu's might not be faulty, but saying they aren't powerful enough doesn't compute.

IF what I am describing is happening? :?:

David


ColdFlame, I'm not mad at you, I'm just "flustrated", and you had the misfortune to "push the button." I assure you, I will take your advice, because it's good advice. I need to get this sorted. And it's not like I'm lacking proven components, I just need to grit my teeth and take the necessary boards offline long enough to swap the parts.

Posted: Thu Jan 15, 2004 12:37 am
by haysdb
OK. Two blades, the most and least stable in my stable. Maybe I have a folding ranch instead of a folding farm?

Same:
  • CPU: 2500+ Barton
  • HSF: Arctic Cooling Copper Silent, with AS3 as the thermal paste
Different:
  • Motherboard: Unstable: Shuttle MK40V. Stable: Biostar M7VIZ
  • PSU: Unstable: 300W Q-Technology. Stable: 180W Fortron micro-ATX
  • Memory: Unstable: 256MB Crucial DDR PC2700. Stable: 256MB generic DDR PC2700
Right off this just seems backward. Counterintuitively, the board with the no-name memory and the micro-ATX power supply is the stable one. Go figure.

I had my choice of which component to swap first, the memory or the PSU. I chose the memory. I have exchanged the DIMM between the two boards, board 1, the stable board, and board 4, the unstable board. That's it. I touched NOTHING else.

I may do the same thing with boards 2 and 5, the next most unstable and stable boards respectively. Like pair #1, the unstable board has a stick of Crucial memory, the stable board a stick of the very cheapest stuff they had.

If it proves to be the memory, I'm going to pull my hair out!

David

Posted: Thu Jan 15, 2004 12:55 am
by TheScarf
Advice in the form of experience, and I'm sure you're doing this anyway: Write down as much detail as possible!! Last weekend I spent at least 30 hours fixing a friends pc that should've taken a lot less due to a lack of diligence with notes...Sure I did a couple of the same things about 20 times :oops: I think you summed it up best though, grit your teeth!!

Posted: Thu Jan 15, 2004 1:14 am
by ColdFlame
One thing you might try to test memory is to run MemTest86. The reason I proposed to look at the PSU is because your system consists of only 4 components: CPU, motherboard, RAM and PSU. CPU is definitely ok. Motherboard unlikely. RAM is possible, PSU too.

You can yell at me as much as you like I won't cry :) I realize it must be very frustrating. The reason I posted what I posted was because from other posts it seemed like errors and lockups are kina expected and I disagree with that. All my systems are heavily overclocked and none is failing so I'd expect yours to behave well.

Actually, now that I think about it, I think we should blame the software, because the hardware is so different that it is very unlikely that all that various hardware you posess would be faulty.

Is it possible for you to install Windows XP on one of the boxes and see if it gives you grief still?

Posted: Thu Jan 15, 2004 1:49 am
by haysdb
Scarf, I took your advice and started a "journal" of the changes I've made and why, and the next steps to take if the previous change proves ineffective. With five boards and the various permutations of memory, power supply, heat sink, and cpu, I will be completely lost in no time if I don't.


In analyzing the various configurations, trying to identify something in common with the two unstable boards, I have identified something worth investigating: the size of the North Bridge heatsinks. The heatsinks on the unstable Shuttle and Abit boards are tiny. Could the North Bridge chips be overheating? Only one of the two Abit boards is unstable, but the CPU in the other board is an Athlon 1800+ with a 266 FSB, which would tax the North Bridge less than the 333 MHz FSB. In contrast, the heatsinks on the stable Biostar and Asus boards, are considerably larger.

It seems like a long shot to me, but I have too often made the mistake of saying "Oh, THAT couldn't possibly be a problem," and overlooked simple solutions. I just replaced the NIC in my desktop two days ago. A slightly flakey onboard Ethernet controller had plagued me for MONTHS.

David

Posted: Thu Jan 15, 2004 2:03 am
by haysdb
ColdFlame, thanks for your understanding. Too many people take an expression of frustration personally.

I have swapped memory between the two most and least stable boards, so that should tell me whether it's a memory issue. Or maybe not. I could run memtest86 on them. I have a USB thumb drive these boards will all boot from as if it were a USB floppy drive.

The software being run by each blade is identical. They each load the same boot image from the server, and run the same script to start the same version of F@H. They do each have a copy of the cores. I could try deleting the cores to force them to load a new one. I will add that to my list of things to try, although I don't think there have been any new Gromacs cores since v4 was released.

My favorite theory at the moment is that it's a "thermal issue", but not with the CPU's. With the boards out in the open, some components might not get the airflow over them they would in an enclosed case. I have lots of fans I can use to see if a little extra cooling will help the situation. Again, it's a longshot, but painless to try.

David

Posted: Thu Jan 15, 2004 4:06 am
by CharlieChan
David,

I have not read the complete thread so may missed something. You could be having a software problem so I suggest installing linux on each of the unstable blades first. If the blades folds without crashing then you should consider using redhat 8 on the server. I get the feeling it is another one of those glibc problems.

Charlie.

Posted: Thu Jan 15, 2004 4:09 am
by TheScarf
I have too often made the mistake of saying "Oh, THAT couldn't possibly be a problem,"
I like the way you think... :D

Posted: Thu Jan 15, 2004 5:50 am
by CoolGav
It does sound to me to be either oneor both of an OS or thermal issue.

I had no end of problems with my i845PE P4 when I replaced the northbridge HSF with a Zalman passive. One day I noticed it was pretty cool and that it was easy to move around. Discarding the pins that were doing a bad job and using the epoxy stuff supplied and things are much better stability wise and the heatsink gets warm now. If you can't get some better heatsinks for the nothbridges then try a fan. Seeing that you know how they fail, then trying multiple blades isn't such an issue.

If you try linux off the pen drive (or off a hard drive) then it's probably worth doing that on one of the blades while you try the northbridge solution on another.

I wish you the best of luck to sort out this stability issue. I know from loosing a few WUs that its annoying, but seeing loads dissapear is terrible.

Posted: Thu Jan 15, 2004 1:58 pm
by haysdb
CharlieChan wrote:David,

I have not read the complete thread so may missed something. You could be having a software problem so I suggest installing linux on each of the unstable blades first. If the blades folds without crashing then you should consider using redhat 8 on the server. I get the feeling it is another one of those glibc problems.

Charlie.
Although I don't see how it could be, I do allow the possibility. The reason for my skepticism is that the instability is mostly confined to two of the five blades. Since they are all VIA KM400 boards booting off the same boot image, it doesn't add up for me that it could be a Linux issue. Yup, it could be, so I won't rule it out, but it seems less likely to me than it being a hardware issue, so I will continue to pursue that for now.

David

Posted: Thu Jan 15, 2004 2:22 pm
by haysdb
I swapped memory between pairs of stable/unstable boards last night, with the only change being that #2 has gone from "unstable" to "completely unstable", aborting 6 of 6 WU's since last night.

#4, the Shuttle board, remains "50% stable"

On #4, I reseated the cpu heatsink and changed to a different power supply. I forgot I was going to delete the core, to force it to download a new one, so that will be the next thing I try. I may just delete everything in the directory and start with a copy of the main exe and config files from one of the stable blade directories.

On #2, the Abit board that's "completely unstable", I've swapped power supplies with one of the 100% stable boards. I also swapped the PSU on another stable board, #5, with a new power supply of the same design used on one of the UNstable boards. I am trying to determine whether my psu's have any affect on folding stability. These two PSU swaps should either reveal a smoking gun or put that issue to bed.

I did put a fan blowing directly across the NB heatsink on #4 last night. It didn't help.


Thanks for the suggestions, one and all. There are solutions to the mysteries of why these two boards aren't stable, and together we will find them.

David

Posted: Thu Jan 15, 2004 4:38 pm
by Mutt_n_head
You know, I'm having problems with all my Athlon XP rigs lately. Maybe the SSE bug I've heard about is rearing its ugly head?

Whatever it is, it seems like AMD'ers are having more problems of late.

I have always wondered how credible that claim about the SSE bug was.

Oh well

Posted: Thu Jan 15, 2004 4:39 pm
by unregistered
Just a thought, I have seen MB's that need more "adaquate" grounding than others, especially the el-cheapos.

Posted: Thu Jan 15, 2004 5:33 pm
by haysdb
unregistered wrote:Just a thought, I have seen MB's that need more "adaquate" grounding than others, especially the el-cheapos.
Ooo, interesting thought. My boards are ALL el-cheapos. Grounding poses a problem for me since the "case" is plywood. The only metal are the power supplies, plus the boards aren't mounted to anything - they just slide into slots cut in the top and bottom of the "box".

David

Posted: Thu Jan 15, 2004 5:52 pm
by haysdb
Mutt_n_head wrote:You know, I'm having problems with all my Athlon XP rigs lately. Maybe the SSE bug I've heard about is rearing its ugly head?

Whatever it is, it seems like AMD'ers are having more problems of late.

I have always wondered how credible that claim about the SSE bug was.

Oh well
There is too much evidence to discount the "SSE bug" on Athlon processors. SSE is disabled by default on Athlon cpu's for a reason, and we force it back on at our peril.

Anyone who claims Athlon systems don't have "issues" with stability and freezing is just not looking at the available evidence IMO. My three P4 systems have just not had the same issues with Folding as have my Athlons, and if you look around at Folding-community.org, I am not unique.

But it's also true that I have four Athlon systems, including three different generations (Palomino, T'bred, and Barton), which HAVE been stable over a period of many weeks, so it's certainly possible. I just need to figure out what makes THOSE two systems stable and two others unstable.

I am going to order another Asus board. I have just one, but it has been the most solid of the four brands/models of micro-ATX boards I have, so I am curious to see if I was just "lucky" or whether it's just a good board. The Biostar M7VIZ has also been reliable, except for some flakiness with the onboard Ethernet, and I'm not fond of the location of the ATX power connector, so given the choice of another Biostar or another Asus, the nod must go to the later.

It has been suggested I try an nForce board, but for this farm I am sticking with VIA boards, if for no other reason than my brother works for VIA. :lol: No, I don't get free boards.

David

Posted: Thu Jan 15, 2004 6:58 pm
by ColdFlame
I have encountered the "SSE bug" on my AMD systems and it locked up my rig completely. I'm of inquisitive nature so I did some simple tests, which pointed at one specific WU that I got and the problem was only showing up when I had SSE enabled (using -forceasm switch on 3.24 client).

Now, whether it is AMD or Stanford folks - nobody knows. But in my experience it was clear that it was related to a particular WU and to SSE.

However, besides that one incident I haven't had problems with AMD folding unless I'm overclocking systems way out of spec.

I've not had any stability problems

Posted: Thu Jan 15, 2004 7:34 pm
by NeilBlanchard
Hello:

I have been folding on one overclocked AthlonXP for 8 months and another for 3-4, and a Athlon "Classic" 700 for 8 months, and the only aborted WU's (maybe 3 total) came very early on when I pushed the oc a coupla' hertz too high! This seems like a red herring to me.

What about power *fluctuations* or RF interference -- possibly from one mobo to the other? Do you have them on a good power filter or UPS? At work, we have a microwave oven, and when it is plugged into certain power circuits, it caused my system to reboot when we used the oven! And that is *with* the UPS... Do you have any old cases to install the unstable system in? Have you run the MemTest yet? Or maybe just put the RAM in another slot?

Posted: Thu Jan 15, 2004 9:31 pm
by mas92264
News:

Read through about 5 pages at www.forum.folding-community.org re amd lockups. There were a bunch of guys over there working/testing this issue in May 2003. Their conclusions were that the lockups are definitely due to having SSE enabled on xp Tbreds and xp Bartons. XP Palaminos seem to be immune to the problem. There was some preliminary indication that MP's are also immune. No one seems to know why.

Fast forward to January 2004 and there is no resolution. I did find this, posted today (same forum, January 15th.)

"AMD thanks the Folding at Home users for providing information about the freezing problem. Thanks to Prof Pande, AMD has reproduced the problem in our Austin labs. Although not an official workaround, we have observed that if the console application is launched without the -forceasm switch, or if the GUI version advanced properties setting to enable advanced optimizations is not enabled, then the freeze does not occur.

We apologize for any inconvenience and will be working with Prof. Pande as soon as a better workaround is available.

AMD_Mike"

We should expect a resolution any minute now as it only took AMD 7 months to get this far. :x

I doubt that this will be any assistance to haysdb's problem, however, maybe it's a little more light on the problem.

M

Posted: Thu Jan 15, 2004 9:39 pm
by haysdb
Nothing much new to report. The good news is no WU's have aborted in the last 12 hours. The bad news, one of my three stable boards went "offline" only an hour after I went to work. I can't be sure, but I believe that's due to an intermittant problem that board has with it's onboard NIC. I wish a had a 7-port KVM switch so I could SEE what the blades are doing.

Neil,
  • All the boards are connected to a Tripp-Lite Isobar surge suppressor. I have considered a UPS, but one large enough for the farm will be pretty expensive. A 7-blade farm looks like it going to draw about 1000 watts. Speaking of which, when I had the Kill-A-Watt on individual PSU's, I was seeing power draw of 84 to 88 watts, and yet somehow, five of them are drawing 770 watts. Time to methodically check each PSU and see which board is drawing what.
  • One of the unstable boards is separate from the others. I doesn't "fit" in my box (until I make a spacer) so it's sitting on a card table a few feet from the blade box.
  • I've swapped RAM between boards, but haven't tried slot 2 yet. That's on my list to try, but not until boards start showing instability again. As mentioned above, none have busted a WU since the last shuffle.
David

Posted: Fri Jan 16, 2004 12:12 am
by haysdb
AMD's workaround is to not enable SSE. :|

I have had some instances of freezing, but not a LOT. I remember having two in one day, but with that exception I have had no more than one in any one day (and most days I have none), spread over 5 non-Palomino Athlons.

I do have aborted WU's every day, which is why I am more concerned with instability than freezing.

I have mentioned it in passing, but never made an issue of pointing out that I do have one blade that, so far as I remember, has never failed on a WU. (Actually, I have two, but the other is host to an 1800+ Palomino with a 266 FSB, so much less is asked of it than of the others.) I have stolen its RAM and its PSU to try on other boards, but it has been equally happy with the new parts. It's running a 2600+ at 333 FSB, with a $9 heatsink/fan. It has an exceptionally convenient layout, with the ATX power connector far away from the CPU where it's easy to connect and disconnect, and it has no 12V ATX connector to fuss with. It even has an LED to show me that the board has power. With the caveat that I have just ONE of these boards, I have to say I wish I had SEVEN of them. The board is an Asus A7V8X-MX. It's in the same box with all the other boards, connected to the same surge protector and network switch. Why is it that this one board hasn't fallen victim to power fluctuations, EMI, weak power supplies, bad memory, or glibc problems when three of the others have? In my eyes, this board proves the other motherboards are WEAK. I ordered a second of these boards tonight to see if a another sample proves to be as robust as the first.

David

Posted: Fri Jan 16, 2004 1:51 am
by CoolGav
My XP2400+ system uses an ECS K7SEM MATX board, which is now sitting in an MDF (Medium Density Fibreboard for those who haven't seen "Changing Rooms"), custom enclousure. Well, when it was in a standard case it never skipped a beat, but now its not doing quite so well. I think its sorted now, but I had my issues with Fedora, which could have been stability related. That board was a cheap one (£30), when I usually spend at least twice that. The Asus boards I've owned haven't actually skipped a beat (okay, I sold my CUSL2 because my PIII-650 wouldn't get over 700MHz). My latest Barton is in an Asus A7V600, a Via Chipset and it is very good. I've also had good luck with Gigabyte boards. If I were in the position (finances, space, time) to build a farm then I wouldn't go with the cheapest I could find, but something that has been designed well.

I hope your new Asus board prooves to be stable. Are you able to return the other boards to where you got them from, seeing that you're having these issues which other boards don't suffer from?

Posted: Fri Jan 16, 2004 2:44 am
by haysdb
CoolGav wrote:Are you able to return the other boards to where you got them from, seeing that you're having these issues which other boards don't suffer from?
I will take that up with New Egg and Directron when the time comes. I've pretty much swapped every other component except for cpu's, so I'm 95% convinced these are "board issues".
  • The Biostar just needs to be RMA'd since it's a good board with a slightly flakey network controller.
  • The Shuttle board is my least favorite for several reasons, and I'd prefer to be rid of it. Rather than RMA'ing that board, I will ask Directron to exchange it for something else.
  • The Abit board remains a mystery. I have two of them, and the other has been problem free, albeit with a slower, FSB 266 processor. I may swap chips on those two and see what happens. Then I will either put a FSB 266 cpu in it, RMA it in hopes of a better one, or ask to exchange it for either another ASUS or Biostar.
I have no reason to believe any of my power supplies or memory is bad; The unstable boards have remained unstable regardless of memory or psu, and likewise for the stable boards, which have remained stable even after memory and power supply swaps.

The server froze up again tonight. That's the fourth time in about a week. I have removed the -forceSSE flag from that machine, but what I will probably do at some point, maybe this weekend, is swap the 2000+ T'bred in the server with the 1800+ Palomino in one of my blades. This is based on the premise that Palomino's don't suffer from the "SSE bug". It's possible the lock-up problem is a Linux issue, but I'm betting it's not. If a blade locks up, it's one cpu offline. If the server locks up, it's SIX, and soon to be SEVEN cpu's offline.

David

Posted: Fri Jan 16, 2004 5:29 am
by TheScarf
FWIW, the milk crate board is an ASUS A7V8X-X and for a budget board, gets another vote for rock solid stability. Heres to hoping your next one keeps to the trend....

Posted: Fri Jan 16, 2004 9:12 am
by wgragg
I may be spitting in the wind here, but bios settings can also cause instability problems, especially the vdd and or vdimm voltages (If I remember right). I have had to raise one of those recently because of a slight instability problem. You might consider checking your voltage levels in BIOS.

Wendell