The downside of running a server :(

A forum just for SPCR's folding team... by request.

Moderators: NeilBlanchard, Ralf Hutter, sthayashi, Lawrence Lee

Post Reply
haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

The downside of running a server :(

Post by haysdb » Thu Jan 08, 2004 10:48 am

When the server goes down, EVERYONE goes down.

My Shuttle XPC has a problem - it's freezing up. That's "a bad thing" as a friend of mine says.

It's been running continuously for some weeks now, but maybe the "strain" has been too much for it. At this point I don't know where to even start looking, and right now I have to get to work. For the moment, it's running again, but since it's frozen twice in just the last few hours, I don't have much hope it will keep running. I stopped running a FAH client on the server, in case it's a heat related issue, that might keep cool enough, but I'm just guessing as to the cause at this point.

David

CharlieChan
Patron of SPCR
Posts: 198
Joined: Sun Jul 13, 2003 2:57 am
Location: East Anglia, UK

Post by CharlieChan » Thu Jan 08, 2004 11:31 am

David,

You could always put the farm on a subnet :wink:

Charlie.

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Post by haysdb » Thu Jan 08, 2004 1:27 pm

Charlie, that has become a high priority, in fact. That way if the farm server fails, at least my Windows machines won't be effected. Those machines are, after all, my "real" computers, that handle email, web browsing, video editing, and so on. If the farm goes offline, it's just my precious points at stake.

I already have the second NIC I need, and I can easily switch the Windows machines back to the .0 subnet just by turning the DHCP server back on in my broadband router, so the remaining task is to configure the DHCP server in the Linux machine to ONLY handle the farm clients. I think I just need to wave my hands over the dhcpd.conf file to specify which NIC it should monitor, and I should be good to go.

I have a couple of example dhcpd.conf files to refer to, including one you posted, so at this point I'm not asking for any help. I need to "noodle through" this on my own first.

David

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Post by haysdb » Fri Jan 09, 2004 1:29 am

This is puzzling to me. Since rebooting the Linux server some 15 hours ago, it seems OK, but...
  1. When I tried to run a folding console on the server, it locked up, so it's no longer folding.
  2. I can't access either of the other WinXP machines. They are both running, and folding, and have Internet connectivity, but they aren't visible on the LAN. One of them was briefly visible to F@H LogStats, but then "went away" again. I CAN ping them by IP address and by name. This is just weird.
  3. Two of my blades continue to be "unstable", aborting nearly every WU at some point. Neither is overclocked. One is on a known good 300W power supply, the other on a micro-ATX psu.
  4. One of my blades appears to have a problem with it's network controller. I had trouble with it a few days ago but was eventually able to be get to boot, but it's not working again.
  5. Two of the blades are stable, one 1800+ and the new ASUS board with a 2600+
In total, 6 of 9 cpu's are folding, but EMIII and F@H LogStats can only monitor 3 of them.

I'm looking for a pattern to these problems and not seeing one.

Of course, overheating comes to mind as a possible explanation for the "instability" issues, but each of these boards is sitting on a cardtable, not inside any enclosure, with an Arctic Cooling Copper Silent heatsink and thermally controlled fan. It just doesn't make sense that they are overheating, unless the heatsinks aren't making good contact with the cpu's. I will replace one of them with an Alpha heatsink and an L1A at 12V, and just see if that makes any difference.

David

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Post by haysdb » Sun Jan 11, 2004 11:14 pm

  1. The (Shuttle XPC) Linux server is back folding full time. I increased the cpu fan speed in the bios from Low to Medium in an attempt to keep temps down, in case the lockups were temperature related. So far so good, but
  2. The problem with the two WinXP machines has been resolved. I had the names and IP addresses flip-flopped in the hosts file on the server. Duh. Not sure how that happened.
  3. Two blades continue to show occasional instability. I'm still looking for the cause. One is a 2500+ Barton, the other a 2600+ Barton. Both have Crucial DDR333 memory. Two different motherboards, one an Abit, the other a Shuttle. Neither is overclocked. It doesn't make sense.

    Suspecting heat, I replaced the Artic Cooling heatsink on one with an Alpha PAL and a 2500 RPM fan, but that hasn't fixed the problem. The heatsink is hardly even warm.
David

ColdFlame
Posts: 451
Joined: Wed May 21, 2003 9:39 pm
Location: Somewhere in Time

Post by ColdFlame » Sun Jan 11, 2004 11:21 pm

Could it be that your PSU does not provide enough juice? You can try to put a single PSU to the troublesome board and exclude the PSU from further consideration. What kind of instability? Lockups or LINCS?

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Post by haysdb » Sun Jan 11, 2004 11:50 pm

One of the two boards is on a 180W micro-ATX power supply, but the other is on a 300W Q-Technology power supply.

According to a Kill-A-Watt, each micro-ATX psu is drawing about 90 Watts. At an estimated 65% efficiency (and it's probably less than that), the psu is delivering 60W to the motherboard, or just 1/3 of the psu's rated power. But to make sure, I put the full-size ATX psu on one of the two boards.

Here is the most recent message, from earlier this evening:

Code: Select all

[21:28:24] Completed 290000 out of 500000 steps  (58%)
[21:30:52] Quit 101 - Fatal error: NaN detected: (ener[13])
[21:30:52] 
[21:30:52] Simulation instability has been encountered. The run has entered a
[21:30:52]   state from which no further progress can be made.
I have also had a number of LINCS errors. Only a few lockups, and I don't remember whether those were either of these two boards.

David

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Post by haysdb » Mon Jan 12, 2004 12:16 am

One of my boards that has been stable just locked up, requiring a reboot.

David

Post Reply