Page 1 of 1

Q6600 G0 crunching, some questions and some issues

Posted: Sun Sep 30, 2007 12:46 pm
by whiic
So, it's autumn again and I've started contributing to keeping the house warm and SPCR score increasing. I have also built a new system and retired my Prescott from folding duty.

Here's my cruncher:
viewtopic.php?t=43624

I've encountered some issue with it:
1) Left it crunching, removed monitor, mouse and keyboard. Later plugged them back in and noticed monitor is stuck in power saving state, mouse and keyboard LEDs remained unlit. I don't any longer remember whether computer managed to shut down with short tap on power button (soft off) or whether I had to hard reset it. Removed monitor stand-by and haven't reoccured since.
2) I believe I've just once had that EARLY_UNIT_END & Client-core communications error. Nothing to worry about I guess since others have has similarsish problems with certain WUs.
3) There's a recurring error (or "non-error") with SMP: when I get a new WU and it starts crunching, it more often than not aborts before first 1%. Last time it aborted between 4 and 5% but that was unusual. No error message or no other line of text what so ever after last checkpoint. CPU cores just return to idle (they are not stuck in some loop or something like that). Restarting the client has always solved it.

CPU is overclocked to 3.15 GHz @1.35 volts in BIOS (SpeedFan reads lower voltage). It's Prime95 stable... well, at least 3 hours that I ran it. But I may not know for sure.

I though I was stable already at 3.2 GHz with same voltage since I had run Prime95 for around 3 hours (or whatever one complete loop takes) but later I managed to find instability when doing fan speed testing while running Prime95 to keep the heat up. Stable at 60 deg C but failed at 65 deg C.

Backed the clock a bit and kept the original voltage, remained stable under lower fan PWM setting for 3 hours. Running FAH doesn't produce nearly as high temperatures (around 50 deg C typically) so this probably should be a problem.

Now running CPU fan at 70% PWM. Temperature difference to 100% PWM is just 1 degree so it's not worth extra noise. I know OCed cores need to be kept cooler to remain stable, especially if one does not want to use higher voltage. And using higher voltage to increase stability at high temps will increase temps and so on. But how high is typically OK for a Quad at higher clocks? Do you think that 50 MHz drop actually had any difference or is it just sheer luck it didn't resurface during next 3 hours of testing? I know finding error is a hit-or-miss type of thing and nothing is really certain.

...ah, and it's definitely worth noting that SMP did make the same behaviour even at stock clock (though I had reduced voltage to 1.1 volts).

Do any of you have these silent abortions using SMP client?

In case you're curious:
with 3.15 GHz, it's pretty accurately 9 minutes per checkpoint.

And one question very off-topic:
I also now that YMMV (Your Mileage May Vary) but do systems with non-exotic cooling solution last several years crunching with 30+% overclocks? (Say, kept within 50...60 deg C so no watercooling here. Not even U120E.) If we forget about old system (since they are too easy to cool) and modern Core 2s (since they are too new to know for longterm) and focus on for example Prescotts (since they run hot), at how high clocks and temperatures have you run them (for several years - not interested in benchmark records)?

Posted: Sun Sep 30, 2007 7:44 pm
by bkh
> Do any of you have these silent abortions using SMP client?

Never experienced that with AMD X2 or Intel C2D at stock speeds running 64-bit Ubuntu. Maybe you have a subtle instability from your overclock.

Posted: Mon Oct 01, 2007 4:33 am
by whiic
But why would it crash at the beginning of the WU, and only once there, working perfectly after restarting the program?

I have encountered similar silent abortion of crunching with 2.4 GHz, 3.0 GHz, 3.2 GHz and 3.15 GHz. Out of these clocks, only 3.2 GHz was later proven slightly instable.

I could of course run a WU with stock voltage stock clock. It'd be quite a was of energy since VID is ridiculously high considering overclockability and undervoltability of G0. Many people have run G0 at lower voltage setting than I, but I'm stuck at 1.1 because BIOS limitation. I could run it in perfect stock settings to see if it'd crash though, just to eliminate one possibility. I only need to run it for the first 1% because there's where it mostly happens. Maybe I'll make the test when I finish the WU in progress as the problem only occur with new WU.

I did a droop elimination mod to motherboard today. Running stability testing with 3.15 GHz vcore dropped from 1.350 to 1.265 in BIOS. Achieved vcore when idle dropped 0.065 volts, under load vcore remains the same as before "pencil mod", 1.255 volts.

Posted: Mon Oct 01, 2007 9:20 am
by floffe
If you're concerned with stability while folding, StressCPU runs pretty much the same code as F@H, so that might be a better stability test for folding than Prime95 would be.

Posted: Wed Oct 03, 2007 7:00 pm
by aristide1
Whiic, there are fans that move more air than the Noctua NH-U12F, perhaps at the same noise level.

The other question is suppose the slowdown wasn't helping the cpu become more stable, suppose it was helping the northbridge be more stable? How hot is that thing? Have you added cooling and voltage, and if you haven't then why not?
In case you're curious:
with 3.15 GHz, it's pretty accurately 9 minutes per checkpoint.
Wow.

Well, that makes me feel better ....

Posted: Thu Oct 04, 2007 2:01 am
by Dutchmm
about my C2D 6750 at stock, where I am getting about 17,5 minutes per checkpoint.

But my CPU temps, reported by lm-sensors, are about 55C above ambient :cry: . As far as I can see, neither Linux nor BIOS is throttling, but at the weekend, before I put the sides back on, I will try to slide a thin coin under the retention bracket of mu U120E.

Mike

Posted: Thu Oct 04, 2007 2:38 am
by Wibla
That is way too high temps, whats your setup like?

Edit: im hovering around 16 minutes/checkpoint on E6600 @ 3.06Ghz, 2GB ram at 4-4-4-12 / 680MHz DDR2

Temps: Fans at 70% (500-800rpm scythe/nexus) = 36/39C (core0/1)

Setup is ...

Posted: Thu Oct 04, 2007 3:52 am
by Dutchmm
Antec P182 (with tri-cools set to low)
Gigabyte P35-DS4
E6750
4 x 1G Kingston DDR2 6400
Thermalright Ultra 120 Extreme (unlapped)
Scythe S-Flex 1200 on Thermalright (probably blowing in the wrong direction)
MSI 7600 GT 256Mb (Passive, but this won't affect CPU temp with the sides open)
Hitachi Deskstar 500G

And no, I don't like the look of the temps either, but it is really really quiet. I can't even hear the crackling of the fires that must be raging in the processor cores!

Mike

Posted: Thu Oct 04, 2007 4:51 am
by Wibla
Well, you should have the fan blowing through the sink towards the back.

You'll probably get better temps with the case closed, a "Kama Bay" in the 5.25" slots (or even just taking out one or two of the blocking panels) and a fan in front of the mid hdd bay (but remove the hdd tray if you're not gonna have drives there)

Image

So it fits in between the diskette tray and the upper disk .

Posted: Thu Oct 04, 2007 6:19 am
by Dutchmm
bay?
And I forgot to mention the PSU; it's the same as yours. Your cabling is a bit tidier than I have got mine, yet. But as I have to take it to bits on Saturday to put a washer under the HS retention bracket - and maybe wipe a little more of the thermal glop off the HS bottom - I haven't bothered to tidy up yet.

Will see if I can get a kama in the local shops.

Best

Mike

Posted: Fri Oct 05, 2007 11:28 am
by Wibla
It's called a Scythe Kama Bay :) comes with an 800rpm 120mm scythe fan (not s-flex afaik).

Posted: Fri Oct 05, 2007 1:16 pm
by iamajai
I've had the SMP client go idle sometimes when my IP address or network connection renews or gets dropped. I can get it to repeatedly happen each time I go and reset my router. Is there something similar to that happening at your end?

Posted: Fri Oct 05, 2007 4:14 pm
by whiic
iamaijai, yeah. I think problem is solved. I share one network cable with two computers and only plug it into Q6600 system when WU is done, to upload and download, then disconnect after it's done transfering. So, no wonder it always crashed at 1%. (Once at 3% but I think I might have left the cable in for a little longer and did some web surfing on the computer before removing network cable.)

I consider it a problem solved with high likelyhood. Since old client expired and I've installed the new one (which doesn't contain any major updates but just extension to client validity date) it did not crash in similar fashion when I removed network cable.

Also, previous client always said something about "8 consecutive improper terminations of core" when program was started, no matter how I closed the program before restarting it. And I mean always... not just when it crashed due to network cable. Weird symptoms. New installation on SMP client has no oddities so far.

aristede1: "How hot is that thing? Have you added cooling and voltage, and if you haven't then why not?"

I have no reliable sensor on NB but I placed a thermocouple between the fins and it topped at 61 deg C (350 MHz FSB). Voltage set on Auto. I have checked in BIOS that Auto does increase NB voltage automatically.

I dropped the voltages down to 1.275 again and run Prime95 at higher temp of 65...70 deg C (fan PWM locked in SpeedFan to 30... or was it 40%). Did not crash like the last time. Maybe it's because I made some changes in BIOS... PCI-E clock locked to 100 MHz, PCI locked to 33 MHz, Spread spectrum clocking disabled. Maybe last one did the trick? Changing first two didn't solve an issue I have with GPU driver crashing and reloading occasionally. Not gaming... idling mostly when it happens. Usually when I do surfing and touch mouse wheel: boom! Monitor full of funny things, then black, then back to normal with a notification that GPU driver was restarted. GPU or just GPU driver? I believe I'm using the latest one. Of course I could intentionally try an old one and see if it crashes too.

And I prolly should install some utility to monitor GPU temperature, but it at least the heatsink on it doesn't burn my fingers. Slightly hot at most. Probably cooler with the case closed so shouldn't be a problem... especially when idling.

Posted: Tue Nov 06, 2007 10:31 am
by aristide1
whiic wrote: aristede1: "How hot is that thing? Have you added cooling and voltage, and if you haven't then why not?"

I have no reliable sensor on NB but I placed a thermocouple between the fins and it topped at 61 deg C (350 MHz FSB). Voltage set on Auto. I have checked in BIOS that Auto does increase NB voltage automatically.
Yes mine is around 60C, and I can't keep my fingers on it for very long. I plan on increasing it's cooling before I OC, even if that simply means attaching some small stick-on heat sinks to the current NB heat sink.

Your FSB should stay near multiples of 66MHz, ie you should be running a 333 MHz FSB as opposed to 350, because of the NB divider frequencies. That way there are fewer components on the board itself that are OCd.

PCI-E locked at 100Mhz, PCI locked at 33MHz and spread spectrum off all required as well.

Posted: Sun Feb 03, 2008 5:29 am
by whiic
Some update on my cruncher. Updating GPU BIOS solved problem with GPU becoming unresponsive and needing reseting by the driver (not requiring a reboot) or even worse: crashing Windows because driver couldn't reset the GPU. Latter would occur quite often while editing with Premiere.

While updating GPU BIOS did solve the issue with GPU becoming unresponsive during idle, it now lost pretty much all 3D acceleration capabilities. FlightGear for example ceased to work, it's full of particles and all 3D objects are completely messed up. GPU is probably fuxxored.

Double Ethernet controller on motherboard is a bit buggy. Maybe I didn't notice it earlier because I usually rebooted my computer every few day but now that I'm leaving the computer unattended for days, I've noticed my network controller crashes and requires a power cycle (Windows reboot usually is not enough) to get it back online. This has been hurting my crunching lately. At least I hope it has always been that way... it'd be a bummer if my motherboard has started to die. I could try to solve it with BIOS update... and at least I could fix it by adding a PCI Ethernet card. That way crunching would be uninterrupted.

But some positive news as well:

I have managed to cut down the interval between checkpoints to 6½ minutes. Well, 13 minutes between checkpoint but running two SMP clients simultaneously (with FAH SMP Affinity Changer). Old checkpoint interval was 9 minutes so (9+9)/13 is 1.38 => 38% increase in FAH calculating power without extra OC.

My points per day should increase 2500 -> 3500 if these preliminary results obtained from checkpoint intervals are to be trusted. With 3500 PPD I'd be the producer number #7, between avidan and Dutchmm. :)

Now I'm below 2500 PPD average due to network controller crashing. With a day or two downtime each week, I can only achieve 2000 PPD (average) with my current setup. I hope I get it fixed soon and keep it to 3500 PPD from now on.

Enjoy it while it lasts ....

Posted: Sun Feb 03, 2008 10:20 am
by Dutchmm
My points per day should increase 2500 -> 3500 if these preliminary results obtained from checkpoint intervals are to be trusted. With 3500 PPD I'd be the producer number #7, between avidan and Dutchmm. Smile
I doubt I can improve on my 3000 to 3300 ppd until later this summer, when I can fix to silence and overclock my wife's 2160. But she is running at stock right now, so her 2605 WUs take 43 hours, as compared with 18.5 for my E6750, which is running at 3.2Ghz.

BTW, would you (or any of the other Windows users) consider running a VM appliance so you could use Linux SMP under Windows? I ask because you are running the Q6600 at nearly the same clock as me, but your WU times must be about 21.5 hours. And they ought to be nearly the same as mine.

At the moment, I think I have all the tools to make one for a few more weeks (evaluation licenses :-((), so now would be the time to say.

TTFN

Mike

Posted: Tue Feb 05, 2008 8:37 am
by whiic
I'm in doubt about increasing PPD by running a Linux virtual machine under Windows. Even if Linux might better utilize all the cores, the problem might become how to make the virtual machine utilize the real machine it's running on. It'd still be running on Windows. It's like trying to boost IDE-133 to 150 MB/s by adding a IDE-to-SATA bridge. It's the weakest link that counts.

The only way I see running Linux virtual machine to crunch improving PPD would be if the application used different crunching methods or using Linux was encouraged by giving the extra credit. For example I believe SMP clients receive extra points compared to multiple instances of single-core client. I might not just be because it creates better results but because they might simply want people to beta test the SMP client. Same might apply to Linux clients vs Windows clients.

Old WU time was around 18 hours. In practice, due to inability to connect to project (Ethernet controller crashes, downtime due to ISP, downtime of FAH server) it was closer to 21.5 hours you mentioned.

With 13 minute interval, my new theoretical WU time will be 21.6 hours but I'll be uploading two WUs (that is, until one of the instances starts to leave the other one behind for some reason) and thus the WU time that can be used in determining crunching efficiency is around 11 hours.

Still, even with Affinity Changer's nearly 40% boost in PPD (that is IF it really gives as high a benefit as I've noticed in early checkpoint times), I'm still FAR from having double the crunching power of your 3.2 GHz Duo. (Since 3.15 GHz isn't that much slower clock and I'm running four cores.)

Or does your 3000 PPD include not only your overclocked E6750 but also your wife's (stock clocked) dual core T2160? Does she crunch under her own name or under yours? Because otherwise I would be shocked... that a 3.2 GHz Duo (Linux) could outperform 3.15 Quad (Windows, single SMP client, no Affinity Changer).

Still, even if the points are combined, your having 18.5 hour WU times with your E6750. This is only half an hour off the WU time of a tad lower clocked Quad. This implies that while Quad may be slightly leading in crunching power, there's way more wasted performance running it under Windows. Effiency in Windows Quad vs Linux Duo appears to be like:
18.5/18.0*2/4*3.2/3.15 =52.2%.

If Affinity Changer improves Quad's score by 38% it'd be 72% of calculating efficiency of Duo under Linux. Still not perfect but ... could possibly be attributed to the fact that Quad isn't a single chip implementation but uses two separate caches. Or maybe that can't make such a big difference.

I don't know if Windows SMP and Linux SMP workunits are identical and whether the points are given on the same basis with no favouring of either client. Someone other could comment on that.

3000 PPD is both

Posted: Tue Feb 05, 2008 9:17 am
by Dutchmm
I am getting about 1.25 * 1760 from the E6750 (I won't tempt fate with any remarks on the reliability of my network) and about 4/7 of 1760 from my wife's stock 2160. Both running Linux SMP beta 6. Both under my name (well, it was my idea. She just wants to read email and surf).

So no, I am not getting nearly the yield of your Quad. But nor are you getting nearly double the yield of the E6750, which is what I would expect of two virtual linux duos. I believe Neil has theorised elsewhere in this forum that linux SMP gives you more bang for your buck than windows, and this looks like confirmation.

The reason for suggesting a VM is that it has also been theorised that the SMP doesn't scale well above 2 cores (although I have seen it allowed that this may be because Intel is not as good at making scalable SMP chips as AMD). So running two VMs, each attached by affinity to two of the cores produces a higher result. I think I saw someone talking about 4000 PPD, which is certainly more like double my yield.

But it would be an interesting experiment to see what happens if you run 2 windows SMP VMs, each bound to a pair of CPUs. Because, and this is the clincher, if I ever get allocated one of the 2500 pointers on the E6750, I expect to finish it in just over one day. 1.5 tops. Of course, if my wife gets one, it's game over ROFL ... until I turn up the wick on her CPU.

Posted: Tue Feb 05, 2008 11:30 am
by whiic
Ah... 1.25 * 1760 =2200 PPD. If mine crunches 22 hours for two WUs, 11 hours each, it's ~2.2 * 1760 =3520 PPD. 3520/2/2200*3.2/3.15 =81% efficiency Quad vs. Duo.

Dutchmm: "I believe Neil has theorised elsewhere in this forum that linux SMP gives you more bang for your buck than windows, and this looks like confirmation."

It may be true but running VM under Windows doesn't sound that optimal either.

Dutchmm: "The reason for suggesting a VM is that it has also been theorised that the SMP doesn't scale well above 2 cores"

Maybe... but I am running two instances of SMP (of which each instance has 4 crunching processes) and have the CPU affinities controlled with software designed specifically for FAH production maximization. Doesn't this pretty much mean there's two cores per SMP client? Also, there's four processes per SMP and each core supports hyperthreading (two threads per core), so I think with two simultaneous SMPs, Quad can be utilized relatively effectively. Emphasis on relatively... and in relation to single SMP that is (not in relation to Linux crunching).

Still VMs are inefficient so even if Linux could utilize all the CPU power given to VM, it's still Windows that handles feeding the VMs. Getting rid of Windows could be a solution but it also makes things, that I'm used to, difficult.

So, I think I'll pass on the experiment. If you find evidence on Linux VMs (not just Linux on real hardware) being more efficient than Windows (with FAH Affinity Changer added and two separate SMP clients running), please link me to the discussion and I may reconsider.

Posted: Wed Feb 06, 2008 11:49 am
by floffe
whiic wrote:So, I think I'll pass on the experiment. If you find evidence on Linux VMs (not just Linux on real hardware) being more efficient than Windows (with FAH Affinity Changer added and two separate SMP clients running), please link me to the discussion and I may reconsider.
I don't think anyone has tested this exact setup, but I know for sure that someone here posted benchmarks of the linux SMP client being faster than the windows one, even when run in a VM inside windows. That was a while back, and with one client only, IIRC. I think it had to do with the client being harder to optimise for SMP on windows compared to linux.