A tale of two GTS 250s

A forum just for SPCR's folding team... by request.

Moderators: NeilBlanchard, Ralf Hutter, sthayashi, Lawrence Lee

Post Reply
haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

A tale of two GTS 250s

Post by haysdb » Wed Mar 18, 2009 8:42 pm

OK, someone explain this to me.

I have two identical EVGA GTS 250 (9800 GTX+) cards. Each is processing a 5903. One is getting 5500 PPD and the other is getting 4000 PPD. What's up with that? The first is processing a step about every 4:30 while the second is processing a step about every 6:00 minutes.

Do these projects vary that much or is something whack with my set-up? Both cards are in the same box.

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Post by haysdb » Wed Mar 18, 2009 9:20 pm

It's apparently just a variation between two work units. I stopped both clients and swapped -gpu 0 and 1 in the shortcuts I use to launch them, so that the two cards swapped work units. The PPD remain the same, so it's not the hardware.

The PPD for one WU has varied between 5100 and almost 6000.

AZBrandon
Friend of SPCR
Posts: 867
Joined: Sun Mar 21, 2004 5:47 pm
Location: Phoenix, AZ

Post by AZBrandon » Thu Mar 19, 2009 12:46 pm

I've noticed variation, but not quite that extreme. On my single GTX+ (roughly the same card you have) I see 4700-5400 PPD most of the time. I don't think I've seen it get down to 4000 unless I start doing a ton of stuff to take away GPU cycles from FAH. It looks like its been over 12 hours since your post was made though - have you seen it average out more over the WU's you've gotten since then? They did say on the FAH forums that these WU's do have a lot of variation both among different WU's as well as even from one frame to another within the individual WU's. Overall though you should be able to complete a 5903 in about 8 hours on a GTX+ on an otherwise idle PC.

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Post by haysdb » Thu Mar 19, 2009 3:24 pm

Something is screwy, I just haven't figured out what. I'm getting 5700 PPD from one GPU and 3700 from the other.
  • The FAH configurations are identical.
  • The clocks (shader, core, memory) are the same.
  • The GPU core temperatures are running about the same, with similar fan speeds (within 2%).
  • Both are working a 5903.
  • They are at about the same point in the WU - one is at 22%, the other at 25%.
I have to be someplace right now, but later tonight I'll run FurMark benchmarks on each card to see if it identifies any performance difference.

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Post by haysdb » Fri Mar 20, 2009 8:19 am

I never did discover a definitive reason why the two cards were folding at different rates. I changed some things and they are folding about equally now. In fact, over the last 10 frames their frame rates are within 4 seconds of each other on a pair of 5903s. Here are a few of the things I did and tried:
  • Reboot. As always with Windows, sometimes the solution is just to reboot. :|
  • Update drivers. I'm using a beta operating system (Windows 7) so the thought of using leaked beta drivers didn't scare me. I'm now using the leaked 185.20 drivers. So far so good, i.e. my system hasn't crashed in the 8 hours since I installed them.
  • Put all GPU clocks back to stock. After running the shaders at 1998, one of the cards reverted to failsafe settings, including a 602 MHz shader clock. [Interestingly the WU did not abort.] I'm now running the shaders at 1890 (one "click" up from stock 1836), and the GPU core at 700 (down from 750). Memory is still at 1100.
  • Set affinity on the four vmware instances to 0&1, 2&3, 4&5, 6&7. This has resulting in MAYBE a 1% improvement in CPU utilization, so probably isn't worth the hassle.
    Edit: I now believe setting affinity makes little to no difference. I have not done controlled tests, however.
  • Set priority on the four vmware instances to Low. They were running at Normal priority by default, and may have been starving the GPU clients for CPU time. The GPU clients don't need much, but when they need it they need it. This hasn't seemed to effect the vmware jobs in any negative way (perhaps because the GPU clients use so little CPU).
  • Set priority on the two GPU clients to Below Normal. The idea was to run them at a slightly higher priority than the CPU clients, which are running at Low, assuring the GPUs aren't contending with the CPU clients for processor time. Unfortunately the priority seems to be going back to Low at the beginning of each new work unit so this didn't work.
  • Set affinity on the two GPU clients. I set one to run on cores 0,1,2,3 and the other on cores 4,5,6,7. However, this is going back to default values (one assigned to core 6 and one assigned to core 7) after each WU. It probably isn't important anyway. I may be able to change that in the GPU client config options.
Ultimately it maybe have simply been the reboot that fixed the problem. All of this is purely "anecdotal" since I didn't follow a structured path of changing one thing at a time and re-testing. I often changed a couple of things and I didn't keep a good log of how each step effected everything.

I do think the new drivers have made a difference, although not in a 100% good way. They seem to utilize GPU resources more completely, but at a cost in heat and power draw. My GPU temperatures are higher, the GPU fan speeds are higher, and my Kill A Watt is now reporting a steady power draw of 380 to 410 watts, which in turn has driven my PSU fan up to 1361 RPM. On the bright side, FAH is reporting 13K PPD from the two cards (a pair of 128-shader GTS 250s).
Last edited by haysdb on Fri Mar 27, 2009 5:24 am, edited 1 time in total.

AZBrandon
Friend of SPCR
Posts: 867
Joined: Sun Mar 21, 2004 5:47 pm
Location: Phoenix, AZ

Post by AZBrandon » Fri Mar 20, 2009 9:53 am

Are you running Windows 7 64-bit? The 185.20 drivers are now known to cause one card to run faster than the other in Vista 64 and Windows 7 64, so even if they're equal right now, once you get a new WU on one of the cards, it may pick up speed and run faster than the other. However, 13k PPD sounds more like both cards are running at bonus speeds.

PPD bonus thread on Vista/7 64-bit with 185.20

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Post by haysdb » Fri Mar 20, 2009 9:22 pm

Yes, Windows 7 64-bit.

Both cards have long since picked up new WUs.

18,713 points posted today. Not a bad day. :shock: I was doing 2K a MONTH with my old 2.4 GHz Northwood rig. I post more in 3 hours now than I was doing in a month.

AZBrandon
Friend of SPCR
Posts: 867
Joined: Sun Mar 21, 2004 5:47 pm
Location: Phoenix, AZ

Post by AZBrandon » Sun Mar 22, 2009 9:21 pm

In the days that have passed, it has been brought into question if the 185.20 drivers are producing accurate results or not. Just curious, which version of drivers are you using? Those are really high points from just a pair of cards, wow. My 9800GTX+ has done around 6700 on its best single day in the last month or so, so it seems your output is quite a lot higher.

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Post by haysdb » Sun Mar 22, 2009 10:24 pm

Could you point me to a source re: the 185.20 drivers producing inaccurate results?

Edit: I found a couple of threads. There's nothing definitive at this point.

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Post by haysdb » Wed Mar 25, 2009 2:19 pm

A couple of points:

I was not using the 185.20 drivers when I posted about the wildly varying PPD numbers. The 185.20 drivers were not the cause because I wasn't using them yet.

I'm still using the 185.20 drivers, but my production the last few days has been in the 4000s, occasionally low 5000s. At the moment they are working on 5903s and FahMon is reporting 4667 PPD for one and 4774 for the other. The 6500 PPD numbers reported during the first day after installing those drivers are now nowhere to be found.

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Post by haysdb » Thu Mar 26, 2009 7:14 am

A project 5779 WU is really working one of my cards this morning. The GPU temp is touching 72C and the fan is running at 60%, which is clearly audible. The good news is it's flying through this work unit at 2 minutes per step so it will be finished with it shortly.

This did give me a good opportunity to experiment with the clocks. I lowered shaders/core/memory from stock 1836/756/1100 to 1728/700/1000 and the temperatures dropped MAYBE 2 degrees and the fan speed changed not. Step times increased from 2:02 to 2:12 (8%) without any real effect on the GPU temperature. This is an unexpected result.

Wondering if the reverse were true, I bumped the shader clock one "step" to 1890. The fan speed increased by another 2%.

I'm about done with overclocking (and underclocking) my graphics cards. The default settings appear to me to be "optimum," I.e. there is a price to increasing the clocks but little benefit to lowering them.


In contrast, the second of the two identical cards is cruising along at 58C and the fan at 39%. I *barely* hear the fans when both are running at no more than 40%. They basically drop to the level of the CPU and PSU fans at that speed, both of which are running at around 1200 RPM, give or take 100 depending on load. Like right now the CPU fan is running at 1278, but my CPU utilization is 97%, so that's to be expected. The WUs my graphics cards are working on aren't stressing them like the 5779 was a while ago, so system power demand is down a bit and the PSU fan is running at only 1110 RPM. I can live with that. It's when it's gets up around 1300 that I start hearing it.


My point is, and this is what I find intriguing about combining silent computing with folding - it's all about matching components and tuning them to get the most you can with heat, power, and noise all within the limits you define. I can run my CPU at 3.2GHz. If I go any higher the fan speed becomes intrusive. I can handle a pair of GTS 250 graphics cards. My PSU could probably handle GTX 260s, and I know the new 770W Zalman can, but would they run as quietly as the 250s? The 250s are right at the point where I don't usually hear them but sometimes I do. I think 260s would push the fan on my current Enermax PSU to levels I would find intrusive, although the Zalman might be fine with the higher power draw. It's fun (for me anyway) juggling all the variables, and optimizing my system.

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Post by haysdb » Fri Mar 27, 2009 5:39 am

AZBrandon wrote:Are you running Windows 7 64-bit? The 185.20 drivers are now known to cause one card to run faster than the other in Vista 64 and Windows 7 64, so even if they're equal right now, once you get a new WU on one of the cards, it may pick up speed and run faster than the other. However, 13k PPD sounds more like both cards are running at bonus speeds.

PPD bonus thread on Vista/7 64-bit with 185.20
I had the situation for a day or two where one card seemed to be running "faster" than the other, but this is no longer true. My points have also gone back to the 4 thousands, sometimes low 5 thousands. I'm still running the 185.20 drivers. My theory? There were and still are some "exceptional" work units. I had one yesterday that drove my GPU temperatures into the 70s and was giving me ~6000 PPD iirc. That same GPU is folding at 56 degrees now and giving me 3600 PPD according to FahMon. I don't remember what project that was but I'm seeing large variation even within a project, say a pair of 5904s.

There may yet prove to be "issues" with the 185.20 drivers with regard to Folding@Home, but I remain unconvinced they are a problem.

haysdb
Patron of SPCR
Posts: 2425
Joined: Fri Aug 29, 2003 11:09 pm
Location: Earth

Post by haysdb » Fri Mar 27, 2009 5:54 am

http://foldingforum.org/viewtopic.php?f ... 165#p90893
ihaque - Pande Group member wrote:There was a bug in the core that interacted with some changed behavior in the 185.20 driver. The good news is that it doesn't appear that this particular error "contaminated" any runs; the effect here was that some results were simply not calculated (and therefore not returned), not that returned results were corrupt. Core execution was faster precisely because not all the calculations were done.

I've released an updated version (1.25) of the core that appears to fix the bug; your clients should automatically update when they get a new 59xx WU assigned, or you can manually force an update by stopping the client, deleting FahCore_14.exe, and restarting. Please reply to this thread if you still see the PPD "doubling" with the 1.25 core, as the core results are definitely missing some returned data under those conditions.
This would explain why my points have gone back to normal.

Turns out it wasn't the drivers at all, but a bug in the core. Using the 185.20 drivers did prove to be less useful to the project since points were awarded for work that wasn't actually done.

I will be reinstalling Windows 7. When I do I will go back to using the "approved" drivers.

AZBrandon
Friend of SPCR
Posts: 867
Joined: Sun Mar 21, 2004 5:47 pm
Location: Phoenix, AZ

Post by AZBrandon » Fri Mar 27, 2009 11:42 am

haysdb wrote:My points have also gone back to the 4 thousands, sometimes low 5 thousands. I'm still running the 185.20 drivers. My theory? There were and still are some "exceptional" work units. I had one yesterday that drove my GPU temperatures into the 70s and was giving me ~6000 PPD iirc. That same GPU is folding at 56 degrees now and giving me 3600 PPD according to FahMon. I don't remember what project that was but I'm seeing large variation even within a project, say a pair of 5904s.
If you read the descriptions over on FoldingForum, they say all the 590X series WU's will have large variations in frame time within the WU. They are often consistent for a long time, but then have random frames here and there that either complete very quickly or very slowly. IIRC they said variation of 50% is considered normal from frame to frame, but over the course of the whole WU, it should still achieve a normal average of something like 10% from WU to WU.

Right now there's two main WU's for the nvidia side, and the easiest way to tell them apart is by what core they use. You'll see this in the log file for FAH. If it loads fah_core11, it's the older style, probably a 768-point WU and you may see anywhere from 5400-6000 ppd and it should run a steady, very high temperature. The fah_core14's are now worth 1888 points and will fluctuate more. I've seen 4700-5400 ppd reported on a frame by frame basis, averaging out around 5200 ppd for the whole WU most of the time on those guys. Temperatures are significantly lower with fah_core14 WU's than fah_core11 WU's.

Post Reply