Intel's new Capabilities Assessment Tools

The Silent Front
Viewing page 4 of 5 pages. Previous 1 2 3 4 5 Next

EVALUATING THE INTEL CATs

Earlier, I mentioned that understanding the results of a benchmark requires answering two questions:

  • What hardware components are being measured?
  • What kind of performance is being measured?

Now that the both of Intel's tools have been described, it is easy enough to answer these questions. The tools are similar enough that the answer is the same for both tools.

The answer to the first question is that they measure a whole system, not a single component. For measuring hardware in isolation, timedemos and synthetic benchmarks are still just as valid as they always were. However, instead of trying to filter out the effects of other components, the tools measure all the components together as a whole. Let me repeat that in different words: the G-CAT and DH-CAT are designed for testing PC systems, not individual components.

So, what's the use of testing whole systems if users buy only single components at a time? I can think of two uses:

  1. The vast majority of PC users do not build their own systems — they buy them pre-configured from the major OEMs. DIYers consist of only 1~2% of the PC marketplace. The CATs are designed to provide useful results to a wide market segment, not just the enthusiats.
  2. Testing a whole system while changing only a single component is a good method for judging the effect of that component on a system. This shows the actual effect of the component on the system rather than its isolated performance, which may be limited by the rest of the system.

Intel's new CATs — benchmarks — will probably be adopted by the mainstream tech media (such as the glossy computer magazines in drug stores) and by OEMs as part of their marketing. They will be useful for consumers who have no time or interest in doing extensive research before making a purchase. Certainly, something needs to replace Intel's long focus on CPU clock speed as a simple measure of PC performance.

The second use is of direct interest to a hardware review web site like SPCR.

In the past, the approach to evaluating performance has been to test and judge the hardware in isolation. When testing a graphics card, for example, the rest of test bench was usually as fast a machine as possible to reduce the possibility of it affecting the benchmark. But this approach is less valid for Intel's tools because the rest of the system is supposed to influence the results.

Using the G-CAT or DH-CAT to evaluate a specific component requires maintaining a static system and changing only the component being tested between different tests. Any change in the result can then be attributed to the new component. However, it is not the absolute performance of the hardware that is being evaluated. Instead, the CATs measure the impact of that hardware on a specific system. In other words, the result will only change if the new hardware produces a subjective difference in how the system performs.

The answer to the second question — What kind of performance is being measured? — is that, in essence, user experience is being measured. Strictly speaking, that's not quite true; the G-CAT uses FRAPS to measure frame rate, and the DH-CAT measures dropped frames, frame delays and variations in image quality. However, with the research that Intel has done, these raw measurements are interpreted in more meaningful ways. These benchmarks are most useful to everyday users who are not technically inclined but want to know what kind of system they need to do what they want.

RELIABILITY, REPEATABILITY

Anyone who is seriously considering using one of Intel's tools will inevitably ask the question, how accurate are they? The short answer is that there is simply no way of knowing without actually using it. Ultimately, the accuracy of these tools will not be known until (and unless) it has been exposed to the wild. A huge range of systems need to be tested under a huge number of circumstances before any serious attempt as answering this question can be made.

However, a guess at the accuracy of the tools can be made by examining two factors:

  1. How closely the tools models the experiences of actual users, i.e. how reliably the tools reports what it intends to report.
  2. How much inter-run variance there is and whether it is possible to generate different results on the same machine, i.e. how repeatable the testing is.

Reliablility

The reliability of the tools will be governed by how thorough the research into user experience is, and how well developed the prediction models are. This is the question of how well the tools predict the user experience from the data. If the confidence intervals produced by the G-CAT and the capability table produced by the DH-CAT actually do predict how end users of the tested system would judge the system, the tools can be considered reliable.

We are cautiously optimistic that the tools will produce reliable results. Ultimately, we only have Intel's word about the research behind the tools, but everything that we've been shown has indicated that a lot of care has been taken to conduct the research in a way that produces statistically valid and useful results. In fact, I was asked to refrain from using the term "market research" to describe the research because of the imprecision and sloppiness that the term implies.

If anything, this is the most important thing for Intel to get right, as they will be suspected of biasing the benchmark if the tools do not produce impartial results. Intel is certainly making an effort to remain impartial: Some of the code in the tools will be available for close examination, and much research has been contracted to independent firms. Perhaps the best proof of Intel's good intentions is one of the people behind the tools. Dave Salvator was recently hired by Intel after a stint with the tech web site ExtremeTech. Prominently on his resume is a piece published on ET in 2003 about nVidia’s GeForce 5000 series using some questionable optimizations on the 3DMark03 benchmark. Personally, I find it unlikely that Intel would want to release a biased benchmark with this fellow around.

Repeatability

Repeatability refers to the amount of variance between different test runs. No matter how reliably the tools interpret the data, the results are useless if the data is not accurate in the first place. The key to obtaining accurate data is to make sure that the a single system always produces the same data. Unfortunately, the focus of the tools on real-world testing will inevitably hurt the repeatability of the tools, especially the G-CAT, which bases its results on actual gaming sessions.

There is no better way to simulate "real game play" than to actually play the game. Unfortunately, this creates a problem. Gaming is a random activity, which means that tests are not repeatable. Every session of testing will produce a slightly different result because the game will not be played the same way every time.

Timedemos or even scripted bots produce "gameplay" is the same every time. For this reason they are reliable and repeatable, which is very nice for testing purposes. However, these tests are often done with AI and Physics disabled and do not accurately reflect the load on the system during actual gaming.

So, should the realism that comes with the randomness of gaming be included in the test, or should it be sacrificed in the name of repeatability? This is a question that cannot be answered without examining the tool in person, so I will not try to answer it here. Instead, I would like to put it in a slightly different way for each tool.

For the G-CAT: Will all tests on a single system have confidence intervals that overlap, no matter what game, what level, and what settings are used? Intel admits that a player death during the testing can skew results (and so recommends re-running such tests), but are there other factors that also affect the end result?

For the DH-CAT: Will the capability level (and extra-credit score) stay consistent across different tests on the same platform?



Previous 1 2 3 4 5 Next

The Silent Front - Article Index
Help support this site, buy from one of our affiliate retailers!
Search: