Intel's new Capabilities Assessment Tools

The Silent Front
Viewing page 1 of 5 pages. 1 2 3 4 5 Next

Sept 13~Oct 24, 2005 by Devon Cooke with Mike Chin

The gigahertz race has ended, and Intel is heading in a new direction. Instead of focussing their marketing efforts purely on processor speed, Intel is taking advantage of its presence in all sectors of the tech industry — and perhaps the success of Centrino with its intergated suite of components — to sell their products together as a platform rather than individually.

Mike Chin, Editor of SPCR, recently reported on this phenomenon in his article, Paradigm Shift at Intel: IDF Fall 2005. Among the elements of the paradigm shift he mentioned was a new approach to platform testing that moves away from synthetic benchmarks and timedemos. In its place, Intel recommends using benchmarks based on how a platform will actually be used. Two tools have been announced, one for evaluating gaming performance and one that assesses a system's capabilities for use as a home theater PC.

Both of these tools are of interest in their own right but of limited use to SPCR; we generally focus on noise and heat rather than pure performance. What is relevant to SPCR is the thinking behind these tools: Intel is attempting to "Objectify the Subjective", to put down in formal, verifiable terms how a particular system will behave in terms of the end user's experience. As Mike pointed out, this echoes SPCR's raison d'être: "The essence of our interest is the enhancement of the computing experience... You could call it ergonomics in the broadest sense."

What follows is a preliminary evaluation of these tools, based on some on firsthand experience, Intel's press presentation about these tools at IDF in August, and a series of exchanges between Intel's CAT development team members and SPCR.

METHODOLOGY

There is no shortage of benchmarks on the web. Most performance-centered web sites run at least half a dozen benchmarks in a typical hardware review. So, why does Intel think it's worth it to design its own tools? Current benchmarks tend to be of two varieties: Synthetic benchmarks and timedemos. Both of these do a good job of testing and compiling the raw technical capabilities of a piece of hardware. Synthetic benchmarks are excellent for testing things like data throughput and memory latency, while timedemos do a reasonable job of testing complex loads that combine the various performance aspects of a piece of hardware.

The problem with these benchmarks is that they are difficult to interpret: What is the actual, subjective effect of a 500 Mbps increase in memory bandwidth, for example? It may be a 25% increase compared to a 2 Gbps baseline, but what does this mean in terms of actual user experience? Will a user actually be 25% more satisfied with it? Probably not; as long as a system does what is needed, the amount of extra resources is almost irrelevant. On the other hand, if the 25% increase is enough to suddenly make a new game playable, the user may be 100% more satisfied. After all, he can do something he couldn't before!

The problem is similar with timedemos, which typically report average frames per second. Often the results are 100 FPS or higher — high enough that some frames may never actually be displayed on the monitor, which typically refreshes 60-85 times per second. So, once again, users must assume that a higher average FPS will translate into a better overall experience, but it is almost impossible to tell what frame rate they need for their purposes.

The goal of Intel's new tools is to integrate the hard data from a traditional benchmark with a model that predicts the user's actual experience. To do this, the tools rely heavily on research about how (and when) raw performance affects the usability of a system.

While Intel's tools measure the same kinds of things that other benchmarks do — latency, bandwidth, FPS, etc. — they do not simply report the result. Instead, they make an effort to interpret the results in a way that is meaningful to the end user. Thus, instead of saying, "System X ran the demo at 24 FPS and System Y ran it at 100 FPS", Intel's tools would say "System X would be rated 'Poor' by an average user, but System Y would be rated 'Excellent'".

A significant amount of research has been done to correlate the raw data of traditional benchmarks to actual user experience. What is new is not how measurement is done, but how it is interpreted.

Intel wants to call these "Capabilities Assessment Tools" rather than benchmarks. That said, the definition of "benchmark" at dictionary.com fits Intel's tools as well as any other benchmark: "A standard by which something can be measured or judged". Our opinion is that Intel's tools are benchmarks, but we'll call them by the Intel's names:

  • Gaming Capabilities Assessment Tool (G-CAT)
  • Digital Home Capabilities Assessment Tool (DH-CAT)

G-CAT

With any benchmark, it is a good idea to ask what is being measured. If you can't answer this question, then you are unlikely to understand what the results it produces mean. There are actually two questions here:

  • What hardware components are being measured?
  • What kind of performance is being measured?

These are related, as different things can be measured on different kinds of hardware.

Traditionally, the approach has been to test individual components separately. A benchmark that tests the VGA card attempts to isolate it from the rest of the system, for example, by disabling CPU-intensive tasks, such as physics and AI. Then, individual tests are run for each type of performance: Memory bandwidth, clock speed, latency, rendering time, etc. Then, the results are compared to similar tests from competing components, and some judgment is made between them.

This is a useful approach when considering a single piece of hardware. Assuming that no other components have the potential to affect the tests (i.e. there are no bottlenecks that come from other components), such tests can help judge between two similar products, at least as far as performance is concerned.

What these tests cannot do is create a link between the performance numbers their usefulness in real-world applications; they cannot predict whether the gaming experience for a specific game with specific settings will be different when a GeForce 6800GT is used instead of a 6600GT. In fact, such a link is impossible: Games are not played on a VGA card in isolation; the other system components also affect the gaming experience.

This is why it is important to ask what kind of performance is being measured. Most benchmarks measure a specific aspect of performance on a specific piece of hardware. But, most users aren't interested in the raw performance numbers; what they want to know is whether that hardware will make a noticeable difference in their system.

So, why not measure what is noticeable? This is the question that motivated the research behind Intel's Gaming Capabilities Assessment Tool.

The G-CAT is unusual in two respects:

  • Results are based on three minute sessions of actual gameplay.
  • Results are given not in average frames per second (although the data is available), but in a five-point "user satisfaction" scale.

The G-CAT uses software that already exists: FRAPS, a video capture / benchmark application that collects statistical data about frame rate during a gaming session. Tests sessions are three minutes long and use specific game settings, although an experimental mode that allows other game settings to be used is also available. Once the three minutes are up, the statistics from FRAPS are loaded into the G-CAT, which then transforms them into a gaming experience rating based on research about how actual gamers rate their experience.

THE RESEARCH BEHIND THE G-CAT

Intel contracted the research firm Gartner, to conduct a study on how gamers respond to changes in frame rate. In December 2004 Gartner conducted a large scale test at the Cyberathlete Professional League event. Approximately 175 people participated in the test, which invited the participants to play three different games, Doom 3, Half-Life 2, and Unreal Tournament 2004, on five different systems and then rate their gaming experience on a five point scale. FRAPS was used to collect statistics about the frame rate during each gaming session.

All of the usual statistical safeguards were in place: Users were not told in advance what kind of machine they were using, the sequence in which the games and systems were tested was varied, and the five-point scale was deliberately left vague so that participants were not shepherded to a particular result. Test cases where the user died in the game were also excluded to ensure that they did not affect the results.

Once the test was complete, the user's opinions were plotted against the average frame rate to see if a relationship between the two could be found. Surprisingly, no model could be found that could predict how the users would rate their experience on the basis of average frame rate. To quote Aashish Rao, who presented the tool at a special presentation for the press: "Average FPS cannot be used to predict the gaming experience on a PC". So, another measurement needed to be found that could predict how users would react.


User satisfaction depends on more than just average FPS.

PREDICTING USER RESPONSES

What Intel came up with is still related to frame rate, but it is no longer the average. Instead, two separate mathematical models are used to predict how actual users would react: The Threshold Model and the Bayesian Model.


Intel summarizes the Pros and Cons of the two models in this table.

The Threshold Model takes into account the fact that frame rate is irrelevant as long as it is imperceptible to humans. In each of the three games tested, frame rate had no effect on how the users rated their experience so long as it was above 40-45 fps. Below this threshold, frame rate did affect the user experience. So, instead of using the average frame rate to predict user experience, the Threshold Model uses the number of frames below the threshold (in a three minute period). The higher the number of frames below the threshold, the lower the users would rate their experience.


Average FPS does not predict how users experience the game, especially above 60 FPS.
Note that this is how high the
average frame rate needs to be — the minimum is around 40-45 FPS.


Graphing the number of FPS against time makes it easy to see when the frame rate drops below the threshold.

This model turned out to correspond well to the data collected in the Gartner study, but it still takes only a single factor into account. In hopes of finding a more accurate model, the Bayesian Model was developed. This model takes the variability of the frame rate into account as well as the speed, but its mathematical complexity makes it difficult to understand exactly how it works. Although it works well for most scenarios, there are still a few cases where its error of prediction can be quite high.

The end result of running the tool is a frame rate graph from FRAPS, plus two "Gaming Experience" ratings, one for each model. In keeping with the statistical methods behind the tool, the "score" is not a single number but a confidence interval. This shows the margin of error and allows different results to be properly compared.


Both models produced statistically identical results in this test, although the confidence interval of the Bayesian Model is much smaller (more reliable).

Neither model is perfect. However, both reflect the actual user's experience, not just the average frame rate, which turns out to be a poor indicator of the gaming experience.



1 2 3 4 5 Next

The Silent Front - Article Index
Help support this site, buy from one of our affiliate retailers!
Search: