Viewing page 1 of 5 pages.
1 2 3 4 5 NextSept 13~Oct 24, 2005 by Devon
Cooke with Mike Chin
The gigahertz race has ended, and Intel is heading in a new direction. Instead
of focussing their marketing efforts purely on processor speed, Intel is taking
advantage of its presence in all sectors of the tech industry — and perhaps
the success of Centrino with its intergated suite of components — to sell their products
together as a platform rather than individually.
Mike Chin, Editor of SPCR, recently reported on this phenomenon in his article,
Paradigm Shift at Intel:
IDF Fall 2005. Among the elements of the paradigm shift he mentioned
was a new approach to platform testing that moves away from synthetic benchmarks
and timedemos. In its place, Intel recommends using benchmarks based on how
a platform will actually be used. Two tools have been announced,
one for evaluating gaming performance and one that assesses a system's capabilities
for use as a home theater PC.
Both of these tools are of interest in their own right but of limited use to
SPCR; we generally focus on noise and heat rather than pure performance. What
is relevant to SPCR is the thinking behind these tools: Intel is attempting
to "Objectify the Subjective", to put down in formal, verifiable terms
how a particular system will behave in terms of the end user's experience. As
Mike pointed out, this echoes SPCR's raison d'être: "The essence
of our interest is the enhancement of the computing experience... You could
call it ergonomics in the broadest sense."
What follows is a preliminary
evaluation of these tools, based on some on firsthand experience, Intel's
press presentation about these tools at IDF in August, and a series of exchanges between Intel's CAT development team members and SPCR.
METHODOLOGY
There is no shortage of benchmarks on the web. Most performance-centered web
sites run at least half a dozen benchmarks in a typical hardware review. So,
why does Intel think it's worth it to design its own tools? Current benchmarks
tend to be of two varieties: Synthetic benchmarks and timedemos. Both of these
do a good job of testing and compiling the raw technical capabilities of a piece
of hardware. Synthetic benchmarks are excellent for testing things like data
throughput and memory latency, while timedemos do a reasonable job of testing
complex loads that combine the various performance aspects of a piece of hardware.
The problem with these benchmarks is that they are difficult to interpret:
What is the actual, subjective effect of a 500 Mbps increase in memory bandwidth,
for example? It may be a 25% increase compared to a 2 Gbps baseline, but what
does this mean in terms of actual user experience? Will a user actually be 25%
more satisfied with it? Probably not; as long as a system does
what is needed, the amount of extra resources is almost irrelevant.
On the other hand, if the 25% increase is enough to suddenly make a new game
playable, the user may be 100% more satisfied. After all, he can do something
he couldn't before!
The problem is similar with timedemos, which typically report average frames
per second. Often the results are 100 FPS or higher — high enough that some
frames may never actually be displayed on the monitor, which typically refreshes
60-85 times per second. So, once again, users must assume that a higher average
FPS will translate into a better overall experience, but it is almost impossible
to tell what frame rate they need for their purposes.
The goal of Intel's new tools is to integrate the hard data from a traditional
benchmark with a model that predicts the user's actual experience. To do this,
the tools rely heavily on research about how (and when) raw performance affects
the usability of a system.
While Intel's tools measure the same kinds of things that
other benchmarks do — latency, bandwidth, FPS, etc. — they do not
simply report the result. Instead, they make an effort to
interpret the results in a way that is meaningful to the end user. Thus,
instead of saying, "System X ran the demo at 24 FPS and System Y ran it
at 100 FPS", Intel's tools would say "System X would be rated 'Poor'
by an average user, but System Y would be rated 'Excellent'".
A significant amount of
research has been done to correlate the raw data of traditional
benchmarks to actual user experience. What is
new is not how measurement is done, but how it is interpreted.
Intel wants to call these "Capabilities
Assessment Tools" rather than benchmarks. That said, the
definition of "benchmark" at dictionary.com fits Intel's tools
as well as any other benchmark: "A standard by which something can be measured
or judged". Our opinion is that Intel's tools are benchmarks, but we'll call them by the Intel's names:
- Gaming Capabilities Assessment Tool (G-CAT)
- Digital Home Capabilities Assessment Tool (DH-CAT)
G-CAT
With any benchmark, it is a good idea to ask what is being measured. If you
can't answer this question, then you are unlikely to understand what the results
it produces mean. There are actually two questions here:
- What hardware components are being measured?
- What kind of performance is being measured?
These are related, as different things can be measured on different
kinds of hardware.
Traditionally, the approach has been to test individual components separately.
A benchmark that tests the VGA card attempts to isolate it from
the rest of the system, for example, by disabling CPU-intensive tasks, such
as physics and AI. Then, individual tests are run for each type of performance:
Memory bandwidth, clock speed, latency, rendering time, etc. Then, the results
are compared to similar tests from competing components, and some judgment
is made between them.
This is a useful approach when considering a single piece of hardware. Assuming
that no other components have the potential to affect the tests (i.e. there
are no bottlenecks that come from other components), such tests can help judge
between two similar products, at least as far as performance is concerned.
What these tests cannot do is create a link between the performance numbers
their usefulness in real-world applications; they cannot predict whether the
gaming experience for a specific game with specific settings will be different
when a GeForce 6800GT is used instead of a 6600GT. In fact, such a link is impossible:
Games are not played on a VGA card in isolation; the other system components
also affect the gaming experience.
This is why it is important to ask what kind of performance is being measured.
Most benchmarks measure a specific aspect of performance on a specific piece
of hardware. But, most users aren't interested in the raw performance numbers;
what they want to know is whether that hardware will make a noticeable difference
in their system.
So, why not measure what is noticeable? This is the question that motivated
the research behind Intel's Gaming Capabilities Assessment Tool.
The G-CAT is unusual in two respects:
- Results are based on three minute sessions of actual gameplay.
- Results are given not in average frames per second (although the data is
available), but in a five-point "user satisfaction" scale.
The G-CAT uses software that already exists: FRAPS,
a video capture / benchmark application that collects statistical data about
frame rate during a gaming session. Tests sessions are three minutes long and
use specific game settings, although an experimental mode that allows other
game settings to be used is also available. Once the three minutes are up, the
statistics from FRAPS are loaded into the G-CAT, which then transforms them
into a gaming experience rating based on research about how actual gamers rate
their experience.
THE RESEARCH BEHIND THE G-CAT
Intel contracted the
research firm Gartner, to conduct a study on how gamers respond to changes
in frame rate. In December 2004 Gartner conducted a large scale test at the Cyberathlete
Professional League event. Approximately 175 people participated in the test, which
invited the participants to play three different games, Doom 3, Half-Life
2, and Unreal Tournament 2004, on five different systems and then
rate their gaming experience on a five point scale. FRAPS was used to collect
statistics about the frame rate during each gaming session.
All of the usual statistical safeguards were in place: Users were not told
in advance what kind of machine they were using, the sequence in which the games
and systems were tested was varied, and the five-point scale was deliberately
left vague so that participants were not shepherded to a particular result.
Test cases where the user died in the game were also excluded to
ensure that they did not affect the results.
Once the test was complete, the user's opinions were plotted against the average
frame rate to see if a relationship between the two could be found. Surprisingly,
no model could be found that could predict how the users would rate their experience
on the basis of average frame rate. To quote Aashish Rao, who presented the
tool at a special presentation for the press: "Average FPS cannot be used to
predict the gaming experience on a PC". So, another measurement needed to be
found that could predict how users would react.

User satisfaction depends on more than just average FPS.
PREDICTING USER RESPONSES
What Intel came up with is still related to frame rate, but it is no longer
the average. Instead, two separate mathematical models are used to predict
how actual users would react: The Threshold Model and the Bayesian
Model.

Intel summarizes the Pros and Cons of the two models in this table.
The Threshold Model takes into account the fact that frame rate is irrelevant
as long as it is imperceptible to humans. In each of the three games tested,
frame rate had no effect on how the users rated their experience so long as
it was above 40-45 fps. Below this threshold, frame rate did
affect the user experience. So, instead of using the average
frame rate to predict user experience, the Threshold Model uses the number
of frames below the threshold (in a three minute period). The higher the number
of frames below the threshold, the lower the users would rate their experience.

Average FPS does not predict how users experience the game, especially
above 60 FPS.
Note that this is how high the average frame rate needs to be —
the minimum is around 40-45 FPS.

Graphing the number of FPS against time makes it easy to see when the
frame rate drops below the threshold.
This model turned out to correspond well to the data collected in the Gartner
study, but it still takes only a single factor into account. In hopes of finding
a more accurate model, the Bayesian Model was developed. This model takes
the variability of the frame rate into account as well as the speed, but its
mathematical complexity makes it difficult to understand exactly how it works.
Although it works well for most scenarios, there are still a few cases where
its error of prediction can be quite high.
The end result of running the tool is a frame rate graph from FRAPS, plus two
"Gaming Experience" ratings, one for each model. In keeping with the
statistical methods behind the tool, the "score" is not a single number but a confidence
interval. This shows the margin of error and allows different results
to be properly compared.

Both models produced statistically identical results in this test, although
the confidence interval of the Bayesian Model is much smaller (more reliable).
Neither model is perfect. However, both reflect the actual user's
experience, not just the average frame rate, which turns out to be a poor
indicator of the gaming experience.
| Help support this site, buy from one of our affiliate retailers! |
|