I think that
Devonavar's test is a very good idea in many (most) respects, though flawed in others. First, I would like to speak about the
good.
Devonavar is actually quite underestimating the power of his proposed test. If we (slightly) restructured the test in the way I propose, we could actually learn quite a lot about not only "nuance", but about the listeners as well: I propose a test with several musical clips using different compression schemes and with different types of music and a large number of observers (this method is a hog for data).
Rather than getting into the deep math, let me try and tell you why
Devonavar's idea is intuitively appealing. Assume there are two unobserved parameters we wish to estimate: our sensitivity to musical information and the degree to which compression schemes affect our enjoyment of music. Assume we have multiple compression schemes and music (and uncompressed music as well, of course) as above, and many listeners. Truly terrible compression schemes would be correctly identified by all and very good compression (say true lossless) would be correctly identified by none--telling us much about the quality of the compression, though little about the listeners. The middling compression would seperate well however, with the better listeners progressivly clustering towards each other in their answers and the less able clustering as well. From this test, we could determine both the best listeners (the "golden ears") and the best and worse compression (or music)--the degree to which a given clip separates the listeners into the two groups. This is called a two-parameter item response test (IRT) model and is the basis for standardized tests such as the SAT, GRE, LSAT, etc. wherein the questions on these tests stand in for the music in our proposed test. When recording the results, those who proctor these tests are able to determine both the "difficulty" of each question (its discrimination parameter) and the quality of those tested. So cool
(... for stats nerds).
But here is the problem: we must assume that the test is truly a true estimator of our unobserved parameter(s). Here, I am less certain. First, if the system is too poor to reveal the differences in the clips, we would be led astray--for example I have my doubts I should be moved emotionally to any music played through a Bose Wave Radio (BTW, this is also yet another reason why demo-ing gear in "Bog Box" retail stores, Best Buy Circuit City, is such a farce: in order to easily switch amongst electronics and speakers, these retailers use very lossy and long lengths of cable and switchers which are almost sure to obscure the differences between equipment by dumbing it all down, but I digress).
Second, even if a suitable system were assembled (contrary to the anti-audiophile mafia, this could be done quite easily and for very little money--perhaps as little as $2-3k), would we still be testing the differences in the music? Well, indeed we would be if we stuck to
Devonavar's suggestion and and all listened on the same system--though given our geographic dispersion this seems unfortunately unlikely.
The problem is easy to see: assume that there is uncertainty regarding our perceived ability to perceive an unobserved but "true" parameter "sound quality". Setting aside the vagaries of compression for the moment, let us study the uncertainty surrounding the listeners (call this var_l for variance-listener) and that of the stereo (var_s). If we do not account for the stereo, but observing the true amount of uncertainty or error we should incorrectly assume that listeners' uncertainty = the quantity (var_s+var_l), assuming corr. var_s, var_l = 0, which may or may not be correct. This would lead us to a possibly biased conclusion and one that certainly overestimates the variability in "hearing". By terribly abusing some statistical ideas and terms (IRT does not work in quite the way I describe...), you could think of this as (incorrectly) expanding our confidence intervals and leading to Type two errors (incorrect acceptance of the null). This could be overcome by using several listeners at each of several stereos each playing several clips of music and then estimating a three parameter IRT model... though a) I have never done that and b) I think that that would make it difficult to get quality results due to inefficiency (but I really am not sure about that... see point a).
Finally (or, if you skipped that over-long statistical detour, next), I take offense at the way that nuance and emotion are used here. In fact, the difference between equipment can be quite
profound, not limited in the way that
Devonavar conceives of it. For example, when switching between my roommate’s $100.00 Samsung DVD player as a transport and my EMM Labs CDSD transport, I was struck 1) by how incredibly much they sounded alike (like 99.99 percent), and how much I am sure they measured alike (though i did not measure them), but then again 2) how profound these very small differences were on an emotional level (to use
Devonavar's term), or on my ability to connect with the music. Much the same might be said about cable elevators or my Arcici equipment rack: small differences, yet profound ones none the less. Of course I am sure these would fail to impress
yeha, as he himself has said he is far more concerned with how a piece of equipment measured than how it sounds, but for lovers of music, I do not think that
Devonavar's idea of small and nearly-but-not-quite unimportant differences really captures this idea. Whether these differences justified the $7000.00 difference between the EMM Labs and the Samsung is a question that each person has to answer for his or herself, but to write this off to only a minor difference, a splitting of hairs, or of nuance, is to confuse a difference in
degree (which is, again, slight), with a difference in
kind (which is profound).