Subjective Evaluation of the MPEG Audio Codec
Mark Neidengard
June, 1996
California Institute of Technology
Introduction
The MPEG audio/video codecs are a set of compression and decompression routines for reducing the bandwidth requirements of multimedia communication. This is done by discarding information deemed - by the Motion Picture Experts Group - imperceptible to humans. Offering spatial savings far greater than conventional "lossless" techniques, MPEG will form the basis of the High Definition Television (HDTV) and Digital Versatile Disc (DVD) systems. As an audio/video consumer, I am likely to be affected by these systems; for Projects in Music and Science I elected to scrutinize the audio segment of the MPEG standard to obtain a preliminary grasp of how the "lossy" MPEG audio compression would affect my listening enjoyment.
Conventional lossless schemes do not offer the user a choice of compression ratio: their effectiveness is determined solely by the type of data being compressed. For raw 16-bit audio data, the popular "gzip" package achieves a compression ratio around 10% using a lossless algorithm. By contrast, lossy MPEG audio permits the user to specify the compression ratio of its output, to an excess of 95% if desired. In general, the only way to increase the compression ratio is to discard more audio data. Therefore, I hypothesized that audio should sound "worse" the more it is compressed. To test this theory, I needed a protocol that permits listeners to compare audio samples at different compression settings, as well as a definition of what "better" and "worse" would mean for my purposes.Preparation of Test Material
The MPEG standard currently defines two Levels (I and II), each of which contains several different "layers" of compression/decompression. The higher layers offer improved fidelity at the cost of increased computational requirements. Level I specifies three layers, the third of which was dropped from Layer II because cheap processing hardware was not available. For the purposes of my experiment, the lower two layers of Levels I and II function identically; I will only refer to layer number hereafter. The standard also specifies several optional mechanisms for further reducing spatial requirements, including "joint stereo", which combines sonic information above 2kHz into a single channel.
The encoder used was the MPEG/audio "working package" from the Independent Group. The decoder used was Tobias Bading's MAPlay 1.1. Both are freely available over the Internet, and I verified that both passed a battery of conformance tests, also available on the Internet. For the experiment, I used layer 2 with all optional features disabled, the highest-fidelity algorithm currently at my disposal, and the one likely to be used to encode HDTV and DVD material. To prepare the test material, data was transferred digitally from CD to hard disk and manipulated using a Hewlett-Packard 9000/735 workstation. Originals were then compressed at different compression ratios and decompressed again to make the test samples. Both original and processed samples were then digitally transferred to audio DAT using a Silicon Graphics Indigo 2. All audio data was stored in 44100 samples/second, 16-bit linear binary format.
Experimental Protocol
The protocol I settled on was a type of A/B test. Subjects listen to three different audio selections, chosen to emphasize areas of the sound I judged would suffer under lossy compression. Each selection is presented at three different compression ratios (4:1, 6:1, and 12:1). For each selection, the subject is presented with an alternation between original and compressed versions, where the ordering of compressed versions varies from selection to selection. Also, the 4:1 version is introduced into the sequence an extra time as a check on the stability of the subject's answers.
I originally ran my experiment with five audio selections, but the first round of subjects reported that the test was too fatiguing. Upon review of their data, I settled upon the following selections: "Blumlein Microphone Demonstration" (from the Caltech Music Lab Microphone Demo album), a minute-long section of "Blood Roses" (from Tori Amos's album Boys for Pele), and a thirty-second-long section of Mussorgsky's "Pictures at an Exhibition" (from the album performed by James Boyk). The Mussorgsky is played both at the beginning and the end of the test with different sample orderings as an additional check on the stability of the subjects' responses.
Through discussion in class, I decided to focus my experiment on measuring the accuracy of the MPEG audio codecs. I originally intended to ask listeners to rank the compressed samples in order of preference, but realized that preference is meaningless for characterizing the effect of MPEG audio compression. This is because changes wrought by the compression might sound "better" to any particular listener, even though by definition any change in the sound is a degradation. Through class discussion, the metric I settled upon was "perceived accuracy", measured on a scale from 1 (highly unlike the original) to 7 (indistinguishable from the original). I tabulated the number of subjects assigning each rank to each compressed sample to discover how perceived accuracy varies with compression ratio.
To measure this perceived accuracy, I identified five audio characteristics which should show variation as audio quality degrades: imaging accuracy, ambiance, intelligibility, percussive sharpness, and overall quality. Each was ranked separately on the experimental questionnaire for each compressed version of each different selection (5 criteria x 4 selections x 4 compressed versions = 80 rankings in all). After review of the preliminary data, I decided that imaging accuracy and intelligibility were redundant and possibly ambiguous; they have therefore been discarded (reducing the number of rankings to 3 criteria x 4 selections x 4 compressed versions = 48).
Subjects were outfitted with AKG 240 headphones, connected to the Caltech Music Lab headphone amplifier (designed and constructed by David Barksdale as a previous class project), driven by the audio line out of the Apogee D/A converter, connected via coax digital to the Sony DAT deck.
Results
I collected data from a total of thirteen subjects, all drawn from the pool of undergraduates at the California Institute of Technology.
To decide how much validity to attach to the results, I will first consider the "stability" checks implemented in the experiment. For each piece of music, subjects judged three criteria for each of four compressed versions. For each subject, I calculated the standard deviation for each of the 3*4=12 criteria. I then calculated the number of standard deviations of difference in the rankings assigned to the two 4:1 samples for each criterion, a quantity I will call "error", and averaged over all 12 criteria for each subject. Next, I calculated the average error for the three criteria for the two sets of Mussorgsky samples, and took the difference as a measure of the "settling in" of the subjects.
In the average error metric, lower numbers indicate more stable rankings by the subject, with zero indicating completely identical rankings on all criteria for all the sets of identical samples. For the settling factor metric, greater magnitudes indicate greater differences between the first and second Mussorgsky sets; positive numbers indicate a decrease of average error between the first and second sets, while negative numbers indicate the opposite. Zero indicates no difference in error, on average, between the first and second sets. Numbers are presented on a per-subject basis:
As indicated in the chart above, four of the subjects were within one of their own standard deviations, on average, in their ranking of the identical sample sets. Most of the remainder of the subjects clustered around 1.3 standard deviations of average error, and two subjects were worse than 1.5 standard deviations of error on average. All subjects experienced some settling in, although in widely varying degrees (nearly two standard deviations in one case). It is distressing to note that three of the subjects actually increased in average error on the second Mussorgsky set: this may indicate fatigue. It is also distressing that only three of the thirteen subjects displayed both average error and settling factor within one standard deviation; this will be discussed below.
As can be seen, the data has very little of the character suggested by the hypothesis. The subjects have tended to identify the low compression with high quality, but have not consistently noted the converse. In fact, distribution of responses becomes flatter on the whole as compression ratio increases, in some instances indicating that the higher-compression samples sound more accurate than the lower-compression ones. Because of the dubious stability of some of the subjects' responses, I decided to repeat the analysis using only the data from the subjects whose average error and settling were both within one of their own standard deviations. The results follow:
These graphs show more of the expected character, but lack a sufficient sample base to have much persuasive force. Data generated with an intermediate number of responses falls, unsurprisingly, between the two sets of graphs.
Conclusions
Clearly, this experiment has failed to do more than hint at the expected decline in perceived accuracy with increasing compression ratio. Logically, it follows that either the effect does not exist, my experiment was not designed in a way that would reveal it, my experiment was executed improperly, or that my pool of subjects was unable to perceive it. Having heard the effect myself and having solid information-theoretic reasons for expecting it, I am highly disinclined to accept the first possibility. As to the second possibility, my experiment contains the essential element of comparing an original against copies made with different compression ratios. It is possible that further study could suggest a group of musical selections which would more clearly highlight (or discredit) the effect. It is also possible that the length of the experiment, approximately twenty four minutes, is simply too great for experimental subjects to sit through; the negative settling factors may lend credence to this idea. Another possibility is that subjects need more repetition of a musical selection to eliminate initial hesitation in scoring: this would involve some number of "trial run" compressed versions prepended to the actual experimental regime. Yet a fourth possibility is that the the differences between 4:1 and 12:1 compression are too subtle to affect the responses of untrained listeners and that a wider compression range would elicit the hypothesized response.To reduce the possibility of experimental error, every effort was made to limit outside influences on the test. The experimental venue, the Caltech Music Lab, was kept silent to within the (low) limits of plumbing noise throughout the occasions the experiment was held. Subjects used headphones to further limit disturbances from outside noise, and the entire test was played without interruption from DAT to remove the need for direct operator intervention. I have personally verified that the material on the DAT contains audible perceived accuracy degradation with increasing compression ratio, although my assessment is still awaiting independent expert review. One possibility is that the subjects were not made to understand specifically what I was asking them to report: comments made by some subjects after the experiment lead me to suspect that they were implicitly judging "quality" rather than "accuracy", despite the explanation on the questionnaire. Perhaps further clarification of this concept is necessary to ensure the subjects are making the judgments called for by the experiment.
The possibility that the subjects were unable to perceive the differences is worthy of serious consideration. I decided at the beginning of designing my experiment to use "untrained" listeners in an attempt to give my results some validity for the general public. It has been argued in many venues that the human animal is surprisingly good at listening to sound, even if he may not be trained in how to articulate his impressions. Unfortunately, the pool of experimental subjects, none of whom have any formal ear training, did not present any sort of concerted response to the experimental stimulus. It may simply be the case that the subjects were either not attuned to the criteria I was asking about, or that the subjects were unable to articulate their perceptions on the questionnaire provided. To be sure, most untrained listeners do not normally judge sound by "accuracy" against some standard, but by a much vaguer personal preference scale (how "good" it sounds to them). It is not inconceivable that the MPEG audio compression creates a "preferable" distortion in the sound; it may be difficult to get an untrained listener to ignore this "pleasant" distortion and concentrate instead on accuracy.
Several directions for further inquiry exist. One option is to simply continue the experiment with more subjects. It is not inconceivable that the unsatisfactory clarity of the results is merely the result of sample bias that would be corrected by aggregating more data. Another avenue would be to repeat the experiment with trained listeners to ascertain if at least some percentage of the population was sensitive the changes wrought by the compression. If a strong positive indication of quality loss could be detected through either method, the experiment could be repeated with the reputedly higher-fidelity layer 3 to see if the effect was lessened.
Despite the disappointing lack of proof of the hypothesized decline in accuracy with compression ratio, the reactions of the experimental subjects have served as an important validation of the MPEG effort as a whole. During discussions after the experiment, the subjects reported that while some of the compressed versions sounded "better" than others, all of them sounded "pretty darn good", and that all were quite acceptable for use in a broadcast environment. Moreover, several trained listeners commented during the experimental design that, while the compression had its problems, it was not "unreasonable" in its representation of audio material. It seems that, whether or not MPEG audio compression is capable of delivering its spatial savings while retaining high fidelity, its results are at least plausible to the casual listener. And given the immense compression advantage MPEG audio has over alternate techniques, some loss of fidelity may be an acceptable compromise for many consumer-level applications.
This report was prepared for Caltech's EE/MU 107 "Projects in Music and Science" class, taught by Jim Boyk.