I have spent a lot of time focusing on the Red Book standard of CD Audio, which calls for 16-bit data at a 44.1kHz sample rate. In doing so, I imagine I have given implicit support to the commonly-held notion that this standard is perfectly adequate for use in high-end audio applications. After all, this is what Sony and Phillips promised us back in 1982: “Pure, Perfect Sound – Forever”. While the ‘forever’ part referred to the actual CD disk itself, the ‘pure, perfect sound’ part was clearly attributed to their new digital audio standard.
But if this really is ‘pure and perfect’ as the promise grandly claims, why, then, do we have all these Hi-Rez audio formats floating around? Why do we need 24-bit formats if 16-bit is pure and perfect? Why all these high sample rates if what’s-his-name’s theory says 44.1kHz is plenty high enough? Why bother with DSD and its phenomenally complicated underpinnings? These are good questions, and need to be addressed.
The argument for the superiority of 24-bit over 16-bit is perhaps the least contentious and best understood aspect. Simply put, every extra bit of bit depth allows us to push the noise floor down by an additional 6dB. So the 8 extra bits offered by a 24-bit format translates to a whopping 48dB reduction in the noise floor. Very roughly speaking, the very best analog electronics, microphones, and magnetic tape, all have a noise floor which is of the order of 20dB above that of 24-bit audio, and 20dB below that of 16-bit audio. In simple terms, this means that 16-bit audio cannot come close to capturing the full dynamic range of the best analog signals you might ask it to encode, but 24-bit audio has it handily covered with plenty of margin in hand.
So, in areas of noise floor and dynamic range, going from 16-bit to 24-bit buys us a great deal of potential benefit at a cost of only 50% in additional file size. I say ‘potential’ benefit, because you do require seriously high quality source material in order to take advantage of those benefits. But if you’re reading this, the chances are you have more than just a passing interest in ‘seriously high quality source material’!
Moving on to the sample rate. There are two immediately obvious benefits to increasing the sample rate. First of all, there is actually a further reduction in the noise floor to be had. Without going into all the ifs and buts, this amounts to an additional 3dB of noise floor reduction for every doubling of the sample rate. However, since moving to 24-bit sampling has bought us all the noise floor reduction we can handle, this aspect of increased sample rate is not really very exciting. The second obvious benefit is more nuanced – the fact that by increasing the sampling rate we can also increase the frequency range that we can capture. This bears examination in more detail.
Suppose we double the sample rate from 44.1kHz to 88.2kHz, and thereby double the Nyquist frequency from 22.05kHz to 44.1kHz. This means we can now capture a range of audio frequencies that extend quite a way above the nominal upper limit of audibility. This is where things start to get contentious. The main issue is that the vast majority of scientific study tends to support the general notion that as human beings we don’t hear any frequencies above 20kHz. Furthermore, that only applies to what Vinny Gambini memorably referred to as “yoots”. Once you get to my age, you’ll be lucky if you can still make it to 15kHz. However, the topic of frequency audibility does throw up some occasional odd results.
For example, in a scientific paper from 2000, Tsutomu Oohashi et al reported that by measuring a number of subjects’ EEG and PET scan responses, they showed that when played Gamelan music (known to be rich in ultrasonic harmonics) their brains did in fact respond to frequencies above 22kHz. The problem is that this paper’s results – referred to as the ‘Hypersonic Effect’ – have not been satisfactorily replicated, and in addition some valid technical criticisms have been made of its methodology. Even the authors themselves showed no apparent interest in following up on their work.
Overall, this, like other arguments of its ilk, doesn’t appear to provide any plausible basis upon which to build a case in favor of high sample rate PCM formats.
Various other studies have looked at ways of assessing whether or not the human ear/brain is capable of resolving sonic effects due to relative timing factors – such as the discernability of time misalignment between signals from spatially displaced speakers (Milind Kunchur, 2007) – and a reasonably consistent picture emerges that we are sensitive to time alignments in the range 1-10μs. This has been argued to imply that our brains are able to process audio signals with a bandwidth approaching 100kHz. This is interesting because it suggests (in a roundabout kind of way) that we may be able to perceive certain effects of hypersonic harmonics of audible frequencies, without necessarily being able to detect those harmonics in isolation (i.e. Oohashi’s idea redux).
Any treatment of timing-related issues opens the door to the discussion of phase, since phase and timing are one and the same thing. A phase error applied to a particular frequency simply means that that frequency is delayed (or advanced) in time. For example, a time delay of 10μs is exactly the same as a phase retard of 0.628 radians for a 10kHz frequency. Interesting things happen when you start to apply different time delays (i.e. different phase errors) to the different individual frequencies within a waveform. What happens is that the frequency response doesn’t change per se, but the shape of the waveform does. So the question arises – are phase errors which change the shape of a waveform, but not its frequency content, audible? It has taken a while, but a consensus seems to be emerging that yes, they are.
The connection between the audibility of phase and the improved fidelity of high sample-rate PCM recording is a slightly tortuous one, but I believe it is very important. Recall that Nyquist/Shannon sampling theory requires the very simple assumption that the signal is band-limited to the Nyquist Frequency. Therefore, in any practical implementation, a signal has to be passed through a low-pass filter before it can be digitally sampled. I’m running short of room in this column, but basically, the lower the sampling rate, the more aggressive this low-pass ‘anti-aliasing’ filter needs to be, because the Nyquist Frequency gets closer and closer to the upper limit of the audio band. With 44.1kHz sampling, the two are very close indeed.
As a pretty good rule of thumb, the more aggressive the filter, the greater the phase (and other) errors it introduces. These errors are at their worst at the frequencies above the top of the audio band where the filter is doing most of its work, but the residual phase errors do leak down into the audio band itself. My contention is that the ‘sound’ of PCM audio – to the extent that it adds a distinct sound signature of its own – is the ‘sound’ of the band-limiting filters that the signal must go through before being digitally encoded. And I attribute a significant element of that ‘sound signature’ – if not the whole ball of wax – to phase errors. By progressively increasing the sample rate, it allows you to use progressively more gentle anti-aliasing filters, whose phase errors may be less and less sonically intrusive.
Interestingly, with DSD, because the sample rate is so high, we can consider it to have effectively a linear phase response across the entire audio band and into the hypersonic range. Indeed, this may well be a significant reason why its adherents prefer its sound over high sample-rate PCM.
I believe that end-to-end (i.e. Mic-to-Speaker) phase linearity is going to be the next major advance in digital audio technology, one which we will see flower during the next decade. In fact, it may be that this is at least one of the precepts behind the nascent MQA technology.