top of page
Søg

Perceptual Acoustics: About the Trickery Used in Audio Compression.

Forfatters billede: Arved DeeckeArved Deecke


If audiophiles were asked what troubles them in their lives, audio compression in digital formats might very well be the number one piece of technology that might come up. This article is intentionally placed in the category of Hi-Fi history because I am personally convinced that the days of MP3 and other "lossy formats "are numbered for any serious music application since storage and bandwidth have become so abundant that compression at the expense of quality is really fulfilling a need from the past.


There are two ways of looking at MP3 as the most well known compression standard. One is the most apparent one: It’s a gruesome hack into music that is really unforgivable and has done it’s part to provide a rupture in audiophile culture like no other technology before. The other way, is the one I am really more interested in at this point, is given the outright huge exclusion of a large part of the audible signal, why and how does it even work at all, and well enough for that matter to have become a de facto standard for a brief moment in time.


A lot of why MP3 works as well as it does is because it piggybacks on our brains own compression system that helps us titrate the relevant from the noise. The set of tricks and shortcuts employed by compression CODEX are known as perceptual coding. MP3 or MPEG-1 and MEPG-2 audio layer III as the official technical term is uses psychoacoustic models to discard or reduce precision of components less audible to human hearing, and then records the remaining information in an efficient manner. The intent is to omit only information that is not audible by most people. This is of course the core of the attack on our audiophile ego. We are not like most people. We are the audiophiles, the chosen ones, we have Chuck Norris hearing abilities.


So here are some of MP-3’s dirty secrets:


Masking:




Clapping your hand can sound rather loud. A gun fired at the same time, however, will mask the hand-clap completely. Not surprisingly this has been known for some time and as early as 1894, the American physicist Alfred M. Mayer reported that a tone could be rendered inaudible by another tone of lower frequency. The amount in dB a barely audible sound needs to be raised to be audible in the presence of a louder sound is called the masking threshold. This threshold depends on the loudness of the masking sound, it’s frequency and the frequency of the quieter sound. The curves show the masking threshold attributed to human hearing of sound at a 1200 Hz. So yes, one of the things MP-3 encoding does is to use such curves and do away with anything that is below the masking threshold. On an MP-3 you won’t be able to hear the third violin second desk turning the page of the score during a Mahler symphony. The makers of the standard will uphold that you would not hear that while sitting at the original live performance at the Carnegie Hall. Chuck Norris, as a reference, will hear the taxi driver waiting outside turn his newspaper to page three.



Joint Stereo



In an ideal world of audiophile bliss, the right channel is completely independent from the left channel recorded by two different microphones in a binaural setting that mimics the anatomy of the human head when it comes to hearing.


Minute cues help the human auditory system localize sound and they include differences in loudness, onset time delays, phase delays, group delays as well as cues from the pinna that absorbs some of the overtones coming from the back versus the front. The human ability to determine the origin of complex sounds in a large variety of different acoustic environments is really quite remarkable and many audiophiles will agree that stereo-imaging is one of the most undervalued attribute of a great sounding system.


And of course MP-3 has to mess with that. Now the temptation is clear: Two channels really require two times the amount of data compared to one, so why wouldn’t we want to combine the two and only keep those differences that really help with the stereo effect.


The first piece of information that is thrown over board are the bass frequencies that humans other than Chuck Norris struggle with localizing anyways having given a rise to the existence of sub-woofers. But what happens then starts to get rude to the sophisticated listener: The compression in CODEC extreme cases reduces all stereo information to panning left right by varying the loudness of different frequencies between left and right. No phase delays, no group delays, no ear muffled overtones, all gone. No need to be Chuck Norris here, the Stereo Image on most MP3s is toast.



Missing Fundamental



Human hearing has adapted to work well in reality. And just like optical illusions happen when we look at something that couldn’t exist in reality, acoustic illusions are a very real and MP-3 uses such trickery. The timbre of a musical instrument, or any sound at all, really, is determined by the overtones of a fundamental frequency.


And just like our optical system is incapable of counting the legs on the elephant shown in the picture, we simply don’t seem to have much choice than to invent a fundamental frequency 1f if the overtones 2f, 3f, 4f, 5f, etc. are present.


The makers of MP3 of course know that and many times simply deprive us of the fundamental frequency. This actually works quite well when listening to music on head-phones that don’t offer a bodily sensation of bass notes. On a right set of loudspeaker, however, there may well be a difference in what happens when we listen to uncompressed music: Goosebumps.



Undersampling



MP-3 encoders and decoders can choose how many times an acoustical signal gets sampled per second. The most common sampling rate is 44kHz, meaning that 44 thousand times every second a value is returned after decompression back into the analogue realm. That does sound like quite a lot, however, this means, that in theory, a 22kHz sound can only by represented by two points and in the best of cases one at tis maximum the other at its minimum. Now granted, only Chuck Norris, the bat whisperer, can hear 22kHz, but there are two problems with 44kHz. One is that even at audible frequencies for middle aged men like me around 16kHz, only three samples are going to be offered per wave cycle and they are almost definitely not going to be at the best time for the wave to be represented. Also these sudden rises and falls are not handled very well by digital and analogue filters always present in the reproduction chain of sound and they can definitely cause audible problems for more people than just Chuck Norris. And you certainly don't have to have Norris hearing to hear when undersampling is taken too far. As shown in the image above, an undersampled wave of a certain frequency can actually be encoded as a lower frequency wave when undersampling is so extreme that aliasing occours.


David Chesky with Chesky Records recommends 96kHz sampling rate to avoid such issues in my previous interview with him.



Pre-echoes or the Modified discrete cosine transform (MDCT)


Yes that is a big word. The normal dicrete cosine transform simply put takes a block of sample and replaces it with a finite individual waves not unlike an inverse fourier transformation. Now in reality an MP3 CODEC does not look at an entire piece of music when encoding but it looks at blocks of usually 576 samples at a time. The transition between the blocks can lead to audible artifacts and hence the MDCT overlaps two such blocks to avoid this issue, yet creating another: Pre-Echoes. A pre-echo is the artifact of a sound that is audible prior to the sound actually occurring. Chuck Norris can hear a rolling thunder on MP3 a day before the lightning striges, but you and I will notice the problem mostly with fast rising high pitch sounds like cymbals perhaps.



Acoustic Beats



Ever sat in a twin engine aircraft and the engines ran at a slightly different speed? What we hear then is not two separate engines runing, but a noise that swells at the frequency that IS equal to the difference between both frequencies, or much slower. This of course can and is being used in audio comrpession by not storing the information for each individual wave, but the superposition of the two. This can be done in several steps or layers where the combination of these phantom beat notes leads to new phantom beat notes drastically reducing the amount of data needed. Chuck Norris can already hear intermodual distortion caused by this compression techinque when Yehudi Menuhin and Gidon Kremer play the same note on the same violin at the same time.


Now, I can already hear the analog crowd crying victory over the apparently abohrrent use of compression. Not so fast: The arguably most agressive compression technique used in HIFI-history comes from the inherrent inability of a vinyl groove to encode any reasonable amount of bass compared to its natrual ability to produce screetching sounds at the higher frequency. To fix this, the RIAA curves were introduced that provide standard equilazation used in all records today. The bass is vastly amplified, while the higher frequencies are supressed. With this, music can be literally compressed into very narrow and close grooves increasing the amount of music that can be stored The digital world certainly wasn't the first to push the boundaries of storing more music than the format itself can hold.


I personally conclude that good MP3`s still sound remarkably well given the massively intrusive nature. They also typically offer an 11 to 1 compression compared to uncompressed music and definitely have a place for applications like telephone systems or public address systems where sound quality is not paramount over band width. According to Moores law the computational power and other attributes tends to double every 18 months and for those who really have a passion for great sounding music, there really isn’t much of a reason to live with MP3 limitations anymore. As good as an MP3 might be, there is always the emotional toll we pay for not knowing what we are missing even if the music sounds great.


It’s time to pay our friends at HD-Tracks a visit who don’t only offer some of the best recordings in the world, but they also know how to keep it all when selling it to you as a digital format.



Arved Deecke is founder of the Danish / Mexican Loudspeaker company KVART & BØLGE that makes audiophile quarter wave loudspeakers and sound systems at a price anyone can afford. In his free time he blogs about all things related to sound, music and audio.




49 visninger0 kommentarer

Seneste blogindlæg

Se alle
bottom of page