Introduction to Digital Audio

Sound Waves

When a sound source vibrates it transmits its motion to molecules into the surrounding medium. The elastic nature of the medium causes molecules adjacent to the disturbance to also start vibrating. This "chain reaction" forms a displacement wave that moves away from the source in all directions. This motion can be easily understood by visualizing the "rippling effect" that occurs when you throw a rock into a still pond. The radiating waves consists of repeated compressions and rarefactions of molecules in the medium brought about by the motion of the source. When the molecules in the medium are compressed the pressure in that vicinity is higher than in regions of the medium at equilibrium. Similarly, when motion pulls molecules further apart (rarefaction) the pressure at that locality is less than then it is at equilibrium. This smooth, continuous change in pressure can be graphed over time as an oscillation centered around the equilibrium.

Example 1a: Two graphs of a sound wave superimposed on top of each other. The blue plot depicts the density of molecules in the medium at an instant in time. The black plot graphs changing pressure as a function of time. Maximum compression, rarefaction and equilibrium are labeled C R and E, respectively.

The X axis of the graph in Example 1a plots time in seconds. The Y axis of the graph represents the amplitude of the waveform, a measurement of how much energy is carried in the waveform. Since this energy can be measure many ways (pressure, displacement, voltage, etc.) the Y axis is often depicted in terms of a simple 1 to -1 scale, where 1 means maximum compression and -1 means maximum rarefaction.

Waveforms like the one depicted in Example 1a repeat themselves at regular intervals of time. This type of wave is called a periodic waveform. The unit of regularity in a periodic waveform is called its cycle. A wave like Example 1a oscillates in a completely uniform manner (at a constant rate). This type of wave is called a sinewave. But musical instruments that produce the harmonic series actually generate a periodic wave that oscillates at multiple rates. These types of waves are called complex waveforms. Note that not all complex waves are periodic, that is, it is possible for a wave to have no discernible regularity in its oscillation. These types of waves are called aperiodic waveforms, or noise.

Example 1b: Plots of three different types of waves. The blue and green plots both depict periodic waveforms, a sinewave at 6 Hz (blue) and a complex wave (green) containing the first , thid and fifth harmonics and whose first harmonic is traveling at 1 Hz. The red plot depicts an aperiodic waveform with no discernible regularity, also called noise.

The amount of time a waveform needs to travel through one cycle is called the period of the wave. The number of cycles per second of a wave is called the frequency and is measured in Hertz. Since the graph in Example 1a shows two cycles of a waveform over two seconds of duration the wave is oscillating at the rate of one cycle per second, or 1 Hertz. (The frequency of 1 Hz is not audible by humans.) The frequency and period of a waveform are reciprocal values, that is, the period of a wave can be calculated as 1/frequency. For example, the period of a 440 Hertz waveform is 1/440th of a second.

Analog Recording

Natural sound is ephemeral, it only exists while a sound source is physically vibrating. Once the sound source stops moving the medium returns to its normal state of equilibrium and no trace of the wave's pressure fluctuation remain. Since the wave has ceased to exist so has the sound. In order to record sound the pressure waves in the air must be transformed into some other medium that does not "return to equilibrium" once it has been distorted by a wave. An analog recording accomplishes this task by literally translating a pressure wave into an analogous medium. This process can easily be visualized by imagining a pen writing on a wax tablet: as the pen moves its point in the surface of the wax and captures every motions that the hand makes. The etched grooves in the surface of the wax remain once writing has stopped. Note that the wax recoding actually records motion in three dimensions: horizontal and vertical (as with paper) and depth, which captures the instantaneous pressure changes applied to the pen as it moved over the wax surface. The pressure recording is quite similar to the original Edison wax cylinder recordings. In this process an incoming pressure wave activated a membrane that communicated its vibrations to a needle that then etched these vibrations into a continually rotating wax cylinder. Over the course of a few decades the analog recording process improved tremendously. One of the major improvements was to use powered microphones that transformed air pressure into voltage that could then be amplified before the signal is recorded. Analog recording reached its apex in the late 1950's with the 33 1/3 RPM LP (long playing) record and the tape recorder.

The analog recording process translates vibrations in air into a more permanent medium. This means an analog recording faithfully reproduce all the subtleties and richness contained in the original waveform. Unfortunately, the fact that the analog recoding process uses a physical object to accomplish goal this as several disadvantages.

Digital Recording

Sound is digitally recorded by converting a continuously varying analog signal into a discrete series of numerical quanta called samples. A sample is simply a number that represents a "snapshot" of a waveform's amplitude at some instant in time. Since digital samples are simply numbers they can be stored in a computer's memory, saved to a sound file on the computer's hard drive or on a digital audio CD.

To digitally record a waveform a continuous signal is processed by an analog to digital converter (ADC), a device which takes snapshots, or samples of the input signal at some specified rate, or times per second. Each sample that the ADC outputs can then be saved in the same sequential order to produce as a "recording" of the original wave form. Once the recording has stopped a Digital to Analog Converter (DAC) can then process the stored samples at the same rate they were recorded in order to produce an output waveform.

Example 2. A continuous waveform is sampled 8 times per second by an Analog to Digital Converter (ADC) and the samples are saved in sequential order. The samples are then read by a Digital to Analog converter (DAC) to produce an output waveform based on the original. The output waveform is not an exact duplicate (analog) of the original waveform which is shown as a dotted line in the second graph.

Note that, since the ADC does not capture all the values of the original waveform the output waveform produced from the recording will not be an exact copy, or analog, of the original. The sampling rate determines the the number of samples that are taken from the wave per second. The more a waveform is sampled per second the closer the digital recording is to the original. The sampling resolution is the maximum number of digits (bits) each sample can represent. The sampling resolution determines how accurately the amplitudes in the waveform can be represented. An 8-bit sampling resolution means that the continuous values of the input signal will be quantized to 28 or 128 amplitude values. 16-bit resolution quantizes to 216 or 65,536 amplitude values and is therefor much more accurate than 8-bit recording. The lower the sampling resolution the more likely the quantized sample will deviate from the actual value in the wave. This difference is an error and ends up as noise in the audio recording. As general guideline, the signal-to-noise ratio can be determined from the sampling resolution n by SNR (dB)  =  20 * log10 2 n   n * 6, so 8-bit signal-to-noise ratio is 8*6=48 dB and 16-bit is 96 dB

Example 3. A 1 Hz wave is sampled at two different sampling rates and sampling resolutions. The first cycle (blue) uses a sampling rate of 50 Hz and a sampling resolution of 25, or 32 bits. The second cycle is sampled at 25 Hz with a 22 or 4-bit sampling resolution. As can be seen in the graph the blue samples are more accurate in both the time and amplitude domains than the red samples.

Despite the fact that the digital recording is only an approximation of the continuous waveform it has several advantages over an analog recording:

Sampling Rate

The number of times a source waveform is sampled per second is called the sampling rate of the recording. Sampling rates of 22050, 44100 and 48000 are typical rates used in the recording industry. A sampling rate of 44100 means that, for every second of sound, the digital recording stores 44,100 "snapshots", or samples, of the original waveform. Of course if the recording is in stereo then the digital recording will actually store double that number of samples, or 88,200 samples per second for a sampling rate of 44100. To record Mahler's Second in digital stereo requires a lot of numbers!

Obviously, the more times the original wave is sampled per cycle (ie the faster the sampling rate) the more faithful the digital recording will be to the original waveform. But since each sample must be stored there is trade off between the fidelity of the recording and the amount of storage resources consumed by the recording process.

How many samples per second are actually needed to make a good digital recording?

It turns out that the answer to this question depends, in part, on the highest frequency that the original waveform contains. To see why this is so, consider two different waveforms that are recorded with a fixed sampling rate of 8 Hz, or 8 times per second.

Example 4. A sampling rate of 8 Hz is applied to a 4 Hz wave (blue) and a 1/2 Hz wave (red). The 1/2 Hz wave is better represented by the sampling because there are more points per cycle of its waveform.

Example 4 shows that for a given sampling rate (in this case 8 Hz) a recording of the blue wave will be worse than for the the red wave because the blue wave will have fewer samples per cycle recorded than the red one. In other words, for a fixed sampling rate the higher the frequency the worse the recording because each cycle of a higher frequency will be represented by fewer and fewer sample points. Is there an actual limit to how many points can represent a given waveform?

The answer to this question is "yes", and it is explained by the Nyquist Theorem.

The Nyquist Theorem

The Nyquist theorem states that the maximum frequency that can be represented in a digital recording is 1/2 the frequency of the sampling rate. This frequency is called the Nyquist Limit. Another way of stating the Nyquist Theorem is that there must be a minimum of 2 sample points per cycle of a waveform. To see why this is so, consider the following graph which depicts the results of sampling a wave below at and above the Nyquist Limit:

Example 5. A 1 Hz waveform (gray) at three different sampling rates. The blue plot samples at 1/4 cycle, below the Nyquist limit (2 points per cycle.) The green plot samples at 1/2 cycle, the Nyquist limit. The red plot samples at 3/4 cycle, above the Nyquist limit. The red plot can be seen to describe a different frequency.

The Nyquist Limit and Aliasing

Frequencies that lie above the Nyquist Limit (1/2 the sampling rate) are reflected back or aliased to frequencies below the limit. Since aliased frequencies are not part of the original signal they distort the recording.

Example 6.A sinewave glissando up to the sampling rate of 10000. Frequencies above the Nyquist limit (5000 Hz green line) are reflected, or aliased, to incorrect frequencies (red) below the limit.

Example 6 Audio


Since aliased components are no longer at their original positions relative to frequencies below the Limit they distort the spectrum and so the are disastrous for a recording. For example, with a 20 kHz sampling rate both 15 kHz and 25 kHz would be folded over to 5 kHz in the digital recording. For this reason an analog signal is often sent through a lowpass filter before it is sampled to remove frequencies above the Nyquist limit which would otherwise distort the recording.

Example 7. A major chord is arpeggiated in two different octaves. The upper two notes in the second octave (gray notes) are above the Nyquist limit and are aliased the frequencies that distort the chord (red notes).

Example 7 Audio


Sound Files

A sound file, or audio file, is a computer file that contains audio samples. The samples may be the result of a digital recording as described earlier in this document, or they may result from direct computation performed by a computer program. Audio samples that result from direct computation are said to be synthesized.

Since audio samples are just numerical "snapshots" of a waveform It is impossible to tell by examining an audio file if it was the result of a recording or a synthesis program. Nor is possible to directly deduce what the actual source of the waveform is (if any) or anything about the musical notation that produced it. Since none of that information was "encoded" in the actual sound pressure wave it cannot be read directly from the samples.

The fact that an audio file simply contains normalized numerical information means that it can serve as grist for many different types of computer programs to manipulate, transform, edit and analyze. Today there are many programs that allow users to manipulate audio samples. One of the most interesting things that can be done with audio files is to process them using audio editing programs or synthesis languages like CLM and Csound. Manipulating samples in this fashion is reminiscent of a style of music composition called musique concrète invented by Pierre Schaeffer in the 1950's using analog tape recordings. In this type of composition a piece of music is literally assembled out of bits and pieces of sound that are gathered, processed and mixed together.

Sound Files Formats

Although all audio files contain audio samples there are a number of different strategies, or formats for storing the samples. Some formats are designed to quickly access large "chucks" of audio data. Other strategies involve compressing samples so that they take up less room on the computer's hard drive.

Table of common audio file formats

Name Description File Type
Audio Interchange File Format Standard Mac digital audio .aiff
Waveform Standard Windows digital audio .wav
System Sound System sound format for Mac and Linux (older) .snd
MPEG I, Layer 3 Compresed audio (lossy), proprietary codex .mp3
Ogg Vorbis Compresed audio (lossy) open standard .ogg
Free Lossless Audio Codec Compresed audio (lossless) open standard .flac
RealAudio or RealMedia Compressed, streaming audio for Internet .ra or .rm
u-Law Compressed audio for Internet (older) .au

Roshal Archive compressed archival format, proprietary .rar

Sound File Headers

Since samples are just numbers representing a pressure wave its not possible to tell by looking at where they came from, or even what sampling resolution and sampling rate the samples were recorded at! For that reason all sound files contain a short introductory portion called the header of the sound file. Information that might be included in a header include:

Sound File Data

The header of the sound file contains global information about the samples that the sound file contains. This data usually starts directly after the header portion of the file. There are a number of different strategies to store this data. One method is to separate the various audio channels into different chunks, or blocks, in the file. Other method involves interleaving samples from each channel into a single stream of audio samples.

How a particular format chooses to store data should not normally concern the user. However, not all programs can read or write all of the different formats so its important to become acquainted with the more common sound file types. There are also a number of different sound file conversion programs, like sox, to convert sound files from one format to another.