How is Audio Represented Digitally

Sound waves are captured by a microphone, which converts the acoustic energy into electrical analog signals. The analog signals are then fed into an ADC (Analog-to-Digital Conversion). Here, two critical processes occur - sampling and quantization.

Sampling

1

Definition

Sampling is the process of measuring the amplitude of an analog signal at regular intervals. These intervals are determined by the sample rate, expressed in Hertz (Hz). For example, a sample rate of 44.1 kHz means the signal is sampled 44,100 times per second.

2

Purpose

By sampling the audio signal, we create a series of discrete data points that approximate the continuous analog waveform.

3

Implication

The Nyquist Theorem states that the sample rate must be at least twice the highest frequency component in the audio signal to accurately reconstruct the original signal. For example, human hearing typically ranges up to 20 kHz, hence the standard CD sample rate of 44.1 kHz.

Quantization

1

Definition

Quantization is the process of converting each sampled amplitude value into a digital value. This involves assigning a specific numerical value (quantization level) to each sample, based on its amplitude.

2

Purpose

The range of possible amplitude values is divided into discrete steps. Each step is assigned a digital value. The bit depth determines the number of possible quantization levels. For instance, a 16-bit system can represent 65,536 (2^16) different levels.

3

Implication

Quantization introduces a small amount of error, known as quantization noise, because the process involves rounding the true amplitude value to the nearest quantization level. Higher bit depths can reduce this error, leading to higher fidelity audio.

Terminology

The sample rate is the number of samples of audio carried per second. It’s measured in Hertz (Hz).

This refers to the number of separate audio channels (e.g., mono, stereo, surround sound) in the recording.
Mono means single channel. All audio is combined into one channel.

Bit depth refers to the number of bits used to represent each audio sample.

Audio Encoding

Audio encoding refers to the process of converting audio data into a format that can be easily stored, transmitted, and decoded by audio playback devices. This process often involves compression to reduce file size while trying to maintain the quality of the original audio. There are several popular audio encoding formats, each with its own specific use cases and characteristics.

Here’re some examples:

  • Description: PCM (Pulse Code Modulation) is the most straightforward form of digital audio encoding. It represents the amplitude of the audio signal at uniformly spaced intervals.

  • Usage: It’s the standard form of digital audio in computers, CDs, digital telephony, and other digital audio applications.

Note that audio encoding is not the same as audio format. An audio format refers to the entire structure of the audio file, which includes the encoding, but also encompasses other elements like metadata, file headers, and containers. For example, a WAV file typically uses PCM encoding, and has its own header that specifies audio sample rate, number of samples, etc.

PCM Audio Representation

When audio is played, it is typically decoded into PCM (Pulse Code Modulation). This process is true for most digital audio systems, regardless of the original audio format or encoding method.

There are generally two types of PCM audio representation:

  • Float 32 Array: It uses a 32-bit floating-point format to represent each sample. When capturing mic stream and setting up playback in web environment, PCM will be represented in this format.
  • Unsigned 8 Array: It uses an array of 8-bit unsigned integers (aka bytes), and each sample can be multiple bytes. For example, for a mono PCM audio with bit depth of 16 bit, each sample will be two bytes. This is a lower-level representation and is often used in programming for audio processing.

Here’s the code snippet to convert between these two format:

export function convertUnsigned8ToFloat32(array: Uint8Array): Float32Array {
  const targetArray = new Float32Array(array.byteLength / 2);

  // A DataView is used to read our 16-bit little-endian samples
  // out of the Uint8Array buffer
  const sourceDataView = new DataView(array.buffer);

  // Loop through, get values, and divide by 32,768
  for (let i = 0; i < targetArray.length; i++) {
    targetArray[i] = sourceDataView.getInt16(i * 2, true) / Math.pow(2, 16 - 1);
  }
  return targetArray;
}

export function convertFloat32ToUnsigned8(array: Float32Array): Uint8Array {
  const buffer = new ArrayBuffer(array.length * 2);
  const view = new DataView(buffer);

  for (let i = 0; i < array.length; i++) {
    const value = array[i] * 32768;
    view.setInt16(i * 2, value, true); // true for little-endian
  }

  return new Uint8Array(buffer);
}

Audio Spec Retell AI Use

  • Phone Calls: interanlly uses μ-law encoding, which is 8000Hz mono raw audio.
  • Web Calls: uses mono raw 16-bit PCM with a unsigned 8 array representation, and user can specify sample rate between 8,000 to 44,100Hz.