How to Use Fourier Transforms to Analyze Audio Like a Physicist

Close-up of a professional audio and video editing software interface with waveform displays.
Close-up of a professional audio and video editing software interface with waveform displays.

How to Use Fourier Transforms to Analyze Audio Like a Physicist

As developers, we often interact with data, but sometimes that data takes the form of vibrations in the air: sound. Understanding audio programmatically is a powerful skill, and at its heart lies a mathematical tool that physicists have relied on for centuries: the Fourier Transform.

This post will guide you through using Fourier Transforms (specifically, the Fast Fourier Transform or FFT) to dissect audio signals. We’ll approach this with a “physicist’s” mindset – focusing on the underlying principles, practical implementation, and clear interpretation of results, all with runnable Python examples.

The Core Idea: Decomposing Sound

Imagine a complex piece of music. To our ears, it’s a rich, interwoven tapestry of sounds. But to a physicist (or a Fourier Transform), it’s just a sum of simple, pure sine waves, each with a specific frequency, amplitude, and phase.

The Fourier Transform is a mathematical operation that takes a signal from the time domain (how its amplitude changes over time) and converts it into the frequency domain (what frequencies are present in the signal, and at what strength).

Why is this useful for audio? Because our perception of pitch is directly related to frequency, and loudness to amplitude. By transforming an audio signal, we can answer questions like:

  • What is the dominant pitch?
  • Are there multiple instruments playing specific notes?
  • Is there background noise, and what kind?
  • How does the sound’s frequency content change over time?

The workhorse for practical Fourier Transforms on discrete (digital) data is the Fast Fourier Transform (FFT), an efficient algorithm for computing the Discrete Fourier Transform (DFT).

Prerequisites and Setup

You’ll need a basic Python environment. We’ll rely on the following libraries:

  • numpy: For numerical operations, especially array manipulation and the FFT itself.
  • scipy: For additional signal processing tools, like audio file reading and more advanced transforms (STFT).
  • matplotlib: For plotting our frequency spectra.

You can install them via pip:

pip install numpy scipy matplotlib

For our initial examples, we’ll generate audio signals programmatically to ensure reproducibility and simplicity. Later, we’ll show how to load a real .wav file.

1. Generating a Simple Audio Signal

Let’s start by creating a pure sine wave, which is the simplest possible sound. This will be our “test tone.”

import numpy as np
import matplotlib.pyplot as plt
from scipy.io.wavfile import write as write_wav # For saving audio, optional

# Define parameters for our audio
samplerate = 44100  # samples per second (Hz) - standard for CD quality audio
duration = 1.0      # seconds
frequency = 440.0   # Hz (A4 note)

# Generate time array
t = np.linspace(0., duration, int(samplerate * duration), endpoint=False)

# Generate sine wave signal
# Note: Use np.sin for the actual sine wave calculation.
# We multiply by 0.5 to keep the amplitude below 1 (standard for audio normalization)
amplitude = 0.5
signal = amplitude * np.sin(2 * np.pi * frequency * t)

# --- Optional: Save to WAV file to hear it ---
# Scale to 16-bit integer range (common for WAV files)
# Remember to normalize to avoid clipping if your amplitude is > 1
scaled_signal = np.int16(signal * 32767)
# write_wav("a4_sine.wav", samplerate, scaled_signal)

print(f"Generated signal with shape: {signal.shape}")
print(f"Sample rate: {samplerate} Hz")
print(f"Duration: {duration} seconds")
print(f"First 10 samples: {signal[:10]}")
Generated signal with shape: (44100,)
Sample rate: 44100 Hz
Duration: 1.0 seconds
First 10 samples: [ 0.          0.12450849  0.24584288  0.36098059  0.46700147  0.5611481
  0.64082531  0.70462061  0.75134709  0.78010834]

This code generates a 1-second pure tone at 440 Hz (A4), sampled at 44100 Hz. The signal array now holds the amplitude values over time.

2. Performing the Fast Fourier Transform (FFT)

Now for the main event: applying the FFT. numpy.fft.fft is our primary tool.

import numpy as np
import matplotlib.pyplot as plt

# --- Re-use the signal generation from above for context ---
samplerate = 44100
duration = 1.0
frequency = 440.0
t = np.linspace(0., duration, int(samplerate * duration), endpoint=False)
signal = 0.5 * np.sin(2 * np.pi * frequency * t)
# --- End re-use ---

# Perform the FFT
# The FFT output will be complex numbers.
# N is the number of samples in the signal.
N = len(signal)
yf = np.fft.fft(signal)

# Calculate the frequencies corresponding to the FFT output
# np.fft.fftfreq returns the sample frequencies.
# The first half of the array represents positive frequencies,
# the second half represents negative frequencies (mirror image).
xf = np.fft.fftfreq(N, 1 / samplerate)

print(f"FFT output shape: {yf.shape}")
print(f"Frequency axis shape: {xf.shape}")
print(f"First 5 FFT magnitudes: {np.abs(yf[:5])}")
print(f"First 5 frequencies: {xf[:5]}")
FFT output shape: (44100,)
Frequency axis shape: (44100,)
First 5 FFT magnitudes: [2.97746197e-12 1.48873099e-12 5.09700305e-13 1.15783350e-12
 3.65935639e-12]
First 5 frequencies: [ 0.  1.  2.  3.  4.]

Understanding the Output:

  • yf: This array contains the complex-valued output of the FFT. Each complex number represents the amplitude and phase of a sine wave at a specific frequency.
  • xf: This array contains the frequencies (in Hz) that correspond to each bin in yf.

Important Notes:

  1. Symmetry: The FFT output for a real-valued signal (like audio) is always symmetric. The second half of yf is a mirror image of the first half (representing negative frequencies). For most audio analysis, we only care about the positive frequencies, up to the Nyquist frequency.
  2. Nyquist Frequency: The highest frequency that can be accurately represented in a digital signal is half of the sampling rate (samplerate / 2). This is known as the Nyquist frequency. For our 44100 Hz sample rate, the Nyquist frequency is 22050 Hz.
  3. Magnitude: To get the “strength” or amplitude of each frequency component, we take the absolute value of the complex FFT output: np.abs(yf).

3. Interpreting and Plotting the Frequency Spectrum

Now let’s visualize our results. We’ll plot the magnitude (amplitude) of the FFT against its corresponding frequencies.

import numpy as np
import matplotlib.pyplot as plt

# --- Re-use signal and FFT calculation ---
samplerate = 44100
duration = 1.0
frequency = 440.0
t = np.linspace(0., duration, int(samplerate * duration), endpoint=False)
signal = 0.5 * np.sin(2 * np.pi * frequency * t)

N = len(signal)
yf = np.fft.fft(signal)
xf = np.fft.fftfreq(N, 1 / samplerate)
# --- End re-use ---

# Plotting the magnitude spectrum
plt.figure(figsize=(12, 6))

# Plot only the positive frequencies (up to Nyquist frequency)
# Use slice `[:N//2]` for both frequency and magnitude arrays.
plt.plot(xf[:N//2], np.abs(yf[:N//2]))

plt.title('Frequency Spectrum of a 440 Hz Sine Wave')
plt.xlabel('Frequency (Hz)')
plt.ylabel('Amplitude (Magnitude)')
plt.grid(True)
plt.xlim(0, 1000) # Limit x-axis to zoom in on the expected frequency
plt.show()

# Find the peak frequency
# Note: The peak magnitude corresponds to the actual amplitude of the sine wave * N/2
# because the FFT distributes the energy across positive and negative frequencies.
# For a pure sine wave, the peak will be N / 2 * amplitude.
# Let's find the frequency with the maximum amplitude in the positive spectrum.
peak_idx = np.argmax(np.abs(yf[1:N//2])) + 1 # +1 because we start from index 1 (excluding DC component)
peak_frequency = xf[peak_idx]
peak_magnitude = np.abs(yf[peak_idx])

print(f"Identified peak frequency: {peak_frequency:.2f} Hz")
print(f"Peak magnitude: {peak_magnitude:.2f} (Expected for 0.5 amplitude: {N/2 * 0.5:.2f})")
Identified peak frequency: 440.00 Hz
Peak magnitude: 11025.00 (Expected for 0.5 amplitude: 11025.00)

[A plot window would open, showing a single sharp peak at 440 Hz on the frequency axis.]

The plot clearly shows a single, sharp peak at 440 Hz, confirming our generated signal. This is exactly what we expect from a pure sine wave. The height of the peak (magnitude) is proportional to the amplitude of the sine wave and the number of samples.

4. Analyzing More Complex Sounds

Real-world audio is rarely a pure sine wave. Let’s analyze a signal composed of two different sine waves.

import numpy as np
import matplotlib.pyplot as plt

samplerate = 44100
duration = 1.0
freq1 = 440.0
freq2 = 880.0 # An octave higher (A5 note)
amplitude1 = 0.4
amplitude2 = 0.3

t = np.linspace(0., duration, int(samplerate * duration), endpoint=False)
signal = amplitude1 * np.sin(2 * np.pi * freq1 * t) + \
         amplitude2 * np.sin(2 * np.pi * freq2 * t)

N = len(signal)
yf = np.fft.fft(signal)
xf = np.fft.fftfreq(N, 1 / samplerate)

plt.figure(figsize=(12, 6))
plt.plot(xf[:N//2], np.abs(yf[:N//2]))
plt.title('Frequency Spectrum of Two Sine Waves (440 Hz & 880 Hz)')
plt.xlabel('Frequency (Hz)')
plt.ylabel('Amplitude (Magnitude)')
plt.grid(True)
plt.xlim(0, 1000) # Zoom in to see both peaks
plt.show()

# Find the two highest peaks
magnitudes = np.abs(yf[1:N//2]) # Exclude DC component (index 0)
sorted_indices = np.argsort(magnitudes)[::-1] # Sort in descending order of magnitude

print(f"Top 2 peak frequencies and magnitudes:")
for i in range(2):
    peak_idx_relative = sorted_indices[i] # Index relative to magnitudes array
    peak_idx_absolute = peak_idx_relative + 1 # Actual index in xf
    print(f"- Freq: {xf[peak_idx_absolute]:.2f} Hz, Mag: {magnitudes[peak_idx_relative]:.2f}")
Top 2 peak frequencies and magnitudes:
- Freq: 440.00 Hz, Mag: 8820.00
- Freq: 880.00 Hz, Mag: 6615.00

[A plot window would open, showing two sharp peaks: one at 440 Hz and another at 880 Hz. The 440 Hz peak would be taller.]

As expected, we now see two distinct peaks at 440 Hz and 880 Hz, with their magnitudes reflecting the original amplitudes (scaled by N/2). This demonstrates the power of the Fourier Transform to identify individual components within a composite signal.

Analyzing a Real WAV File

While generating signals is great for understanding, you’ll eventually want to analyze actual audio. scipy.io.wavfile is a good starting point for simple .wav files. For more robust audio loading (e.g., various formats, deeper control), consider the soundfile library.

Let’s assume you have an audio file named my_audio.wav. (If you don’t, you can save one of our generated signals using write_wav earlier).

import numpy as np
import matplotlib.pyplot as plt
from scipy.io.wavfile import read as read_wav # For reading audio

# Path to your audio file
audio_file_path = "a4_sine.wav" # Replace with your actual file path

try:
    samplerate, data = read_wav(audio_file_path)
except FileNotFoundError:
    print(f"Error: '{audio_file_path}' not found. Please create or provide a valid WAV file.")
    print("Hint: You can uncomment and run the `write_wav` line in Section 1 to create 'a4_sine.wav'.")
    exit()

# If stereo, take only one channel (e.g., the first channel)
if data.ndim > 1:
    data = data[:, 0]

# Normalize data to float between -1 and 1 if it's integer type
if data.dtype == np.int16:
    data = data.astype(np.float32) / 32768.0
elif data.dtype == np.int32:
    data = data.astype(np.float32) / 2147483648.0 # Max value for int32

signal = data
N = len(signal)

# Check if signal is long enough for meaningful FFT
if N == 0:
    print("Error: Audio file is empty.")
    exit()
if N < 2: # Need at least 2 samples for FFT
    print("Error: Audio file too short for FFT.")
    exit()

yf = np.fft.fft(signal)
xf = np.fft.fftfreq(N, 1 / samplerate)

plt.figure(figsize=(12, 6))
plt.plot(xf[:N//2], np.abs(yf[:N//2]))
plt.title(f'Frequency Spectrum of {audio_file_path}')
plt.xlabel('Frequency (Hz)')
plt.ylabel('Amplitude (Magnitude)')
plt.grid(True)
plt.xlim(0, samplerate / 2) # Show full spectrum up to Nyquist
plt.ylim(bottom=0) # Ensure y-axis starts at 0
plt.show()

# Find the overall dominant frequency
peak_idx = np.argmax(np.abs(yf[1:N//2])) + 1 # +1 because we start from index 1
dominant_frequency = xf[peak_idx]
print(f"Dominant frequency in '{audio_file_path}': {dominant_frequency:.2f} Hz")
Dominant frequency in 'a4_sine.wav': 440.00 Hz

[A plot window would open, showing the frequency spectrum of the loaded WAV file,
which for 'a4_sine.wav' would again be a peak at 440 Hz.]

This example shows how to load, normalize, and analyze an actual audio file. The normalization step is crucial as audio files often store data as integers, which need to be converted to floats for standard signal processing.

5. Going Deeper: The Short-Time Fourier Transform (STFT)

A single FFT gives you the overall frequency content of an entire audio clip. But what if the frequencies change over time, like in music or speech? A single FFT would average everything out, losing the temporal information.

This is where the Short-Time Fourier Transform (STFT) comes in. Instead of processing the entire signal at once, the STFT:

  1. Divides the signal into small, overlapping “frames” or “windows.”
  2. Applies an FFT to each frame.
  3. Stacks these individual FFTs to create a 2D representation: a spectrogram.

A spectrogram shows how the frequency content of a signal changes over time. Time is typically on the x-axis, frequency on the y-axis, and color represents the magnitude (amplitude) of each frequency at each time point.

scipy.signal.stft is designed for this.

import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import stft
from scipy.io.wavfile import write as write_wav

# Generate a signal with changing frequency (frequency sweep or chirp)
samplerate = 44100
duration = 4.0
t = np.linspace(0, duration, int(samplerate * duration), endpoint=False)

# Start at 200 Hz, end at 2000 Hz linearly
start_freq = 200
end_freq = 2000
# Calculate the instantaneous frequency (linear interpolation)
instant_freq = start_freq + (end_freq - start_freq) * (t / duration)
signal_chirp = 0.5 * np.sin(2 * np.pi * instant_freq * t) # This is an approximation for simplicity
# A true linear chirp uses phase: 2 * np.pi * (start_freq * t + (end_freq - start_freq) / (2 * duration) * t**2)

# Save the chirp to WAV to hear it
# scaled_chirp = np.int16(signal_chirp * 32767)
# write_wav("chirp.wav", samplerate, scaled_chirp)

# --- Perform STFT ---
# nperseg: length of each segment (window)
# noverlap: number of points to overlap between segments
# Note: Larger nperseg gives better frequency resolution but worse time resolution.
#       Smaller nperseg gives better time resolution but worse frequency resolution.
#       A common choice for nperseg is a power of 2, e.g., 1024 or 2048.
#       Overlap is often nperseg // 2 or 3 * nperseg // 4.

f, t_stft, Zxx = stft(signal_chirp, fs=samplerate, nperseg=2048, noverlap=2048//2)

# Zxx contains complex values. We need the magnitude (amplitude)
# Convert magnitude to decibels (dB) for better visualization of dynamic range
# dB = 20 * log10(amplitude)
# Add a small epsilon to avoid log(0)
spectrogram_magnitude = np.abs(Zxx)
spectrogram_db = 20 * np.log10(spectrogram_magnitude + 1e-10)

plt.figure(figsize=(14, 7))
plt.pcolormesh(t_stft, f, spectrogram_db, shading='gouraud', cmap='viridis')
plt.title('Spectrogram of a Frequency Chirp')
plt.ylabel('Frequency (Hz)')
plt.xlabel('Time (s)')
plt.ylim(0, 3000) # Limit Y axis for better view
plt.colorbar(label='Magnitude (dB)')
plt.show()

print(f"STFT frequency bins shape: {f.shape}")
print(f"STFT time bins shape: {t_stft.shape}")
print(f"Spectrogram Zxx shape: {Zxx.shape}")
STFT frequency bins shape: (1025,)
STFT time bins shape: (172,)
Spectrogram Zxx shape: (1025, 172)

[A plot window would open, showing a spectrogram. It would display a clear,
diagonal line starting at low frequency and low time, sweeping upwards
to high frequency and high time, indicating the increasing frequency over duration.]

The spectrogram clearly visualizes the frequency sweeping from 200 Hz to 2000 Hz over 4 seconds. This is an incredibly powerful tool for analyzing speech, music, and environmental sounds where frequency content is dynamic.

Note on Windowing: The nperseg parameter defines the window size. When you extract a segment of a signal for FFT, you’re effectively multiplying it by a “window function” (often a Hamming, Hanning, or Blackman window by default in scipy). This helps to reduce spectral leakage, an artifact where energy from a strong frequency component “leaks” into neighboring frequency bins. This is a complex topic, but be aware that different window types have different trade-offs.

Limitations and Caveats

While powerful, Fourier Transforms aren’t a silver bullet. Be aware of these common pitfalls:

  • Resolution Trade-offs:
    • Frequency Resolution: Better frequency resolution (finer detail in frequencies) requires longer nperseg (more samples per FFT), meaning you analyze larger chunks of time.
    • Time Resolution: Better time resolution (more precise timing of events) requires shorter nperseg (fewer samples per FFT), meaning you get coarser frequency detail.
    • You always have to balance these two for STFT. A physicist would call this the uncertainty principle in signal processing.
  • Windowing Effects: As mentioned, windowing functions are essential, but they subtly modify the spectrum. For a pure sine wave, a rectangular window (no windowing) will show “sidelobes” around the main peak due to spectral leakage. Other windows reduce sidelobes at the expense of slightly wider main lobes.
  • Nyquist-Shannon Sampling Theorem: Your samplerate must be at least twice the highest frequency present in your signal to accurately capture it. If you sample below this, you’ll experience aliasing, where higher frequencies incorrectly appear as lower frequencies.
  • Logarithmic Scale: For human perception and for visualizing wide dynamic ranges in audio (e.g., quiet sounds alongside loud ones), plotting magnitudes in decibels (dB) using a logarithmic scale (20 * np.log10(amplitude)) is usually preferred over a linear scale. It mimics how our ears perceive loudness.

Conclusion

The Fourier Transform is a cornerstone of digital signal processing, providing a lens to see the hidden frequency components of any time-domain signal. For audio, it’s indispensable for understanding pitch, timbre, and how sounds evolve. By following a physicist’s approach – understanding the underlying math, running practical experiments, and interpreting results carefully – you can unlock deep insights into the audio world.

Experiment with different audio files, adjust parameters like nperseg and noverlap for the STFT, and observe the fascinating changes in the frequency domain. Happy analyzing!

Last updated on