Audio Analysis Basics image

Audio Analysis Basics

Using SciPy for introductory audio analysis.

An important component of video that I had yet to explore via Python is audio. To address this, I put together a script to analyze the audio of various clips, a continuation from my cut counter experiment to evaluate the production style of the shows my kids were watching. While the analysis is an introduction only, it covers several key metrics: duration, RMS (Root Mean Square), Zero Crossing Rate, Dynamic Range, and even Beats Per Minute (BPM).

Each of these metrics offers unique insights into the characteristics of the audio. Here's a quick overview:

Duration: Measures the total length of the audio file in seconds. A fundamental metric for comparing pacing across different media types.

RMS (Root Mean Square): Reflects the average loudness of the audio. It captures the overall energy of the sound signal and correlates well with perceived loudness.

Zero Crossing Rate (ZCR): Counts the number of times the audio waveform crosses zero amplitude per second. High ZCR values indicate high-frequency content like sibilance or sharp percussive sounds.

Dynamic Range: Indicates the difference between the loudest and quietest parts of the audio. Higher dynamic range suggests greater contrast in volume, common in cinematic or orchestral audio.

Beats Per Minute (BPM): Measures the tempo of the audio. Higher BPM values reflect faster-paced audio, while lower BPM values suggest calmer or slower-paced tracks.

The script takes an audio track, processes it, and calculates the metrics:

Stereo to Mono Conversion: If the audio track is stereo, it is converted to mono for uniform processing.

Normalization: Samples (the individual units of audio data) are normalized to ensure fair comparison across tracks.

Calculations and Output: The metrics are displayed in the terminal for quick insights and logged in a CSV file for future evaluation.

The scipy.io library was used for audio processing. While other libraries like Essentia and pydub offer robust tools for audio analysis, installation and compatibility issues led me to choose SciPy. I look forward to exploring Essentia in future projects, as its capabilities hold great potential for advanced analysis.

The goal was to analyze the audio of video clips, including films and kids’ shows (live-action and animation). Each analysis revealed interesting insights. My initial premise—that older productions have subtler audio and visual stimulation—was validated. However, the results also came with a few surprises.

One notable finding was that Zero Crossing Rate (ZCR) is a key factor in understanding audio fatigue. High ZCR values correspond to high-frequency undulations, or shrill tones, that can make audio grating to listen to over extended periods. Among the kids’ shows analyzed, modern productions generally had higher ZCR values, indicating sharper, more frequent high-frequency content.

Here is the list of the kid shows analyzed, along with their respective years:

  • Blippi (2014)
  • Walt Disney: Casey at the Bat (1946)
  • Walt Disney: Silly Symphonies (1935)
  • Ninja Turtles (1987)
  • Puppy Dog Pals (2017)
  • Bluey (2018)
  • StoryBots (2015)
  • Recess (1997)
  • Power Rangers (1993)

Findings:

Preferred Shows for Kids: My personal preference for Bluey and StoryBots was validated. Both shows scored well across metrics, balancing pacing and dynamic range while avoiding excessively high ZCR values.

Lowest Zero Crossing Rate: The winner for the lowest ZCR was Walt Disney’s Silly Symphonies (1946). Its audio, characterized by smooth transitions and rich orchestration, stood out for its calm and pleasant soundscape.

Modern vs. Classic Shows: Older productions, like Silly Symphonies and Casey at the Bat, demonstrated lower audio stimulation and more balanced dynamics. Modern productions tended to favor higher ZCR values, likely due to their fast-paced, high-energy design aimed at capturing and holding the attention of younger viewers.

This project has highlighted the differences in audio design across decades, offering a quantifiable way to evaluate the pacing, energy, and overall auditory experience of video content. Beyond personal preferences, the metrics provide a deeper understanding of how audio affects viewer engagement and comfort.

For parents, choosing shows like Bluey, StoryBots, or classic Disney productions ensures a more pleasant auditory experience for both kids and adults. On the other hand, shows with high ZCR and fast BPM might engage kids momentarily but could lead to fatigue over time.

This project has opened the door to exploring more advanced audio processing techniques in the future. And before concluding, I put together a quick script to generate an SRT displaying RMS and ZCR paired with the classic example of dynamics, In the Hall of the Mountain King, by Edvard Grieg as performed by the London Symphony Orchestra. This serves as a visual explanation for these two metrics. Notice the increase in ZCR value with the use of cymbals in the orchestra.

Back to Posts