Sign In

Introduction to Audio Processing for Machine Learning

Use the 'Clone' button if you want to run this notebook on a local/cloud machine, or use the 'Run' button to run it on BinderHub or Kaggle.

Data dowload & exploration

We'll use a sample audio dataset from the Open Speech & Language Resouces ( for our analysis. We begin by downloading our data.

Next, let's download and unzip the data:

In [1]:
# Only on Linux, Mac and Windows WSL
!rm -rf openslr-sample.tgz openslr-sample
!wget -O openslr-sample.tgz
!tar -zxf openslr-sample.tgz
!rm openslr-sample.tgz
--2019-06-16 10:11:23-- Resolving ( Connecting to (||:443... connected. HTTP request sent, awaiting response... 301 Moved Permanently Location: /s/dl/u5lkq2f64ljw7au/openslr-sample.tgz [following] --2019-06-16 10:11:23-- Reusing existing connection to HTTP request sent, awaiting response... 302 Found Location: [following] --2019-06-16 10:11:24-- Resolving ( Connecting to (||:443... connected. HTTP request sent, awaiting response... 200 OK Length: 10463024 (10.0M) [application/binary] Saving to: 'openslr-sample.tgz' openslr-sample.tgz 100%[===================>] 9.98M 7.58MB/s in 1.3s 2019-06-16 10:11:26 (7.58 MB/s) - 'openslr-sample.tgz' saved [10463024/10463024]

We can listen to audio files directly within Jupyter using a display widget.

In [2]:
import os
from IPython.display import Audio

DATA_DIR = 'openslr-sample'
In [3]:

audio_files = os.listdir(DATA_DIR)
len(audio_files), audio_files[:10]
In [4]:
example = DATA_DIR + "/" + audio_files[0]
In [5]:
Audio(DATA_DIR + "/" + audio_files[1])
In [7]:
Audio(DATA_DIR + "/" + audio_files[23])

Audio signals & sampling

We'll use the library librosa to process and play around with audio files.

In [8]:
import librosa
In [9]:
y, sr = librosa.load(example, sr=None)
In [10]:
print("Sample rate  :", sr)
print("Signal Length:", len(y))
print("Duration     :", len(y)/sr, "seconds")
Sample rate : 16000 Signal Length: 60800 Duration : 3.8 seconds

Audio is a continuous wave that is "sampled" by measuring the amplitude of the wave at a given time. How many times you sample per second is called the "sample rate" and can be thought of as the resolution of the audio. The higher the sample rate, the closer our discrete digital representation will be to the true continuous sound wave. Sample rates generally range from 8000-44100 but can go higher or lower.

Our signal is just a numpy array with the amplitude of the wave.

In [11]:
print("Type  :", type(y))
print("Signal: ", y)
print("Shape :", y.shape)
Type : <class 'numpy.ndarray'> Signal: [0. 0. 0. ... 0.00048828 0.00048828 0.00045776] Shape : (60800,)

We can also display a play a numpy array using the Audio widget.

In [12]:
Audio(y, rate=sr)

Let's try some experiments now. Can you guess what the following will sound like?

In [13]:
Audio(y, rate=sr/2)
In [14]:
Audio(y, rate=sr*2)
In [15]:
y_new, sr_new = librosa.load(example, sr=sr*2)
Audio(y_new, rate=sr_new)
In [16]:
y_new, sr_new = librosa.load(example, sr=sr/2)
Audio(y_new, rate=sr_new)

Waveforms, amplitude vs magnitude

A waveform is a curve showing the amplitude of the soundwave (y-axis) at time T (x-axis). Let's check out the waveform of our audio clip.

In [17]:
%matplotlib inline
import matplotlib.pyplot as plt
import librosa.display
In [18]:
plt.figure(figsize=(15, 5))
librosa.display.waveplot(y, sr=sr);
Notebook Image

Amplitude and magnitude are often confused, but the difference is simple. Amplitude of a wave is just the distance, positive or negative, from the equilibrium (zero in our case), and magnitude is the absolute value of the amplitude. In audio we sample the amplitude.

Frequency and Pitch

Most of us remember frequency from physics as cycles per second of a wave. It's the same for sound, but really hard to see in the above image. How many cycles are there? How can there be cycles if it's not regular? The reality is that sound is extremely complex, and the above recording of human speech is the combination of many different frequencies added together. To talk about frequency and pitch, it's easier to start with a pure tone, so let's make one.

Human hearing ranges from 20hz to 20,000hz, hz=hertz=cycles per second. The higher the frequency, the more cycles per second, and the "higher" the pitch sounds to us. To demonstrate, let's make a sound at 500hz, and another at 5000hz.

In [19]:
import numpy as np
In [20]:
# Adapted from
# An amazing open-source resource, especially if music is your sub-domain.
def make_tone(freq, clip_length=1, sr=16000):
    t = np.linspace(0, clip_length, int(clip_length*sr), endpoint=False)
    return 0.1*np.sin(2*np.pi*freq*t)
clip_500hz = make_tone(500)
clip_5000hz = make_tone(5000)
In [21]:
Audio(clip_500hz, rate=sr)
In [22]:
Audio(clip_5000hz, rate=sr)

500 cycles per second, 16000 samples per second, means 1 cycle = 16000/500 = 32 samples, let's see 2 cycles.

In [23]:
plt.figure(figsize=(15, 5))
Notebook Image

Now let's look at 5000Hz.

In [24]:
plt.figure(figsize=(15, 5))
Notebook Image

Now let's put the two sounds together.

In [25]:
plt.figure(figsize=(15, 5))
plt.plot((clip_500hz + clip_5000hz)[0:64]);
Notebook Image
In [26]:
Audio(clip_500hz + clip_5000hz, rate=sr)

Pitch is a musical term that means the human perception of frequency. This concept of human perception instead of actual values seems vague and non-scientific, but it is hugely important for machine learning because most of what we're interested in, speech, sound classification, music...etc are inseparable from human hearing and how it functions.

Let's do an experiment and increase the frequency of the above tones by 500hz each and see how much this moves our perception of them

In [27]:
clip_500_to_1000 = np.concatenate([make_tone(500), make_tone(1000)])
clip_5000_to_5500 = np.concatenate([make_tone(5000), make_tone(5500)])
In [28]:
# first half of the clip is 500hz, 2nd is 1000hz
Audio(clip_500_to_1000, rate=sr)
In [29]:
# first half of the clip is 5000hz, 2nd is 5500hz
Audio(clip_5000_to_5500, rate=sr)

Notice that the pitch of the first clip seems to change more even though they were shifted by the same amount. This makes intuitive sense as the frequency of the first was doubled and the frequency of the second only increased 10%. Like other forms of human perception, hearing is not linear, it is logarithmic. This will be important later as the range of frequencies from 100-200hz convey as much information to us as the range from 10,000-20,000hz.

Mel scale

The mel scale is a human-centered metric of audio perception that was developed by asking participants to judge how far apart different tones were.


Frequency Mel Equivalent
20 0
160 250
394 500
670 750
1000 1000
1420 1250
1900 1500
2450 1750
3120 2000
4000 2250
5100 2500
6600 2750
9000 3000
14000 3250


Just like frequency, human perception of loudness occurs on a logarithmic scale. A constant increase in the amplitude of a wave will be perceived differently if the original sound is soft or loud.

Decibels measure the ratio of power between 2 sounds, and each 10x increase in the energy of the wave (multiplicative) results in a 10dB increase in sound (additive). Thus something that is 20dB louder has 100x (10*10) the amount of energy, something that is 25dB louder has (10^2.5) = 316.23x more energy.


Spectrogram - visual representation of audio

We'll plot the time on the x-axis, frequencies on the y-axis, and use the color to represent the amplitude of various frequencies.

In [30]:
sg0 = librosa.stft(y)
sg_mag, sg_phase = librosa.magphase(sg0)
Notebook Image

Next we use the mel-scale instead of raw frequency.

In [31]:
sg1 = librosa.feature.melspectrogram(S=sg_mag, sr=sr)
Notebook Image

Next, let's use the decibel scale, and labels the x & y axes.

In [32]:
sg2 = librosa.amplitude_to_db(sg1, ref=np.min)
librosa.display.specshow(sg2, sr=16000, y_axis='mel', fmax=8000, x_axis='time')
plt.colorbar(format='%+2.0f dB')
plt.title('Mel spectrogram');
Notebook Image

Every point in the square represents the energy at the frequency of it's y-coordinate at the time of it's x-coordinate.

In [33]:
sg2.min(), sg2.max(), sg2.mean()
(1.3055977504610752, 81.30559775046108, 29.65997874062258)

The spectrogram itself is nothing special, simply a 2d numpy array

In [34]:
type(sg2), sg2.shape
(numpy.ndarray, (128, 119))

In fact, we can we it as an image.

In [35]:
Notebook Image

It looks inverted because the y-axis is inverted. Also, the ticks on the y-axis now represent mel frequencies, and the ticks on the x-asis represent the actual sample length.

While there's a lot more to explore about audio processing, we are going to stop here, since we have successfully converted the audio into images, and now we can use same models that we use for computer vision with audio.

In [36]:
# Clean up (for Kaggle)
# !rm -rf {DATA_DIR}

Save and Commit

In [37]:
import jovian
In [ ]:
[jovian] Saving notebook..
In [ ]: