Sign In

Audio Classification using FastAI

Use the 'Clone' button if you want to run this notebook on a local/cloud machine, or use the 'Run' button to run it on BinderHub or Kaggle.


To begin, we download a bunch of utility functions for I/O and conversion of audio files to spectrogram images.

In [1]:
!rm -rf
--2019-06-16 10:19:54-- Resolving ( Connecting to (||:443... connected. HTTP request sent, awaiting response... 200 OK Length: 7182 (7.0K) [text/plain] Saving to: '' 100%[===================>] 7.01K --.-KB/s in 0s 2019-06-16 10:19:54 (26.0 MB/s) - '' saved [7182/7182]

We can import the necessary modules and functions

In [2]:
%matplotlib inline
import os
from pathlib import Path
from IPython.display import Audio
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
from utils import read_file, transform_path

Next, let's download the data. We'll use the free spoken digits database:

In [3]:
!rm -rf free-spoken-digit-dataset-master
!unzip -q
!rm -rf
--2019-06-16 10:20:04-- Resolving ( Connecting to (||:443... connected. HTTP request sent, awaiting response... 302 Found Location: [following] --2019-06-16 10:20:04-- Resolving ( Connecting to (||:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/zip] Saving to: '' [ <=> ] 9.03M 1.53MB/s in 9.4s 2019-06-16 10:20:14 (984 KB/s) - '' saved [9465098] __pycache__ free-spoken-digit-dataset-master audio-classification-fastai.ipynb
In [4]:
AUDIO_DIR = Path('free-spoken-digit-dataset-master/recordings')
IMG_DIR = Path('imgs')
!mkdir {IMG_DIR} -p

Let's see how many recordings we have, and some sample files.

In [5]:
fnames = os.listdir(str(AUDIO_DIR))
len(fnames), fnames[:5]

As before we can play the recording using the Audio widget.

In [6]:
fn = fnames[94]

We can use the read_file helper function to read the audio file and do some preprocessing

In [26]:
# ??read_file
In [7]:
x, sr = read_file(fn, AUDIO_DIR)
x.shape, sr, x.dtype
((3467,), 8000, dtype('float32'))

Next, let's define a function to convert it into a mel spectrogram, and save the resulting image as a PNG file.

In [8]:
def log_mel_spec_tfm(fname, src_path, dst_path):
    x, sample_rate = read_file(fname, src_path)
    n_fft = 1024
    hop_length = 256
    n_mels = 40
    fmin = 20
    fmax = sample_rate / 2 
    mel_spec_power = librosa.feature.melspectrogram(x, sr=sample_rate, n_fft=n_fft, 
                                                    n_mels=n_mels, power=2.0, 
                                                    fmin=fmin, fmax=fmax)
    mel_spec_db = librosa.power_to_db(mel_spec_power, ref=np.max)
    dst_fname = dst_path / (fname[:-4] + '.png')
    plt.imsave(dst_fname, mel_spec_db)

Here's an example audio file convertered to PNG.

In [9]:
log_mel_spec_tfm(fn, AUDIO_DIR, IMG_DIR)
img = plt.imread(str(IMG_DIR/(fn[:-4] + '.png')))
plt.imshow(img, origin='lower');
Notebook Image

Now we can apply the log_mel_spec_tfm transformation to the entire dataset.

In [10]:
transform_path(AUDIO_DIR, IMG_DIR, log_mel_spec_tfm, fnames=fnames, delete=True)
HBox(children=(IntProgress(value=0, max=2000), HTML(value='')))
In [11]:

Image Classifier

From this point onwards, we use a standard FastAI CNN image classifier.

In [13]:
import fastai
In [14]:
from import *

The label can be extracted using a regular expression, and for the validation set, we'll pick all the recordings of one of the speakers.

In [15]:
digit_pattern = r'(\d+)_\w+_\d+.png$'
In [33]:
data = (ImageList.from_folder(IMG_DIR)
        #.split_by_valid_func(lambda fname: 'nicolas' in str(fname))
data.c, data.classes
(10, ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'])

Let's see what a batch of data looks liks.

In [34]:
# Shape of batch
xs, ys = data.one_batch()
xs.shape, ys.shape
(torch.Size([64, 3, 128, 64]), torch.Size([64]))
In [35]:
# Stats
xs.min(), xs.max(), xs.mean(), xs.std()
(tensor(0.0039), tensor(0.9903), tensor(0.4042), tensor(0.1950))
In [36]:
# Sample batch
data.show_batch(4, figsize=(5,9), hide_axis=False)
Notebook Image

Now we're ready to define and train the model. We'll use the ResNet18 architecture.

In [37]:
learn = cnn_learner(data, models.resnet18, metrics=accuracy)

We start by finetuning the classification layers.

In [38]:

Let's unfreeze and train some more.

In [40]:

It's looking pretty good. Let's look at the confusion matrix.

In [23]:
interp = ClassificationInterpretation.from_learner(learn)
In [25]:
interp.plot_confusion_matrix(figsize=(10, 10), dpi=60)
Notebook Image
In [30]:
# Clean up (Kaggle)
# !rm -rf {AUDIO_DIR}
# !rm -rf {IMG_DIR}

Save & Commit

In [41]:
import jovian
In [ ]:
[jovian] Saving notebook..
In [ ]: