Jovian
⭐️
Sign In

Audio Classification using FastAI

Use the 'Clone' button if you want to run this notebook on a local/cloud machine, or use the 'Run' button to run it on BinderHub or Kaggle.

Source: https://github.com/sevenfx/fastai_audio

To begin, we download a bunch of utility functions for I/O and conversion of audio files to spectrogram images.

In [1]:
!rm -rf utils.py
!wget https://raw.githubusercontent.com/sevenfx/fastai_audio/master/notebooks/utils.py
--2019-06-16 10:19:54-- https://raw.githubusercontent.com/sevenfx/fastai_audio/master/notebooks/utils.py Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.156.133 Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.156.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 7182 (7.0K) [text/plain] Saving to: 'utils.py' utils.py 100%[===================>] 7.01K --.-KB/s in 0s 2019-06-16 10:19:54 (26.0 MB/s) - 'utils.py' saved [7182/7182]

We can import the necessary modules and functions

In [2]:
%matplotlib inline
import os
from pathlib import Path
from IPython.display import Audio
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
from utils import read_file, transform_path

Next, let's download the data. We'll use the free spoken digits database: https://github.com/Jakobovski/free-spoken-digit-dataset

In [3]:
!rm -rf free-spoken-digit-dataset-master master.zip
!wget https://github.com/Jakobovski/free-spoken-digit-dataset/archive/master.zip
!unzip -q master.zip
!rm -rf master.zip
!ls
--2019-06-16 10:20:04-- https://github.com/Jakobovski/free-spoken-digit-dataset/archive/master.zip Resolving github.com (github.com)... 13.234.176.102 Connecting to github.com (github.com)|13.234.176.102|:443... connected. HTTP request sent, awaiting response... 302 Found Location: https://codeload.github.com/Jakobovski/free-spoken-digit-dataset/zip/master [following] --2019-06-16 10:20:04-- https://codeload.github.com/Jakobovski/free-spoken-digit-dataset/zip/master Resolving codeload.github.com (codeload.github.com)... 192.30.253.120 Connecting to codeload.github.com (codeload.github.com)|192.30.253.120|:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [application/zip] Saving to: 'master.zip' master.zip [ <=> ] 9.03M 1.53MB/s in 9.4s 2019-06-16 10:20:14 (984 KB/s) - 'master.zip' saved [9465098] __pycache__ free-spoken-digit-dataset-master audio-classification-fastai.ipynb utils.py
In [4]:
AUDIO_DIR = Path('free-spoken-digit-dataset-master/recordings')
IMG_DIR = Path('imgs')
!mkdir {IMG_DIR} -p

Let's see how many recordings we have, and some sample files.

In [5]:
fnames = os.listdir(str(AUDIO_DIR))
len(fnames), fnames[:5]
Out[5]:
(2000,
 ['5_nicolas_9.wav',
  '3_yweweler_14.wav',
  '4_yweweler_38.wav',
  '3_yweweler_28.wav',
  '4_yweweler_10.wav'])

As before we can play the recording using the Audio widget.

In [6]:
fn = fnames[94]
print(fn)
Audio(str(AUDIO_DIR/fn))
4_jackson_32.wav
Out[6]:

We can use the read_file helper function to read the audio file and do some preprocessing

In [26]:
# ??read_file
In [7]:
x, sr = read_file(fn, AUDIO_DIR)
x.shape, sr, x.dtype
Out[7]:
((3467,), 8000, dtype('float32'))

Next, let's define a function to convert it into a mel spectrogram, and save the resulting image as a PNG file.

In [8]:
def log_mel_spec_tfm(fname, src_path, dst_path):
    x, sample_rate = read_file(fname, src_path)
    
    n_fft = 1024
    hop_length = 256
    n_mels = 40
    fmin = 20
    fmax = sample_rate / 2 
    
    mel_spec_power = librosa.feature.melspectrogram(x, sr=sample_rate, n_fft=n_fft, 
                                                    hop_length=hop_length, 
                                                    n_mels=n_mels, power=2.0, 
                                                    fmin=fmin, fmax=fmax)
    mel_spec_db = librosa.power_to_db(mel_spec_power, ref=np.max)
    dst_fname = dst_path / (fname[:-4] + '.png')
    plt.imsave(dst_fname, mel_spec_db)

Here's an example audio file convertered to PNG.

In [9]:
log_mel_spec_tfm(fn, AUDIO_DIR, IMG_DIR)
img = plt.imread(str(IMG_DIR/(fn[:-4] + '.png')))
plt.imshow(img, origin='lower');
Notebook Image

Now we can apply the log_mel_spec_tfm transformation to the entire dataset.

In [10]:
transform_path(AUDIO_DIR, IMG_DIR, log_mel_spec_tfm, fnames=fnames, delete=True)
HBox(children=(IntProgress(value=0, max=2000), HTML(value='')))
In [11]:
os.listdir(str(IMG_DIR))[:10]
Out[11]:
['2_theo_12.png',
 '7_jackson_24.png',
 '6_yweweler_45.png',
 '7_jackson_30.png',
 '7_jackson_18.png',
 '8_yweweler_36.png',
 '5_theo_36.png',
 '3_jackson_45.png',
 '3_nicolas_12.png',
 '5_theo_22.png']

Image Classifier

From this point onwards, we use a standard FastAI CNN image classifier.

In [13]:
import fastai
fastai.__version__
Out[13]:
'1.0.53.post2'
In [14]:
from fastai.vision import *

The label can be extracted using a regular expression, and for the validation set, we'll pick all the recordings of one of the speakers.

In [15]:
digit_pattern = r'(\d+)_\w+_\d+.png$'
In [33]:
data = (ImageList.from_folder(IMG_DIR)
        .split_by_rand_pct(0.2)
        #.split_by_valid_func(lambda fname: 'nicolas' in str(fname))
        .label_from_re(digit_pattern)
        .transform(size=(128,64))
        .databunch())
data.c, data.classes
Out[33]:
(10, ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'])

Let's see what a batch of data looks liks.

In [34]:
# Shape of batch
xs, ys = data.one_batch()
xs.shape, ys.shape
Out[34]:
(torch.Size([64, 3, 128, 64]), torch.Size([64]))
In [35]:
# Stats
xs.min(), xs.max(), xs.mean(), xs.std()
Out[35]:
(tensor(0.0039), tensor(0.9903), tensor(0.4042), tensor(0.1950))
In [36]:
# Sample batch
data.show_batch(4, figsize=(5,9), hide_axis=False)
Notebook Image

Now we're ready to define and train the model. We'll use the ResNet18 architecture.

In [37]:
learn = cnn_learner(data, models.resnet18, metrics=accuracy)

We start by finetuning the classification layers.

In [38]:
learn.fit_one_cycle(4)

Let's unfreeze and train some more.

In [40]:
learn.unfreeze()
learn.fit_one_cycle(4)

It's looking pretty good. Let's look at the confusion matrix.

In [23]:
interp = ClassificationInterpretation.from_learner(learn)
In [25]:
interp.plot_confusion_matrix(figsize=(10, 10), dpi=60)
Notebook Image
In [30]:
# Clean up (Kaggle)
# !rm -rf {AUDIO_DIR}
# !rm -rf {IMG_DIR}

Save & Commit

In [41]:
import jovian
In [ ]:
jovian.commit()
[jovian] Saving notebook..
In [ ]: