Use the 'Clone' button if you want to run this notebook on a local/cloud machine, or use the 'Run' button to run it on BinderHub or Kaggle.
To begin, we download a bunch of utility functions for I/O and conversion of audio files to spectrogram images.
!rm -rf utils.py
!wget https://raw.githubusercontent.com/sevenfx/fastai_audio/master/notebooks/utils.py
--2019-06-16 10:19:54-- https://raw.githubusercontent.com/sevenfx/fastai_audio/master/notebooks/utils.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.156.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.156.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7182 (7.0K) [text/plain]
Saving to: 'utils.py'
utils.py 100%[===================>] 7.01K --.-KB/s in 0s
2019-06-16 10:19:54 (26.0 MB/s) - 'utils.py' saved [7182/7182]
We can import the necessary modules and functions
%matplotlib inline
import os
from pathlib import Path
from IPython.display import Audio
import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
from utils import read_file, transform_path
Next, let's download the data. We'll use the free spoken digits database: https://github.com/Jakobovski/free-spoken-digit-dataset
!rm -rf free-spoken-digit-dataset-master master.zip
!wget https://github.com/Jakobovski/free-spoken-digit-dataset/archive/master.zip
!unzip -q master.zip
!rm -rf master.zip
!ls
--2019-06-16 10:20:04-- https://github.com/Jakobovski/free-spoken-digit-dataset/archive/master.zip
Resolving github.com (github.com)... 13.234.176.102
Connecting to github.com (github.com)|13.234.176.102|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/Jakobovski/free-spoken-digit-dataset/zip/master [following]
--2019-06-16 10:20:04-- https://codeload.github.com/Jakobovski/free-spoken-digit-dataset/zip/master
Resolving codeload.github.com (codeload.github.com)... 192.30.253.120
Connecting to codeload.github.com (codeload.github.com)|192.30.253.120|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: 'master.zip'
master.zip [ <=> ] 9.03M 1.53MB/s in 9.4s
2019-06-16 10:20:14 (984 KB/s) - 'master.zip' saved [9465098]
__pycache__ free-spoken-digit-dataset-master
audio-classification-fastai.ipynb utils.py
AUDIO_DIR = Path('free-spoken-digit-dataset-master/recordings')
IMG_DIR = Path('imgs')
!mkdir {IMG_DIR} -p
Let's see how many recordings we have, and some sample files.
fnames = os.listdir(str(AUDIO_DIR))
len(fnames), fnames[:5]
(2000,
['5_nicolas_9.wav',
'3_yweweler_14.wav',
'4_yweweler_38.wav',
'3_yweweler_28.wav',
'4_yweweler_10.wav'])
As before we can play the recording using the Audio
widget.
fn = fnames[94]
print(fn)
Audio(str(AUDIO_DIR/fn))
4_jackson_32.wav
We can use the read_file
helper function to read the audio file and do some preprocessing
# ??read_file
x, sr = read_file(fn, AUDIO_DIR)
x.shape, sr, x.dtype
((3467,), 8000, dtype('float32'))
Next, let's define a function to convert it into a mel spectrogram, and save the resulting image as a PNG file.
def log_mel_spec_tfm(fname, src_path, dst_path):
x, sample_rate = read_file(fname, src_path)
n_fft = 1024
hop_length = 256
n_mels = 40
fmin = 20
fmax = sample_rate / 2
mel_spec_power = librosa.feature.melspectrogram(x, sr=sample_rate, n_fft=n_fft,
hop_length=hop_length,
n_mels=n_mels, power=2.0,
fmin=fmin, fmax=fmax)
mel_spec_db = librosa.power_to_db(mel_spec_power, ref=np.max)
dst_fname = dst_path / (fname[:-4] + '.png')
plt.imsave(dst_fname, mel_spec_db)
Here's an example audio file convertered to PNG.
log_mel_spec_tfm(fn, AUDIO_DIR, IMG_DIR)
img = plt.imread(str(IMG_DIR/(fn[:-4] + '.png')))
plt.imshow(img, origin='lower');
Now we can apply the log_mel_spec_tfm
transformation to the entire dataset.
transform_path(AUDIO_DIR, IMG_DIR, log_mel_spec_tfm, fnames=fnames, delete=True)
HBox(children=(IntProgress(value=0, max=2000), HTML(value='')))
os.listdir(str(IMG_DIR))[:10]
['2_theo_12.png',
'7_jackson_24.png',
'6_yweweler_45.png',
'7_jackson_30.png',
'7_jackson_18.png',
'8_yweweler_36.png',
'5_theo_36.png',
'3_jackson_45.png',
'3_nicolas_12.png',
'5_theo_22.png']
From this point onwards, we use a standard FastAI CNN image classifier.
import fastai
fastai.__version__
'1.0.53.post2'
from fastai.vision import *
The label can be extracted using a regular expression, and for the validation set, we'll pick all the recordings of one of the speakers.
digit_pattern = r'(\d+)_\w+_\d+.png$'
data = (ImageList.from_folder(IMG_DIR)
.split_by_rand_pct(0.2)
#.split_by_valid_func(lambda fname: 'nicolas' in str(fname))
.label_from_re(digit_pattern)
.transform(size=(128,64))
.databunch())
data.c, data.classes
(10, ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'])
Let's see what a batch of data looks liks.
# Shape of batch
xs, ys = data.one_batch()
xs.shape, ys.shape
(torch.Size([64, 3, 128, 64]), torch.Size([64]))
# Stats
xs.min(), xs.max(), xs.mean(), xs.std()
(tensor(0.0039), tensor(0.9903), tensor(0.4042), tensor(0.1950))
# Sample batch
data.show_batch(4, figsize=(5,9), hide_axis=False)
Now we're ready to define and train the model. We'll use the ResNet18 architecture.
learn = cnn_learner(data, models.resnet18, metrics=accuracy)
We start by finetuning the classification layers.
learn.fit_one_cycle(4)
Let's unfreeze and train some more.
learn.unfreeze()
learn.fit_one_cycle(4)
It's looking pretty good. Let's look at the confusion matrix.
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix(figsize=(10, 10), dpi=60)
# Clean up (Kaggle)
# !rm -rf {AUDIO_DIR}
# !rm -rf {IMG_DIR}
import jovian
jovian.commit()
[jovian] Saving notebook..