This post is the fourth in a series of tutorials on building deep learning models with PyTorch, an open source neural networks library. Check out the full series:
In the previous tutorial, we trained a logistic regression model to identify handwritten digits from the MNIST dataset with an accuracy of around 86%.
However, we also noticed that it's quite difficult to improve the accuracy beyond 87%, due to the limited power of the model. In this post, we'll try to improve upon it using a feedforward neural network.
If you want to follow along and run the code as you read, you can clone this notebook, install the required dependencies using conda, and start Jupyter by running the following commands on the terminal:
pip install jovian --upgrade # Install the jovian library
jovian clone fdaae0bf32cf4917a931ac415a5c31b0 # Download notebook
cd 04-feedforward-nn # Enter the created directory
jovian install # Install the dependencies
conda activate 04-feedfoward-nn # Activate virtual env
jupyter notebook # Start Jupyter
On older versions of conda, you might need to run source activate 04-feedfoward-nn
to activate the virtual environment. For a more detailed explanation of the above steps, check out the System setup section in the first notebook.
The data preparation is identical to the previous tutorial. We begin by importing the required modules & classes.
import torch
import numpy as np
import torchvision
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
from torch.utils.data.sampler import SubsetRandomSampler
from torch.utils.data.dataloader import DataLoader
We download the data and create a PyTorch dataset using the MNIST
class from torchvision.datasets
.
dataset = MNIST(root='data/',
download=True,
transform=ToTensor())
Next, we define and use a function split_indices
to pick a random 20% fraction of the images for the validation set.
def split_indices(n, val_pct):
# Determine size of validation set
n_val = int(val_pct*n)
# Create random permutation of 0 to n-1
idxs = np.random.permutation(n)
# Pick first n_val indices for validation set
return idxs[n_val:], idxs[:n_val]
train_indices, val_indices = split_indices(len(dataset), val_pct=0.2)
print(len(train_indices), len(val_indices))
print('Sample val indices: ', val_indices[:20])
48000 12000
Sample val indices: [13541 26766 27540 1931 58020 16756 22475 54824 35811 28772 8400 27130
57761 32223 11259 58824 46588 18089 24000 6632]
We can now create PyTorch data loaders for each of the subsets using a SubsetRandomSampler
, which samples elements randomly from a given list of indices, while creating batches of data.
batch_size=100
# Training sampler and data loader
train_sampler = SubsetRandomSampler(train_indices)
train_dl = DataLoader(dataset,
batch_size,
sampler=train_sampler)
# Validation sampler and data loader
valid_sampler = SubsetRandomSampler(val_indices)
valid_dl = DataLoader(dataset,
batch_size,
sampler=valid_sampler)
def get_train_dl(ds, bs, smplr):
return DataLoader(ds, bs, smplr)
To improve upon logistic regression, we'll create a neural network with one hidden layer. Here's what this means:
Instead of using a single nn.Linear
object to transform a batch of inputs (pixel intensities) into a batch of outputs (class probabilities), we'll use two nn.Linear
objects. Each of these is called a layer in the network.
The first layer (also known as the hidden layer) will transform the input matrix of shape batch_size x 784
into an intermediate output matrix of shape batch_size x hidden_size
, where hidden_size
is a preconfigured parameter (e.g. 32 or 64).
The intermediate outputs are then passed into a non-linear activation function, which operates on individual elements of the output matrix.
The result of the activation function, which is also of size batch_size x hidden_size
, is passed into the second layer (also knowns as the output layer), which transforms it into a matrix of size batch_size x 10
, identical to the output of the logistic regression model.
Introducing a hidden layer and an activation function allows the model to learn more complex, multi-layered and non-linear relationships between the inputs and the targets. Here's what it looks like visually:
The activation function we'll use here is called a Rectified Linear Unit or ReLU, and it has a really simple formula: relu(x) = max(0,x)
i.e. if an element is negative, we replace it by 0, otherwise we leave it unchanged.
To define the model, we extend the nn.Module
class, just as we did with logistic regression.
import torch.nn.functional as F
import torch.nn as nn
class MnistModel(nn.Module):
"""Feedfoward neural network with 1 hidden layer"""
def __init__(self, in_size, hidden_size, out_size):
super().__init__()
# hidden layer
self.linear1 = nn.Linear(in_size, hidden_size)
# output layer
self.linear2 = nn.Linear(hidden_size, out_size)
def forward(self, xb):
# Flatten the image tensors
xb = xb.view(xb.size(0), -1)
# Get intermediate outputs using hidden layer
out = self.linear1(xb)
# Apply activation function
out = F.relu(out)
# Get predictions using output layer
out = self.linear2(out)
return out
We'll create a model that contains a hidden layer with 32 activations.
input_size = 784
num_classes = 10
model = MnistModel(input_size, hidden_size=32,
out_size=num_classes)
Let's take a look at the model's parameters. We expect to see one weight and bias matrix for each of the layers.
for t in model.parameters():
print(t.shape)
torch.Size([32, 784])
torch.Size([32])
torch.Size([10, 32])
torch.Size([10])
Let's try and generate some outputs using our model. We'll take the first batch of 100 images from our dataset, and pass them into our model.
for images, labels in train_dl:
outputs = model(images)
loss = F.cross_entropy(outputs, labels)
print('Loss:', loss.item())
break
print('outputs.shape : ', outputs.shape)
print('Sample outputs :\n', outputs[:2].data)
Loss: 2.33284068107605
outputs.shape : torch.Size([100, 10])
Sample outputs :
tensor([[ 0.2022, 0.0180, 0.1954, -0.0176, -0.1575, -0.1170, -0.2710, 0.1300,
-0.0899, -0.2078],
[ 0.0385, 0.1787, 0.1076, -0.0136, -0.1613, -0.1780, -0.3600, 0.1198,
-0.1288, -0.2146]])
for images, labels in get_train_dl(dataset, 25, train_sampler):
outputs = model(images)
loss = F.cross_entropy(outputs, labels)
print('Loss:', loss.item())
break
print('outputs.shape : ', outputs.shape)
print('Sample outputs :\n', outputs[:2].data)
Loss: 2.3188652992248535
outputs.shape : torch.Size([25, 10])
Sample outputs :
tensor([[ 0.0931, 0.0554, 0.3005, -0.0124, -0.1851, -0.1277, -0.2825, 0.1635,
-0.0648, -0.3079],
[ 0.1045, 0.1444, 0.0801, -0.0846, -0.0444, 0.0795, -0.2522, 0.1933,
-0.0204, -0.4069]])
As the sizes of our models and datasets increase, we need to use GPUs to train our models within a reasonable amount of time. GPUs contain hundreds of cores that are optimized for performing expensive matrix operations on floating point numbers in a short time, which makes them ideal for training deep neural networks with many layers. You can use GPUs for free on Kaggle kernels or Google Colab, or rent GPU-powered machines on services like Google Cloud Platform, Amazon Web Services or Paperspace.
We can check if a GPU is available and the required NVIDIA CUDA drivers are installed using torch.cuda.is_available
.
torch.cuda.is_available()
False
Let's define a helper function to ensure that our code uses the GPU if available, and defaults to using the CPU if it isn't.
def get_default_device():
"""Pick GPU if available, else CPU"""
if torch.cuda.is_available():
return torch.device('cuda')
else:
return torch.device('cpu')
device = get_default_device()
device
device(type='cpu')
Next, let's define a function that can move data and model to a chosen device.
def to_device(data, device):
"""Move tensor(s) to chosen device"""
if isinstance(data, (list,tuple)):
return [to_device(x, device) for x in data]
return data.to(device, non_blocking=True)
for images, labels in train_dl:
print(images.shape)
images = to_device(images, device)
print(images.device)
break
torch.Size([100, 1, 28, 28])
cpu
for images, labels in get_train_dl(dataset, 25, train_sampler):
print(images.shape)
images = to_device(images, device)
print(images.device)
break
torch.Size([25, 1, 28, 28])
cpu
Finally, we define a DeviceDataLoader
class to wrap our existing data loaders and move data to the selected device, as a batches are accessed. Interestingly, we don't need to extend an existing class to create a PyTorch dataloader. All we need is an __iter__
method to retrieve batches of data, and an __len__
method to get the number of batches.
class DeviceDataLoader():
"""Wrap a dataloader to move data to a device"""
def __init__(self, dl, device):
self.dl = dl
self.device = device
def __iter__(self):
"""Yield a batch of data after moving it to device"""
for b in self.dl:
yield to_device(b, self.device)
def __len__(self):
"""Number of batches"""
return len(self.dl)
We can now wrap our data loaders using DeviceDataLoader
.
train_dl = DeviceDataLoader(train_dl, device)
valid_dl = DeviceDataLoader(valid_dl, device)
Tensors that have been moved to the GPU's RAM have a device
property which includes the word cuda
. Let's verify this by looking at a batch of data from valid_dl
.
for xb, yb in valid_dl:
print('xb.device:', xb.device)
print('yb:', yb)
break
xb.device: cpu
yb: tensor([7, 7, 8, 8, 0, 8, 1, 0, 8, 1, 4, 2, 3, 8, 3, 3, 2, 3, 3, 2, 2, 7, 4, 2,
9, 2, 7, 1, 8, 1, 4, 0, 1, 6, 3, 2, 3, 7, 9, 9, 2, 6, 8, 9, 7, 8, 7, 4,
5, 0, 0, 3, 5, 5, 6, 2, 9, 2, 1, 6, 6, 5, 3, 3, 8, 9, 6, 9, 1, 7, 2, 0,
2, 7, 3, 5, 0, 1, 3, 5, 1, 3, 9, 1, 5, 4, 4, 3, 6, 1, 3, 7, 1, 6, 9, 1,
3, 3, 4, 7])
As with logistic regression, we can use cross entropy as the loss function and accuracy as the evaluation metric for our model. The training loop is also identical, so we can reuse the loss_batch
, evaluate
and fit
functions from the previous tutorial.
The loss_batch
function calculates the loss and metric value for a batch of data, and optionally performs gradient descent if an optimizer is provided.
def loss_batch(model, loss_func, xb, yb, opt=None, metric=None):
# Generate predictions
preds = model(xb)
# Calculate loss
loss = loss_func(preds, yb)
if opt is not None:
# Compute gradients
loss.backward()
# Update parameters
opt.step()
# Reset gradients
opt.zero_grad()
metric_result = None
if metric is not None:
# Compute the metric
metric_result = metric(preds, yb)
return loss.item(), len(xb), metric_result
The evaluate
function calculates the overall loss (and a metric, if provided) for the validation set.
def evaluate(model, loss_fn, valid_dl, metric=None):
with torch.no_grad():
# Pass each batch through the model
results = [loss_batch(model, loss_fn, xb, yb, metric=metric)
for xb,yb in valid_dl]
# Separate losses, counts and metrics
losses, nums, metrics = zip(*results)
# Total size of the dataset
total = np.sum(nums)
# Avg. loss across batches
avg_loss = np.sum(np.multiply(losses, nums)) / total
avg_metric = None
if metric is not None:
# Avg. of metric across batches
avg_metric = np.sum(np.multiply(metrics, nums)) / total
return avg_loss, total, avg_metric
The fit
function contains the actual training loop, as defined ni the previous tutorials. We'll make a couple more enhancements to the fit
function:
Instead of the defining the optimizer manually, we'll pass in the learning rate and create an optimizer inside the fit
function. This will allows us to train the model with different learning rates, if required.
We'll record the validation loss and accuracy at the end of every epoch, and return the history as the output of the fit
function.
def fit(epochs, lr, model, loss_fn, train_dl,
valid_dl, metric=None, opt_fn=None):
losses, metrics = [], []
# Instantiate the optimizer
if opt_fn is None: opt_fn = torch.optim.SGD
opt = torch.optim.SGD(model.parameters(), lr=lr)
for epoch in range(epochs):
# Training
for xb,yb in train_dl:
loss,_,_ = loss_batch(model, loss_fn, xb, yb, opt)
# Evaluation
result = evaluate(model, loss_fn, valid_dl, metric)
val_loss, total, val_metric = result
# Record the loss & metric
losses.append(val_loss)
metrics.append(val_metric)
# Print progress
if metric is None:
print('Epoch [{}/{}], Loss: {:.4f}'
.format(epoch+1, epochs, val_loss))
else:
print('Epoch [{}/{}], Loss: {:.4f}, {}: {:.4f}'
.format(epoch+1, epochs, val_loss,
metric.__name__, val_metric))
return losses, metrics
We also define an accuracy
function which calculates the overall accuracy of the model on an entire batch of outputs, so that we can use it as a metric in fit
.
def accuracy(outputs, labels):
_, preds = torch.max(outputs, dim=1)
return torch.sum(preds == labels).item() / len(preds)
Before we train the model, we need to ensure that the data and the model's parameters (weights and biases) are on the same device (CPU or GPU). We can reuse the to_device
function to move the model's parameters to the right device.
# Model (on GPU)
model = MnistModel(input_size, hidden_size=32, out_size=num_classes)
to_device(model, device)
MnistModel(
(linear1): Linear(in_features=784, out_features=32, bias=True)
(linear2): Linear(in_features=32, out_features=10, bias=True)
)
Let's see how the model performs on the validation set with the initial set of weights and biases.
val_loss, total, val_acc = evaluate(model, F.cross_entropy,
valid_dl, metric=accuracy)
print('Loss: {:.4f}, Accuracy: {:.4f}'.format(val_loss, val_acc))
Loss: 2.3052, Accuracy: 0.0907
The initial accuracy is around 10%, which is what one might expect from a randomly intialized model (since it has a 1 in 10 chance of getting a label right by guessing randomly).
We are now ready to train the model. Let's train for 5 epochs and look at the results. We can use a relatively higher learning of 0.5.
losses1, metrics1 = fit(5, 0.5, model, F.cross_entropy,
train_dl, valid_dl, accuracy)
Epoch [1/5], Loss: 0.2222, accuracy: 0.9346
Epoch [2/5], Loss: 0.1698, accuracy: 0.9498
Epoch [3/5], Loss: 0.1502, accuracy: 0.9534
Epoch [4/5], Loss: 0.1435, accuracy: 0.9577
Epoch [5/5], Loss: 0.1396, accuracy: 0.9557
95% is pretty good! Let's train the model for 5 more epochs at a lower learning rate of 0.1, to further improve the accuracy.
losses2, metrics2 = fit(5, 0.1, model, F.cross_entropy,
train_dl, valid_dl, accuracy)
Epoch [1/5], Loss: 0.1156, accuracy: 0.9627
Epoch [2/5], Loss: 0.1184, accuracy: 0.9628
Epoch [3/5], Loss: 0.1146, accuracy: 0.9639
Epoch [4/5], Loss: 0.1130, accuracy: 0.9639
Epoch [5/5], Loss: 0.1119, accuracy: 0.9649
We can now plot the accuracies to study how the model improves over time.
import matplotlib.pyplot as plt
# Replace these values with your results
accuracies = [val_acc] + metrics1 + metrics2
plt.plot(accuracies, '-x')
plt.xlabel('epoch')
plt.ylabel('accuracy')
plt.title('Accuracy vs. No. of epochs');
Our current model outperforms the logistic regression model (which could only reach around 86% accuracy) by a huge margin! It quickly reaches an accuracy of 96%, but doesn't improve much beyond this. To improve the accuracy further, we need to make the model more powerful. As you can probably guess, this can be achieved by increasing the size of the hidden layer, or adding more hidden layers. I encourage you to try out both these approaches and see which one works better.
As a final step, we can save and commit our work using the jovian library.
!pip install jovian --upgrade -q
import jovian
jovian.commit()
[jovian] Saving notebook..
Here is a summary of the topics covered in this tutorial:
We created a neural network with one hidden layer to improve upon the logistic regression model from the previous tutorial. We also used the ReLU activation function to introduce non-linearity into the model, allowing it to learn more complex relationships between the inputs (pixel densities) and outputs (class probabilities).
We defined some utilities like get_default_device
, to_device
and DeviceDataLoader
to leverage a GPU if available, by moving the input data and model parameters to the appropriate device.
We were able to use the exact same training loop: the fit
function we had define earlier to train out model and evaluate it using the validation dataset.
There's a lot of scope to experiment here, and I encourage you to use the interactive nature of Jupyter to play around with the various parameters. Here are a few ideas:
Try changing the size of the hidden layer, or add more hidden layers and see if you can achieve a higher accuracy.
Try changing the batch size and learning rate to see if you can achieve the same accuracy in fewer epochs.
Compare the training times on a CPU vs. GPU. Do you see a significant difference. How does it vary with the size of the dataset and the size of the model (no. of weights and parameters)?
Try building a model for a different dataset, such as the CIFAR10 or CIFAR100 datasets.
Here are some references for further reading:
A visual proof that neural networks can compute any function, also known as the Universal Approximation Theorem.
But what is a neural network? - A visual and intuitive introduction to what neural networks are and what the intermediate layers represent
Stanford CS229 Lecture notes on Backpropagation - for a more mathematical treatment of how gradients are calculated and weights are updated for neural networks with multiple layers.