Sign In

Training Deep Neural Networks on a GPU with PyTorch

Part 4 of "PyTorch: Zero to GANs"

This post is the fourth in a series of tutorials on building deep learning models with PyTorch, an open source neural networks library. Check out the full series:

  1. PyTorch Basics: Tensors & Gradients
  2. Linear Regression & Gradient Descent
  3. Image Classfication using Logistic Regression
  4. Training Deep Neural Networks on a GPU
  5. Coming soon.. (CNNs, RNNs, GANs etc.)

In the previous tutorial, we trained a logistic regression model to identify handwritten digits from the MNIST dataset with an accuracy of around 86%.

However, we also noticed that it's quite difficult to improve the accuracy beyond 87%, due to the limited power of the model. In this post, we'll try to improve upon it using a feedforward neural network.

System Setup

If you want to follow along and run the code as you read, you can clone this notebook, install the required dependencies using conda, and start Jupyter by running the following commands on the terminal:

pip install jovian --upgrade    # Install the jovian library 
jovian clone fdaae0bf32cf4917a931ac415a5c31b0  # Download notebook
cd 04-feedforward-nn            # Enter the created directory 
jovian install                  # Install the dependencies
conda activate 04-feedfoward-nn # Activate virtual env
jupyter notebook                # Start Jupyter

On older versions of conda, you might need to run source activate 04-feedfoward-nn to activate the virtual environment. For a more detailed explanation of the above steps, check out the System setup section in the first notebook.

Preparing the Data

The data preparation is identical to the previous tutorial. We begin by importing the required modules & classes.

In [1]:
import torch
import numpy as np
import torchvision
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
from import SubsetRandomSampler
from import DataLoader

We download the data and create a PyTorch dataset using the MNIST class from torchvision.datasets.

In [2]:
dataset = MNIST(root='data/', 

Next, we define and use a function split_indices to pick a random 20% fraction of the images for the validation set.

In [3]:
def split_indices(n, val_pct):
    # Determine size of validation set
    n_val = int(val_pct*n)
    # Create random permutation of 0 to n-1
    idxs = np.random.permutation(n)
    # Pick first n_val indices for validation set
    return idxs[n_val:], idxs[:n_val]
In [4]:
train_indices, val_indices = split_indices(len(dataset), val_pct=0.2)

print(len(train_indices), len(val_indices))
print('Sample val indices: ', val_indices[:20])
48000 12000 Sample val indices: [13541 26766 27540 1931 58020 16756 22475 54824 35811 28772 8400 27130 57761 32223 11259 58824 46588 18089 24000 6632]

We can now create PyTorch data loaders for each of the subsets using a SubsetRandomSampler, which samples elements randomly from a given list of indices, while creating batches of data.

In [5]:

# Training sampler and data loader
train_sampler = SubsetRandomSampler(train_indices)
train_dl = DataLoader(dataset, 

# Validation sampler and data loader
valid_sampler = SubsetRandomSampler(val_indices)
valid_dl = DataLoader(dataset,
In [7]:
def get_train_dl(ds, bs, smplr):
  return DataLoader(ds, bs, smplr)


To improve upon logistic regression, we'll create a neural network with one hidden layer. Here's what this means:

  • Instead of using a single nn.Linear object to transform a batch of inputs (pixel intensities) into a batch of outputs (class probabilities), we'll use two nn.Linear objects. Each of these is called a layer in the network.

  • The first layer (also known as the hidden layer) will transform the input matrix of shape batch_size x 784 into an intermediate output matrix of shape batch_size x hidden_size, where hidden_size is a preconfigured parameter (e.g. 32 or 64).

  • The intermediate outputs are then passed into a non-linear activation function, which operates on individual elements of the output matrix.

  • The result of the activation function, which is also of size batch_size x hidden_size, is passed into the second layer (also knowns as the output layer), which transforms it into a matrix of size batch_size x 10, identical to the output of the logistic regression model.

Introducing a hidden layer and an activation function allows the model to learn more complex, multi-layered and non-linear relationships between the inputs and the targets. Here's what it looks like visually:

The activation function we'll use here is called a Rectified Linear Unit or ReLU, and it has a really simple formula: relu(x) = max(0,x) i.e. if an element is negative, we replace it by 0, otherwise we leave it unchanged.

To define the model, we extend the nn.Module class, just as we did with logistic regression.

In [8]:
import torch.nn.functional as F
import torch.nn as nn
In [9]:
class MnistModel(nn.Module):
    """Feedfoward neural network with 1 hidden layer"""
    def __init__(self, in_size, hidden_size, out_size):
        # hidden layer
        self.linear1 = nn.Linear(in_size, hidden_size)
        # output layer
        self.linear2 = nn.Linear(hidden_size, out_size)
    def forward(self, xb):
        # Flatten the image tensors
        xb = xb.view(xb.size(0), -1)
        # Get intermediate outputs using hidden layer
        out = self.linear1(xb)
        # Apply activation function
        out = F.relu(out)
        # Get predictions using output layer
        out = self.linear2(out)
        return out

We'll create a model that contains a hidden layer with 32 activations.

In [10]:
input_size = 784
num_classes = 10

model = MnistModel(input_size, hidden_size=32, 

Let's take a look at the model's parameters. We expect to see one weight and bias matrix for each of the layers.

In [11]:
for t in model.parameters():
torch.Size([32, 784]) torch.Size([32]) torch.Size([10, 32]) torch.Size([10])

Let's try and generate some outputs using our model. We'll take the first batch of 100 images from our dataset, and pass them into our model.

In [12]:
for images, labels in train_dl:
    outputs = model(images)
    loss = F.cross_entropy(outputs, labels)
    print('Loss:', loss.item())

print('outputs.shape : ', outputs.shape)
print('Sample outputs :\n', outputs[:2].data)
Loss: 2.33284068107605 outputs.shape : torch.Size([100, 10]) Sample outputs : tensor([[ 0.2022, 0.0180, 0.1954, -0.0176, -0.1575, -0.1170, -0.2710, 0.1300, -0.0899, -0.2078], [ 0.0385, 0.1787, 0.1076, -0.0136, -0.1613, -0.1780, -0.3600, 0.1198, -0.1288, -0.2146]])
In [13]:
for images, labels in get_train_dl(dataset, 25, train_sampler):
    outputs = model(images)
    loss = F.cross_entropy(outputs, labels)
    print('Loss:', loss.item())

print('outputs.shape : ', outputs.shape)
print('Sample outputs :\n', outputs[:2].data)
Loss: 2.3188652992248535 outputs.shape : torch.Size([25, 10]) Sample outputs : tensor([[ 0.0931, 0.0554, 0.3005, -0.0124, -0.1851, -0.1277, -0.2825, 0.1635, -0.0648, -0.3079], [ 0.1045, 0.1444, 0.0801, -0.0846, -0.0444, 0.0795, -0.2522, 0.1933, -0.0204, -0.4069]])

Using a GPU

As the sizes of our models and datasets increase, we need to use GPUs to train our models within a reasonable amount of time. GPUs contain hundreds of cores that are optimized for performing expensive matrix operations on floating point numbers in a short time, which makes them ideal for training deep neural networks with many layers. You can use GPUs for free on Kaggle kernels or Google Colab, or rent GPU-powered machines on services like Google Cloud Platform, Amazon Web Services or Paperspace.

We can check if a GPU is available and the required NVIDIA CUDA drivers are installed using torch.cuda.is_available.

In [14]:

Let's define a helper function to ensure that our code uses the GPU if available, and defaults to using the CPU if it isn't.

In [15]:
def get_default_device():
    """Pick GPU if available, else CPU"""
    if torch.cuda.is_available():
        return torch.device('cuda')
        return torch.device('cpu')
In [16]:
device = get_default_device()

Next, let's define a function that can move data and model to a chosen device.

In [17]:
def to_device(data, device):
    """Move tensor(s) to chosen device"""
    if isinstance(data, (list,tuple)):
        return [to_device(x, device) for x in data]
    return, non_blocking=True)
In [18]:
for images, labels in train_dl:
    images = to_device(images, device)
torch.Size([100, 1, 28, 28]) cpu
In [20]:
for images, labels in  get_train_dl(dataset, 25, train_sampler):
    images = to_device(images, device)
torch.Size([25, 1, 28, 28]) cpu

Finally, we define a DeviceDataLoader class to wrap our existing data loaders and move data to the selected device, as a batches are accessed. Interestingly, we don't need to extend an existing class to create a PyTorch dataloader. All we need is an __iter__ method to retrieve batches of data, and an __len__ method to get the number of batches.

In [21]:
class DeviceDataLoader():
    """Wrap a dataloader to move data to a device"""
    def __init__(self, dl, device):
        self.dl = dl
        self.device = device
    def __iter__(self):
        """Yield a batch of data after moving it to device"""
        for b in self.dl: 
            yield to_device(b, self.device)

    def __len__(self):
        """Number of batches"""
        return len(self.dl)

We can now wrap our data loaders using DeviceDataLoader.

In [22]:
train_dl = DeviceDataLoader(train_dl, device)
valid_dl = DeviceDataLoader(valid_dl, device)

Tensors that have been moved to the GPU's RAM have a device property which includes the word cuda. Let's verify this by looking at a batch of data from valid_dl.

In [23]:
for xb, yb in valid_dl:
    print('xb.device:', xb.device)
    print('yb:', yb)
xb.device: cpu yb: tensor([7, 7, 8, 8, 0, 8, 1, 0, 8, 1, 4, 2, 3, 8, 3, 3, 2, 3, 3, 2, 2, 7, 4, 2, 9, 2, 7, 1, 8, 1, 4, 0, 1, 6, 3, 2, 3, 7, 9, 9, 2, 6, 8, 9, 7, 8, 7, 4, 5, 0, 0, 3, 5, 5, 6, 2, 9, 2, 1, 6, 6, 5, 3, 3, 8, 9, 6, 9, 1, 7, 2, 0, 2, 7, 3, 5, 0, 1, 3, 5, 1, 3, 9, 1, 5, 4, 4, 3, 6, 1, 3, 7, 1, 6, 9, 1, 3, 3, 4, 7])

Training the Model

As with logistic regression, we can use cross entropy as the loss function and accuracy as the evaluation metric for our model. The training loop is also identical, so we can reuse the loss_batch, evaluate and fit functions from the previous tutorial.

The loss_batch function calculates the loss and metric value for a batch of data, and optionally performs gradient descent if an optimizer is provided.

In [24]:
def loss_batch(model, loss_func, xb, yb, opt=None, metric=None):
    # Generate predictions
    preds = model(xb)
    # Calculate loss
    loss = loss_func(preds, yb)
    if opt is not None:
        # Compute gradients
        # Update parameters             
        # Reset gradients
    metric_result = None
    if metric is not None:
        # Compute the metric
        metric_result = metric(preds, yb)
    return loss.item(), len(xb), metric_result

The evaluate function calculates the overall loss (and a metric, if provided) for the validation set.

In [25]:
def evaluate(model, loss_fn, valid_dl, metric=None):
    with torch.no_grad():
        # Pass each batch through the model
        results = [loss_batch(model, loss_fn, xb, yb, metric=metric)
                   for xb,yb in valid_dl]
        # Separate losses, counts and metrics
        losses, nums, metrics = zip(*results)
        # Total size of the dataset
        total = np.sum(nums)
        # Avg. loss across batches 
        avg_loss = np.sum(np.multiply(losses, nums)) / total
        avg_metric = None
        if metric is not None:
            # Avg. of metric across batches
            avg_metric = np.sum(np.multiply(metrics, nums)) / total
    return avg_loss, total, avg_metric

The fit function contains the actual training loop, as defined ni the previous tutorials. We'll make a couple more enhancements to the fit function:

  • Instead of the defining the optimizer manually, we'll pass in the learning rate and create an optimizer inside the fit function. This will allows us to train the model with different learning rates, if required.

  • We'll record the validation loss and accuracy at the end of every epoch, and return the history as the output of the fit function.

In [26]:
def fit(epochs, lr, model, loss_fn, train_dl, 
        valid_dl, metric=None, opt_fn=None):
    losses, metrics = [], []
    # Instantiate the optimizer
    if opt_fn is None: opt_fn = torch.optim.SGD
    opt = torch.optim.SGD(model.parameters(), lr=lr)
    for epoch in range(epochs):
        # Training
        for xb,yb in train_dl:
            loss,_,_ = loss_batch(model, loss_fn, xb, yb, opt)

        # Evaluation
        result = evaluate(model, loss_fn, valid_dl, metric)
        val_loss, total, val_metric = result
        # Record the loss & metric
        # Print progress
        if metric is None:
            print('Epoch [{}/{}], Loss: {:.4f}'
                  .format(epoch+1, epochs, val_loss))
            print('Epoch [{}/{}], Loss: {:.4f}, {}: {:.4f}'
                  .format(epoch+1, epochs, val_loss, 
                          metric.__name__, val_metric))
    return losses, metrics

We also define an accuracy function which calculates the overall accuracy of the model on an entire batch of outputs, so that we can use it as a metric in fit.

In [27]:
def accuracy(outputs, labels):
    _, preds = torch.max(outputs, dim=1)
    return torch.sum(preds == labels).item() / len(preds)

Before we train the model, we need to ensure that the data and the model's parameters (weights and biases) are on the same device (CPU or GPU). We can reuse the to_device function to move the model's parameters to the right device.

In [28]:
# Model (on GPU)
model = MnistModel(input_size, hidden_size=32, out_size=num_classes)
to_device(model, device)
  (linear1): Linear(in_features=784, out_features=32, bias=True)
  (linear2): Linear(in_features=32, out_features=10, bias=True)

Let's see how the model performs on the validation set with the initial set of weights and biases.

In [29]:
val_loss, total, val_acc = evaluate(model, F.cross_entropy, 
                                    valid_dl, metric=accuracy)
print('Loss: {:.4f}, Accuracy: {:.4f}'.format(val_loss, val_acc))
Loss: 2.3052, Accuracy: 0.0907

The initial accuracy is around 10%, which is what one might expect from a randomly intialized model (since it has a 1 in 10 chance of getting a label right by guessing randomly).

We are now ready to train the model. Let's train for 5 epochs and look at the results. We can use a relatively higher learning of 0.5.

In [30]:
losses1, metrics1 = fit(5, 0.5, model, F.cross_entropy, 
                        train_dl, valid_dl, accuracy)
Epoch [1/5], Loss: 0.2222, accuracy: 0.9346 Epoch [2/5], Loss: 0.1698, accuracy: 0.9498 Epoch [3/5], Loss: 0.1502, accuracy: 0.9534 Epoch [4/5], Loss: 0.1435, accuracy: 0.9577 Epoch [5/5], Loss: 0.1396, accuracy: 0.9557

95% is pretty good! Let's train the model for 5 more epochs at a lower learning rate of 0.1, to further improve the accuracy.

In [31]:
losses2, metrics2 = fit(5, 0.1, model, F.cross_entropy, 
                        train_dl, valid_dl, accuracy)
Epoch [1/5], Loss: 0.1156, accuracy: 0.9627 Epoch [2/5], Loss: 0.1184, accuracy: 0.9628 Epoch [3/5], Loss: 0.1146, accuracy: 0.9639 Epoch [4/5], Loss: 0.1130, accuracy: 0.9639 Epoch [5/5], Loss: 0.1119, accuracy: 0.9649

We can now plot the accuracies to study how the model improves over time.

In [32]:
import matplotlib.pyplot as plt
In [33]:
# Replace these values with your results
accuracies = [val_acc] + metrics1 + metrics2
plt.plot(accuracies, '-x')
plt.title('Accuracy vs. No. of epochs');
Notebook Image

Our current model outperforms the logistic regression model (which could only reach around 86% accuracy) by a huge margin! It quickly reaches an accuracy of 96%, but doesn't improve much beyond this. To improve the accuracy further, we need to make the model more powerful. As you can probably guess, this can be achieved by increasing the size of the hidden layer, or adding more hidden layers. I encourage you to try out both these approaches and see which one works better.

Commit and upload the notebook

As a final step, we can save and commit our work using the jovian library.

In [38]:
!pip install jovian --upgrade -q
In [39]:
import jovian
In [ ]:
[jovian] Saving notebook..

Summary and Further Reading

Here is a summary of the topics covered in this tutorial:

  • We created a neural network with one hidden layer to improve upon the logistic regression model from the previous tutorial. We also used the ReLU activation function to introduce non-linearity into the model, allowing it to learn more complex relationships between the inputs (pixel densities) and outputs (class probabilities).

  • We defined some utilities like get_default_device, to_device and DeviceDataLoader to leverage a GPU if available, by moving the input data and model parameters to the appropriate device.

  • We were able to use the exact same training loop: the fit function we had define earlier to train out model and evaluate it using the validation dataset.

There's a lot of scope to experiment here, and I encourage you to use the interactive nature of Jupyter to play around with the various parameters. Here are a few ideas:

  • Try changing the size of the hidden layer, or add more hidden layers and see if you can achieve a higher accuracy.

  • Try changing the batch size and learning rate to see if you can achieve the same accuracy in fewer epochs.

  • Compare the training times on a CPU vs. GPU. Do you see a significant difference. How does it vary with the size of the dataset and the size of the model (no. of weights and parameters)?

  • Try building a model for a different dataset, such as the CIFAR10 or CIFAR100 datasets.

Here are some references for further reading:

In [ ]: