adityaramesh12/02-insurance-linear-regression-assignment - Jovian
Learn data science and machine learning by building real-world projects on Jovian

Insurance cost prediction using linear regression

Make a submisson here: https://jovian.ai/learn/deep-learning-with-pytorch-zero-to-gans/assignment/assignment-2-train-your-first-model

In this assignment we're going to use information like a person's age, sex, BMI, no. of children and smoking habit to predict the price of yearly medical bills. This kind of model is useful for insurance companies to determine the yearly insurance premium for a person. The dataset for this problem is taken from Kaggle.

We will create a model with the following steps:

  1. Download and explore the dataset
  2. Prepare the dataset for training
  3. Create a linear regression model
  4. Train the model to fit the data
  5. Make predictions using the trained model

This assignment builds upon the concepts from the first 2 lessons. It will help to review these Jupyter notebooks:

As you go through this notebook, you will find a ??? in certain places. Your job is to replace the ??? with appropriate code or values, to ensure that the notebook runs properly end-to-end . In some cases, you'll be required to choose some hyperparameters (learning rate, batch size etc.). Try to experiment with the hypeparameters to get the lowest loss.

# Uncomment and run the appropriate command for your operating system, if required

# Linux / Binder
# !pip install numpy matplotlib pandas torch==1.7.0+cpu torchvision==0.8.1+cpu torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html

# Windows
# !pip install numpy matplotlib pandas torch==1.7.0+cpu torchvision==0.8.1+cpu torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html

# MacOS
# !pip install numpy matplotlib pandas torch torchvision torchaudio
import torch
import jovian
import torchvision
import torch.nn as nn
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import torch.nn.functional as F
from datetime import datetime
from torchvision.datasets.utils import download_url
from torch.utils.data import DataLoader, TensorDataset, random_split
project_name='02-insurance-linear-regression-assignment' # will be used by jovian.commit

Step 1: Download and explore the data

Let us begin by downloading the data. We'll use the download_url function from PyTorch to get the data as a CSV (comma-separated values) file.

DATASET_URL = "https://hub.jovian.ml/wp-content/uploads/2020/05/insurance.csv"
DATA_FILENAME = "insurance.csv"
download_url(DATASET_URL, '.')
Using downloaded and verified file: ./insurance.csv

To load the dataset into memory, we'll use the read_csv function from the pandas library. The data will be loaded as a Pandas dataframe. See this short tutorial to learn more: https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/

dataframe_raw = pd.read_csv(DATA_FILENAME)
dataframe_raw.head()

We're going to do a slight customization of the data, so that you every participant receives a slightly different version of the dataset. Fill in your name below as a string (enter at least 5 characters)

your_name = 'aditya' # at least 5 characters

The customize_dataset function will customize the dataset slightly using your name as a source of random numbers.

def customize_dataset(dataframe_raw, rand_str):
    dataframe = dataframe_raw.copy(deep=True)
    # drop some rows
    dataframe = dataframe.sample(int(0.95*len(dataframe)), random_state=int(ord(rand_str[0])))
    # scale input
    dataframe.bmi = dataframe.bmi * ord(rand_str[1])/100.
    # scale target
    dataframe.charges = dataframe.charges * ord(rand_str[2])/100.
    # drop column
    if ord(rand_str[3]) % 2 == 1:
        dataframe = dataframe.drop(['region'], axis=1)
    return dataframe
dataframe = customize_dataset(dataframe_raw, your_name)
dataframe.head()

Let us answer some basic questions about the dataset.

Let's use the shape property provided by pandas to find the number of rows and columns

Q: How many rows does the dataset have?

df_rows, df_columns = dataframe.shape
num_rows = df_rows
print(num_rows)
1271

Q: How many columns does the dataset have

num_cols = df_columns
print(num_cols)
7

Q: What are the column titles of the input variables?

To get a list of all column names, we use the columns property which returns an object and convert into a list.

Since we need to predict the medical bill charges from the other inputs, every input except the 'charges' column which is the output.

df_cols_list = list(dataframe.columns)
print(df_cols_list)
['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges']
input_cols = df_cols_list
input_cols.remove('charges')
print('Input Columns are -> ',input_cols)
Input Columns are -> ['age', 'sex', 'bmi', 'children', 'smoker', 'region']

Q: Which of the input columns are non-numeric or categorial variables ?

We can get an idea of the catagorical features by looking at the head. Here 'sex', 'smoker', 'region' are non numeric.

But let's explore some common pandas functions to get these through code ( remove or add exceptions we identify in the end). This will help us get a hang of the pandas library.

#to get the datatypes of the columns in the dataframe
dataframe.dtypes 
age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object
# _get_numeric_data function can be directly used to get just the numeric columns
numeric_cols = list(dataframe._get_numeric_data().columns)
print(numeric_cols)

['age', 'bmi', 'children', 'charges']

These are the numeric columns.

Let's use set and its built in functions to get the difference between the list of all columns and the numeric columns to get the catagorical columns.

all_cols_set = set(df_cols_list)
print('All Columns Set ->', all_cols_set)

numeric_cols_set = set(numeric_cols)
print('\nNumeric Columns Set ->', all_cols_set)

categorical_cols_set = all_cols_set.difference(numeric_cols_set)
print('\nCategorical Columns Set ->', categorical_cols_set)
All Columns Set -> {'bmi', 'region', 'children', 'sex', 'age', 'smoker'} Numeric Columns Set -> {'bmi', 'region', 'children', 'sex', 'age', 'smoker'} Categorical Columns Set -> {'smoker', 'sex', 'region'}
categorical_cols = list(categorical_cols_set)
print('Categorical Columns ->', categorical_cols)
Categorical Columns -> ['smoker', 'sex', 'region']

Q: What are the column titles of output/target variable(s)?

output_cols = ['charges']
print('Output Columns ->', output_cols)
Output Columns -> ['charges']

Q: What is the minimum, maximum and average value of the charges column? Can you show the distribution of values in a graph? Use this data visualization cheatsheet for referece: https://jovian.ml/aakashns/dataviz-cheatsheet

# Write your answer here

min_charges = dataframe["charges"].min()
max_charges = dataframe["charges"].max()
mean_charges = dataframe["charges"].mean()
print(f"\nMinimun charges -> {min_charges:.2f}")
print(f"\nMaximum charges -> {max_charges:.2f}")
print(f"\nAverage charges -> {mean_charges:.2f}")

dataframe.describe()
Minimun charges -> 1177.97 Maximum charges -> 66958.95 Average charges -> 13979.91
plt.figure(figsize=(14,8))
sns.distplot(dataframe.charges, kde=True)
/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<matplotlib.axes._subplots.AxesSubplot at 0x7f17648ba630>
Notebook Image

Step 2: Prepare the dataset for training

We need to convert the data from the Pandas dataframe into a PyTorch tensors for training. To do this, the first step is to convert it numpy arrays. If you've filled out input_cols, categorial_cols and output_cols correctly, this following function will perform the conversion to numpy arrays.

def dataframe_to_arrays(dataframe):
    # Make a copy of the original dataframe
    dataframe1 = dataframe.copy(deep=True)
    # Convert non-numeric categorical columns to numbers
    for col in categorical_cols:
        dataframe1[col] = dataframe1[col].astype('category').cat.codes
    # Extract input & outupts as numpy arrays
    inputs_array = dataframe1[input_cols].to_numpy()
    targets_array = dataframe1[output_cols].to_numpy()
    return inputs_array, targets_array

Read through the Pandas documentation to understand how we're converting categorical variables into numbers.

inputs_array, targets_array = dataframe_to_arrays(dataframe)
inputs_array, targets_array
(array([[55.   ,  0.   , 32.775,  2.   ,  0.   ,  1.   ],
        [63.   ,  0.   , 36.85 ,  0.   ,  0.   ,  2.   ],
        [54.   ,  1.   , 39.6  ,  1.   ,  0.   ,  3.   ],
        ...,
        [58.   ,  1.   , 32.01 ,  1.   ,  0.   ,  2.   ],
        [32.   ,  0.   , 44.22 ,  0.   ,  0.   ,  2.   ],
        [35.   ,  1.   , 17.86 ,  1.   ,  0.   ,  1.   ]]),
 array([[12882.0638625],
        [14582.366925 ],
        [10973.0796   ],
        ...,
        [12543.957195 ],
        [ 4193.88669  ],
        [ 5372.32542  ]]))

Q: Convert the numpy arrays inputs_array and targets_array into PyTorch tensors. Make sure that the data type is torch.float32.

We can convert these numpy arrays to tensors by using the from_numpy function provided by pytorch

inputs = torch.from_numpy(inputs_array.astype(np.float32))
targets = torch.from_numpy(targets_array.astype(np.float32))
inputs.dtype, targets.dtype
(torch.float32, torch.float32)

Next, we need to create PyTorch datasets & data loaders for training & validation. We'll start by creating a TensorDataset.

dataset = TensorDataset(inputs, targets)

Q: Pick a number between 0.1 and 0.2 to determine the fraction of data that will be used for creating the validation set. Then use random_split to create training & validation datasets.

val_percent = 0.1 # between 0.1 and 0.2
val_size = int(num_rows * val_percent)
train_size = num_rows - val_size


train_ds, val_ds = random_split(dataset=dataset, lengths=[train_size, val_size]) # Use the random_split function to split dataset into 2 parts of the desired length

Finally, we can create data loaders for training & validation.

Q: Pick a batch size for the data loader.

We have 1271 rows.

Let's experiment with batch size and log metrics accordingly.

Batch Sizes tried -> [100, 200, 400]

batch_size = 100
train_loader = DataLoader(train_ds, batch_size, shuffle=True)
val_loader = DataLoader(val_ds, batch_size)

Let's look at a batch of data to verify everything is working fine so far.

for xb, yb in train_loader:
    print("inputs:", xb)
    print("targets:", yb)
    break
inputs: tensor([[52.0000, 0.0000, 31.2000, 0.0000, 0.0000, 3.0000], [18.0000, 0.0000, 33.8800, 0.0000, 0.0000, 2.0000], [61.0000, 1.0000, 28.3100, 1.0000, 1.0000, 1.0000], [25.0000, 1.0000, 26.6950, 4.0000, 0.0000, 1.0000], [19.0000, 0.0000, 28.9000, 0.0000, 0.0000, 3.0000], [56.0000, 0.0000, 26.6000, 1.0000, 0.0000, 1.0000], [31.0000, 0.0000, 25.7400, 0.0000, 0.0000, 2.0000], [21.0000, 0.0000, 32.6800, 2.0000, 0.0000, 1.0000], [54.0000, 0.0000, 35.8150, 3.0000, 0.0000, 1.0000], [19.0000, 0.0000, 20.6000, 0.0000, 0.0000, 3.0000], [38.0000, 0.0000, 37.7300, 0.0000, 0.0000, 2.0000], [62.0000, 0.0000, 39.1600, 0.0000, 0.0000, 2.0000], [49.0000, 1.0000, 25.8400, 1.0000, 0.0000, 0.0000], [34.0000, 1.0000, 25.2700, 1.0000, 0.0000, 1.0000], [35.0000, 1.0000, 27.1000, 1.0000, 0.0000, 3.0000], [38.0000, 1.0000, 29.2600, 2.0000, 0.0000, 1.0000], [62.0000, 1.0000, 39.9300, 0.0000, 0.0000, 2.0000], [47.0000, 1.0000, 25.4100, 1.0000, 1.0000, 2.0000], [54.0000, 1.0000, 30.8000, 1.0000, 1.0000, 2.0000], [63.0000, 0.0000, 37.7000, 0.0000, 1.0000, 3.0000], [46.0000, 1.0000, 42.3500, 3.0000, 1.0000, 2.0000], [26.0000, 0.0000, 40.1850, 0.0000, 0.0000, 1.0000], [47.0000, 1.0000, 36.0800, 1.0000, 1.0000, 2.0000], [30.0000, 1.0000, 35.3000, 0.0000, 1.0000, 3.0000], [43.0000, 0.0000, 25.0800, 0.0000, 0.0000, 0.0000], [58.0000, 1.0000, 36.0800, 0.0000, 0.0000, 2.0000], [49.0000, 1.0000, 30.9000, 0.0000, 1.0000, 3.0000], [31.0000, 0.0000, 32.6800, 1.0000, 0.0000, 1.0000], [27.0000, 0.0000, 25.1750, 0.0000, 0.0000, 0.0000], [30.0000, 1.0000, 24.4000, 3.0000, 1.0000, 3.0000], [21.0000, 1.0000, 31.0200, 0.0000, 0.0000, 2.0000], [58.0000, 0.0000, 36.4800, 0.0000, 0.0000, 1.0000], [56.0000, 0.0000, 27.2000, 0.0000, 0.0000, 3.0000], [57.0000, 0.0000, 38.0000, 2.0000, 0.0000, 3.0000], [18.0000, 0.0000, 31.1300, 0.0000, 0.0000, 2.0000], [28.0000, 0.0000, 26.5100, 2.0000, 0.0000, 2.0000], [54.0000, 1.0000, 34.2100, 2.0000, 1.0000, 2.0000], [38.0000, 1.0000, 19.9500, 1.0000, 0.0000, 1.0000], [57.0000, 1.0000, 43.7000, 1.0000, 0.0000, 3.0000], [59.0000, 1.0000, 41.1400, 1.0000, 1.0000, 2.0000], [48.0000, 0.0000, 32.3000, 2.0000, 0.0000, 0.0000], [19.0000, 0.0000, 32.9000, 0.0000, 0.0000, 3.0000], [30.0000, 0.0000, 30.9000, 3.0000, 0.0000, 3.0000], [23.0000, 0.0000, 31.4000, 0.0000, 1.0000, 3.0000], [20.0000, 0.0000, 26.8400, 1.0000, 1.0000, 2.0000], [25.0000, 1.0000, 26.2200, 0.0000, 0.0000, 0.0000], [19.0000, 1.0000, 33.1000, 0.0000, 0.0000, 3.0000], [48.0000, 1.0000, 35.6250, 4.0000, 0.0000, 0.0000], [59.0000, 0.0000, 26.6950, 3.0000, 0.0000, 1.0000], [22.0000, 1.0000, 28.3100, 1.0000, 0.0000, 1.0000], [22.0000, 1.0000, 39.5000, 0.0000, 0.0000, 3.0000], [26.0000, 0.0000, 19.8000, 1.0000, 0.0000, 3.0000], [30.0000, 0.0000, 32.4000, 1.0000, 0.0000, 3.0000], [61.0000, 1.0000, 32.3000, 2.0000, 0.0000, 1.0000], [29.0000, 1.0000, 29.7350, 2.0000, 0.0000, 1.0000], [37.0000, 0.0000, 34.8000, 2.0000, 1.0000, 3.0000], [22.0000, 1.0000, 52.5800, 1.0000, 1.0000, 2.0000], [27.0000, 0.0000, 24.1000, 0.0000, 0.0000, 3.0000], [62.0000, 1.0000, 30.8750, 3.0000, 1.0000, 1.0000], [25.0000, 0.0000, 30.3000, 0.0000, 0.0000, 3.0000], [35.0000, 1.0000, 27.6100, 1.0000, 0.0000, 2.0000], [41.0000, 1.0000, 28.8000, 1.0000, 0.0000, 3.0000], [40.0000, 1.0000, 24.9700, 2.0000, 0.0000, 2.0000], [32.0000, 0.0000, 28.9300, 0.0000, 0.0000, 2.0000], [34.0000, 0.0000, 26.7300, 1.0000, 0.0000, 2.0000], [20.0000, 1.0000, 40.4700, 0.0000, 0.0000, 0.0000], [31.0000, 0.0000, 29.2600, 1.0000, 0.0000, 2.0000], [45.0000, 1.0000, 30.4950, 2.0000, 0.0000, 1.0000], [54.0000, 1.0000, 25.1000, 3.0000, 1.0000, 3.0000], [50.0000, 1.0000, 32.2050, 0.0000, 0.0000, 1.0000], [34.0000, 1.0000, 32.8000, 1.0000, 0.0000, 3.0000], [37.0000, 0.0000, 30.8000, 2.0000, 0.0000, 2.0000], [39.0000, 1.0000, 28.3000, 1.0000, 1.0000, 3.0000], [26.0000, 0.0000, 29.3550, 2.0000, 0.0000, 0.0000], [56.0000, 0.0000, 35.8000, 1.0000, 0.0000, 3.0000], [40.0000, 0.0000, 29.6000, 0.0000, 0.0000, 3.0000], [31.0000, 1.0000, 26.8850, 1.0000, 0.0000, 0.0000], [42.0000, 0.0000, 24.9850, 2.0000, 0.0000, 1.0000], [57.0000, 0.0000, 34.2950, 2.0000, 0.0000, 0.0000], [48.0000, 0.0000, 36.5750, 0.0000, 0.0000, 1.0000], [53.0000, 0.0000, 26.7000, 2.0000, 0.0000, 3.0000], [23.0000, 0.0000, 42.7500, 1.0000, 1.0000, 0.0000], [37.0000, 1.0000, 36.1900, 0.0000, 0.0000, 2.0000], [64.0000, 0.0000, 26.8850, 0.0000, 1.0000, 1.0000], [25.0000, 0.0000, 22.5150, 1.0000, 0.0000, 1.0000], [59.0000, 1.0000, 29.8300, 3.0000, 1.0000, 0.0000], [64.0000, 0.0000, 31.3000, 2.0000, 1.0000, 3.0000], [26.0000, 1.0000, 30.0000, 1.0000, 0.0000, 3.0000], [43.0000, 1.0000, 27.3600, 3.0000, 0.0000, 0.0000], [42.0000, 1.0000, 28.3100, 3.0000, 1.0000, 1.0000], [43.0000, 1.0000, 20.1300, 2.0000, 1.0000, 2.0000], [44.0000, 1.0000, 31.3500, 1.0000, 1.0000, 0.0000], [54.0000, 0.0000, 46.7000, 2.0000, 0.0000, 3.0000], [61.0000, 0.0000, 33.3300, 4.0000, 0.0000, 2.0000], [53.0000, 0.0000, 22.6100, 3.0000, 1.0000, 0.0000], [29.0000, 1.0000, 35.5000, 2.0000, 1.0000, 3.0000], [21.0000, 0.0000, 34.8700, 0.0000, 0.0000, 2.0000], [47.0000, 0.0000, 36.0000, 1.0000, 0.0000, 3.0000], [42.0000, 0.0000, 33.1550, 1.0000, 0.0000, 0.0000], [28.0000, 0.0000, 33.4000, 0.0000, 0.0000, 3.0000]]) targets: tensor([[10107.2158], [12056.7666], [30312.0977], [ 5121.8799], [ 1830.3748], [12646.5596], [ 3944.4526], [27319.8984], [13120.0557], [ 1818.2609], [ 5667.4976], [14144.3447], [ 9746.6045], [ 5139.4907], [ 4983.6611], [ 6780.7354], [13632.0186], [23077.6113], [44099.4961], [51265.6719], [48458.6797], [ 3361.3074], [44321.6953], [38679.3398], [ 7691.3008], [11931.4473], [41713.9961], [ 4975.1816], [ 3736.5513], [19172.1777], [17415.8223], [12847.6309], [11626.8350], [13278.5176], [ 1702.9768], [ 4557.4629], [46473.7891], [ 6148.6978], [12154.9365], [51418.7617], [10545.4111], [ 1836.2126], [ 5591.9336], [35874.5859], [17939.5312], [ 2857.3867], [24237.1035], [11273.7139], [15101.8447], [ 2770.9951], [ 1766.7268], [ 3547.8555], [ 4357.2227], [14825.6006], [19065.7695], [41828.3438], [46726.4688], [ 3122.8323], [49054.0703], [ 2764.6416], [ 4984.4058], [ 6596.3467], [ 6923.1836], [ 4171.5708], [ 5252.9219], [ 2083.6760], [ 4568.0400], [ 8834.1357], [26651.4121], [ 9277.0283], [15076.2822], [ 6629.4468], [22136.2676], [ 4792.4009], [12257.8369], [ 6206.4912], [ 4663.2739], [ 8417.9141], [13885.2598], [ 9104.7510], [11708.3193], [42949.4102], [20175.4414], [30797.5332], [ 3773.8794], [31694.1836], [49655.6094], [ 3049.2925], [ 9036.5283], [34426.8320], [19706.1250], [41534.3203], [12115.3418], [38409.2969], [26117.0547], [46814.7305], [ 2121.5798], [ 8984.7520], [ 8021.3882], [ 3330.6189]])

Step 3: Create a Linear Regression Model

Our model itself is a fairly straightforward linear regression (we'll build more complex models in the next assignment).

input_size = len(input_cols)
output_size = len(output_cols)
print('Input Size is {inp_size} and Target Size is {out_size}'.format(inp_size=input_size, out_size=output_size))
Input Size is 6 and Target Size is 1

Q: Complete the class definition below by filling out the constructor (__init__), forward, training_step and validation_step methods.

Hint: Think carefully about picking a good loss fuction (it's not cross entropy). Maybe try 2-3 of them and see which one works best. See https://pytorch.org/docs/stable/nn.functional.html#loss-functions

Choosing a loss function

The loss function depends on factors like the presence of outliers, choice of machine learning algorithm, time efficiency of gradient descent, ease of finding derivatives and confience of predictions.

Lets look at some of the common regression losses.

  • Mean Square Loss/ Quadratic Loss/ L2 Loss It is one of the most commonly used regression loss functions. It is calculated as the sum of squared distances between our target variables and predicted losses.

  • Mean Absolute Loss/ L1 Loss It is the sum of absolute differences between our target and predicted variables. A measure of the average magnitude of errors in a set without considering direction.

    **MSE vs MAE** [Link](https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0)
    Using squared error is easier to solve, but using absolute is robust to errors. Especially when a few outliers (or errors greater than 1) are  present, they are much more amplified and cause volatility in the loss function resulting in high error.
    
  • Huber Loss, Smooth Mean Absolute Error This is less sensitive to outliers than squared error loss and is also differentiable at 0.

  • Log-Cosh Loss smoother than L2

  • Quantile Loss interested in predicting an interval instead of only point predictions

If the outliers represent anomalies that are important for business and should be detected, then we should use MSE. On the other hand, if we believe that the outliers just represent corrupted data, then we should choose MAE as loss.

  So MAE is useful, when data is corrupted with outliers.But the problem with MAE is that the gradient is same throughout, ie. gradient will be large even for small losses. Especially a problem in neural nets. 

Inference(for this dataset) The reason why mse_loss returns a high loss(and sometimes NaN), while l1_loss does not is because the distribution of the data(as seen in the graph a few sections above) is not linear. Hence, while validating the model, we see that squaring a high loss will amplify it, thus pushing it to a higher value. While, l1_loss doesnt get scaled as heavily.

# Lets write a conditional loss function to help us easily test different cases
current_loss_type = 'l1'

def loss_function(outputs,targets):
  default_loss_function = 'mse' # setting mse as default as it is the simplest function for linear regression

  if current_loss_type == 'l1' : loss_function = F.l1_loss(outputs, targets)
  elif current_loss_type == 'mse' : loss_function = F.mse_loss(outputs,targets)
  elif current_loss_type == 'poisson' : loss_function = F.poisson_nll_loss(outputs,targets)
  elif current_loss_type == 'smooth_l1_loss' : loss_function = F.smooth_l1_loss(outputs,targets)
  else: loss_function = F.mse_loss(outputs, targets)
  #print('Current Loss Type is {type} and loss function is {fn}'.format(type=current_loss_type, fn=loss_function))

  return loss_function
class InsuranceModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(in_features=input_size, out_features=output_size)                  # fill this (hint: use input_size & output_size defined above)
        
    def forward(self, xb):
        out = self.linear(xb)                         # fill this
        return out
    
    def training_step(self, batch):
        inputs, targets = batch 
        # Generate predictions
        out = self(inputs)          
        # Calcuate loss
        # #loss = F.mse_loss(out, targets)                          # fill this
        # loss = F.l1_loss(out,targets)
        loss = loss_function(outputs=out, targets=targets)
        return loss
    
    def validation_step(self, batch):
        inputs, targets = batch
        # Generate predictions
        out = self(inputs)
        # Calculate loss
        #loss = F.mse_loss(out,targets)                           # fill this    
        #loss = F.l1_loss(out,targets)
        loss = loss_function(outputs=out, targets=targets)
        return {'val_loss': loss.detach()}
        
    def validation_epoch_end(self, outputs):
        batch_losses = [x['val_loss'] for x in outputs]
        epoch_loss = torch.stack(batch_losses).mean()   # Combine losses
        return {'val_loss': epoch_loss.item()}
    
    def epoch_end(self, epoch, result, num_epochs):
        # Print result every 20th epoch
        if (epoch+1) % 1000 == 0 or epoch == num_epochs-1:
            print("Epoch [{}], val_loss: {:.4f}".format(epoch+1, result['val_loss']))

Let us create a model using the InsuranceModel class. You may need to come back later and re-run the next cell to reinitialize the model, in case the loss becomes nan or infinity.

model = InsuranceModel()

Let's check out the weights and biases of the model using model.parameters.

list(model.parameters())
[Parameter containing:
 tensor([[-0.2372,  0.1008, -0.2199,  0.3357, -0.0358, -0.0196]],
        requires_grad=True), Parameter containing:
 tensor([0.0288], requires_grad=True)]
model.state_dict()
OrderedDict([('linear.weight',
              tensor([[-0.2372,  0.1008, -0.2199,  0.3357, -0.0358, -0.0196]])),
             ('linear.bias', tensor([0.0288]))])

If we have saved weights...

Load them from the saved file

model.load_state_dict(torch.load('insurance-linear-weights-biases.pth'))
model.state_dict()

Step 4: Train the model to fit the data

To train our model, we'll use the same fit function explained in the lecture. That's the benefit of defining a generic training loop - you can use it for any problem.

def evaluate(model, val_loader):
    outputs = [model.validation_step(batch) for batch in val_loader]
    return model.validation_epoch_end(outputs)

def fit(epochs, lr, model, train_loader, val_loader, opt_func=torch.optim.SGD):
    
    history = []
    optimizer = opt_func(model.parameters(), lr)

    for epoch in range(epochs):

        # Training Phase 
        for batch in train_loader:
            start_time = datetime.now()
            loss = model.training_step(batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

            end_time = datetime.now()
            batch_time = end_time - start_time

        # Validation phase
        result = evaluate(model, val_loader)
        model.epoch_end(epoch, result, epochs)
        history.append(result)

    
    return history, str(batch_time)

Q: Use the evaluate function to calculate the loss on the validation set before training.

result = evaluate(model=model, val_loader=val_loader) # Use the the evaluate function
print(result)
{'val_loss': 13911.75}

We are now ready to train the model. You may need to run the training loop many times, for different number of epochs and with different learning rates, to get a good result. Also, if your loss becomes too large (or nan), you may have to re-initialize the model by running the cell model = InsuranceModel(). Experiment with this for a while, and try to get to as low a loss as possible.

Q: Train the model 4-5 times with different learning rates & for different number of epochs.

Hint: Vary learning rates by orders of 10 (e.g. 1e-2, 1e-3, 1e-4, 1e-5, 1e-6) to figure out what works.

Trying with different batch sizes first.

Batch sizes tried(changed when creating the data loader) -> [200,100]

Epochs tried -> [100,1000, 10000, 100000, 25000] ...losses were linearly decreasing initally, so tried lower

Learing rate's tried -> [1e-5, 1e-4, 1e-6]

The learning rate is a factor by which we adjust the losses mainly to avoid huge fluctuations or magnification of losses. the smaller the learning rate, the more the adjustments to the losses which will make the changes more precise but slower.

So, we observed that with a learning rate of 1e-4, we were settling at a loss at a faster rate(at around 25000 epochs). Lets try to combine two learning rates(1e-4 for 25000 epochs and 1e-6 for 25000 epochs) and observe what happens.

FINAL RUN


epochs_1 = 30000
lr_1 = 1e-4

epochs_2 = 20000
lr_2 = 1e-6

epochs_3 = 20000
lr_3 = 1e-7

total_start_time = datetime.now()

history1, batch_time1 = fit(epochs_1, lr_1, model, train_loader, val_loader)

history2, batch_time2 = fit(epochs_2, lr_2, model, train_loader, val_loader)

history3, batch_time3 = fit(epochs_3, lr_3, model, train_loader, val_loader)

history = [result] + history1 + history2 + history3
batch_time = batch_time1 + batch_time2 + batch_time3
epochs = epochs_1 + epochs_2 + epochs_3
lr = str(lr_1) + str(lr_2) + str(lr_3)

total_end_time = datetime.now()
total_time = total_end_time - total_start_time
Epoch [1000], val_loss: 11021.6680 Epoch [2000], val_loss: 9077.9141 Epoch [3000], val_loss: 8074.0264 Epoch [4000], val_loss: 7764.4932 Epoch [5000], val_loss: 7663.5732 Epoch [6000], val_loss: 7630.2100 Epoch [7000], val_loss: 7606.7070 Epoch [8000], val_loss: 7587.7378 Epoch [9000], val_loss: 7569.8223 Epoch [10000], val_loss: 7552.0601 Epoch [11000], val_loss: 7534.5586 Epoch [12000], val_loss: 7517.3379 Epoch [13000], val_loss: 7500.2090 Epoch [14000], val_loss: 7483.4795 Epoch [15000], val_loss: 7466.6665 Epoch [16000], val_loss: 7449.9082 Epoch [17000], val_loss: 7433.6875 Epoch [18000], val_loss: 7418.3701 Epoch [19000], val_loss: 7403.3247 Epoch [20000], val_loss: 7388.3779 Epoch [21000], val_loss: 7373.9082 Epoch [22000], val_loss: 7359.4248 Epoch [23000], val_loss: 7345.5225 Epoch [24000], val_loss: 7331.5264 Epoch [25000], val_loss: 7317.7041 Epoch [26000], val_loss: 7303.8721 Epoch [27000], val_loss: 7290.0859 Epoch [28000], val_loss: 7276.2974 Epoch [29000], val_loss: 7262.6641 Epoch [30000], val_loss: 7249.3076 Epoch [1000], val_loss: 7249.1836 Epoch [2000], val_loss: 7249.0615 Epoch [3000], val_loss: 7248.9395 Epoch [4000], val_loss: 7248.8164 Epoch [5000], val_loss: 7248.6943 Epoch [6000], val_loss: 7248.5718 Epoch [7000], val_loss: 7248.4492 Epoch [8000], val_loss: 7248.3271 Epoch [9000], val_loss: 7248.2036 Epoch [10000], val_loss: 7248.0811 Epoch [11000], val_loss: 7247.9600 Epoch [12000], val_loss: 7247.8359 Epoch [13000], val_loss: 7247.7144 Epoch [14000], val_loss: 7247.5903 Epoch [15000], val_loss: 7247.4688 Epoch [16000], val_loss: 7247.3457 Epoch [17000], val_loss: 7247.2231 Epoch [18000], val_loss: 7247.1011 Epoch [19000], val_loss: 7246.9775 Epoch [20000], val_loss: 7246.8564 Epoch [1000], val_loss: 7246.8457 Epoch [2000], val_loss: 7246.8350 Epoch [3000], val_loss: 7246.8242 Epoch [4000], val_loss: 7246.8135 Epoch [5000], val_loss: 7246.8027 Epoch [6000], val_loss: 7246.7925 Epoch [7000], val_loss: 7246.7808 Epoch [8000], val_loss: 7246.7705 Epoch [9000], val_loss: 7246.7598 Epoch [10000], val_loss: 7246.7490 Epoch [11000], val_loss: 7246.7383 Epoch [12000], val_loss: 7246.7271 Epoch [13000], val_loss: 7246.7168 Epoch [14000], val_loss: 7246.7051 Epoch [15000], val_loss: 7246.6943 Epoch [16000], val_loss: 7246.6836 Epoch [17000], val_loss: 7246.6733 Epoch [18000], val_loss: 7246.6621 Epoch [19000], val_loss: 7246.6514 Epoch [20000], val_loss: 7246.6411

EARLIER TEST RUNS

epochs = 10000
lr = 1e-5

total_start_time = datetime.now()

history2, batch_time2 = fit(epochs, lr, model, train_loader, val_loader)

history = history2
batch_time = batch_time2

total_end_time = datetime.now()
total_time = total_end_time - total_start_time
epochs = 100000
lr = 1e-5

total_start_time = datetime.now()

history3, batch_time3 = fit(epochs, lr, model, train_loader, val_loader)

history = history3
batch_time = batch_time3

total_end_time = datetime.now()
total_time = total_end_time - total_start_time
epochs = 100000
lr = 1e-4

total_start_time = datetime.now()

history4, batch_time4 = fit(epochs, lr, model, train_loader, val_loader)

history = history4
batch_time = batch_time4

total_end_time = datetime.now()
total_time = total_end_time - total_start_time

epochs_1 = 25000
lr_1 = 1e-4

epochs_2 = 25000
lr_2 = 1e-6

total_start_time = datetime.now()

history51, batch_time51 = fit(epochs_1, lr_1, model, train_loader, val_loader)

history52, batch_time52 = fit(epochs_2, lr_2, model, train_loader, val_loader)

history = [result] + history51 + history52
batch_time = batch_time51 + batch_time52
epochs = epochs_1 + epochs_2
lr = str(lr_1) + str(lr_2)

total_end_time = datetime.now()
total_time = total_end_time - total_start_time

Plotting Loss v/s Epochs

Lets plot a graph of the losses for the current run to see the trend along epochs.

accuracies = [result['val_loss'] for result in history]
plt.plot(accuracies, '-x')
plt.xlabel('epoch')
plt.ylabel('loss')
plt.title('Loss vs. No. of epochs');
Notebook Image

Q: What is the final validation loss of your model?

val_loss = history2[-1]['val_loss']
print("Final Validation loss is -> ", val_loss)
Final Validation loss is -> 7246.8564453125

Let's log the final validation loss to Jovian and commit the notebook (kept changing which history value to consider based on hyperparameter tuning)

jovian.log_metrics(val_loss=history[-1]['val_loss'], epochs=epochs, batch_time = batch_time, total_time = str(total_time))
[jovian] Metrics logged.
jovian.log_hyperparams(batch_size=batch_size, learning_rate=lr, loss_function=current_loss_type)
[jovian] Hyperparams logged.

Now scroll back up, re-initialize the model, and try different set of values for batch size, number of epochs, learning rate etc. Commit each experiment and use the "Compare" and "View Diff" options on Jovian to compare the different results.

Step 5: Make predictions using the trained model

Q: Complete the following function definition to make predictions on a single input

def predict_single(input, target, model):
    inputs = input.unsqueeze(0)
    predictions = model(inputs)                # fill this
    prediction = predictions[0].detach()
    print("Input:", input)
    print("Target:", target)
    print("Prediction:", prediction)
input, target = val_ds[0]
predict_single(input, target, model)
Input: tensor([57.0000, 0.0000, 29.8100, 0.0000, 1.0000, 2.0000]) Target: tensor([28910.6094]) Prediction: tensor([12043.2451])
input, target = val_ds[10]
predict_single(input, target, model)
Input: tensor([48.0000, 1.0000, 30.2000, 2.0000, 0.0000, 3.0000]) Target: tensor([9416.7461]) Prediction: tensor([10178.0312])
input, target = val_ds[23]
predict_single(input, target, model)
Input: tensor([43.0000, 0.0000, 32.5600, 3.0000, 1.0000, 2.0000]) Target: tensor([42988.3516]) Prediction: tensor([9170.4199])

Are you happy with your model's predictions? Try to improve them further.

Saving the Model

Once we have trained the model for long enough and achieved a reasonable accuracy, its good practice to save weights and biases. This can help us avoid retraining from scratch always.

The three main functions required are-

  • model.state_dict() to get the current weights and biases
  • torch.save() to save the weights to a path
  • model.load_state_dict(torch.load('insurance-linear.pth')) to load the saved weights.
torch.save(model.state_dict(), 'insurance-linear-weights-biases.pth')

FINAL COMMIT (with weights)

jovian.commit(project=project_name, environment=None, outputs=['insurance-linear-weights-biases.pth'])
[jovian] Detected Colab notebook... [jovian] Uploading colab notebook to Jovian... [jovian] Uploading additional outputs... [jovian] Attaching records (metrics, hyperparameters, dataset etc.) [jovian] Committed successfully! https://jovian.ai/adityaramesh12/02-insurance-linear-regression-assignment

Submit the Assignment

jovian.submit(project=project_name, assignment="zerotogans-a2")

(Optional) Step 6: Try another dataset & blog about it

While this last step is optional for the submission of your assignment, we highly recommend that you do it. Try to replicate this notebook for a different linear regression or logistic regression problem. This will help solidify your understanding, and give you a chance to differentiate the generic patterns in machine learning from problem-specific details.You can use one of these starer notebooks (just change the dataset):

Here are some sources to find good datasets:

We also recommend that you write a blog about your approach to the problem. Here is a suggested structure for your post (feel free to experiment with it):

  • Interesting title & subtitle
  • Overview of what the blog covers (which dataset, linear regression or logistic regression, intro to PyTorch)
  • Downloading & exploring the data
  • Preparing the data for training
  • Creating a model using PyTorch
  • Training the model to fit the data
  • Your thoughts on how to experiment with different hyperparmeters to reduce loss
  • Making predictions using the model

As with the previous assignment, you can embed Juptyer notebook cells & outputs from Jovian into your blog.

Don't forget to share your work on the forum: https://jovian.ai/forum/t/linear-regression-and-logistic-regression-notebooks-and-blog-posts/14039

jovian.commit(project=project_name, environment=None)
jovian.commit(project=project_name, environment=None) # try again, kaggle fails sometimes
# RESET METRICS
jovian.reset('hyperparams', 'metrics')