agnieszka08/02-insurance-linear-regression - Jovian
Learn data science and machine learning by building real-world projects on Jovian

Insurance cost prediction using linear regression

In this assignment we're going to use information like a person's age, sex, BMI, no. of children and smoking habit to predict the price of yearly medical bills. This kind of model is useful for insurance companies to determine the yearly insurance premium for a person. The dataset for this problem is taken from: https://www.kaggle.com/mirichoi0218/insurance

We will create a model with the following steps:

  1. Download and explore the dataset
  2. Prepare the dataset for training
  3. Create a linear regression model
  4. Train the model to fit the data
  5. Make predictions using the trained model

This assignment builds upon the concepts from the first 2 lectures. It will help to review these Jupyter notebooks:

As you go through this notebook, you will find a ??? in certain places. Your job is to replace the ??? with appropriate code or values, to ensure that the notebook runs properly end-to-end . In some cases, you'll be required to choose some hyperparameters (learning rate, batch size etc.). Try to experiment with the hypeparameters to get the lowest loss.

# Uncomment and run the commands below if imports fail
# !conda install numpy pytorch torchvision cpuonly -c pytorch -y
# !pip install matplotlib --upgrade --quiet
!pip install jovian --upgrade --quiet
import torch
import jovian
import torchvision
import torch.nn as nn
import pandas as pd
import matplotlib.pyplot as plt
import torch.nn.functional as F
from torchvision.datasets.utils import download_url
from torch.utils.data import DataLoader, TensorDataset, random_split
project_name='02-insurance-linear-regression' # will be used by jovian.commit

Step 1: Download and explore the data

Let us begin by downloading the data. We'll use the download_url function from PyTorch to get the data as a CSV (comma-separated values) file.

DATASET_URL = "https://hub.jovian.ml/wp-content/uploads/2020/05/insurance.csv"
DATA_FILENAME = "insurance.csv"
download_url(DATASET_URL, '.')
HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

To load the dataset into memory, we'll use the read_csv function from the pandas library. The data will be loaded as a Pandas dataframe. See this short tutorial to learn more: https://data36.com/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection/

dataframe_raw = pd.read_csv(DATA_FILENAME)
dataframe_raw.head()

We're going to do a slight customization of the data, so that you every participant receives a slightly different version of the dataset. Fill in your name below as a string (enter at least 5 characters)

your_name = 'Agnieszka' # at least 5 characters

The customize_dataset function will customize the dataset slightly using your name as a source of random numbers.

def customize_dataset(dataframe_raw, rand_str):
    dataframe = dataframe_raw.copy(deep=True)
    # drop some rows
    dataframe = dataframe.sample(int(0.95*len(dataframe)), random_state=int(ord(rand_str[0])))
    # scale input
    dataframe.bmi = dataframe.bmi * ord(rand_str[1])/100.
    # scale target
    dataframe.charges = dataframe.charges * ord(rand_str[2])/100.
    # drop column
    if ord(rand_str[3]) % 2 == 1:
        dataframe = dataframe.drop(['region'], axis=1)
    return dataframe
dataframe = customize_dataset(dataframe_raw, your_name)
dataframe.head()

Let us answer some basic questions about the dataset.

Q: How many rows does the dataset have?

dataframe
df = dataframe
num_rows = df.shape[0]
print(num_rows)
1271

Q: How many columns doe the dataset have

num_cols = df.shape[1]
print(num_cols)
6

Q: What are the column titles of the input variables?

input_cols = list(df.columns[0:5])
print(input_cols)
['age', 'sex', 'bmi', 'children', 'smoker']

Q: Which of the input columns are non-numeric or categorial variables ?

Hint: sex is one of them. List the columns that are not numbers.

categorical_cols = list(df.dtypes[df.dtypes == "object"].index)
print(categorical_cols)
['sex', 'smoker']

Q: What are the column titles of output/target variable(s)?

output_cols = [df.columns[5]]
print(output_cols)
['charges']

Q: (Optional) What is the minimum, maximum and average value of the charges column? Can you show the distribution of values in a graph? Use this data visualization cheatsheet for referece: https://jovian.ml/aakashns/dataviz-cheatsheet

# Write your answer here

Remember to commit your notebook to Jovian after every step, so that you don't lose your work.

jovian.commit(project=project_name, environment=None)
[jovian] Attempting to save notebook.. [jovian] Please enter your API key ( from https://jovian.ml/ ): API KEY: ········ [jovian] Updating notebook "agnieszka08/02-insurance-linear-regression" on https://jovian.ml/ [jovian] Uploading notebook.. [jovian] Committed successfully! https://jovian.ml/agnieszka08/02-insurance-linear-regression

Step 2: Prepare the dataset for training

We need to convert the data from the Pandas dataframe into a PyTorch tensors for training. To do this, the first step is to convert it numpy arrays. If you've filled out input_cols, categorial_cols and output_cols correctly, this following function will perform the conversion to numpy arrays.

def dataframe_to_arrays(dataframe):
    # Make a copy of the original dataframe
    dataframe1 = dataframe.copy(deep=True)
    # Convert non-numeric categorical columns to numbers
    for col in categorical_cols:
        dataframe1[col] = dataframe1[col].astype('category').cat.codes
    # Extract input & outupts as numpy arrays
    inputs_array = dataframe1[input_cols].to_numpy()
    targets_array = dataframe1[output_cols].to_numpy()
    return inputs_array, targets_array

Read through the Pandas documentation to understand how we're converting categorical variables into numbers.

inputs_array, targets_array = dataframe_to_arrays(df)
inputs_array, targets_array
(array([[55.     ,  0.     , 33.75825,  2.     ,  0.     ],
        [64.     ,  1.     , 39.04215,  0.     ,  0.     ],
        [55.     ,  1.     , 38.84645,  3.     ,  0.     ],
        ...,
        [56.     ,  1.     , 26.71305,  0.     ,  0.     ],
        [18.     ,  0.     , 31.21415,  0.     ,  0.     ],
        [43.     ,  1.     , 26.8109 ,  0.     ,  0.     ]]),
 array([[13495.495475],
        [15631.589545],
        [33069.938605],
        ...,
        [12281.959415],
        [ 2424.109545],
        [ 7521.10557 ]]))

Q: Convert the numpy arrays inputs_array and targets_array into PyTorch tensors. Make sure that the data type is torch.float32.

inputs = torch.from_numpy(inputs_array).type(torch.float32)
targets = torch.from_numpy(targets_array).type(torch.float32)
inputs.dtype, targets.dtype
(torch.float32, torch.float32)

Next, we need to create PyTorch datasets & data loaders for training & validation. We'll start by creating a TensorDataset.

dataset = TensorDataset(inputs, targets)

**Q: Pick a number between 0.1 and 0.2 to determine the fraction of data that will be used for creating the validation set. Then use random_split to create training & validation datasets. **

val_percent = 0.12 # between 0.1 and 0.2
val_size = int(num_rows * val_percent)
train_size = num_rows - val_size

train_ds, val_ds = random_split(dataset, [1000, 271]) # Use the random_split function to split dataset into 2 parts of the desired length

Finally, we can create data loaders for training & validation.

Q: Pick a batch size for the data loader.

batch_size = 16
train_loader = DataLoader(train_ds, batch_size, shuffle=True)
val_loader = DataLoader(val_ds, batch_size)

Let's look at a batch of data to verify everything is working fine so far.

for xb, yb in train_loader:
    print("inputs:", xb)
    print("targets:", yb)
    break
inputs: tensor([[36.0000, 1.0000, 36.2560, 1.0000, 1.0000], [42.0000, 1.0000, 38.2954, 2.0000, 0.0000], [31.0000, 1.0000, 27.6915, 1.0000, 0.0000], [52.0000, 0.0000, 32.6819, 2.0000, 0.0000], [29.0000, 1.0000, 33.0733, 2.0000, 0.0000], [51.0000, 0.0000, 35.1230, 0.0000, 0.0000], [37.0000, 0.0000, 35.1282, 1.0000, 0.0000], [25.0000, 1.0000, 28.3765, 0.0000, 0.0000], [40.0000, 0.0000, 30.4880, 0.0000, 0.0000], [59.0000, 0.0000, 33.3669, 3.0000, 0.0000], [37.0000, 1.0000, 28.8657, 2.0000, 0.0000], [59.0000, 1.0000, 29.6486, 0.0000, 0.0000], [19.0000, 0.0000, 30.6940, 0.0000, 0.0000], [27.0000, 1.0000, 31.4150, 0.0000, 0.0000], [40.0000, 1.0000, 30.7970, 2.0000, 0.0000], [35.0000, 0.0000, 44.6402, 2.0000, 0.0000]]) targets: tensor([[42580.0938], [ 7878.2134], [ 4885.3345], [12306.4229], [ 4877.3076], [10211.9180], [ 6723.5884], [ 2775.4863], [ 6502.0386], [16049.6953], [ 6824.2920], [13342.5752], [ 1918.9115], [ 2743.4243], [ 7260.3970], [ 6431.6094]])

Let's save our work by committing to Jovian.

jovian.commit(project=project_name, environment=None)
[jovian] Attempting to save notebook.. [jovian] Updating notebook "agnieszka08/02-insurance-linear-regression" on https://jovian.ml/ [jovian] Uploading notebook.. [jovian] Attaching records (metrics, hyperparameters, dataset etc.) [jovian] Committed successfully! https://jovian.ml/agnieszka08/02-insurance-linear-regression

Step 3: Create a Linear Regression Model

Our model itself is a fairly straightforward linear regression (we'll build more complex models in the next assignment).

input_size = len(input_cols)
output_size = len(output_cols)

Q: Complete the class definition below by filling out the constructor (__init__), forward, training_step and validation_step methods.

Hint: Think carefully about picking a good loss fuction (it's not cross entropy). Maybe try 2-3 of them and see which one works best. See https://pytorch.org/docs/stable/nn.functional.html#loss-functions

class InsuranceModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(input_size, output_size)    # fill this (hint: use input_size & output_size defined above)
        
    def forward(self, xb):
        out = self.linear(xb)    # fill this         
        return out
    
    def training_step(self, batch):
        inputs, targets = batch 
        # Generate predictions
        out = self(inputs)          
        # Calcuate loss
        loss = F.l1_loss(out, targets)        # fill this
        return loss
    
    def validation_step(self, batch):
        inputs, targets = batch
        # Generate predictions
        out = self(inputs)
        # Calculate loss
        loss = F.l1_loss(out, targets)          # fill this    
        return {'val_loss': loss.detach()}
        
    def validation_epoch_end(self, outputs):
        batch_losses = [x['val_loss'] for x in outputs]
        epoch_loss = torch.stack(batch_losses).mean()   # Combine losses
        return {'val_loss': epoch_loss.item()}
    
    def epoch_end(self, epoch, result, num_epochs):
        # Print result every 20th epoch
        if (epoch+1) % 20 == 0 or epoch == num_epochs-1:
            print("Epoch [{}], val_loss: {:.4f}".format(epoch+1, result['val_loss']))

Let us create a model using the InsuranceModel class. You may need to come back later and re-run the next cell to reinitialize the model, in case the loss becomes nan or infinity.

model = InsuranceModel()

Let's check out the weights and biases of the model using model.parameters.

list(model.parameters())
[Parameter containing:
 tensor([[ 0.1845, -0.2419,  0.2803, -0.1663,  0.2530]], requires_grad=True),
 Parameter containing:
 tensor([0.3481], requires_grad=True)]

One final commit before we train the model.

jovian.commit(project=project_name, environment=None)
[jovian] Attempting to save notebook.. [jovian] Updating notebook "agnieszka08/02-insurance-linear-regression" on https://jovian.ml/ [jovian] Uploading notebook.. [jovian] Attaching records (metrics, hyperparameters, dataset etc.) [jovian] Committed successfully! https://jovian.ml/agnieszka08/02-insurance-linear-regression

Step 4: Train the model to fit the data

To train our model, we'll use the same fit function explained in the lecture. That's the benefit of defining a generic training loop - you can use it for any problem.

def evaluate(model, val_loader):
    outputs = [model.validation_step(batch) for batch in val_loader]
    return model.validation_epoch_end(outputs)

def fit(epochs, lr, model, train_loader, val_loader, opt_func=torch.optim.SGD):
    history = []
    optimizer = opt_func(model.parameters(), lr)
    for epoch in range(epochs):
        # Training Phase 
        for batch in train_loader:
            loss = model.training_step(batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        # Validation phase
        result = evaluate(model, val_loader)
        model.epoch_end(epoch, result, epochs)
        history.append(result)
    return history

Q: Use the evaluate function to calculate the loss on the validation set before training.

result = evaluate(model, val_loader) # Use the the evaluate function
print(result)
{'val_loss': 13715.078125}

We are now ready to train the model. You may need to run the training loop many times, for different number of epochs and with different learning rates, to get a good result. Also, if your loss becomes too large (or nan), you may have to re-initialize the model by running the cell model = InsuranceModel(). Experiment with this for a while, and try to get to as low a loss as possible.

Q: Train the model 4-5 times with different learning rates & for different number of epochs.

Hint: Vary learning rates by orders of 10 (e.g. 1e-2, 1e-3, 1e-4, 1e-5, 1e-6) to figure out what works.

epochs = 500
lr = 1e-2
history1 = fit(epochs, lr, model, train_loader, val_loader)
Epoch [20], val_loss: 7106.9009 Epoch [40], val_loss: 6913.0254 Epoch [60], val_loss: 6749.6626 Epoch [80], val_loss: 6606.2285 Epoch [100], val_loss: 6503.7251 Epoch [120], val_loss: 6441.5757 Epoch [140], val_loss: 6408.4736 Epoch [160], val_loss: 6391.2383 Epoch [180], val_loss: 6383.2383 Epoch [200], val_loss: 6377.3638 Epoch [220], val_loss: 6374.8853 Epoch [240], val_loss: 6372.1782 Epoch [260], val_loss: 6370.6328 Epoch [280], val_loss: 6368.8628 Epoch [300], val_loss: 6367.3042 Epoch [320], val_loss: 6365.7427 Epoch [340], val_loss: 6364.3779 Epoch [360], val_loss: 6362.4321 Epoch [380], val_loss: 6361.3770 Epoch [400], val_loss: 6359.6831 Epoch [420], val_loss: 6358.6504 Epoch [440], val_loss: 6356.7329 Epoch [460], val_loss: 6356.6445 Epoch [480], val_loss: 6354.9004 Epoch [500], val_loss: 6353.3931
epochs = 200
lr = 1e-2
history2 = fit(epochs, lr, model, train_loader, val_loader)
Epoch [20], val_loss: 6351.5757 Epoch [40], val_loss: 6350.9990 Epoch [60], val_loss: 6349.5327 Epoch [80], val_loss: 6347.6069 Epoch [100], val_loss: 6346.2695 Epoch [120], val_loss: 6345.3438 Epoch [140], val_loss: 6344.4854 Epoch [160], val_loss: 6342.7612 Epoch [180], val_loss: 6341.3765 Epoch [200], val_loss: 6340.3320
epochs = 300
lr = 1e-3
history3 = fit(epochs, lr, model, train_loader, val_loader)
Epoch [20], val_loss: 6339.8804 Epoch [40], val_loss: 6339.6792 Epoch [60], val_loss: 6339.4854 Epoch [80], val_loss: 6339.3594 Epoch [100], val_loss: 6339.1533 Epoch [120], val_loss: 6339.0444 Epoch [140], val_loss: 6338.9092 Epoch [160], val_loss: 6338.7808 Epoch [180], val_loss: 6338.5679 Epoch [200], val_loss: 6338.4136 Epoch [220], val_loss: 6338.3130 Epoch [240], val_loss: 6338.1719 Epoch [260], val_loss: 6338.1069 Epoch [280], val_loss: 6337.9390 Epoch [300], val_loss: 6337.7495
epochs = 500
lr = 1e-3
history4 = fit(epochs, lr, model, train_loader, val_loader)
Epoch [20], val_loss: 6337.6221 Epoch [40], val_loss: 6337.5269 Epoch [60], val_loss: 6337.3716 Epoch [80], val_loss: 6337.2524 Epoch [100], val_loss: 6337.0981 Epoch [120], val_loss: 6336.9727 Epoch [140], val_loss: 6336.8145 Epoch [160], val_loss: 6336.6797 Epoch [180], val_loss: 6336.6006 Epoch [200], val_loss: 6336.4912 Epoch [220], val_loss: 6336.3091 Epoch [240], val_loss: 6336.1958 Epoch [260], val_loss: 6336.0508 Epoch [280], val_loss: 6335.9165 Epoch [300], val_loss: 6335.7900 Epoch [320], val_loss: 6335.7310 Epoch [340], val_loss: 6335.5601 Epoch [360], val_loss: 6335.4541 Epoch [380], val_loss: 6335.2793 Epoch [400], val_loss: 6335.1890 Epoch [420], val_loss: 6335.0195 Epoch [440], val_loss: 6334.9414 Epoch [460], val_loss: 6334.7827 Epoch [480], val_loss: 6334.6533 Epoch [500], val_loss: 6334.5034
epochs = 500
lr = 1e-1
history5 = fit(epochs, lr, model, train_loader, val_loader)
Epoch [20], val_loss: 6322.6865 Epoch [40], val_loss: 6314.7295 Epoch [60], val_loss: 6308.4272 Epoch [80], val_loss: 6314.9355 Epoch [100], val_loss: 6289.5894 Epoch [120], val_loss: 6280.1904 Epoch [140], val_loss: 6288.4116 Epoch [160], val_loss: 6272.7744 Epoch [180], val_loss: 6263.8457 Epoch [200], val_loss: 6271.7568 Epoch [220], val_loss: 6256.6572 Epoch [240], val_loss: 6248.0991 Epoch [260], val_loss: 6241.8711 Epoch [280], val_loss: 6232.0454 Epoch [300], val_loss: 6240.0728 Epoch [320], val_loss: 6233.3281 Epoch [340], val_loss: 6227.7451 Epoch [360], val_loss: 6215.4922 Epoch [380], val_loss: 6212.0962 Epoch [400], val_loss: 6196.8066 Epoch [420], val_loss: 6198.1738 Epoch [440], val_loss: 6185.0630 Epoch [460], val_loss: 6211.8906 Epoch [480], val_loss: 6200.0449 Epoch [500], val_loss: 6169.6621

Q: What is the final validation loss of your model?

val_loss = 6169.6621

Let's log the final validation loss to Jovian and commit the notebook

jovian.log_metrics(val_loss=val_loss)
[jovian] Metrics logged.
jovian.commit(project=project_name, environment=None)
[jovian] Attempting to save notebook.. [jovian] Updating notebook "agnieszka08/02-insurance-linear-regression" on https://jovian.ml/ [jovian] Uploading notebook.. [jovian] Attaching records (metrics, hyperparameters, dataset etc.) [jovian] Committed successfully! https://jovian.ml/agnieszka08/02-insurance-linear-regression

Now scroll back up, re-initialize the model, and try different set of values for batch size, number of epochs, learning rate etc. Commit each experiment and use the "Compare" and "View Diff" options on Jovian to compare the different results.

Step 5: Make predictions using the trained model

Q: Complete the following function definition to make predictions on a single input

def predict_single(input, target, model):
    inputs = input.unsqueeze(0)
    predictions = model(input)                # fill this
    prediction = predictions[0].detach()
    print("Input:", input)
    print("Target:", target)
    print("Prediction:", prediction)
input, target = val_ds[0]
predict_single(input, target, model)
Input: tensor([61.0000, 1.0000, 24.3647, 0.0000, 0.0000]) Target: tensor([14442.5635]) Prediction: tensor(14341.0400)
input, target = val_ds[10]
predict_single(input, target, model)
Input: tensor([20.0000, 1.0000, 30.6271, 0.0000, 0.0000]) Target: tensor([1946.4849]) Prediction: tensor(2447.5469)
input, target = val_ds[23]
predict_single(input, target, model)
Input: tensor([28.0000, 0.0000, 26.7131, 1.0000, 0.0000]) Target: tensor([4547.0059]) Prediction: tensor(5608.6177)

Are you happy with your model's predictions? Try to improve them further.

(Optional) Step 6: Try another dataset & blog about it

While this last step is optional for the submission of your assignment, we highly recommend that you do it. Try to clean up & replicate this notebook (or this one, or this one ) for a different linear regression or logistic regression problem. This will help solidify your understanding, and give you a chance to differentiate the generic patters in machine learning from problem-specific details.

Here are some sources to find good datasets:

We also recommend that you write a blog about your approach to the problem. Here is a suggested structure for your post (feel free to experiment with it):

  • Interesting title & subtitle
  • Overview of what the blog covers (which dataset, linear regression or logistic regression, intro to PyTorch)
  • Downloading & exploring the data
  • Preparing the data for training
  • Creating a model using PyTorch
  • Training the model to fit the data
  • Your thoughts on how to experiment with different hyperparmeters to reduce loss
  • Making predictions using the model

As with the previous assignment, you can embed Juptyer notebook cells & outputs from Jovian into your blog.

Don't forget to share your work on the forum: https://jovian.ml/forum/t/share-your-work-here-assignment-2/4931

jovian.commit(project=project_name, environment=None)
jovian.commit(project=project_name, environment=None) # try again, kaggle fails sometimes
[jovian] Attempting to save notebook.. [jovian] Updating notebook "agnieszka08/02-insurance-linear-regression" on https://jovian.ml/ [jovian] Uploading notebook.. [jovian] Attaching records (metrics, hyperparameters, dataset etc.) [jovian] Committed successfully! https://jovian.ml/agnieszka08/02-insurance-linear-regression
[jovian] Attempting to save notebook..