Learn data science and machine learning by building real-world projects on Jovian

Logistic Regression with Scikit Learn - Machine Learning with Python

This tutorial is a part of Zero to Data Science Bootcamp by Jovian and Machine Learning with Python: Zero to GBMs

The following topics are covered in this tutorial:

  • Downloading a real-world dataset from Kaggle
  • Exploratory data analysis and visualization
  • Splitting a dataset into training, validation & test sets
  • Filling/imputing missing values in numeric columns
  • Scaling numeric features to a \((0,1)\) range
  • Encoding categorical columns as one-hot vectors
  • Training a logistic regression model using Scikit-learn
  • Evaluating a model using a validation set and test set
  • Saving a model to disk and loading it back

How to run the code

This tutorial is an executable Jupyter notebook hosted on Jovian. You can run this tutorial and experiment with the code examples in a couple of ways: using free online resources (recommended) or on your computer.

Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing the code is to click the Run button at the top of this page and select Run on Colab. You will be prompted to connect your Google Drive account so that this notebook can be placed into your drive for execution.

Option 2: Running on your computer locally

To run the code on your computer locally, you'll need to set up Python, download the notebook and install the required libraries. We recommend using the Conda distribution of Python. Click the Run button at the top of this page, select the Run Locally option, and follow the instructions.

Problem Statement

This tutorial takes a practical and coding-focused approach. We'll learn how to apply logistic regression to a real-world dataset from Kaggle:

QUESTION: The Rain in Australia dataset contains about 10 years of daily weather observations from numerous Australian weather stations. Here's a small sample from the dataset:

As a data scientist at the Bureau of Meteorology, you are tasked with creating a fully-automated system that can use today's weather data for a given location to predict whether it will rain at the location tomorrow.

EXERCISE: Before proceeding further, take a moment to think about how you can approach this problem. List five or more ideas that come to your mind below:

  1. ???
  2. ???
  3. ???
  4. ???
  5. ???

Linear Regression vs. Logistic Regression

In the previous tutorial, we attempted to predict a person's annual medical charges using linear regression. In this tutorial, we'll use logistic regression, which is better suited for classification problems like predicting whether it will rain tomorrow. Identifying whether a given problem is a classfication or regression problem is an important first step in machine learning.

Classification Problems

Problems where each input must be assigned a discrete category (also called label or class) are known as classification problems.

Here are some examples of classification problems:

  • Rainfall prediction: Predicting whether it will rain tomorrow using today's weather data (classes are "Will Rain" and "Will Not Rain")
  • Breast cancer detection: Predicting whether a tumor is "benign" (noncancerous) or "malignant" (cancerous) using information like its radius, texture etc.
  • Loan Repayment Prediction - Predicting whether applicants will repay a home loan based on factors like age, income, loan amount, no. of children etc.
  • Handwritten Digit Recognition - Identifying which digit from 0 to 9 a picture of handwritten text represents.

Can you think of some more classification problems?

EXERCISE: Replicate the steps followed in this tutorial with each of the above datasets.

Classification problems can be binary (yes/no) or multiclass (picking one of many classes).

Regression Problems

Problems where a continuous numeric value must be predicted for each input are known as regression problems.

Here are some example of regression problems:

Can you think of some more regression problems?

EXERCISE: Replicate the steps followed in the previous tutorial with each of the above datasets.

Linear Regression for Solving Regression Problems

Linear regression is a commonly used technique for solving regression problems. In a linear regression model, the target is modeled as a linear combination (or weighted sum) of input features. The predictions from the model are evaluated using a loss function like the Root Mean Squared Error (RMSE).

Here's a visual summary of how a linear regression model is structured:

For a mathematical discussion of linear regression, watch this YouTube playlist

Logistic Regression for Solving Classification Problems

Logistic regression is a commonly used technique for solving binary classification problems. In a logistic regression model:

  • we take linear combination (or weighted sum of the input features)
  • we apply the sigmoid function to the result to obtain a number between 0 and 1
  • this number represents the probability of the input being classified as "Yes"
  • instead of RMSE, the cross entropy loss function is used to evaluate the results

Here's a visual summary of how a logistic regression model is structured (source):

The sigmoid function applied to the linear combination of inputs has the following formula:

The output of the sigmoid function is called a logistic, hence the name logistic regression. For a mathematical discussion of logistic regression, sigmoid activation and cross entropy, check out this YouTube playlist. Logistic regression can also be applied to multi-class classification problems, with a few modifications.

Machine Learning Workflow

Whether we're solving a regression problem using linear regression or a classification problem using logistic regression, the workflow for training a model is exactly the same:

  1. We initialize a model with random parameters (weights & biases).
  2. We pass some inputs into the model to obtain predictions.
  3. We compare the model's predictions with the actual targets using the loss function.
  4. We use an optimization technique (like least squares, gradient descent etc.) to reduce the loss by adjusting the weights & biases of the model
  5. We repeat steps 1 to 4 till the predictions from the model are good enough.

Classification and regression are both supervised machine learning problems, because they use labeled data. Machine learning applied to unlabeled data is known as unsupervised learning (image source).

In this tutorial, we'll train a logistic regression model using the Rain in Australia dataset to predict whether or not it will rain at a location tomorrow, using today's data. This is a binary classification problem.

Let's install the scikit-learn library which we'll use to train our model.

In [2]:
!pip install scikit-learn --upgrade --quiet
|████████████████████████████████| 22.3MB 139kB/s

Downloading the Data

We'll use the opendatasets library to download the data from Kaggle directly within Jupyter. Let's install and import opendatasets.

In [4]:
!pip install opendatasets --upgrade --quiet
In [5]:
import opendatasets as od
In [23]:
od.version()
Out[23]:
'0.1.20'

The dataset can now be downloaded using od.download. When you execute od.download, you will be asked to provide your Kaggle username and API key. Follow these instructions to create an API key: http://bit.ly/kaggle-creds

In [6]:
dataset_url = 'https://www.kaggle.com/jsphyg/weather-dataset-rattle-package'
In [8]:
od.download(dataset_url)
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds Your Kaggle username: abhishekramachandra Your Kaggle Key: ··········
100%|██████████| 3.83M/3.83M [00:00<00:00, 150MB/s]
Downloading weather-dataset-rattle-package.zip to ./weather-dataset-rattle-package

Once the above command is executed, the dataset is downloaded and extracted to the the directory weather-dataset-rattle-package.

In [9]:
import os
In [10]:
data_dir = './weather-dataset-rattle-package'
In [11]:
os.listdir(data_dir)
Out[11]:
['weatherAUS.csv']
In [12]:
train_csv = data_dir + '/weatherAUS.csv'

Let's load the data from weatherAUS.csv using Pandas.

In [13]:
!pip install pandas --quiet
In [14]:
import pandas as pd
In [15]:
raw_df = pd.read_csv(train_csv)
In [16]:
raw_df
Out[16]:

The dataset contains over 145,000 rows and 23 columns. The dataset contains date, numeric and categorical columns. Our objective is to create a model to predict the value in the column RainTomorrow.

Let's check the data types and missing values in the various columns.

In [17]:
raw_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 145460 entries, 0 to 145459 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 145460 non-null object 1 Location 145460 non-null object 2 MinTemp 143975 non-null float64 3 MaxTemp 144199 non-null float64 4 Rainfall 142199 non-null float64 5 Evaporation 82670 non-null float64 6 Sunshine 75625 non-null float64 7 WindGustDir 135134 non-null object 8 WindGustSpeed 135197 non-null float64 9 WindDir9am 134894 non-null object 10 WindDir3pm 141232 non-null object 11 WindSpeed9am 143693 non-null float64 12 WindSpeed3pm 142398 non-null float64 13 Humidity9am 142806 non-null float64 14 Humidity3pm 140953 non-null float64 15 Pressure9am 130395 non-null float64 16 Pressure3pm 130432 non-null float64 17 Cloud9am 89572 non-null float64 18 Cloud3pm 86102 non-null float64 19 Temp9am 143693 non-null float64 20 Temp3pm 141851 non-null float64 21 RainToday 142199 non-null object 22 RainTomorrow 142193 non-null object dtypes: float64(16), object(7) memory usage: 25.5+ MB

While we should be able to fill in missing values for most columns, it might be a good idea to discard the rows where the value of RainTomorrow or RainToday is missing to make our analysis and modeling simpler (since one of them is the target variable, and the other is likely to be very closely related to the target variable).

In [18]:
raw_df.dropna(subset=['RainToday', 'RainTomorrow'], inplace=True)

How would you deal with the missing values in the other columns?

Exploratory Data Analysis and Visualization

Before training a machine learning model, its always a good idea to explore the distributions of various columns and see how they are related to the target column. Let's explore and visualize the data using the Plotly, Matplotlib and Seaborn libraries. Follow these tutorials to learn how to use these libraries:

In [19]:
!pip install plotly matplotlib seaborn --quiet
In [20]:
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
In [21]:
px.histogram(raw_df, x='Location', title='Location vs. Rainy Days', color='RainToday')
In [22]:
px.histogram(raw_df, 
             x='Temp3pm', 
             title='Temperature at 3 pm vs. Rain Tomorrow', 
             color='RainTomorrow')
In [24]:
px.histogram(raw_df, 
             x='RainTomorrow', 
             color='RainToday', 
             title='Rain Tomorrow vs. Rain Today')
In [25]:
px.scatter(raw_df.sample(2000), 
           title='Min Temp. vs Max Temp.',
           x='MinTemp', 
           y='MaxTemp', 
           color='RainToday')
In [26]:
px.scatter(raw_df.sample(2000), 
           title='Temp (3 pm) vs. Humidity (3 pm)',
           x='Temp3pm',
           y='Humidity3pm',
           color='RainTomorrow')

What interpertations can you draw from the above charts?

EXERCISE: Visualize all the other columns of the dataset and study their relationship with the RainToday and RainTomorrow columns.

In [ ]:
 
In [ ]:
 
In [ ]:
 

Let's save our work before continuing.

In [27]:
!pip install jovian --upgrade --quiet
In [28]:
import jovian
In [29]:
jovian.commit()
[jovian] Detected Colab notebook... [jovian] Please enter your API key ( from https://jovian.ai/ ): API KEY: ·········· [jovian] Uploading colab notebook to Jovian... Committed successfully! https://jovian.ai/abhishekramachandra98/python-sklearn-logistic-regression-29842

(Optional) Working with a Sample

When working with massive datasets containing millions of rows, it's a good idea to work with a sample initially, to quickly set up your model training notebook. If you'd like to work with a sample, just set the value of use_sample to True.

In [30]:
use_sample = True
In [31]:
sample_fraction = 0.1
In [32]:
if use_sample:
    raw_df = raw_df.sample(frac=sample_fraction).copy()

Make sure to set use_sample to False and re-run the notebook end-to-end once you're ready to use the entire dataset.

Training, Validation and Test Sets

While building real-world machine learning models, it is quite common to split the dataset into three parts:

  1. Training set - used to train the model, i.e., compute the loss and adjust the model's weights using an optimization technique.

  2. Validation set - used to evaluate the model during training, tune model hyperparameters (optimization technique, regularization etc.), and pick the best version of the model. Picking a good validation set is essential for training models that generalize well. Learn more here.

  3. Test set - used to compare different models or approaches and report the model's final accuracy. For many datasets, test sets are provided separately. The test set should reflect the kind of data the model will encounter in the real-world, as closely as feasible.

As a general rule of thumb you can use around 60% of the data for the training set, 20% for the validation set and 20% for the test set. If a separate test set is already provided, you can use a 75%-25% training-validation split.

When rows in the dataset have no inherent order, it's common practice to pick random subsets of rows for creating test and validation sets. This can be done using the train_test_split utility from scikit-learn. Learn more about it here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [33]:
!pip install scikit-learn --upgrade --quiet
In [34]:
from sklearn.model_selection import train_test_split
In [35]:
train_val_df, test_df = train_test_split(raw_df, test_size=0.2, random_state=42)
train_df, val_df = train_test_split(train_val_df, test_size=0.25, random_state=42)
In [36]:
print('train_df.shape :', train_df.shape)
print('val_df.shape :', val_df.shape)
print('test_df.shape :', test_df.shape)
train_df.shape : (8447, 23) val_df.shape : (2816, 23) test_df.shape : (2816, 23)

However, while working with dates, it's often a better idea to separate the training, validation and test sets with time, so that the model is trained on data from the past and evaluated on data from the future.

For the current dataset, we can use the Date column in the dataset to create another column for year. We'll pick the last two years for the test set, and one year before it for the validation set.

In [37]:
plt.title('No. of Rows per Year')
sns.countplot(x=pd.to_datetime(raw_df.Date).dt.year);
Notebook Image
In [38]:
year = pd.to_datetime(raw_df.Date).dt.year

train_df = raw_df[year < 2015]
val_df = raw_df[year == 2015]
test_df = raw_df[year > 2015]
In [39]:
print('train_df.shape :', train_df.shape)
print('val_df.shape :', val_df.shape)
print('test_df.shape :', test_df.shape)
train_df.shape : (9710, 23) val_df.shape : (1722, 23) test_df.shape : (2647, 23)

While not a perfect 60-20-20 split, we have ensured that the test validation and test sets both contain data for all 12 months of the year.

In [41]:
train_df
Out[41]:
In [42]:
val_df
Out[42]:
In [43]:
test_df
Out[43]:

Let's save our work before continuing.

In [44]:
jovian.commit()
[jovian] Detected Colab notebook... [jovian] Uploading colab notebook to Jovian... Committed successfully! https://jovian.ai/abhishekramachandra98/python-sklearn-logistic-regression-29842

Identifying Input and Target Columns

Often, not all the columns in a dataset are useful for training a model. In the current dataset, we can ignore the Date column, since we only want to weather conditions to make a prediction about whether it will rain the next day.

Let's create a list of input columns, and also identify the target column.

In [45]:
input_cols = list(train_df.columns)[1:-1]
target_col = 'RainTomorrow'
In [46]:
print(input_cols)
['Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm', 'RainToday']
In [47]:
target_col
Out[47]:
'RainTomorrow'

We can now create inputs and targets for the training, validation and test sets for further processing and model training.

In [48]:
train_inputs = train_df[input_cols].copy()
train_targets = train_df[target_col].copy()
In [49]:
val_inputs = val_df[input_cols].copy()
val_targets = val_df[target_col].copy()
In [50]:
test_inputs = test_df[input_cols].copy()
test_targets = test_df[target_col].copy()
In [51]:
train_inputs
Out[51]:
In [52]:
train_targets
Out[52]:
13075      No
30695      No
26162     Yes
119147    Yes
53511      No
         ... 
31644      No
79484      No
94909      No
12800      No
15776     Yes
Name: RainTomorrow, Length: 9710, dtype: object

Let's also identify which of the columns are numerical and which ones are categorical. This will be useful later, as we'll need to convert the categorical data to numbers for training a logistic regression model.

In [53]:
!pip install numpy --quiet
In [54]:
import numpy as np
In [55]:
numeric_cols = train_inputs.select_dtypes(include=np.number).columns.tolist()[:-1]
categorical_cols = train_inputs.select_dtypes('object').columns.tolist()

Let's view some statistics for the numeric columns.

In [ ]:
train_inputs[numeric_cols].describe()
Out[]:

Do the ranges of the numeric columns seem reasonable? If not, we may have to do some data cleaning as well.

Let's also check the number of categories in each of the categorical columns.

In [ ]:
train_inputs[categorical_cols].nunique()
Out[]:
Location       49
WindGustDir    16
WindDir9am     16
WindDir3pm     16
RainToday       2
dtype: int64

Let's save our work before continuing.

In [ ]:
jovian.commit()
[jovian] Detected Colab notebook... [jovian] Uploading colab notebook to Jovian... Committed successfully! https://jovian.ai/aakashns/python-sklearn-logistic-regression

Imputing Missing Numeric Data

Machine learning models can't work with missing numerical data. The process of filling missing values is called imputation.

There are several techniques for imputation, but we'll use the most basic one: replacing missing values with the average value in the column using the SimpleImputer class from sklearn.impute.

In [ ]:
from sklearn.impute import SimpleImputer
In [ ]:
imputer = SimpleImputer(strategy = 'mean')

Before we perform imputation, let's check the no. of missing values in each numeric column.

In [ ]:
raw_df[numeric_cols].isna().sum()
Out[]:
MinTemp            468
MaxTemp            307
Rainfall             0
Evaporation      59694
Sunshine         66805
WindGustSpeed     9105
WindSpeed9am      1055
WindSpeed3pm      2531
Humidity9am       1517
Humidity3pm       3501
Pressure9am      13743
Pressure3pm      13769
Cloud9am         52625
Cloud3pm         56094
Temp9am            656
dtype: int64

These values are spread across the training, test and validation sets. You can also check the no. of missing values individually for train_inputs, val_inputs and test_inputs.

In [ ]:
train_inputs[numeric_cols].isna().sum()
Out[]:
MinTemp            314
MaxTemp            187
Rainfall             0
Evaporation      36331
Sunshine         40046
WindGustSpeed     6828
WindSpeed9am       874
WindSpeed3pm      1069
Humidity9am       1052
Humidity3pm       1116
Pressure9am       9112
Pressure3pm       9131
Cloud9am         34988
Cloud3pm         36022
Temp9am            574
dtype: int64

The first step in imputation is to fit the imputer to the data i.e. compute the chosen statistic (e.g. mean) for each column in the dataset.

In [ ]:
imputer.fit(raw_df[numeric_cols])
Out[]:
SimpleImputer()

After calling fit, the computed statistic for each column is stored in the statistics_ property of imputer.

In [ ]:
list(imputer.statistics_)
Out[]:
[12.18482386562048,
 23.235120301822324,
 2.349974074310839,
 5.472515506887154,
 7.630539861047281,
 39.97051988882308,
 13.990496092519967,
 18.631140782316862,
 68.82683277087672,
 51.44928834695453,
 1017.6545771543717,
 1015.2579625879797,
 4.431160817585808,
 4.499250233195188,
 16.98706638787991]

The missing values in the training, test and validation sets can now be filled in using the transform method of imputer.

In [ ]:
train_inputs[numeric_cols] = imputer.transform(train_inputs[numeric_cols])
val_inputs[numeric_cols] = imputer.transform(val_inputs[numeric_cols])
test_inputs[numeric_cols] = imputer.transform(test_inputs[numeric_cols])

The missing values are now filled in with the mean of each column.

In [ ]:
train_inputs[numeric_cols].isna().sum()
Out[]:
MinTemp          0
MaxTemp          0
Rainfall         0
Evaporation      0
Sunshine         0
WindGustSpeed    0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Cloud9am         0
Cloud3pm         0
Temp9am          0
dtype: int64

EXERCISE: Apply some other imputation techniques and observe how they change the results of the model. You can learn more about other imputation techniques here: https://scikit-learn.org/stable/modules/impute.html

Scaling Numeric Features

Another good practice is to scale numeric features to a small range of values e.g. \((0,1)\) or \((-1,1)\). Scaling numeric features ensures that no particular feature has a disproportionate impact on the model's loss. Optimization algorithms also work better in practice with smaller numbers.

The numeric columns in our dataset have varying ranges.

In [ ]:
raw_df[numeric_cols].describe()
Out[]:

Let's use MinMaxScaler from sklearn.preprocessing to scale values to the \((0,1)\) range.

In [ ]:
from sklearn.preprocessing import MinMaxScaler
In [ ]:
?MinMaxScaler
In [ ]:
scaler = MinMaxScaler()

First, we fit the scaler to the data i.e. compute the range of values for each numeric column.

In [ ]:
scaler.fit(raw_df[numeric_cols])
Out[]:
MinMaxScaler()

We can now inspect the minimum and maximum values in each column.

In [ ]:
print('Minimum:')
list(scaler.data_min_)
Minimum:
Out[]:
[-8.5,
 -4.8,
 0.0,
 0.0,
 0.0,
 6.0,
 0.0,
 0.0,
 0.0,
 0.0,
 980.5,
 977.1,
 0.0,
 0.0,
 -7.2]
In [ ]:
print('Maximum:')
list(scaler.data_max_)
Maximum:
Out[]:
[33.9,
 48.1,
 371.0,
 145.0,
 14.5,
 135.0,
 130.0,
 87.0,
 100.0,
 100.0,
 1041.0,
 1039.6,
 9.0,
 9.0,
 40.2]

We can now separately scale the training, validation and test sets using the transform method of scaler.

In [ ]:
train_inputs[numeric_cols] = scaler.transform(train_inputs[numeric_cols])
val_inputs[numeric_cols] = scaler.transform(val_inputs[numeric_cols])
test_inputs[numeric_cols] = scaler.transform(test_inputs[numeric_cols])

We can now verify that values in each column lie in the range \((0,1)\)

In [ ]:
train_inputs[numeric_cols].describe()
Out[]:

Let's save our work before continuing.

In [ ]:
jovian.commit()
[jovian] Detected Colab notebook... [jovian] Uploading colab notebook to Jovian... Committed successfully! https://jovian.ai/aakashns/python-sklearn-logistic-regression

Encoding Categorical Data

Since machine learning models can only be trained with numeric data, we need to convert categorical data to numbers. A common technique is to use one-hot encoding for categorical columns.

One hot encoding involves adding a new binary (0/1) column for each unique category of a categorical column.

In [ ]:
raw_df[categorical_cols].nunique()
Out[]:
Location       49
WindGustDir    16
WindDir9am     16
WindDir3pm     16
RainToday       2
dtype: int64

We can perform one hot encoding using the OneHotEncoder class from sklearn.preprocessing.

In [ ]:
from sklearn.preprocessing import OneHotEncoder
In [ ]:
?OneHotEncoder
In [ ]:
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

First, we fit the encoder to the data i.e. identify the full list of categories across all categorical columns.

In [ ]:
encoder.fit(raw_df[categorical_cols])
Out[]:
OneHotEncoder(handle_unknown='ignore', sparse=False)
In [ ]:
encoder.categories_
Out[]:
[array(['Adelaide', 'Albany', 'Albury', 'AliceSprings', 'BadgerysCreek',
        'Ballarat', 'Bendigo', 'Brisbane', 'Cairns', 'Canberra', 'Cobar',
        'CoffsHarbour', 'Dartmoor', 'Darwin', 'GoldCoast', 'Hobart',
        'Katherine', 'Launceston', 'Melbourne', 'MelbourneAirport',
        'Mildura', 'Moree', 'MountGambier', 'MountGinini', 'Newcastle',
        'Nhil', 'NorahHead', 'NorfolkIsland', 'Nuriootpa', 'PearceRAAF',
        'Penrith', 'Perth', 'PerthAirport', 'Portland', 'Richmond', 'Sale',
        'SalmonGums', 'Sydney', 'SydneyAirport', 'Townsville',
        'Tuggeranong', 'Uluru', 'WaggaWagga', 'Walpole', 'Watsonia',
        'Williamtown', 'Witchcliffe', 'Wollongong', 'Woomera'],
       dtype=object),
 array(['E', 'ENE', 'ESE', 'N', 'NE', 'NNE', 'NNW', 'NW', 'S', 'SE', 'SSE',
        'SSW', 'SW', 'W', 'WNW', 'WSW', nan], dtype=object),
 array(['E', 'ENE', 'ESE', 'N', 'NE', 'NNE', 'NNW', 'NW', 'S', 'SE', 'SSE',
        'SSW', 'SW', 'W', 'WNW', 'WSW', nan], dtype=object),
 array(['E', 'ENE', 'ESE', 'N', 'NE', 'NNE', 'NNW', 'NW', 'S', 'SE', 'SSE',
        'SSW', 'SW', 'W', 'WNW', 'WSW', nan], dtype=object),
 array(['No', 'Yes'], dtype=object)]

The encoder has created a list of categories for each of the categorical columns in the dataset.

We can generate column names for each individual category using get_feature_names.

In [ ]:
encoded_cols = list(encoder.get_feature_names(categorical_cols))
print(encoded_cols)
['Location_Adelaide', 'Location_Albany', 'Location_Albury', 'Location_AliceSprings', 'Location_BadgerysCreek', 'Location_Ballarat', 'Location_Bendigo', 'Location_Brisbane', 'Location_Cairns', 'Location_Canberra', 'Location_Cobar', 'Location_CoffsHarbour', 'Location_Dartmoor', 'Location_Darwin', 'Location_GoldCoast', 'Location_Hobart', 'Location_Katherine', 'Location_Launceston', 'Location_Melbourne', 'Location_MelbourneAirport', 'Location_Mildura', 'Location_Moree', 'Location_MountGambier', 'Location_MountGinini', 'Location_Newcastle', 'Location_Nhil', 'Location_NorahHead', 'Location_NorfolkIsland', 'Location_Nuriootpa', 'Location_PearceRAAF', 'Location_Penrith', 'Location_Perth', 'Location_PerthAirport', 'Location_Portland', 'Location_Richmond', 'Location_Sale', 'Location_SalmonGums', 'Location_Sydney', 'Location_SydneyAirport', 'Location_Townsville', 'Location_Tuggeranong', 'Location_Uluru', 'Location_WaggaWagga', 'Location_Walpole', 'Location_Watsonia', 'Location_Williamtown', 'Location_Witchcliffe', 'Location_Wollongong', 'Location_Woomera', 'WindGustDir_E', 'WindGustDir_ENE', 'WindGustDir_ESE', 'WindGustDir_N', 'WindGustDir_NE', 'WindGustDir_NNE', 'WindGustDir_NNW', 'WindGustDir_NW', 'WindGustDir_S', 'WindGustDir_SE', 'WindGustDir_SSE', 'WindGustDir_SSW', 'WindGustDir_SW', 'WindGustDir_W', 'WindGustDir_WNW', 'WindGustDir_WSW', 'WindGustDir_nan', 'WindDir9am_E', 'WindDir9am_ENE', 'WindDir9am_ESE', 'WindDir9am_N', 'WindDir9am_NE', 'WindDir9am_NNE', 'WindDir9am_NNW', 'WindDir9am_NW', 'WindDir9am_S', 'WindDir9am_SE', 'WindDir9am_SSE', 'WindDir9am_SSW', 'WindDir9am_SW', 'WindDir9am_W', 'WindDir9am_WNW', 'WindDir9am_WSW', 'WindDir9am_nan', 'WindDir3pm_E', 'WindDir3pm_ENE', 'WindDir3pm_ESE', 'WindDir3pm_N', 'WindDir3pm_NE', 'WindDir3pm_NNE', 'WindDir3pm_NNW', 'WindDir3pm_NW', 'WindDir3pm_S', 'WindDir3pm_SE', 'WindDir3pm_SSE', 'WindDir3pm_SSW', 'WindDir3pm_SW', 'WindDir3pm_W', 'WindDir3pm_WNW', 'WindDir3pm_WSW', 'WindDir3pm_nan', 'RainToday_No', 'RainToday_Yes']

All of the above columns will be added to train_inputs, val_inputs and test_inputs.

To perform the encoding, we use the transform method of encoder.

In [ ]:
train_inputs[encoded_cols] = encoder.transform(train_inputs[categorical_cols])
val_inputs[encoded_cols] = encoder.transform(val_inputs[categorical_cols])
test_inputs[encoded_cols] = encoder.transform(test_inputs[categorical_cols])

We can verify that these new columns have been added to our training, test and validation sets.

In [ ]:
pd.set_option('display.max_columns', None)
In [ ]:
test_inputs
Out[]:

Let's save our work before continuing.

In [ ]:
jovian.commit()
[jovian] Detected Colab notebook... [jovian] Uploading colab notebook to Jovian... Committed successfully! https://jovian.ai/aakashns/python-sklearn-logistic-regression

Saving Processed Data to Disk

It can be useful to save processed data to disk, especially for really large datasets, to avoid repeating the preprocessing steps every time you start the Jupyter notebook. The parquet format is a fast and efficient format for saving and loading Pandas dataframes.

In [ ]:
print('train_inputs:', train_inputs.shape)
print('train_targets:', train_targets.shape)
print('val_inputs:', val_inputs.shape)
print('val_targets:', val_targets.shape)
print('test_inputs:', test_inputs.shape)
print('test_targets:', test_targets.shape)
train_inputs: (97988, 123) train_targets: (97988,) val_inputs: (17089, 123) val_targets: (17089,) test_inputs: (25710, 123) test_targets: (25710,)
In [ ]:
!pip install pyarrow --quiet
In [ ]:
train_inputs.to_parquet('train_inputs.parquet')
val_inputs.to_parquet('val_inputs.parquet')
test_inputs.to_parquet('test_inputs.parquet')
In [ ]:
%%time
pd.DataFrame(train_targets).to_parquet('train_targets.parquet')
pd.DataFrame(val_targets).to_parquet('val_targets.parquet')
pd.DataFrame(test_targets).to_parquet('test_targets.parquet')
CPU times: user 33.3 ms, sys: 903 µs, total: 34.2 ms Wall time: 37.7 ms

We can read the data back using pd.read_parquet.

In [ ]:
%%time

train_inputs = pd.read_parquet('train_inputs.parquet')
val_inputs = pd.read_parquet('val_inputs.parquet')
test_inputs = pd.read_parquet('test_inputs.parquet')

train_targets = pd.read_parquet('train_targets.parquet')[target_col]
val_targets = pd.read_parquet('val_targets.parquet')[target_col]
test_targets = pd.read_parquet('test_targets.parquet')[target_col]
CPU times: user 304 ms, sys: 161 ms, total: 465 ms Wall time: 300 ms

Let's verify that the data was loaded properly.

In [ ]:
print('train_inputs:', train_inputs.shape)
print('train_targets:', train_targets.shape)
print('val_inputs:', val_inputs.shape)
print('val_targets:', val_targets.shape)
print('test_inputs:', test_inputs.shape)
print('test_targets:', test_targets.shape)
train_inputs: (97988, 123) train_targets: (97988,) val_inputs: (17089, 123) val_targets: (17089,) test_inputs: (25710, 123) test_targets: (25710,)
In [ ]:
val_inputs
Out[]:
In [ ]:
val_targets
Out[]:
2133      No
2134      No
2135      No
2136      No
2137      No
          ..
144913    No
144914    No
144915    No
144916    No
144917    No
Name: RainTomorrow, Length: 17089, dtype: object

Training a Logistic Regression Model

Logistic regression is a commonly used technique for solving binary classification problems. In a logistic regression model:

  • we take linear combination (or weighted sum of the input features)
  • we apply the sigmoid function to the result to obtain a number between 0 and 1
  • this number represents the probability of the input being classified as "Yes"
  • instead of RMSE, the cross entropy loss function is used to evaluate the results

Here's a visual summary of how a logistic regression model is structured (source):

The sigmoid function applied to the linear combination of inputs has the following formula:

To train a logistic regression model, we can use the LogisticRegression class from Scikit-learn.

In [ ]:
from sklearn.linear_model import LogisticRegression
In [ ]:
?LogisticRegression
In [ ]:
model = LogisticRegression(solver='liblinear')

We can train the model using model.fit.

In [ ]:
model.fit(train_inputs[numeric_cols + encoded_cols], train_targets)
Out[]:
LogisticRegression(solver='liblinear')

model.fit uses the following workflow for training the model (source):

  1. We initialize a model with random parameters (weights & biases).
  2. We pass some inputs into the model to obtain predictions.
  3. We compare the model's predictions with the actual targets using the loss function.
  4. We use an optimization technique (like least squares, gradient descent etc.) to reduce the loss by adjusting the weights & biases of the model
  5. We repeat steps 1 to 4 till the predictions from the model are good enough.

For a mathematical discussion of logistic regression, sigmoid activation and cross entropy, check out this YouTube playlist. Logistic regression can also be applied to multi-class classification problems, with a few modifications.

Let's check the weights and biases of the trained model.

In [ ]:
print(numeric_cols + encoded_cols)
['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am', 'Location_Adelaide', 'Location_Albany', 'Location_Albury', 'Location_AliceSprings', 'Location_BadgerysCreek', 'Location_Ballarat', 'Location_Bendigo', 'Location_Brisbane', 'Location_Cairns', 'Location_Canberra', 'Location_Cobar', 'Location_CoffsHarbour', 'Location_Dartmoor', 'Location_Darwin', 'Location_GoldCoast', 'Location_Hobart', 'Location_Katherine', 'Location_Launceston', 'Location_Melbourne', 'Location_MelbourneAirport', 'Location_Mildura', 'Location_Moree', 'Location_MountGambier', 'Location_MountGinini', 'Location_Newcastle', 'Location_Nhil', 'Location_NorahHead', 'Location_NorfolkIsland', 'Location_Nuriootpa', 'Location_PearceRAAF', 'Location_Penrith', 'Location_Perth', 'Location_PerthAirport', 'Location_Portland', 'Location_Richmond', 'Location_Sale', 'Location_SalmonGums', 'Location_Sydney', 'Location_SydneyAirport', 'Location_Townsville', 'Location_Tuggeranong', 'Location_Uluru', 'Location_WaggaWagga', 'Location_Walpole', 'Location_Watsonia', 'Location_Williamtown', 'Location_Witchcliffe', 'Location_Wollongong', 'Location_Woomera', 'WindGustDir_E', 'WindGustDir_ENE', 'WindGustDir_ESE', 'WindGustDir_N', 'WindGustDir_NE', 'WindGustDir_NNE', 'WindGustDir_NNW', 'WindGustDir_NW', 'WindGustDir_S', 'WindGustDir_SE', 'WindGustDir_SSE', 'WindGustDir_SSW', 'WindGustDir_SW', 'WindGustDir_W', 'WindGustDir_WNW', 'WindGustDir_WSW', 'WindGustDir_nan', 'WindDir9am_E', 'WindDir9am_ENE', 'WindDir9am_ESE', 'WindDir9am_N', 'WindDir9am_NE', 'WindDir9am_NNE', 'WindDir9am_NNW', 'WindDir9am_NW', 'WindDir9am_S', 'WindDir9am_SE', 'WindDir9am_SSE', 'WindDir9am_SSW', 'WindDir9am_SW', 'WindDir9am_W', 'WindDir9am_WNW', 'WindDir9am_WSW', 'WindDir9am_nan', 'WindDir3pm_E', 'WindDir3pm_ENE', 'WindDir3pm_ESE', 'WindDir3pm_N', 'WindDir3pm_NE', 'WindDir3pm_NNE', 'WindDir3pm_NNW', 'WindDir3pm_NW', 'WindDir3pm_S', 'WindDir3pm_SE', 'WindDir3pm_SSE', 'WindDir3pm_SSW', 'WindDir3pm_SW', 'WindDir3pm_W', 'WindDir3pm_WNW', 'WindDir3pm_WSW', 'WindDir3pm_nan', 'RainToday_No', 'RainToday_Yes']
In [ ]:
print(model.coef_.tolist())
[[0.9829272985765378, -1.6136822077921629, 3.2569451328900922, 0.7391772272261374, -1.665730067365724, 6.712782346582703, -0.8945848254028183, -1.4786777149489574, 0.5085762943945592, 5.668985885134319, 5.7512710157968305, -9.442234740198247, -0.15422870180288917, 1.2692578677045214, 0.9609387600221121, 0.5968095184788006, -0.5433599933400493, 0.48409445642112614, 0.012646371301369185, 0.3420937583172522, -0.3502912727984089, 0.18144823211158378, 0.4258617593285847, -0.004902265798005033, 0.015432475688749822, 0.25380114477159266, -0.018375820646252074, -0.03048369335905766, -0.46729041060947873, -0.1441966650545995, -0.5908198550000147, -0.7446331726196194, -0.24989076701188495, -0.32868637877351403, -0.5709379172058887, 0.08019115981179582, 0.014039302320957656, 0.05995223838401963, -0.8771230773722085, -0.441465682398046, 0.011834632183502208, -0.45949004429924983, -0.4601851982479873, -0.07469554670856815, 0.1945916834015849, 0.4456783445702781, 0.6073735657904817, 0.4303986182093421, -0.02091355539749858, 0.25314382242140443, -0.31939859295381545, 0.4066468101809617, -0.05796390284596123, -0.11141022335853902, -0.7155350954107347, 0.35870406948747874, 0.19602783718192243, 0.18236866703794213, 0.18190786142141765, -0.25252488233915393, 0.02166676526308533, 0.6916194371797169, -0.7935662642200941, -0.18413292480846205, -0.15310880860720139, -0.15142785979614481, -0.05549105244610673, -0.21757291016650984, -0.22518473624344681, -0.3149627139958612, -0.16058102195006552, -0.15315196181408183, -0.1022968291689486, -0.02692553646989396, -0.04355934274171627, -0.08850280682601555, -0.08752613615596141, -0.22240076310905402, -0.2584519201796868, -0.19965055383835098, 0.09685428219750397, -0.30933313646401334, -0.02164366098214061, -0.32062146009339976, 0.03047995275483984, -0.017576520724588637, 0.14834043391209928, -0.06361435278904533, -0.05755840214226519, -0.4025960977215561, -0.30093759728804326, -0.3989761738612584, -0.1874253680639446, -0.05350143660530763, -0.08160359659583666, -0.05459095297873519, -0.010293967501836074, -0.2624883341682838, -0.23360446213357397, -0.1432678478499558, -0.23204114668787276, 0.05016385730005804, -0.29928087812066456, -0.06875521792622655, 0.2978706769198813, 0.2315582031241263, -0.26774717714717844, -0.23979456498260399, -0.3676332980176867, -0.32160415828281486, -0.37623569898788223, -0.18106033292198612, -0.027693711569770855, -0.2815457267737701, 0.0967308127452155, -1.4301555545398859, -0.933785116780864]]
In [ ]:
print(model.intercept_)
[-2.36394067]

Each weight is applied to the value in a specific column of the input. Higher the weight, greater the impact of the column on the prediction.

Making Predictions and Evaluating the Model

We can now use the trained model to make predictions on the training, test

In [ ]:
X_train = train_inputs[numeric_cols + encoded_cols]
X_val = val_inputs[numeric_cols + encoded_cols]
X_test = test_inputs[numeric_cols + encoded_cols]
In [ ]:
train_preds = model.predict(X_train)
In [ ]:
train_preds
Out[]:
array(['No', 'No', 'No', ..., 'No', 'No', 'No'], dtype=object)
In [ ]:
train_targets
Out[]:
0         No
1         No
2         No
3         No
4         No
          ..
144548    No
144549    No
144550    No
144551    No
144552    No
Name: RainTomorrow, Length: 97988, dtype: object

We can output a probabilistic prediction using predict_proba.

In [ ]:
train_probs = model.predict_proba(X_train)
train_probs
Out[]:
array([[0.93950704, 0.06049296],
       [0.94333134, 0.05666866],
       [0.95980416, 0.04019584],
       ...,
       [0.98729985, 0.01270015],
       [0.98358003, 0.01641997],
       [0.87598858, 0.12401142]])

The numbers above indicate the probabilities for the target classes "No" and "Yes".

In [ ]:
model.classes_
Out[]:
array(['No', 'Yes'], dtype=object)

We can test the accuracy of the model's predictions by computing the percentage of matching values in train_preds and train_targets.

This can be done using the accuracy_score function from sklearn.metrics.

In [ ]:
from sklearn.metrics import accuracy_score
In [ ]:
accuracy_score(train_targets, train_preds)
Out[]:
0.8519002326815528

The model achieves an accuracy of 85.1% on the training set. We can visualize the breakdown of correctly and incorrectly classified inputs using a confusion matrix.

In [ ]:
from sklearn.metrics import confusion_matrix
In [ ]:
confusion_matrix(train_targets, train_preds, normalize='true')
Out[]:
array([[0.94613466, 0.05386534],
       [0.477475  , 0.522525  ]])

Let's define a helper function to generate predictions, compute the accuracy score and plot a confusion matrix for a given st of inputs.

In [ ]:
def predict_and_plot(inputs, targets, name=''):
    preds = model.predict(inputs)
    
    accuracy = accuracy_score(targets, preds)
    print("Accuracy: {:.2f}%".format(accuracy * 100))
    
    cf = confusion_matrix(targets, preds, normalize='true')
    plt.figure()
    sns.heatmap(cf, annot=True)
    plt.xlabel('Prediction')
    plt.ylabel('Target')
    plt.title('{} Confusion Matrix'.format(name));
    
    return preds
In [ ]:
train_preds = predict_and_plot(X_train, train_targets, 'Training')
Accuracy: 85.19%
Notebook Image

Let's compute the model's accuracy on the validation and test sets too.

In [ ]:
val_preds = predict_and_plot(X_val, val_targets, 'Validatiaon')
Accuracy: 85.41%
Notebook Image
In [ ]:
test_preds = predict_and_plot(X_test, test_targets, 'Test')
Accuracy: 84.25%
Notebook Image

The accuracy of the model on the test and validation set are above 84%, which suggests that our model generalizes well to data it hasn't seen before.

But how good is 84% accuracy? While this depends on the nature of the problem and on business requirements, a good way to verify whether a model has actually learned something useful is to compare its results to a "random" or "dumb" model.

Let's create two models: one that guesses randomly and another that always return "No". Both of these models completely ignore the inputs given to them.

In [ ]:
def random_guess(inputs):
    return np.random.choice(["No", "Yes"], len(inputs))
In [ ]:
def all_no(inputs):
    return np.full(len(inputs), "No")

Let's check the accuracies of these two models on the test set.

In [ ]:
accuracy_score(test_targets, random_guess(X_test))
Out[]:
0.4970050563982886
In [ ]:
accuracy_score(test_targets, all_no(X_test))
Out[]:
0.7734344612991054

Our random model achieves an accuracy of 50% and our "always No" model achieves an accuracy of 77%.

Thankfully, our model is better than a "dumb" or "random" model! This is not always the case, so it's a good practice to benchmark any model you train against such baseline models.

EXERCISE: Initialize the LogisticRegression model with different arguments and try to achieve a higher accuracy. The arguments used for initializing the model are called hyperparameters (to differentiate them from weights and biases - parameters that are learned by the model during training). You can find the full list of arguments here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [ ]:
 
In [ ]:
 

EXERCISE: Train a logistic regression model using just the numeric columns from the dataset. Does it perform better or worse than the model trained above?

In [ ]:
 
In [ ]:
 

EXERCISE: Train a logistic regression model using just the categorical columns from the dataset. Does it perform better or worse than the model trained above?

In [ ]:
 
In [ ]:
 

EXERCISE: Train a logistic regression model without feature scaling. Also try a different strategy for missing data imputation. Does it perform better or worse than the model trained above?

In [ ]:
 
In [ ]:
 

Let's save our work before continuing.

In [ ]:
jovian.commit()
[jovian] Detected Colab notebook... [jovian] Uploading colab notebook to Jovian... Committed successfully! https://jovian.ai/aakashns/python-sklearn-logistic-regression

Making Predictions on a Single Input

Once the model has been trained to a satisfactory accuracy, it can be used to make predictions on new data. Consider the following dictionary containing data collected from the Katherine weather department today.

In [ ]:
new_input = {'Date': '2021-06-19',
             'Location': 'Katherine',
             'MinTemp': 23.2,
             'MaxTemp': 33.2,
             'Rainfall': 10.2,
             'Evaporation': 4.2,
             'Sunshine': np.nan,
             'WindGustDir': 'NNW',
             'WindGustSpeed': 52.0,
             'WindDir9am': 'NW',
             'WindDir3pm': 'NNE',
             'WindSpeed9am': 13.0,
             'WindSpeed3pm': 20.0,
             'Humidity9am': 89.0,
             'Humidity3pm': 58.0,
             'Pressure9am': 1004.8,
             'Pressure3pm': 1001.5,
             'Cloud9am': 8.0,
             'Cloud3pm': 5.0,
             'Temp9am': 25.7,
             'Temp3pm': 33.0,
             'RainToday': 'Yes'}

The first step is to convert the dictionary into a Pandas dataframe, similar to raw_df. This can be done by passing a list containing the given dictionary to the pd.DataFrame constructor.

In [ ]:
new_input_df = pd.DataFrame([new_input])
In [ ]:
new_input_df
Out[]:

We've now created a Pandas dataframe with the same columns as raw_df (except RainTomorrow, which needs to be predicted). The dataframe contains just one row of data, containing the given input.

We must now apply the same transformations applied while training the model:

  1. Imputation of missing values using the imputer created earlier
  2. Scaling numerical features using the scaler created earlier
  3. Encoding categorical features using the encoder created earlier
In [ ]:
new_input_df[numeric_cols] = imputer.transform(new_input_df[numeric_cols])
new_input_df[numeric_cols] = scaler.transform(new_input_df[numeric_cols])
new_input_df[encoded_cols] = encoder.transform(new_input_df[categorical_cols])
In [ ]:
X_new_input = new_input_df[numeric_cols + encoded_cols]
X_new_input
Out[]:

We can now make a prediction using model.predict.

In [ ]:
prediction = model.predict(X_new_input)[0]
In [ ]:
prediction
Out[]:
'Yes'

Our model predicts that it will rain tomorrow in Katherine! We can also check the probability of the prediction.

In [ ]:
prob = model.predict_proba(X_new_input)[0]
In [ ]:
prob
Out[]:
array([0.48885322, 0.51114678])

Looks like our model isn't too confident about its prediction!

Let's define a helper function to make predictions for individual inputs.

In [ ]:
def predict_input(single_input):
    input_df = pd.DataFrame([single_input])
    input_df[numeric_cols] = imputer.transform(input_df[numeric_cols])
    input_df[numeric_cols] = scaler.transform(input_df[numeric_cols])
    input_df[encoded_cols] = encoder.transform(input_df[categorical_cols])
    X_input = input_df[numeric_cols + encoded_cols]
    pred = model.predict(X_input)[0]
    prob = model.predict_proba(X_input)[0][list(model.classes_).index(pred)]
    return pred, prob

We can now use this function to make predictions for individual inputs.

In [ ]:
new_input = {'Date': '2021-06-19',
             'Location': 'Launceston',
             'MinTemp': 23.2,
             'MaxTemp': 33.2,
             'Rainfall': 10.2,
             'Evaporation': 4.2,
             'Sunshine': np.nan,
             'WindGustDir': 'NNW',
             'WindGustSpeed': 52.0,
             'WindDir9am': 'NW',
             'WindDir3pm': 'NNE',
             'WindSpeed9am': 13.0,
             'WindSpeed3pm': 20.0,
             'Humidity9am': 89.0,
             'Humidity3pm': 58.0,
             'Pressure9am': 1004.8,
             'Pressure3pm': 1001.5,
             'Cloud9am': 8.0,
             'Cloud3pm': 5.0,
             'Temp9am': 25.7,
             'Temp3pm': 33.0,
             'RainToday': 'Yes'}
In [ ]:
predict_input(new_input)
Out[]:
('Yes', 0.6316581522137068)

EXERCISE: Try changing the values in new_input and observe how the predictions and probabilities change. Try different values of location, temperature, humidity, pressure etc. Try to get an intuitive feel of which columns have the greatest effect on the result of the model.

In [ ]:
raw_df.Location.unique()
Out[]:
array(['Albury', 'BadgerysCreek', 'Cobar', 'CoffsHarbour', 'Moree',
       'Newcastle', 'NorahHead', 'NorfolkIsland', 'Penrith', 'Richmond',
       'Sydney', 'SydneyAirport', 'WaggaWagga', 'Williamtown',
       'Wollongong', 'Canberra', 'Tuggeranong', 'MountGinini', 'Ballarat',
       'Bendigo', 'Sale', 'MelbourneAirport', 'Melbourne', 'Mildura',
       'Nhil', 'Portland', 'Watsonia', 'Dartmoor', 'Brisbane', 'Cairns',
       'GoldCoast', 'Townsville', 'Adelaide', 'MountGambier', 'Nuriootpa',
       'Woomera', 'Albany', 'Witchcliffe', 'PearceRAAF', 'PerthAirport',
       'Perth', 'SalmonGums', 'Walpole', 'Hobart', 'Launceston',
       'AliceSprings', 'Darwin', 'Katherine', 'Uluru'], dtype=object)
In [ ]:
 
In [ ]:
 

Let's save our work before continuing.

In [ ]:
jovian.commit()
[jovian] Detected Colab notebook... [jovian] Uploading colab notebook to Jovian... Committed successfully! https://jovian.ai/aakashns/python-sklearn-logistic-regression

Saving and Loading Trained Models

We can save the parameters (weights and biases) of our trained model to disk, so that we needn't retrain the model from scratch each time we wish to use it. Along with the model, it's also important to save imputers, scalers, encoders and even column names. Anything that will be required while generating predictions using the model should be saved.

We can use the joblib module to save and load Python objects on the disk.

In [ ]:
import joblib

Let's first create a dictionary containing all the required objects.

In [ ]:
aussie_rain = {
    'model': model,
    'imputer': imputer,
    'scaler': scaler,
    'encoder': encoder,
    'input_cols': input_cols,
    'target_col': target_col,
    'numeric_cols': numeric_cols,
    'categorical_cols': categorical_cols,
    'encoded_cols': encoded_cols
}

We can now save this to a file using joblib.dump

In [ ]:
joblib.dump(aussie_rain, 'aussie_rain.joblib')
Out[]:
['aussie_rain.joblib']

The object can be loaded back using joblib.load

In [ ]:
aussie_rain2 = joblib.load('aussie_rain.joblib')

Let's use the loaded model to make predictions on the original test set.

In [ ]:
test_preds2 = aussie_rain2['model'].predict(X_test)
accuracy_score(test_targets, test_preds2)
Out[]:
0.8424737456242707

As expected, we get the same result as the original model.

Let's save our work before continuing. We can upload our trained models to Jovian using the outputs argument.

In [ ]:
jovian.commit(outputs=['aussie_rain.joblib'])
[jovian] Detected Colab notebook... [jovian] Uploading colab notebook to Jovian... [jovian] Uploading additional outputs... Committed successfully! https://jovian.ai/aakashns/python-sklearn-logistic-regression

Putting it all Together

While we've covered a lot of ground in this tutorial, the number of lines of code for processing the data and training the model is fairly small. Each step requires no more than 3-4 lines of code.

Data Preprocessing

In [ ]:
import opendatasets as od
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

# Download the dataset
od.download('https://www.kaggle.com/jsphyg/weather-dataset-rattle-package')
raw_df = pd.read_csv('weather-dataset-rattle-package/weatherAUS.csv')
raw_df.dropna(subset=['RainToday', 'RainTomorrow'], inplace=True)

# Create training, validation and test sets
year = pd.to_datetime(raw_df.Date).dt.year
train_df, val_df, test_df = raw_df[year < 2015], raw_df[year == 2015], raw_df[year > 2015]

# Create inputs and targets
input_cols = list(train_df.columns)[1:-1]
target_col = 'RainTomorrow'
train_inputs, train_targets = train_df[input_cols].copy(), train_df[target_col].copy()
val_inputs, val_targets = val_df[input_cols].copy(), val_df[target_col].copy()
test_inputs, test_targets = test_df[input_cols].copy(), test_df[target_col].copy()

# Identify numeric and categorical columns
numeric_cols = train_inputs.select_dtypes(include=np.number).columns.tolist()[:-1]
categorical_cols = train_inputs.select_dtypes('object').columns.tolist()

# Impute missing numerical values
imputer = SimpleImputer(strategy = 'mean').fit(raw_df[numeric_cols])
train_inputs[numeric_cols] = imputer.transform(train_inputs[numeric_cols])
val_inputs[numeric_cols] = imputer.transform(val_inputs[numeric_cols])
test_inputs[numeric_cols] = imputer.transform(test_inputs[numeric_cols])

# Scale numeric features
scaler = MinMaxScaler().fit(raw_df[numeric_cols])
train_inputs[numeric_cols] = scaler.transform(train_inputs[numeric_cols])
val_inputs[numeric_cols] = scaler.transform(val_inputs[numeric_cols])
test_inputs[numeric_cols] = scaler.transform(test_inputs[numeric_cols])

# One-hot encode categorical features
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore').fit(raw_df[categorical_cols])
encoded_cols = list(encoder.get_feature_names(categorical_cols))
train_inputs[encoded_cols] = encoder.transform(train_inputs[categorical_cols])
val_inputs[encoded_cols] = encoder.transform(val_inputs[categorical_cols])
test_inputs[encoded_cols] = encoder.transform(test_inputs[categorical_cols])

# Save processed data to disk
train_inputs.to_parquet('train_inputs.parquet')
val_inputs.to_parquet('val_inputs.parquet')
test_inputs.to_parquet('test_inputs.parquet')
pd.DataFrame(train_targets).to_parquet('train_targets.parquet')
pd.DataFrame(val_targets).to_parquet('val_targets.parquet')
pd.DataFrame(test_targets).to_parquet('test_targets.parquet')

# Load processed data from disk
train_inputs = pd.read_parquet('train_inputs.parquet')
val_inputs = pd.read_parquet('val_inputs.parquet')
test_inputs = pd.read_parquet('test_inputs.parquet')
train_targets = pd.read_parquet('train_targets.parquet')[target_col]
val_targets = pd.read_parquet('val_targets.parquet')[target_col]
test_targets = pd.read_parquet('test_targets.parquet')[target_col]

Skipping, found downloaded files in "./weather-dataset-rattle-package" (use force=True to force download)

EXERCISE: Try to explain each line of code in the above cell in your own words. Scroll back to relevant sections of the notebook if needed.

Model Training and Evaluation

In [ ]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import joblib

# Select the columns to be used for training/prediction
X_train = train_inputs[numeric_cols + encoded_cols]
X_val = val_inputs[numeric_cols + encoded_cols]
X_test = test_inputs[numeric_cols + encoded_cols]

# Create and train the model
model = LogisticRegression(solver='liblinear')
model.fit(X_train, train_targets)

# Generate predictions and probabilities
train_preds = model.predict(X_train)
train_probs = model.predict_proba(X_train)
accuracy_score(train_targets, train_preds)

# Helper function to predict, compute accuracy & plot confustion matrix
def predict_and_plot(inputs, targets, name=''):
    preds = model.predict(inputs)
    accuracy = accuracy_score(targets, preds)
    print("Accuracy: {:.2f}%".format(accuracy * 100))
    cf = confusion_matrix(targets, preds, normalize='true')
    plt.figure()
    sns.heatmap(cf, annot=True)
    plt.xlabel('Prediction')
    plt.ylabel('Target')
    plt.title('{} Confusion Matrix'.format(name));    
    return preds

# Evaluate on validation and test set
val_preds = predict_and_plot(X_val, val_targets, 'Validation')
test_preds = predict_and_plot(X_test, test_targets, 'Test')

# Save the trained model & load it back
aussie_rain = {'model': model, 'imputer': imputer, 'scaler': scaler, 'encoder': encoder,
               'input_cols': input_cols, 'target_col': target_col, 'numeric_cols': numeric_cols,
               'categorical_cols': categorical_cols, 'encoded_cols': encoded_cols}
joblib.dump(aussie_rain, 'aussie_rain.joblib')
aussie_rain2 = joblib.load('aussie_rain.joblib')
Accuracy: 85.41% Accuracy: 84.25%
Notebook Image
Notebook Image

EXERCISE: Try to explain each line of code in the above cell in your own words. Scroll back to relevant sections of the notebook if needed.

Prediction on Single Inputs

In [ ]:
def predict_input(single_input):
    input_df = pd.DataFrame([single_input])
    input_df[numeric_cols] = imputer.transform(input_df[numeric_cols])
    input_df[numeric_cols] = scaler.transform(input_df[numeric_cols])
    input_df[encoded_cols] = encoder.transform(input_df[categorical_cols])
    X_input = input_df[numeric_cols + encoded_cols]
    pred = model.predict(X_input)[0]
    prob = model.predict_proba(X_input)[0][list(model.classes_).index(pred)]
    return pred, prob

new_input = {'Date': '2021-06-19',
             'Location': 'Launceston',
             'MinTemp': 23.2,
             'MaxTemp': 33.2,
             'Rainfall': 10.2,
             'Evaporation': 4.2,
             'Sunshine': np.nan,
             'WindGustDir': 'NNW',
             'WindGustSpeed': 52.0,
             'WindDir9am': 'NW',
             'WindDir3pm': 'NNE',
             'WindSpeed9am': 13.0,
             'WindSpeed3pm': 20.0,
             'Humidity9am': 89.0,
             'Humidity3pm': 58.0,
             'Pressure9am': 1004.8,
             'Pressure3pm': 1001.5,
             'Cloud9am': 8.0,
             'Cloud3pm': 5.0,
             'Temp9am': 25.7,
             'Temp3pm': 33.0,
             'RainToday': 'Yes'}

predict_input(new_input)
Out[]:
('Yes', 0.6316581522137068)

Let's save our work using Jovian.

In [ ]:
jovian.commit()
[jovian] Detected Colab notebook... [jovian] Uploading colab notebook to Jovian... Committed successfully! https://jovian.ai/aakashns/python-sklearn-logistic-regression

Summary and References

Logistic regression is a commonly used technique for solving binary classification problems. In a logistic regression model:

  • we take linear combination (or weighted sum of the input features)
  • we apply the sigmoid function to the result to obtain a number between 0 and 1
  • this number represents the probability of the input being classified as "Yes"
  • instead of RMSE, the cross entropy loss function is used to evaluate the results

Here's a visual summary of how a logistic regression model is structured (source):

To train a logistic regression model, we can use the LogisticRegression class from Scikit-learn. We covered the following topics in this tutorial:

  • Downloading a real-world dataset from Kaggle
  • Exploratory data analysis and visualization
  • Splitting a dataset into training, validation & test sets
  • Filling/imputing missing values in numeric columns
  • Scaling numeric features to a \((0,1)\) range
  • Encoding categorical columns as one-hot vectors
  • Training a logistic regression model using Scikit-learn
  • Evaluating a model using a validation set and test set
  • Saving a model to disk and loading it back

Check out the following resources to learn more:

Try training logistic regression models on the following datasets:

  • Breast cancer detection: Predicting whether a tumor is "benign" (noncancerous) or "malignant" (cancerous) using information like its radius, texture etc.
  • Loan Repayment Prediction - Predicting whether applicants will repay a home loan based on factors like age, income, loan amount, no. of children etc.
  • Handwritten Digit Recognition - Identifying which digit from 0 to 9 a picture of handwritten text represents.
In [ ]: