Learn practical skills, build real-world projects, and advance your career

Cross-Validation with Linear Regression

This notebook demonstrates how to do cross-validation (CV) with linear regression as an example (it is heavily used in almost all modelling techniques such as decision trees, SVM etc.). We will mainly use sklearn to do cross-validation.

This notebook is divided into the following parts:
0. Experiments to understand overfitting

  1. Building a linear regression model without cross-validation
  2. Problems in the current approach
  3. Cross-validation: A quick recap
  4. Cross-validation in sklearn:
    • 4.1 K-fold CV
    • 4.2 Hyperparameter tuning using CV
    • 4.3 Other CV schemes

0. Experiments to Understand Overfitting

In this section, let's quickly go through some experiments to understand what overfitting looks like. We'll run some experiments using polynomial regression.

# import all libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import re

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import scale
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline

import warnings # supress warnings
warnings.filterwarnings('ignore')
# import Housing.csv
housing = pd.read_csv('Housing.csv')
housing.head()
# number of observations 
len(housing.index)
545