Learn practical skills, build real-world projects, and advance your career

Introduction

Pipelines are a simple way to keep your data preprocessing and modeling code organized. Specifically, a pipeline bundles preprocessing and modeling steps so you can use the whole bundle as if it were a single step.

Many data scientists hack together models without pipelines, but pipelines have some important benefits. Those include:

  1. Cleaner Code: Accounting for data at each step of preprocessing can get messy. With a pipeline, you won't need to manually keep track of your training and validation data at each step.
  2. Fewer Bugs: There are fewer opportunities to misapply a step or forget a preprocessing step.
  3. Easier to Productionize: It can be surprisingly hard to transition a model from a prototype to something deployable at scale. We won't go into the many related concerns here, but pipelines can help.
  4. More Options for Model Validation: You will see an example in the next tutorial, which covers cross-validation.

Table Of Contents



  1. About the Dataset
  2. Performing Train Test Split
  3. Preprocessing W/O pipeline
3.1 [Imputing Numberic Columns](#ImputeNum)
3.2 [Scaling Numberic Columns](#Scaling-Num)

3.3 [Imputing Categorical Columns](#Impute-Cat)

3.4 [Encoding Categrorical Columns](#Encoding-Cat)

4. Model Implementation W/O Pipelines
5. Pipeline

5.1 [Pipeline Implementation](#Pipeline-Implementation)

5.2 [Model Implementation with Pipelines](#Implementation-with-pipeline)

6. Summary
7. References

1: About the Dataset


The dataset has been picked up from kaggle and can be accessed from here.The data contains information from the 1990 California census.

The dataset contains the following columns

  1. longitude: A measure of how far west a house is; a higher value is farther west

  2. latitude: A measure of how far north a house is; a higher value is farther north

  3. housingMedianAge: Median age of a house within a block; a lower number is a newer building

  4. totalRooms: Total number of rooms within a block

  5. totalBedrooms: Total number of bedrooms within a block

  6. population: Total number of people residing within a block

  7. households: Total number of households, a group of people residing within a home unit, for a block

  8. medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)

  9. medianHouseValue: Median house value for households within a block (measured in US Dollars)

  10. oceanProximity: Location of the house w.r.t ocean/sea

medianHouseValue being the target block

2: Performing Train Test Split


# Importing all the required libraries
!pip install opendatasets --quiet
import opendatasets as od
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")