Learn practical skills, build real-world projects, and advance your career

Predicting the sale Price of Bulldozers using Machine Learning.

In this notebook, we're going to go through an example of a ML project with the goal of predicting the sale price of bulldozers.

1. Problem definition

How well can we predict the future sale price of a bulldozer, given its characteristic and previous examples of how much similar bulldozers have been sold for?

2. Data

The data is downloaded from the Kaggle Bluebook for Bulldozers competition:
https://www.kaggle.com/c/bluebook-for-bulldozers/data

The data for this competition is split into three parts:

Train.csv is the training set, which contains data through the end of 2011.
Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
Test.csv is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.
The key fields are in train.csv are:

SalesID: the uniue identifier of the sale
MachineID: the unique identifier of a machine. A machine can be sold multiple times
saleprice: what the machine sold for at auction (only provided in train.csv)
saledate: the date of the sale

Evaluation

The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.

Note: The goal for most regression evaluation metrics is to minimize the error. For example, our goal for this project will be to build a ML model which minimizes RMSLE

Features

Kaggle provides a data dictionary detailing all of the features of the dataset. You can view this data dictionary on Kaggle

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
# Import trainnin and validating sets
df = pd.read_csv("data/bluebook-for-bulldozers/TrainAndValid.csv", low_memory=False)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 412698 entries, 0 to 412697 Data columns (total 53 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SalesID 412698 non-null int64 1 SalePrice 412698 non-null float64 2 MachineID 412698 non-null int64 3 ModelID 412698 non-null int64 4 datasource 412698 non-null int64 5 auctioneerID 392562 non-null float64 6 YearMade 412698 non-null int64 7 MachineHoursCurrentMeter 147504 non-null float64 8 UsageBand 73670 non-null object 9 saledate 412698 non-null object 10 fiModelDesc 412698 non-null object 11 fiBaseModel 412698 non-null object 12 fiSecondaryDesc 271971 non-null object 13 fiModelSeries 58667 non-null object 14 fiModelDescriptor 74816 non-null object 15 ProductSize 196093 non-null object 16 fiProductClassDesc 412698 non-null object 17 state 412698 non-null object 18 ProductGroup 412698 non-null object 19 ProductGroupDesc 412698 non-null object 20 Drive_System 107087 non-null object 21 Enclosure 412364 non-null object 22 Forks 197715 non-null object 23 Pad_Type 81096 non-null object 24 Ride_Control 152728 non-null object 25 Stick 81096 non-null object 26 Transmission 188007 non-null object 27 Turbocharged 81096 non-null object 28 Blade_Extension 25983 non-null object 29 Blade_Width 25983 non-null object 30 Enclosure_Type 25983 non-null object 31 Engine_Horsepower 25983 non-null object 32 Hydraulics 330133 non-null object 33 Pushblock 25983 non-null object 34 Ripper 106945 non-null object 35 Scarifier 25994 non-null object 36 Tip_Control 25983 non-null object 37 Tire_Size 97638 non-null object 38 Coupler 220679 non-null object 39 Coupler_System 44974 non-null object 40 Grouser_Tracks 44875 non-null object 41 Hydraulics_Flow 44875 non-null object 42 Track_Type 102193 non-null object 43 Undercarriage_Pad_Width 102916 non-null object 44 Stick_Length 102261 non-null object 45 Thumb 102332 non-null object 46 Pattern_Changer 102261 non-null object 47 Grouser_Type 102193 non-null object 48 Backhoe_Mounting 80712 non-null object 49 Blade_Type 81875 non-null object 50 Travel_Controls 81877 non-null object 51 Differential_Type 71564 non-null object 52 Steering_Controls 71522 non-null object dtypes: float64(3), int64(5), object(45) memory usage: 166.9+ MB
df.isna().sum()
SalesID                          0
SalePrice                        0
MachineID                        0
ModelID                          0
datasource                       0
auctioneerID                 20136
YearMade                         0
MachineHoursCurrentMeter    265194
UsageBand                   339028
saledate                         0
fiModelDesc                      0
fiBaseModel                      0
fiSecondaryDesc             140727
fiModelSeries               354031
fiModelDescriptor           337882
ProductSize                 216605
fiProductClassDesc               0
state                            0
ProductGroup                     0
ProductGroupDesc                 0
Drive_System                305611
Enclosure                      334
Forks                       214983
Pad_Type                    331602
Ride_Control                259970
Stick                       331602
Transmission                224691
Turbocharged                331602
Blade_Extension             386715
Blade_Width                 386715
Enclosure_Type              386715
Engine_Horsepower           386715
Hydraulics                   82565
Pushblock                   386715
Ripper                      305753
Scarifier                   386704
Tip_Control                 386715
Tire_Size                   315060
Coupler                     192019
Coupler_System              367724
Grouser_Tracks              367823
Hydraulics_Flow             367823
Track_Type                  310505
Undercarriage_Pad_Width     309782
Stick_Length                310437
Thumb                       310366
Pattern_Changer             310437
Grouser_Type                310505
Backhoe_Mounting            331986
Blade_Type                  330823
Travel_Controls             330821
Differential_Type           341134
Steering_Controls           341176
dtype: int64