Predicting the sale Price of Bulldozers using Machine Learning.
In this notebook, we're going to go through an example of a ML project with the goal of predicting the sale price of bulldozers.
1. Problem definition
How well can we predict the future sale price of a bulldozer, given its characteristic and previous examples of how much similar bulldozers have been sold for?
2. Data
The data is downloaded from the Kaggle Bluebook for Bulldozers competition:
https://www.kaggle.com/c/bluebook-for-bulldozers/data
The data for this competition is split into three parts:
Train.csv is the training set, which contains data through the end of 2011.
Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
Test.csv is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.
The key fields are in train.csv are:
SalesID: the uniue identifier of the sale
MachineID: the unique identifier of a machine. A machine can be sold multiple times
saleprice: what the machine sold for at auction (only provided in train.csv)
saledate: the date of the sale
Evaluation
The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.
Note: The goal for most regression evaluation metrics is to minimize the error. For example, our goal for this project will be to build a ML model which minimizes RMSLE
Features
Kaggle provides a data dictionary detailing all of the features of the dataset. You can view this data dictionary on Kaggle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
# Import trainnin and validating sets
df = pd.read_csv("data/bluebook-for-bulldozers/TrainAndValid.csv", low_memory=False)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 412698 entries, 0 to 412697
Data columns (total 53 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SalesID 412698 non-null int64
1 SalePrice 412698 non-null float64
2 MachineID 412698 non-null int64
3 ModelID 412698 non-null int64
4 datasource 412698 non-null int64
5 auctioneerID 392562 non-null float64
6 YearMade 412698 non-null int64
7 MachineHoursCurrentMeter 147504 non-null float64
8 UsageBand 73670 non-null object
9 saledate 412698 non-null object
10 fiModelDesc 412698 non-null object
11 fiBaseModel 412698 non-null object
12 fiSecondaryDesc 271971 non-null object
13 fiModelSeries 58667 non-null object
14 fiModelDescriptor 74816 non-null object
15 ProductSize 196093 non-null object
16 fiProductClassDesc 412698 non-null object
17 state 412698 non-null object
18 ProductGroup 412698 non-null object
19 ProductGroupDesc 412698 non-null object
20 Drive_System 107087 non-null object
21 Enclosure 412364 non-null object
22 Forks 197715 non-null object
23 Pad_Type 81096 non-null object
24 Ride_Control 152728 non-null object
25 Stick 81096 non-null object
26 Transmission 188007 non-null object
27 Turbocharged 81096 non-null object
28 Blade_Extension 25983 non-null object
29 Blade_Width 25983 non-null object
30 Enclosure_Type 25983 non-null object
31 Engine_Horsepower 25983 non-null object
32 Hydraulics 330133 non-null object
33 Pushblock 25983 non-null object
34 Ripper 106945 non-null object
35 Scarifier 25994 non-null object
36 Tip_Control 25983 non-null object
37 Tire_Size 97638 non-null object
38 Coupler 220679 non-null object
39 Coupler_System 44974 non-null object
40 Grouser_Tracks 44875 non-null object
41 Hydraulics_Flow 44875 non-null object
42 Track_Type 102193 non-null object
43 Undercarriage_Pad_Width 102916 non-null object
44 Stick_Length 102261 non-null object
45 Thumb 102332 non-null object
46 Pattern_Changer 102261 non-null object
47 Grouser_Type 102193 non-null object
48 Backhoe_Mounting 80712 non-null object
49 Blade_Type 81875 non-null object
50 Travel_Controls 81877 non-null object
51 Differential_Type 71564 non-null object
52 Steering_Controls 71522 non-null object
dtypes: float64(3), int64(5), object(45)
memory usage: 166.9+ MB
df.isna().sum()
SalesID 0
SalePrice 0
MachineID 0
ModelID 0
datasource 0
auctioneerID 20136
YearMade 0
MachineHoursCurrentMeter 265194
UsageBand 339028
saledate 0
fiModelDesc 0
fiBaseModel 0
fiSecondaryDesc 140727
fiModelSeries 354031
fiModelDescriptor 337882
ProductSize 216605
fiProductClassDesc 0
state 0
ProductGroup 0
ProductGroupDesc 0
Drive_System 305611
Enclosure 334
Forks 214983
Pad_Type 331602
Ride_Control 259970
Stick 331602
Transmission 224691
Turbocharged 331602
Blade_Extension 386715
Blade_Width 386715
Enclosure_Type 386715
Engine_Horsepower 386715
Hydraulics 82565
Pushblock 386715
Ripper 305753
Scarifier 386704
Tip_Control 386715
Tire_Size 315060
Coupler 192019
Coupler_System 367724
Grouser_Tracks 367823
Hydraulics_Flow 367823
Track_Type 310505
Undercarriage_Pad_Width 309782
Stick_Length 310437
Thumb 310366
Pattern_Changer 310437
Grouser_Type 310505
Backhoe_Mounting 331986
Blade_Type 330823
Travel_Controls 330821
Differential_Type 341134
Steering_Controls 341176
dtype: int64