New York City Taxi Fare Prediction
Dataset Link: https://www.kaggle.com/c/new-york-city-taxi-fare-prediction
We'll train a machine learning model to predict the fare for a taxi ride in New York city given information like pickup date & time, pickup location, drop location and no. of passengers.
This dataset is taken from a Kaggle competition organized by Google Cloud. It contains over 55 millions rows of training data. We'll attempt to achieve a respectable score in the competition using just a fraction of the data. Along the way, we'll also look at some practical tips for machine learning. PMost of the ideas & techniques covered in this notebook are derived from other public notebooks & blog posts.
To run this notebook, select "Run" > "Run on Colab" and connect your Google Drive account with Jovian. Make sure to use the GPU runtime if you plan on using a GPU.
You can find the completed version of this notebook here: https://jovian.ai/aakashns/nyc-taxi-fare-prediction-filled
TIP #1: Create an outline for your notebook & for each section before you start coding
Here's an outline of the project:
- Download the dataset
- Explore & analyze the dataset
- Prepare the dataset for ML training
- Train hardcoded & baseline models
- Make predictions & submit to Kaggle
- Peform feature engineering
- Train & evaluate different models
- Tune hyperparameters for the best models
- Train on a GPU with the entire dataset
- Document & publish the project online
1. Download the Dataset
Steps:
- Install required libraries
- Download data from Kaggle
- View dataset files
- Load training set with Pandas
- Load test set with Pandas
Install Required Libraries
!pip install jovian opendatasets pandas numpy scikit-learn xgboost --quiet