Learn practical skills, build real-world projects, and advance your career

New York City Taxi Fare Prediction

alt

Dataset Link: https://www.kaggle.com/c/new-york-city-taxi-fare-prediction

We'll train a machine learning model to predict the fare for a taxi ride in New York city given information like pickup date & time, pickup location, drop location and no. of passengers.

This dataset is taken from a Kaggle competition organized by Google Cloud. It contains over 55 millions rows of training data. We'll attempt to achieve a respectable score in the competition using just a fraction of the data. Along the way, we'll also look at some practical tips for machine learning. PMost of the ideas & techniques covered in this notebook are derived from other public notebooks & blog posts.

To run this notebook, select "Run" > "Run on Colab" and connect your Google Drive account with Jovian. Make sure to use the GPU runtime if you plan on using a GPU.

You can find the completed version of this notebook here: https://jovian.ai/aakashns/nyc-taxi-fare-prediction-filled

TIP #1: Create an outline for your notebook & for each section before you start coding

Here's an outline of the project:

  1. Download the dataset
  2. Explore & analyze the dataset
  3. Prepare the dataset for ML training
  4. Train hardcoded & baseline models
  5. Make predictions & submit to Kaggle
  6. Peform feature engineering
  7. Train & evaluate different models
  8. Tune hyperparameters for the best models
  9. Train on a GPU with the entire dataset
  10. Document & publish the project online

1. Download the Dataset

Steps:

  • Install required libraries
  • Download data from Kaggle
  • View dataset files
  • Load training set with Pandas
  • Load test set with Pandas

Install Required Libraries

!pip install jovian opendatasets pandas numpy scikit-learn xgboost --quiet