Course Project - Real-World Machine Learning Project

Ask questions related to the project on this thread, submissions can be made on the link given above.

:page_facing_up: Real-World Machine Learning Project

In this project, you’ll apply and practice the following concepts

  • Perform data cleaning & feature engineering
  • Training, compare & tune multiple models
  • Document and publish your work online

:ledger: Detailed Instructions

In the course project, you will apply the machine learning skills covered in this course by training an ML model on a real-world dataset. Follow these steps to complete your project:

  1. Pick a large real-world dataset from Kaggle (see the “Recommended Datasets” section below) and download it using opendatasets. Your training set should contain at least 50,000 rows and 5 columns of data.

  2. Read the dataset description, understand the problem statement and describe the modeling objective clearly. You can also browse through existing notebooks created by others for inspiration.

  3. Perform exploratory data analysis, gather insights about the data, perform feature engineering, create a training-validation split, and prepare the data for modeling.

  4. Train & evaluate different machine learning models, tune hyperparameters and reduce overfitting to improve the model.

  5. Report the final performance of your best model(s), show sample predictions, and save model weights. Summarize your work, share links to references, and suggest ideas for future work.

  6. Publish your Jupyter notebook to Jovian, make a submission below and share your project with the community. Optionally, you may also write a blog post and contribute to the Jovian official blog.

There is no starter notebook for the course project. Please use the “New” button on Jovian to create a new notebook, “Run on Colab” to execute it, and jovian.commit to record versions. Please review the “Evaluation Criteria” and “Recommended Datasets” sections below carefully before starting your project.

:dart: Evaluation Criteria

Your submission must satisfy the following criteria:

  • Training set should contain at least 50,000 rows of data and 5 columns
  • Notebook must include all the steps listed in the project guidelines above
  • Notebook must be executed end-to-end with error-free outputs for all cells
  • You must train at least 2 different types of machine learning models
  • You must tune at least 2 different hyperparameters for your chosen model
  • Your model’s performance on the validation set must be reasonably good
  • Your project must be documented extensively using markdown cells
  • Notebook must include references to relevant notebooks/tutorials/documentation sites
  • Your notebook must not be plagiarized (i.e., directly copied) from another project

:computer: Join the Jovian Discord Server to interact with the course team, share resources, and attend the study hours :point_right: Jovian Discord Server

Hello, for the course project, I am thinking of doing a Random Forest based classification of text in the ‘comments’ section of a large table.

I know that part of the project involves would normally involve doing some preliminary Exploratory Data Analysis, bit like what was done in the course: Zero to Pandas. As this is a text analysis problem that I am dealing with, much of that may not apply; instead what type of preliminary analysis would be expected in this case? Thanks.

Hello,
my kernel keeps on dying when I transform train_input[numerc_cols] and test-input[numeric_cols] to impute missing numerical data using the simple imputer for my course project.

Kernel doesn’t automatically restart either.

Please guide and provide the easiest solution.

please help me , why i can’t do this

Are you using a very large dataset? Please do the course project on Colab, you can use Colab’s computing power and RAM for large datasets. Also if the dataset is very huge use a sample of the dataset to train.

Hey, the link in od.download() seems incorrect. Please provide correct Kaggle link.

Thanks, I will try with a smaller sample for training set.

If my training dataset has 60 columns and 760k rows will google collab be able to handle it?

Hello, can someone please share the documentation for tuning hyper parameters for LinearRegression from sklearn.
Thanks.

This link from jovian site- related dataset. I used link from kaggle but it already made error.so i decided to chose this link

Hello, I am getting the following error:
‘’’
:1: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
‘’’
when I fit my input and target into the RandomForestRegressor.
Please guide & provide solution.

I think yes, colab will be able to handle that.

Please check the documentation of linear regression, you will see some parameters, go through the documentation and search for the parameter that can be changed. There are not a lot of parameters in linear regression, but if you use lasso/ridge regression you will see a lot of parameters.

@thasnihakeem2017 Can you share the error you are getting when you use the original kaggle link?

@prachin2002patel As the error already mentions the shape of the targets is 2-d array, convert the 2-d array into a 1-d array.

Hey, Please go to this website → Walmart Recruiting - Store Sales Forecasting | Kaggle, open the Rules tab on this page, and accept the rules for this competition(If you have not added a Phone number to your Kaggle profile you will probably have to add it once).
After you have accepted the rules, run this cell again → od.download("https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting") and provide your Kaggle username and Key, Hopefully now you will be able to download the dataset.


Please help me solve this problem.
why I am not able to display the train_df

Can you display test_df, sampleSubmission_df etc ?

1 Like

I restarted my colab notebook and the problem got solved.
Thanks for your response.

I am doing the final project, and doubts have arisen. The project I have chosen is wallmart sales forecasting. My doubts appear at the moment of making the submission in kaggle, since in my first one I obtained almost 4000 of wmae, then when tuning the hyperparameters, that scoring increased a lot. I imagine that this happens due to overfitting, I think the model reached the limit and the tuning only makes things worse. I have used XGBoost, and Random Forest, because models like LinearRegression gave bad results. I thought that doing a little feature engineering could improve the score, but in fact it got worse, is there a tutorial on how to do feature engineering correctly?

Never mind, I made a dumb mistake while merging the dataframes, that’s the reason of the bad scores.