Dataset Selection

Hi Dear Course Support Team,

I reviewed the Recommended Datasets for Machine Learning Course Project and while there is diverse set of datasets therein, I wonder which dataset is more manageable with regards to the project time/resources constraint given the project acceptance criteria which might not be in line with some of the datasets.

For example, majority of datasets require model training for several hours without any interruption in Jupyter Notebook connectivity to Server (GPU-powered machine needed) or require extensive data preparation, merging & math-heavy feature engineering beyond the scope of the course.
Majority requires transformation beyond imputation, scaling & encoding which may in turn cause some hiccups in the middle of the project up to the point of switching to other datasets in order to meet the project criteria.

I would be grateful of any recommendation in terms of manageable dataset selection.

Thank you
Mohsen

1 Like

Hey Mohsen,
You can choose any dataset of your choice, there is no restriction other than the total rows and total columns. Adding to that I don’t think you need GPU powered machine/training the model for several hours as these are all tabular format data without any images and need of neural network. some features might take up to 10 20 minutes but hardly it would take hours. Moreover we are not asking you to train the best model, you have to train a model which is at least better than any random guess and submit for evaluations(With enough documentation). You can work on the same project later after its evaluated, to improve the model accuracy and score.
PS: Recommended Dataset is for those who are doing projects like these for the first time and are confused from where to get the data. You don’t have to necessarily choose your project dataset from the recommended dataset itself if you have some other dataset on which you want to work upon.
Thank You.

Hi Birajde,

Thank you for your clarification and information.
I make sure to follow the aforementioned guideline and project criteria.

Thank you.

1 Like

Hi Birajde
Recommended datasets are non-medical topics and I have difficulty to choose one of them. I am more interested in this one: COVID-19 World Vaccination Progress | Kaggle which has future application in my work: Shape is 36901 x 15. I am also working on >3 national databases with millions of records, but it is not in the Kaggle because they are paid. Can I use any of them? It would be more beneficial if I applied what was given to us on my real applications.
What do you think?
Thank you
Eman Toraih

Yes, you can use any dataset, you don’t need to select a dataset from Kaggle. You can get data from anywhere and make your project.

Thank you Birajde. Can we request time extension? Eman

1 Like

Sure, take your time completing the project.

Hi @birajde,
Hope you are doing fine!
Some recommended datasets do not have more than 50 thousand rows. Like - “West Nile virus detection” Is it okay to use those datasets which have about 30 thousand rows?
Thank you!

1 Like

If the dataset was provided in the recommended dataset list you can use it for the project.