Recommended Datasets for Course Project

Where to find datasets?

You can find interesting datasets on Kaggle:

The data should be in CSV format and should contain at least 3 columns and 150 rows.

You can also create a new dataset on Kaggle by uploading a CSV file here: https://www.kaggle.com/datasets?new=true (make sure to keep your dataset public, otherwise it will not be downloadable)

How to download a dataset within Jupyter?

Datasets can be downloaded withing Jupyter using the opendatasets Python library. Here’s some sample code for downloading the US Elections Dataset:

import opendatasets as od
dataset_url = 'https://www.kaggle.com/tunguz/us-elections-dataset'
od.download('https://www.kaggle.com/tunguz/us-elections-dataset')

Some interesting datasets

Other sources to look for datasets:

If you use an external source other than Kaggle, you’ll create a new dataset on Kaggle by uploading a CSV file here: https://www.kaggle.com/datasets?new=true (make sure to keep your dataset public, otherwise it will not be downloadable using opendatasets)

Downloading Personal data for EDA

You can also analyze your own personal data for exploratory data analysis, from the following sources:

Use this thread for sharing interesting datasets.

20 Likes

Inspirational project indeed!

3 Likes

I keep getting the error message “UnicodeDecodeError” when trying to create a DataFrame from my ‘candidate csv’ file for the project using pd.read_csv() . I am kinda frustrated! May any one help out please?

1 Like

@vincent-kizza try this

import pandas as pd
df = pd.read_csv(‘file_name.csv’, engine=‘python’)

1 Like

You can try using pd.read_csv('file.csv', encoding="utf-8")

if that doesn’t work, please post the entire traceback that you are getting?

1 Like

I played around with the Steam API and grabbed some information about my playtimes on there:

Because the playtime table only contained appids, I merged it with the table for appnames.

Check out the comparison I’ve made between Windows and Linux hours played! Turns out that if I played a game both on Linux and Windows, I usually used Linux more than Windows!
Obviously I’m hardly scratching the surface with this small exercise, for example I did not download any pricing information to add to my dataset which could be done in the future!

3 Likes

With your suggestion, I now get the following traceback

File “”, line 2
ple_2015=pd.read_csv(‘ple.csv’,engine=‘python3’)
^
SyntaxError: invalid character in identifier

With you suggestion, I get the following traceback

File “”, line 2
ple_2015=pd.read_csv(‘ple.csv’,encoding=“utf-8”)
^
SyntaxError: invalid character in identifier

For some reason the quotation marks in the code are not the standard ASCII quotation marks. Python is interpreting those as invalid characters.

I’m not sure which keyboard or language, you are using, but you need to make sure they are the following:

single quotation mark: '
double quotation mark: "

1 Like

I think this happens because someone is being lazy and copies directly from forum.

There must be some different encoding regarding to quotes or something :stuck_out_tongue:

1 Like

I’m having problems with downloading data from this link https://www.kaggle.com/lava18/google-play-store-apps. I both want to download the googleplaystore.csv and googleplaystore_user_reviews.csv, but i don’t know how to download data from kaggle. Is there anyway to import file from notebooks.

1 Like

Given a column with values as 10,000,100 or 1000+ how to convert these into and integer and put back into same column

You can use IO library.
import io

and replace file read code with
with io.open(file, 'r', encoding="utf-8") as raw_data:

It will work.

Can we use the dataset one or two years old, I mean up to 2018 or 2019? Or should we use the latest dataset?

as we learnt how to drop columns, is it possible to drop rows in a data frame? if yes how to?

This is how to drop row 1 in name_df

name_df.drop(1, axis=0)

Multiple rows

rows_list = [1,2,3,4,5] 
name_df.drop(rows_list, axis=0)

Hi we can pick the dataset from this Some interesting datasets right?
I have picked this dataset: Google Play Store Android Apps Data. My course project can be found here. Please let me know if I have to use a different dataset. Thanks!

Hello - my binder keeps failing to run:

  1. Created a new notebook myself and after trying to run… i keep getting these errors:
    “Sorry, https%3A%2F%2Fjovian.ai%2Fapi%2Fgit%2F3998a84f4ea04268ab733aa72d6d82a9_1.git/0b0acc728b73737bd4c2bc4bc6af256e55761597 has been temporarily disabled from launching. Please contact admins for more info!”

  2. I duplicated Aakash’s notebook and after i try to run below is the error message:
    Sorry, https%3A%2F%2Fjovian.ai%2Fapi%2Fgit%2F0ec1e098693a4e0ba763871445f52b12_1.git/c2f11e6e23ca0843aae3a851fd483fbbf71342d2 has been temporarily disabled from launching. Please contact admins for more info!

Can someone help? Thanks!