Make submissions here: https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas/assignment/course-project
This is the starter notebook for the course project for Data Analysis with Python: Zero to Pandas. For the course project, you will pick a real-world dataset of your choice and apply the concepts learned in this course to perform exploratory data analysis. Use this starter notebook as an outline for your project (you can also start with an empty new notebook). Focus on documentation and presentation - this Jupyter notebook will also serve as a project report, so make sure to include detailed explanations whererver possible using Markdown cells.
Find and download an interesting real-world dataset (see the Recommended Datasets section below for ideas).
The dataset should contain tabular data (rowsn & columns), preferably in CSV/JSON/XLS or other formats that can be read using Pandas. If it's not in a compatible format, you may have to write some code to convert it to a desired format.
The dataset should contain at least 3 columns and 150 rows of data. You can also combine data from multiple sources to create a large enough dataset.
Upload your notebook to your Jovian.ml profile using jovian.commit
.
Make a submission here: https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas/assignment/course-project
Share your work on the forum: https://jovian.ml/forum/t/course-project-on-exploratory-data-analysis-discuss-and-share-your-work/11684
Browse through projects shared by other participants and give feedback
Use the following resources for finding interesting datasets:
Refer to these projects for inspiration:
Analyzing your browser history using Pandas & Seaborn by Kartik Godawat
WhatsApp Chat Data Analysis by Prajwal Prashanth
Understanding the Gender Divide in Data Science Roles by Aakanksha N S
Your submission will be evaluated using the following criteria:
NOTE: Remove this cell containing the instructions before making your submission. You can do using the "Edit > Delete Cells" menu option.
Write some introduction about your project here: describe the dataset, where you got it from, what you're trying to do with it, and which tools & techniques you're using. You can also mention about the course, and what you've learned from it.
As a first step, let's upload our Jupyter notebook to Jovian.ml.
project_name = "usage-of-programming-languages"
!pip install jovian --upgrade -q
!pip install pandas
Requirement already satisfied: pandas in /srv/conda/envs/notebook/lib/python3.7/site-packages (1.1.3)
Requirement already satisfied: python-dateutil>=2.7.3 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas) (2.8.0)
Requirement already satisfied: numpy>=1.15.4 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas) (1.19.2)
Requirement already satisfied: pytz>=2017.2 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from pandas) (2020.1)
Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas) (1.12.0)
import jovian
jovian.commit(project=project_name)
[jovian] Attempting to save notebook..
TODO
import pandas as pd
user_lang_df = pd.read_csv('user-languages.csv')
user_lang_df.info()
user_lang_df.describe()
user_lang_df.head()
language_df = pd.DataFrame()
for col in user_lang_df.columns:
if col != 'user_id':
current_col_df = user_lang_df[user_lang_df[col] > 0.0][col]
mean = current_col_df.mean(axis = 0, skipna = True)
count = current_col_df.count()
language_df = language_df.append({'language':col, 'mean': mean, 'count': count}, ignore_index=True)
language_df
valid_languages_df = pd.read_csv('languages.csv')
valid_languages_df
valid_languages_df['name_lower'] = valid_languages_df['name'].str.lower()
valid_existing_languages_in_projects_df = language_df.sort_values(by=['mean'], ascending=False).merge(valid_languages_df, left_on="language", right_on="name_lower", how="inner")
top_valid_existing_languages_in_projects_df = valid_existing_languages_in_projects_df.head(100)[['count', 'language', 'mean', 'name']]
top_valid_existing_languages_in_projects_df['mean * count'] = top_valid_existing_languages_in_projects_df['count'] * top_valid_existing_languages_in_projects_df['mean']
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.barplot(x="name", y="mean * count", data=top_valid_existing_languages_in_projects_df.head(10))
# fig.show()
How many unique users?
pd.DataFrame(user_lang_df.user_id.unique()).count()
Shape?
user_lang_df.shape
This means that all the users are unique here
existing_languages_df = valid_existing_languages_in_projects_df.language
technology_set = set()
technology_dict = dict()
for i, row in user_lang_df.iterrows():
current_set_of_languages = set()
for language in existing_languages_df:
if row[language] > 0.0:
current_set_of_languages.add(language)
technology_set.add(frozenset(current_set_of_languages))
if frozenset(current_set_of_languages) in technology_dict.keys():
technology_dict[frozenset(current_set_of_languages)] = technology_dict[frozenset(current_set_of_languages)] + 1
else:
technology_dict[frozenset(current_set_of_languages)] = 1
technology_dict
technology_combo_freq_df = pd.DataFrame(columns = ['language_combo', 'frequency'])
for combo in technology_dict.keys():
curr_dict = dict()
curr_dict['language_combo'] = ",".join(combo)
curr_dict['frequency'] = technology_dict[combo]
technology_combo_freq_df = technology_combo_freq_df.append(curr_dict, ignore_index=True)
technology_combo_freq_df
technology_combo_freq_df = technology_combo_freq_df.sort_values(by = ['frequency'], ascending = False).head(100)
technology_combo_freq_df = technology_combo_freq_df.reset_index()
technology_combo_freq_df
Top 10 language combos
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.barplot(x="language_combo", y="frequency", data=technology_combo_freq_df.head(10), palette="Blues_d")
new_labels = technology_combo_freq_df.head(10).language_combo
plt.legend(title='language combo vs frequency', loc='upper right', labels=new_labels)
fig = plt.figure(figsize =(20, 20))
plt.show(ax)
invalid_existing_languages_in_projects_df = language_df.sort_values(by=['count', 'mean'], ascending=False)[~language_df.language.isin(valid_languages_df['name_lower'])]
invalid_existing_languages_in_projects_df.head(100)
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.scatterplot(data=invalid_existing_languages_in_projects_df.head(100), x="count", y="mean", hue="language")
# ax = sns.plot(x="count", y="mean", data=invalid_existing_languages_in_projects_df.head(10), hue = )
# new_labels = technology_combo_freq_df.head(10).language_combo
# plt.legend(title='most unpopular languages vs frequency', loc='upper right', labels=invalid_existing_languages_in_projects_df.head(10).language)
plt.legend(title='most unpopular languages: count vs mean', loc='upper right', bbox_to_anchor=(2.00, 2.00), ncol=1, labels=invalid_existing_languages_in_projects_df.head(100).language)
# g.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1)
fig = plt.figure(figsize =(20, 20))
plt.show(ax)
We've taken below data from Wikipedia (https://en.wikipedia.org/wiki/Programming_languages_used_in_most_popular_websites) and converted the table to csv with the help of https://www.convertcsv.com/html-table-to-csv.htm
most_popular_website_languages_df = pd.read_csv("most-popular-website-languages.csv")
most_popular_website_languages_df
We see that only the frontend and the back-end languages are of significance to us. Let's first remove \n and \r characters
most_popular_website_languages_df = most_popular_website_languages_df.rename(columns={"Front-end\r\n(Client-side)": "Front-end", "Back-end\r\n(Server-side)": "Back-end"})
most_popular_website_languages_df
Needing only the 3rd and 4th column, we'll shrink the actual dataframe
most_popular_website_languages_df = most_popular_website_languages_df[["Front-end", "Back-end"]]
most_popular_website_languages_df
As there are some unwanted characters, we'll remove them first
most_popular_website_languages_df = most_popular_website_languages_df.replace(regex=r' ', value='')
most_popular_website_languages_df = most_popular_website_languages_df.replace(regex=r'\[.*\]', value='')
most_popular_website_languages_df = most_popular_website_languages_df.replace(regex=r'\(.*\)', value='')
most_popular_website_languages_df
big_tech_language_freq_dict = dict()
big_tech_language_list = list()
for i, row in most_popular_website_languages_df.iterrows():
big_tech_language_list += row["Front-end"].split(",")
big_tech_language_list += row["Back-end"].split(",")
big_tech_language_list
for language in big_tech_language_list:
if language in big_tech_language_freq_dict.keys():
big_tech_language_freq_dict[language] += 1
else:
big_tech_language_freq_dict[language] = 1
big_tech_language_freq_dict
import jovian
# jovian.commit()
jovian.commit(project=project_name)
TODO
!pip install matplotlib seaborn numpy
from matplotlib import pyplot as plt
import numpy as np
%matplotlib inline
big_tech_languages = big_tech_language_freq_dict.keys()
big_tech_languages_frequencies = big_tech_language_freq_dict.values()
# Creating plot
fig = plt.figure(figsize =(12, 12))
plt.pie(big_tech_languages_frequencies, labels = big_tech_languages)
# show plot
plt.show()
github_usage_with_big_tech_languages_df = pd.DataFrame()
for language in big_tech_languages:
current_df = language_df[language_df['language'] == language.lower()]
current_df['language'] = language
github_usage_with_big_tech_languages_df = github_usage_with_big_tech_languages_df.append(current_df, ignore_index=True)
github_usage_with_big_tech_languages_df
github_usage_with_big_tech_languages = github_usage_with_big_tech_languages_df.language
# data = [23, 17, 35, 29, 12, 41]
github_usage_with_big_tech_languages_frequencies = github_usage_with_big_tech_languages_df['count'].astype(int)
# Creating plot
fig = plt.figure(figsize =(12, 12))
plt.pie(github_usage_with_big_tech_languages_frequencies, labels = github_usage_with_big_tech_languages)
# show plot
plt.show()
import jovian
jovian.commit()
TODO
import jovian
jovian.commit()
TODO
import jovian
jovian.commit()
TODO
import jovian
jovian.commit()