Make submissions here: https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas/assignment/course-project
This is the starter notebook for the course project for Data Analysis with Python: Zero to Pandas. For the course project, you will pick a real-world dataset of your choice and apply the concepts learned in this course to perform exploratory data analysis. Use this starter notebook as an outline for your project (you can also start with an empty new notebook). Focus on documentation and presentation - this Jupyter notebook will also serve as a project report, so make sure to include detailed explanations whererver possible using Markdown cells.
Find and download an interesting real-world dataset (see the Recommended Datasets section below for ideas).
The dataset should contain tabular data (rowsn & columns), preferably in CSV/JSON/XLS or other formats that can be read using Pandas. If it's not in a compatible format, you may have to write some code to convert it to a desired format.
The dataset should contain at least 3 columns and 150 rows of data. You can also combine data from multiple sources to create a large enough dataset.
Upload your notebook to your Jovian.ml profile using jovian.commit
.
Make a submission here: https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas/assignment/course-project
Share your work on the forum: https://jovian.ml/forum/t/course-project-on-exploratory-data-analysis-discuss-and-share-your-work/11684
Browse through projects shared by other participants and give feedback
Use the following resources for finding interesting datasets:
Refer to these projects for inspiration:
Analyzing your browser history using Pandas & Seaborn by Kartik Godawat
WhatsApp Chat Data Analysis by Prajwal Prashanth
Understanding the Gender Divide in Data Science Roles by Aakanksha N S
Your submission will be evaluated using the following criteria:
NOTE: Remove this cell containing the instructions before making your submission. You can do using the "Edit > Delete Cells" menu option.
As finally, this year IPL Season 13 has started on Sept. 19, 2020 , the cricket mood is on. While watching the first match itself, the idea of analyzing IPL dataset struck my mind and luckily I found one dataset on Kaggle which contains the data of matches held between 2008-2019. So, I shall be analyzing that dataset only. Hope you like my work.
As a first step, let's upload our Jupyter notebook to Jovian.ml.
project_name = "ipl-data-analysis"
!pip install jovian --upgrade -q
WARNING: You are using pip version 20.1.1; however, version 20.2.3 is available.
You should consider upgrading via the 'd:\data analysis with python (jovian)\zerotopandas\scripts\python.exe -m pip install --upgrade pip' command.
import jovian
jovian.commit(project=project_name)
[jovian] Attempting to save notebook..
[jovian] Creating a new project "ashutoshkrris/ipl-data-analysis"
[jovian] Uploading notebook..
[jovian] Capturing environment..
[jovian] Error: Failed to read Anaconda environment using command: "conda env export -n base --no-builds"
[jovian] Committed successfully! https://jovian.ml/ashutoshkrris/ipl-data-analysis
Let us first import all the libraries which we'll be using in the entire project.
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
Lets's first load our dataset and take a look on it to have an overview of what our dataset looks like. We will also discard few columns which won't help us in our data visualization.
ipl_df = pd.read_csv('dataset/matches.csv')
ipl_df.head(5)
Let us explain the dataset. So, basically we have a lot of rows and columns here in the dataset. It includes the Season, City , Venue in which the match was held, the Date on which the match was held, the teams between which the match was played , information related to toss , winner and umpires.
ipl_df.shape
(756, 18)
So, we have 756 rows and 18 columns in total.
ipl_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 756 entries, 0 to 755
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 756 non-null int64
1 Season 756 non-null object
2 city 749 non-null object
3 date 756 non-null object
4 team1 756 non-null object
5 team2 756 non-null object
6 toss_winner 756 non-null object
7 toss_decision 756 non-null object
8 result 756 non-null object
9 dl_applied 756 non-null int64
10 winner 752 non-null object
11 win_by_runs 756 non-null int64
12 win_by_wickets 756 non-null int64
13 player_of_match 752 non-null object
14 venue 756 non-null object
15 umpire1 754 non-null object
16 umpire2 754 non-null object
17 umpire3 119 non-null object
dtypes: int64(4), object(14)
memory usage: 106.4+ KB
We see that in the umpire3 column, we have only 119 non-null objects. So we can discard them without any issue. Also, we will discard the umpire1 and umpire2 columns since they won't be useful in our data analysis.
discard_columns = ['umpire1','umpire2','umpire3']
ipl_df = ipl_df.drop(discard_columns, axis=1)
ipl_df.head()
Earlier we see that, we had three columns called umpire1 , umpire2 and umpire3. But we do not need them in our analysis as many of their rows contained NaN values. So, we have discarded them and our dataset now contains 15 columns.
ipl_df.team1.value_counts()
Mumbai Indians 101
Kings XI Punjab 91
Chennai Super Kings 89
Royal Challengers Bangalore 85
Kolkata Knight Riders 83
Delhi Daredevils 72
Rajasthan Royals 67
Sunrisers Hyderabad 63
Deccan Chargers 43
Pune Warriors 20
Rising Pune Supergiants 15
Gujarat Lions 14
Kochi Tuskers Kerala 7
Delhi Capitals 6
Name: team1, dtype: int64
We can see that, these are the all teams that have played in the last 12 seasons of IPL. Few of them like Delhi Capitals, Gujarat Lions, Kochi Tuskers Kerala didn't play in more than 1-2 seasons. That's why their numbers are so low.
ipl_df.result.value_counts()
normal 743
tie 9
no result 4
Name: result, dtype: int64
The result column in the dataset specifies whether the matched ended normally or there was a tie between the teams or the match was cancelled due to rain or some unavoidable reasons.
import jovian
jovian.commit()
[jovian] Attempting to save notebook..
Now that our dataset is good to go, we can analyze it using plots, pie charts and graphs.
import jovian
jovian.commit()
TODO
import jovian
jovian.commit()
TODO
import jovian
jovian.commit()
TODO
import jovian
jovian.commit()