Make submissions here: https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas/assignment/course-project
This is the starter notebook for the course project for Data Analysis with Python: Zero to Pandas. For the course project, you will pick a real-world dataset of your choice and apply the concepts learned in this course to perform exploratory data analysis. Use this starter notebook as an outline for your project (you can also start with an empty new notebook). Focus on documentation and presentation - this Jupyter notebook will also serve as a project report, so make sure to include detailed explanations whererver possible using Markdown cells.
Find and download an interesting real-world dataset (see the Recommended Datasets section below for ideas).
The dataset should contain tabular data (rowsn & columns), preferably in CSV/JSON/XLS or other formats that can be read using Pandas. If it's not in a compatible format, you may have to write some code to convert it to a desired format.
The dataset should contain at least 3 columns and 150 rows of data. You can also combine data from multiple sources to create a large enough dataset.
Upload your notebook to your Jovian.ml profile using
Make a submission here: https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas/assignment/course-project
Share your work on the forum: https://jovian.ml/forum/t/course-project-on-exploratory-data-analysis-discuss-and-share-your-work/11684
Browse through projects shared by other participants and give feedback
Use the following resources for finding interesting datasets:
Refer to these projects for inspiration:
Analyzing your browser history using Pandas & Seaborn by Kartik Godawat
WhatsApp Chat Data Analysis by Prajwal Prashanth
Understanding the Gender Divide in Data Science Roles by Aakanksha N S
Your submission will be evaluated using the following criteria:
NOTE: Remove this cell containing the instructions before making your submission. You can do using the "Edit > Delete Cells" menu option.
As finally, this year IPL Season 13 has started on Sept. 19, 2020 , the cricket mood is on. While watching the first match itself, the idea of analyzing IPL dataset struck my mind and luckily I found one dataset on Kaggle which contains the data of matches held between 2008-2019. So, I shall be analyzing that dataset only. Hope you like my work.
As a first step, let's upload our Jupyter notebook to Jovian.ml.
project_name = "ipl-data-analysis"
!pip install jovian --upgrade -q
WARNING: You are using pip version 20.1.1; however, version 20.2.3 is available. You should consider upgrading via the 'd:\data analysis with python (jovian)\zerotopandas\scripts\python.exe -m pip install --upgrade pip' command.
[jovian] Attempting to save notebook.. [jovian] Creating a new project "ashutoshkrris/ipl-data-analysis" [jovian] Uploading notebook.. [jovian] Capturing environment..
[jovian] Error: Failed to read Anaconda environment using command: "conda env export -n base --no-builds"
Let us first import all the libraries which we'll be using in the entire project.
import pandas as pd import seaborn as sns import matplotlib import matplotlib.pyplot as plt %matplotlib inline sns.set_style('darkgrid') matplotlib.rcParams['font.size'] = 14 matplotlib.rcParams['figure.figsize'] = (9, 5) matplotlib.rcParams['figure.facecolor'] = '#00000000'
Lets's first load our dataset and take a look on it to have an overview of what our dataset looks like. We will also discard few columns which won't help us in our data visualization.
ipl_df = pd.read_csv('dataset/matches.csv') ipl_df.head(5)
Let us explain the dataset. So, basically we have a lot of rows and columns here in the dataset. It includes the Season, City , Venue in which the match was held, the Date on which the match was held, the teams between which the match was played , information related to toss , winner and umpires.
So, we have 756 rows and 18 columns in total.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 756 entries, 0 to 755 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 756 non-null int64 1 Season 756 non-null object 2 city 749 non-null object 3 date 756 non-null object 4 team1 756 non-null object 5 team2 756 non-null object 6 toss_winner 756 non-null object 7 toss_decision 756 non-null object 8 result 756 non-null object 9 dl_applied 756 non-null int64 10 winner 752 non-null object 11 win_by_runs 756 non-null int64 12 win_by_wickets 756 non-null int64 13 player_of_match 752 non-null object 14 venue 756 non-null object 15 umpire1 754 non-null object 16 umpire2 754 non-null object 17 umpire3 119 non-null object dtypes: int64(4), object(14) memory usage: 106.4+ KB
We see that in the umpire3 column, we have only 119 non-null objects. So we can discard them without any issue. Also, we will discard the umpire1 and umpire2 columns since they won't be useful in our data analysis.
discard_columns = ['umpire1','umpire2','umpire3']
ipl_df = ipl_df.drop(discard_columns, axis=1)
Earlier we see that, we had three columns called umpire1 , umpire2 and umpire3. But we do not need them in our analysis as many of their rows contained NaN values. So, we have discarded them and our dataset now contains 15 columns.
Mumbai Indians 101 Kings XI Punjab 91 Chennai Super Kings 89 Royal Challengers Bangalore 85 Kolkata Knight Riders 83 Delhi Daredevils 72 Rajasthan Royals 67 Sunrisers Hyderabad 63 Deccan Chargers 43 Pune Warriors 20 Gujarat Lions 14 Rising Pune Supergiant 8 Kochi Tuskers Kerala 7 Rising Pune Supergiants 7 Delhi Capitals 6 Name: team1, dtype: int64
We can see that, these are the all teams that have played in the last 12 seasons of IPL. Few of them like Delhi Capitals, Gujarat Lions, Kochi Tuskers Kerala didn't play in more than 1-2 seasons. That's why their numbers are so low.
normal 743 tie 9 no result 4 Name: result, dtype: int64
The result column in the dataset specifies whether the matched ended normally or there was a tie between the teams or the match was cancelled due to rain or some unavoidable reasons.
[jovian] Attempting to save notebook..