Jovian
⭐️
Sign In

Data Analysis with Python: Zero to Pandas - Course Project Guidelines

(remove this cell before submission)

Make submissions here: https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas/assignment/course-project

This is the starter notebook for the course project for Data Analysis with Python: Zero to Pandas. For the course project, you will pick a real-world dataset of your choice and apply the concepts learned in this course to perform exploratory data analysis. Use this starter notebook as an outline for your project (you can also start with an empty new notebook). Focus on documentation and presentation - this Jupyter notebook will also serve as a project report, so make sure to include detailed explanations whererver possible using Markdown cells.

Step 1: Select a real-world dataset
  • Find and download an interesting real-world dataset (see the Recommended Datasets section below for ideas).

  • The dataset should contain tabular data (rowsn & columns), preferably in CSV/JSON/XLS or other formats that can be read using Pandas. If it's not in a compatible format, you may have to write some code to convert it to a desired format.

  • The dataset should contain at least 3 columns and 150 rows of data. You can also combine data from multiple sources to create a large enough dataset.

Step 2: Perform data preparation & cleaning
  • Load the dataset into a data frame using Pandas
  • Explore the number of rows & columns, ranges of values etc.
  • Handle missing, incorrect and invalid data
  • Perform any additional steps (parsing dates, creating additional columns, merging multiple dataset etc.)
Step 3: Perform exploratory Analysis & Visualization
  • Compute the mean, sum, range and other interesting statistics for numeric columns
  • Explore distributions of numeric columns using histograms etc.
  • Explore relationship between columns using scatter plots, bar charts etc.
  • Make a note of interesting insights from the exploratory analysis
Step 4: Ask & answer questions about the data
  • Ask at least 5 interesting questions about your dataset
  • Answer the questions either by computing the results using Numpy/Pandas or by plotting graphs using Matplotlib/Seaborn
  • Create new columns, merge multiple dataset and perform grouping/aggregation wherever necessary
  • Wherever you're using a library function from Pandas/Numpy/Matplotlib etc. explain briefly what it does
Step 5: Summarize your inferences & write a conclusion
  • Write a summary of what you've learned from the analysis
  • Include interesting insights and graphs from previous sections
  • Share ideas for future work on the same topic using other relevant datasets
  • Share links to resources you found useful during your analysis
Step 6: Make a submission & share your work
(Optional) Step 7: Write a blog post

Recommended Datasets

Use the following resources for finding interesting datasets:

Example Projects

Refer to these projects for inspiration:

Evaluation Criteria

Your submission will be evaluated using the following criteria:

  • Dataset must contain at least 3 columns and 150 rows of data
  • You must ask and answer at least 5 questions about the dataset
  • Your submission must include at least 5 visualizations (graphs)
  • Your submission must include explanations using markdown cells, apart from the code.
  • Your work must not be plagiarized i.e. copy-pasted for somewhere else.

NOTE: Remove this cell containing the instructions before making your submission. You can do using the "Edit > Delete Cells" menu option.

IPL Matches Data Analysis

As finally, this year IPL Season 13 has started on Sept. 19, 2020 , the cricket mood is on. While watching the first match itself, the idea of analyzing IPL dataset struck my mind and luckily I found one dataset on Kaggle which contains the data of matches held between 2008-2019. So, I shall be analyzing that dataset only. Hope you like my work.

As a first step, let's upload our Jupyter notebook to Jovian.ml.

In [1]:
project_name = "ipl-data-analysis"
In [2]:
!pip install jovian --upgrade -q
WARNING: You are using pip version 20.1.1; however, version 20.2.3 is available. You should consider upgrading via the 'd:\data analysis with python (jovian)\zerotopandas\scripts\python.exe -m pip install --upgrade pip' command.
In [3]:
import jovian
In [4]:
jovian.commit(project=project_name)
[jovian] Attempting to save notebook.. [jovian] Creating a new project "ashutoshkrris/ipl-data-analysis" [jovian] Uploading notebook.. [jovian] Capturing environment..
[jovian] Error: Failed to read Anaconda environment using command: "conda env export -n base --no-builds"
[jovian] Committed successfully! https://jovian.ml/ashutoshkrris/ipl-data-analysis

Importing Libraries

Let us first import all the libraries which we'll be using in the entire project.

In [6]:
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Data Preparation and Cleaning

Lets's first load our dataset and take a look on it to have an overview of what our dataset looks like. We will also discard few columns which won't help us in our data visualization.

In [33]:
ipl_df = pd.read_csv('dataset/matches.csv')
ipl_df.head(5)
Out[33]:

Let us explain the dataset. So, basically we have a lot of rows and columns here in the dataset. It includes the Season, City , Venue in which the match was held, the Date on which the match was held, the teams between which the match was played , information related to toss , winner and umpires.

In [34]:
ipl_df.shape
Out[34]:
(756, 18)

So, we have 756 rows and 18 columns in total.

In [35]:
ipl_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 756 entries, 0 to 755 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 756 non-null int64 1 Season 756 non-null object 2 city 749 non-null object 3 date 756 non-null object 4 team1 756 non-null object 5 team2 756 non-null object 6 toss_winner 756 non-null object 7 toss_decision 756 non-null object 8 result 756 non-null object 9 dl_applied 756 non-null int64 10 winner 752 non-null object 11 win_by_runs 756 non-null int64 12 win_by_wickets 756 non-null int64 13 player_of_match 752 non-null object 14 venue 756 non-null object 15 umpire1 754 non-null object 16 umpire2 754 non-null object 17 umpire3 119 non-null object dtypes: int64(4), object(14) memory usage: 106.4+ KB

We see that in the umpire3 column, we have only 119 non-null objects. So we can discard them without any issue. Also, we will discard the umpire1 and umpire2 columns since they won't be useful in our data analysis.

In [36]:
discard_columns = ['umpire1','umpire2','umpire3']
In [37]:
ipl_df = ipl_df.drop(discard_columns, axis=1)
In [38]:
ipl_df.head()
Out[38]:

Earlier we see that, we had three columns called umpire1 , umpire2 and umpire3. But we do not need them in our analysis as many of their rows contained NaN values. So, we have discarded them and our dataset now contains 15 columns.

In [39]:
ipl_df.team1.value_counts()
Out[39]:
Mumbai Indians                 101
Kings XI Punjab                 91
Chennai Super Kings             89
Royal Challengers Bangalore     85
Kolkata Knight Riders           83
Delhi Daredevils                72
Rajasthan Royals                67
Sunrisers Hyderabad             63
Deccan Chargers                 43
Pune Warriors                   20
Rising Pune Supergiants         15
Gujarat Lions                   14
Kochi Tuskers Kerala             7
Delhi Capitals                   6
Name: team1, dtype: int64

We can see that, these are the all teams that have played in the last 12 seasons of IPL. Few of them like Delhi Capitals, Gujarat Lions, Kochi Tuskers Kerala didn't play in more than 1-2 seasons. That's why their numbers are so low.

In [40]:
ipl_df.result.value_counts()
Out[40]:
normal       743
tie            9
no result      4
Name: result, dtype: int64

The result column in the dataset specifies whether the matched ended normally or there was a tie between the teams or the match was cancelled due to rain or some unavoidable reasons.

In [31]:
import jovian
In [ ]:
jovian.commit()
[jovian] Attempting to save notebook..

Exploratory Analysis and Visualization

Now that our dataset is good to go, we can analyze it using plots, pie charts and graphs.

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
import jovian
In [ ]:
jovian.commit()

Asking and Answering Questions

TODO

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
import jovian
In [ ]:
jovian.commit()

Inferences and Conclusion

TODO

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
import jovian
In [ ]:
jovian.commit()

References and Future Work

TODO

In [ ]:
import jovian
In [ ]:
jovian.commit()
In [ ]: