Make submissions here: https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas/assignment/course-project
This is the starter notebook for the course project for Data Analysis with Python: Zero to Pandas. For the course project, you will pick a real-world dataset of your choice and apply the concepts learned in this course to perform exploratory data analysis. Use this starter notebook as an outline for your project (you can also start with an empty new notebook). Focus on documentation and presentation - this Jupyter notebook will also serve as a project report, so make sure to include detailed explanations whererver possible using Markdown cells.
Find and download an interesting real-world dataset (see the Recommended Datasets section below for ideas).
The dataset should contain tabular data (rowsn & columns), preferably in CSV/JSON/XLS or other formats that can be read using Pandas. If it's not in a compatible format, you may have to write some code to convert it to a desired format.
The dataset should contain at least 3 columns and 150 rows of data. You can also combine data from multiple sources to create a large enough dataset.
Upload your notebook to your Jovian.ml profile using
Make a submission here: https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas/assignment/course-project
Share your work on the forum: https://jovian.ml/forum/t/course-project-on-exploratory-data-analysis-discuss-and-share-your-work/11684
Browse through projects shared by other participants and give feedback
Use the following resources for finding interesting datasets:
Refer to these projects for inspiration:
Analyzing your browser history using Pandas & Seaborn by Kartik Godawat
WhatsApp Chat Data Analysis by Prajwal Prashanth
Understanding the Gender Divide in Data Science Roles by Aakanksha N S
Your submission will be evaluated using the following criteria:
NOTE: Remove this cell containing the instructions before making your submission. You can do using the "Edit > Delete Cells" menu option.
As finally, this year IPL Season 13 has started on Sept. 19, 2020 , the cricket mood is on. While watching the first match itself, the idea of analyzing IPL dataset struck my mind and luckily I found one dataset on Kaggle which contains the data of matches held between 2008-2019. So, I shall be analyzing that dataset only. Hope you like my work.
As a first step, let's upload our Jupyter notebook to Jovian.ml.
project_name = "ipl-data-analysis"
!pip install jovian --upgrade -q
WARNING: You are using pip version 20.1.1; however, version 20.2.3 is available. You should consider upgrading via the 'd:\data analysis with python (jovian)\zerotopandas\scripts\python.exe -m pip install --upgrade pip' command.
[jovian] Attempting to save notebook.. [jovian] Creating a new project "ashutoshkrris/ipl-data-analysis" [jovian] Uploading notebook.. [jovian] Capturing environment..
[jovian] Error: Failed to read Anaconda environment using command: "conda env export -n base --no-builds"
Let us first import all the libraries which we'll be using in the entire project.
import numpy as np import pandas as pd import seaborn as sns import matplotlib import matplotlib.pyplot as plt %matplotlib inline sns.set_style('darkgrid') matplotlib.rcParams['font.size'] = 14 matplotlib.rcParams['figure.figsize'] = (9, 5) matplotlib.rcParams['figure.facecolor'] = '#00000000'
Lets's first load our dataset and take a look on it to have an overview of what our dataset looks like. We will also discard few columns which won't help us in our data visualization.
ipl_df = pd.read_csv('dataset/matches.csv') ipl_df.head(5)
Let us explain the dataset. So, basically we have a lot of rows and columns here in the dataset. It includes the Season, City , Venue in which the match was held, the Date on which the match was held, the teams between which the match was played , information related to toss , winner and umpires.
So, we have 756 rows and 18 columns in total.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 756 entries, 0 to 755 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 756 non-null int64 1 Season 756 non-null object 2 city 749 non-null object 3 date 756 non-null object 4 team1 756 non-null object 5 team2 756 non-null object 6 toss_winner 756 non-null object 7 toss_decision 756 non-null object 8 result 756 non-null object 9 dl_applied 756 non-null int64 10 winner 752 non-null object 11 win_by_runs 756 non-null int64 12 win_by_wickets 756 non-null int64 13 player_of_match 752 non-null object 14 venue 756 non-null object 15 umpire1 754 non-null object 16 umpire2 754 non-null object 17 umpire3 119 non-null object dtypes: int64(4), object(14) memory usage: 106.4+ KB
We see that in the umpire3 column, we have only 119 non-null objects. So we can discard them without any issue. Also, we will discard the umpire1 and umpire2 columns since they won't be useful in our data analysis.
We have another dataset that contains data of each ball bowled in the last 12 seasons of IPL. Let's see how it looks like.
deliveries_df = pd.read_csv('dataset/deliveries.csv') deliveries_df.head()
So we have 179078 rows and 21 columns in this dataset. Well, the number of rows is that high as it contains data of each ball.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 179078 entries, 0 to 179077 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 match_id 179078 non-null int64 1 inning 179078 non-null int64 2 batting_team 179078 non-null object 3 bowling_team 179078 non-null object 4 over 179078 non-null int64 5 ball 179078 non-null int64 6 batsman 179078 non-null object 7 non_striker 179078 non-null object 8 bowler 179078 non-null object 9 is_super_over 179078 non-null int64 10 wide_runs 179078 non-null int64 11 bye_runs 179078 non-null int64 12 legbye_runs 179078 non-null int64 13 noball_runs 179078 non-null int64 14 penalty_runs 179078 non-null int64 15 batsman_runs 179078 non-null int64 16 extra_runs 179078 non-null int64 17 total_runs 179078 non-null int64 18 player_dismissed 8834 non-null object 19 dismissal_kind 8834 non-null object 20 fielder 6448 non-null object dtypes: int64(13), object(8) memory usage: 28.7+ MB
# for ipl_df dataframe discard_columns = ['umpire1','umpire2','umpire3']
ipl_df = ipl_df.drop(discard_columns, axis=1)
Earlier we see that, we had three columns called umpire1 , umpire2 and umpire3. But we do not need them in our analysis as many of their rows contained NaN values. So, we have discarded them and our dataset now contains 15 columns.
Mumbai Indians 101 Kings XI Punjab 91 Chennai Super Kings 89 Royal Challengers Bangalore 85 Kolkata Knight Riders 83 Delhi Daredevils 72 Rajasthan Royals 67 Sunrisers Hyderabad 63 Deccan Chargers 43 Pune Warriors 20 Rising Pune Supergiants 15 Gujarat Lions 14 Kochi Tuskers Kerala 7 Delhi Capitals 6 Name: team1, dtype: int64
We can see that, these are the all teams that have played in the last 12 seasons of IPL. Few of them like Delhi Capitals, Gujarat Lions, Kochi Tuskers Kerala didn't play in more than 1-2 seasons. That's why their numbers are so low.
normal 743 tie 9 no result 4 Name: result, dtype: int64
The result column in the dataset specifies whether the matched ended normally or there was a tie between the teams or the match was cancelled due to rain or some unavoidable reasons.
[jovian] Attempting to save notebook.. [jovian] Updating notebook "ashutoshkrris/ipl-data-analysis" on https://jovian.ml/ [jovian] Uploading notebook.. [jovian] Capturing environment..
[jovian] Error: Failed to read Anaconda environment using command: "conda env export -n base --no-builds"
Now that our dataset is good to go, we can analyze it using plots, pie charts and graphs.
Let's load our cleaned dataset again first.
city_host = ipl_df.city.value_counts() city_host
Mumbai 101 Kolkata 77 Delhi 74 Bangalore 66 Hyderabad 64 Chennai 57 Jaipur 47 Chandigarh 46 Pune 38 Durban 15 Bengaluru 14 Visakhapatnam 13 Centurion 12 Ahmedabad 12 Rajkot 10 Mohali 10 Indore 9 Dharamsala 9 Johannesburg 8 Cape Town 7 Cuttack 7 Ranchi 7 Port Elizabeth 7 Abu Dhabi 7 Raipur 6 Sharjah 6 Kochi 5 Kanpur 4 East London 3 Kimberley 3 Nagpur 3 Bloemfontein 2 Name: city, dtype: int64
plt.figure(figsize=(15,6)) plt.xticks(rotation=90) plt.title('Number of matches hosted') count=0 cities=pd.DataFrame(city_host) cities['name']=city_host.index for i in cities['city']: plt.text(count-0.4,i+1,str(i),color='black') count+=1 sns.barplot(city_host.index, city_host);
So, we can see in the barplot that Mumbai has hosted the maximum number of IPL matches followed by Kolkata and Delhi.
batting_ings = deliveries_df.groupby(['match_id','batsman']).apply(lambda x : np.sum(x['batsman_runs'])).reset_index(name='Innings Runs') sorted_batting_ings = batting_ings.sort_values(by='Innings Runs',ascending=False) top_batsmen_scores = sorted_batting_ings[:10] ball_faced = deliveries_df.groupby(['match_id','batsman']).apply(lambda x : x['batsman_runs'].count()).reset_index(name='Balls Faced') batsmen_performance = pd.merge(top_batsmen_scores, ball_faced, how='inner', left_on=['match_id','batsman'], right_on=['match_id','batsman']) batsmen_performance['Strike Rate for Match'] = batsmen_performance['Innings Runs']*100 / batsmen_performance['Balls Faced'] batsmen_innings = pd.merge(batsmen_performance, deliveries_df, how='inner',left_on=['match_id','batsman'],right_on=['match_id','batsman']) batsmen_innings_table = batsmen_innings.iloc[:,1:8] batsmen_innings_table2 = batsmen_innings_table.drop_duplicates() batsmen_innings_table2
x=batsmen_innings_table2['batsman'] y1=batsmen_innings_table2['Innings Runs'] plt.figure(figsize=(12,6)) plt.scatter(x,y1) plt.xlabel('Batsmen',size=15) plt.ylabel('Innings Score',size=15) plt.title('IPL Best batting performances in an Inning') plt.xticks(rotation=60) plt.legend(['Runs']);
We can see that Chris Gayle is the one who has scored the maximum runs(175*) in a match followed by Brendon McCullum and AB de Villiers.
Well, if you are an IPL fan, you must have been aware of this Gayle Storm that hit the Chinnaswamy Stadium on April 23,2013. Gayle scored an unbeaten 175 against Pune Warriors with 17 sixes and 13 fours in just 66 balls and a strike rate of 265.15 . Isn't that something the Universe Boss can only do??
In the above graph, we can see two players, Chris Gayle and AB de Villers, have appeared twice and they both play for Royal Challengers Bangalore. But there's a fun fact, despite having world class players like Gayle, Kohli and ABD , RCB has never won any season of IPL.
bowling_wickets = deliveries_df[deliveries_df['dismissal_kind']!='run out'] bowling_total = bowling_wickets.groupby('bowler').apply(lambda x : x['dismissal_kind'].dropna()).reset_index(name='Wickets') bowling_wicket_count = bowling_total.groupby('bowler').count().reset_index() bowling_top = bowling_wicket_count.sort_values(by='Wickets',ascending=False) top_bowlers = bowling_top.loc[:,['bowler','Wickets']][0:10] top_bowlers
Since, run outs are not counted in the bowlers account, we can discard all those such dismissals.
Let's plot it.
plt.figure(figsize=(12,6)) plt.scatter(top_bowlers['bowler'],top_bowlers['Wickets'],color='r'); plt.plot(top_bowlers['bowler'],top_bowlers['Wickets'],color='g'); plt.xticks(rotation=60) plt.xlabel('Top 10 Bowlers') plt.ylabel('Wickets Taken') plt.title('Top 10 Bowlers in last 12 seasons');
Lasith Malinga, the Sri Lankan fast bowler, has taken the maximum number(170) wickets in the last 12 seasons of the IPL, followed by Amit Mishra and Harbhajan Singh.
matches_played=pd.concat([ipl_df['team1'],ipl_df['team2']]) matches_played=matches_played.value_counts().reset_index() matches_played.columns=['Team','Total Matches'] matches_played['wins']=ipl_df['winner'].value_counts().reset_index()['winner'] matches_played.set_index('Team',inplace=True)
win_percentage = round(matches_played['wins']/matches_played['Total Matches'],3)*100 win_percentage
Team Mumbai Indians 58.3 Royal Challengers Bangalore 55.6 Kolkata Knight Riders 51.7 Kings XI Punjab 47.7 Chennai Super Kings 50.0 Delhi Daredevils 46.6 Rajasthan Royals 45.6 Sunrisers Hyderabad 53.7 Deccan Chargers 38.7 Pune Warriors 32.6 Rising Pune Supergiants 43.3 Gujarat Lions 40.0 Delhi Capitals 62.5 Kochi Tuskers Kerala 42.9 dtype: float64
We can see the Win percentage of each team. When we look at it closely, we find that Delhi Capitals has the best stats. But when we see the number of matches played by Delhi Capitals, we find that it is quite low as compared with the other teams. This is so because earlier Delhi used to play with the name of Delhi Daredevils and then renamed to Delhi Capitals.
We can see the same data in a bar plot more easily, so let's plot it.
plt.figure(figsize=(12,6)) plt.xticks(rotation=90) plt.title('Matches Played vs. Matches Won') sns.barplot(matches_played['Total Matches'].index, win_percentage);
In this graph, we can clearly see that Delhi Capitals has the best stats. But now that we know the reason, we can say that Mumbai Indians have the best winning percentage as compared to other teams.
high_scores = deliveries_df.groupby(['match_id', 'inning','batting_team','bowling_team'])['total_runs'].sum().reset_index() high_scores = high_scores[high_scores['total_runs']>=200] high_scores.nlargest(10,'total_runs')
Royal Challengers Bangalore has scored the maximum runs in one innings against Pune Warriors India, followed by Kolkata Knight Riders against Kings XI Punjab and then again Royal Challengers Bangalore against Gujarat Lions.
Well, the match in which RCB scored 263 runs is the same match win which Chris Gayle scored unbeated 175.
[jovian] Attempting to save notebook..
Now, let us answer some basic questions that one might want to know about the IPL seasons.
winner = ipl_df.winner.value_counts() winner
Mumbai Indians 109 Chennai Super Kings 100 Kolkata Knight Riders 92 Royal Challengers Bangalore 84 Kings XI Punjab 82 Rajasthan Royals 75 Delhi Daredevils 67 Sunrisers Hyderabad 58 Deccan Chargers 29 Rising Pune Supergiants 15 Gujarat Lions 13 Pune Warriors 12 Delhi Capitals 10 Kochi Tuskers Kerala 6 Name: winner, dtype: int64
We can see that Mumbai Indians have won the maximum number of matches in the last 12 seasons of IPL, followed by Chennai Super Kings and Kolkata Knight Riders. So, we may say that Mumbai Indians has been the most successful team in the IPL. We can also see that few teams like Kochi Tuskers Kerala, Delhi Capitals, Pune Warriors, etc. have won very less number of matches. The reason for this is they had played the IPL for just 1 or 2 seasons.
Let's plot the above data using a barplot.
plt.figure(figsize=(12,6)) plt.xticks(rotation=75) plt.title('Which team won the maximum number of matches in all seasons?') sns.barplot(winner.index, winner);
m_o_m = ipl_df.player_of_match.value_counts() m_o_m
CH Gayle 21 AB de Villiers 20 RG Sharma 17 MS Dhoni 17 DA Warner 17 .. Q de Kock 1 DL Vettori 1 M Ur Rahman 1 WPUJC Vaas 1 CRD Fernando 1 Name: player_of_match, Length: 226, dtype: int64
Whoh! The GayleStorm Chris Gayle has become Man of the Match most number of times, followed by Mr. 360 AB de Villiers, our very own Hitman Rohit Sharma, Thalaiva MS Dhoni and Reverend David Warner.
But here's a quick fun fact : Despite having outstanding performance in IPL, Chris Gayle had gone unsold twice in the IPL 2018 auction before being taken by Kings XI Punjab at his base price of ₹2 Crores.
Now, let's again plot a barplot to see the same data more clearly.
m_o_m_10 = m_o_m[:10] plt.figure(figsize=(12,6)) plt.xticks(rotation=75) plt.title('Top 10 players with most number of Man of the Match title') sns.barplot(m_o_m_10.index, m_o_m_10);
win = ipl_df.winner == ipl_df.toss_winner win_counts = win.value_counts() win_counts
True 393 False 363 dtype: int64
Out of 756 matches in the last 12 seasons, we can see that the toss winning team has won the match 393 times but has lost the match 363 times. Well, the difference is not that much.
Let's see the data in percentage first and then plot a pie chart for it.
True 51.98 False 48.02 dtype: float64
plt.figure(figsize=(12,6)) plt.title('Does winning the toss increase the chances of winning the match?') plt.pie(win_counts,labels=win_counts.index,autopct='%1.1f%%', startangle=180);
batsmen = deliveries_df.groupby('batsman').apply(lambda x : np.sum(x['batsman_runs'])).reset_index(name='Runs') batsmen_sorted=batsmen.sort_values(by='Runs',ascending=False) top_batsmen=batsmen_sorted[:5] top_batsmen
plt.figure(figsize=(12,6)) plt.title('Top 5 Batsmen') sns.barplot(top_batsmen['batsman'],top_batsmen['Runs']);
Thus, we understood that why Virat Kohli is called the Run Machine. He has scored 5434 runs in total, followed by Suresh Raina with slightly less 5415 runs and Rohit Sharma with 4914 runs.