As finally, this year IPL Season 13 has started on Sept. 19, 2020 , the cricket mood is on. While watching the first match itself, the idea of analyzing IPL dataset struck my mind and luckily I found one dataset on Kaggle which contains the data of matches held between 2008-2019. So, I shall be analyzing that dataset only. Hope you like my work.
As a first step, let's upload our Jupyter notebook to Jovian.ml.
project_name = "ipl-data-analysis"
!pip install jovian --upgrade -q
WARNING: You are using pip version 20.1.1; however, version 20.2.3 is available.
You should consider upgrading via the 'd:\data analysis with python (jovian)\zerotopandas\scripts\python.exe -m pip install --upgrade pip' command.
import jovian
jovian.commit(project=project_name)
[jovian] Attempting to save notebook..
[jovian] Creating a new project "ashutoshkrris/ipl-data-analysis"
[jovian] Uploading notebook..
[jovian] Capturing environment..
[jovian] Error: Failed to read Anaconda environment using command: "conda env export -n base --no-builds"
[jovian] Committed successfully! https://jovian.ml/ashutoshkrris/ipl-data-analysis
Let us first import all the libraries which we'll be using in the entire project.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
Lets's first load our dataset and take a look on it to have an overview of what our dataset looks like. We will also discard few columns which won't help us in our data visualization.
from urllib.request import urlretrieve
urlretrieve('https://github.com/ashutoshkrris/Data-Analysis-with-Python/raw/master/zerotopandas-course-project-starter/dataset/deliveries.csv','deliveries.csv')
urlretrieve('https://github.com/ashutoshkrris/Data-Analysis-with-Python/raw/master/zerotopandas-course-project-starter/dataset/matches.csv','matches.csv')
('matches.csv', <http.client.HTTPMessage at 0x215d829af40>)
ipl_df = pd.read_csv('matches.csv')
ipl_df.head(5)
Let us explain the dataset. So, basically we have a lot of rows and columns here in the dataset. It includes the Season, City , Venue in which the match was held, the Date on which the match was held, the teams between which the match was played , information related to toss , winner and umpires.
ipl_df.shape
(756, 18)
So, we have 756 rows and 18 columns in total.
ipl_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 756 entries, 0 to 755
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 756 non-null int64
1 Season 756 non-null object
2 city 749 non-null object
3 date 756 non-null object
4 team1 756 non-null object
5 team2 756 non-null object
6 toss_winner 756 non-null object
7 toss_decision 756 non-null object
8 result 756 non-null object
9 dl_applied 756 non-null int64
10 winner 752 non-null object
11 win_by_runs 756 non-null int64
12 win_by_wickets 756 non-null int64
13 player_of_match 752 non-null object
14 venue 756 non-null object
15 umpire1 754 non-null object
16 umpire2 754 non-null object
17 umpire3 119 non-null object
dtypes: int64(4), object(14)
memory usage: 106.4+ KB
We see that in the umpire3 column, we have only 119 non-null objects. So we can discard them without any issue. Also, we will discard the umpire1 and umpire2 columns since they won't be useful in our data analysis.
We have another dataset that contains data of each ball bowled in the last 12 seasons of IPL. Let's see how it looks like.
deliveries_df = pd.read_csv('deliveries.csv')
deliveries_df.head()
deliveries_df.shape
(179078, 21)
So we have 179078 rows and 21 columns in this dataset. Well, the number of rows is that high as it contains data of each ball.
deliveries_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179078 entries, 0 to 179077
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 match_id 179078 non-null int64
1 inning 179078 non-null int64
2 batting_team 179078 non-null object
3 bowling_team 179078 non-null object
4 over 179078 non-null int64
5 ball 179078 non-null int64
6 batsman 179078 non-null object
7 non_striker 179078 non-null object
8 bowler 179078 non-null object
9 is_super_over 179078 non-null int64
10 wide_runs 179078 non-null int64
11 bye_runs 179078 non-null int64
12 legbye_runs 179078 non-null int64
13 noball_runs 179078 non-null int64
14 penalty_runs 179078 non-null int64
15 batsman_runs 179078 non-null int64
16 extra_runs 179078 non-null int64
17 total_runs 179078 non-null int64
18 player_dismissed 8834 non-null object
19 dismissal_kind 8834 non-null object
20 fielder 6448 non-null object
dtypes: int64(13), object(8)
memory usage: 28.7+ MB
# for ipl_df dataframe
discard_columns = ['umpire1','umpire2','umpire3']
ipl_df = ipl_df.drop(discard_columns, axis=1)
ipl_df.head()
Earlier we see that, we had three columns called umpire1 , umpire2 and umpire3. But we do not need them in our analysis as many of their rows contained NaN values. So, we have discarded them and our dataset now contains 15 columns.
ipl_df.team1.value_counts()
Mumbai Indians 101
Kings XI Punjab 91
Chennai Super Kings 89
Royal Challengers Bangalore 85
Kolkata Knight Riders 83
Delhi Daredevils 72
Rajasthan Royals 67
Sunrisers Hyderabad 63
Deccan Chargers 43
Pune Warriors 20
Rising Pune Supergiants 15
Gujarat Lions 14
Kochi Tuskers Kerala 7
Delhi Capitals 6
Name: team1, dtype: int64
We can see that, these are the all teams that have played in the last 12 seasons of IPL. Few of them like Delhi Capitals, Gujarat Lions, Kochi Tuskers Kerala didn't play in more than 1-2 seasons. That's why their numbers are so low.
ipl_df.result.value_counts()
normal 743
tie 9
no result 4
Name: result, dtype: int64
The result column in the dataset specifies whether the matched ended normally or there was a tie between the teams or the match was cancelled due to rain or some unavoidable reasons.
import jovian
jovian.commit()
[jovian] Attempting to save notebook..
[jovian] Updating notebook "ashutoshkrris/ipl-data-analysis" on https://jovian.ml/
[jovian] Uploading notebook..
[jovian] Capturing environment..
[jovian] Error: Failed to read Anaconda environment using command: "conda env export -n base --no-builds"
[jovian] Committed successfully! https://jovian.ml/ashutoshkrris/ipl-data-analysis
Now that our dataset is good to go, we can analyze it using plots, pie charts and graphs.
Let's load our cleaned dataset again first.
ipl_df.sample(10)
city_host = ipl_df.city.value_counts()
city_host
Mumbai 101
Kolkata 77
Delhi 74
Bangalore 66
Hyderabad 64
Chennai 57
Jaipur 47
Chandigarh 46
Pune 38
Durban 15
Bengaluru 14
Visakhapatnam 13
Centurion 12
Ahmedabad 12
Rajkot 10
Mohali 10
Indore 9
Dharamsala 9
Johannesburg 8
Cape Town 7
Cuttack 7
Ranchi 7
Port Elizabeth 7
Abu Dhabi 7
Raipur 6
Sharjah 6
Kochi 5
Kanpur 4
East London 3
Kimberley 3
Nagpur 3
Bloemfontein 2
Name: city, dtype: int64
plt.figure(figsize=(15,6))
plt.xticks(rotation=90)
plt.title('Number of matches hosted')
count=0
cities=pd.DataFrame(city_host)
cities['name']=city_host.index
for i in cities['city']:
plt.text(count-0.4,i+1,str(i),color='black')
count+=1
sns.barplot(city_host.index, city_host);
So, we can see in the barplot that Mumbai has hosted the maximum number of IPL matches followed by Kolkata and Delhi.
batting_ings = deliveries_df.groupby(['match_id','batsman']).apply(lambda x : np.sum(x['batsman_runs'])).reset_index(name='Innings Runs')
sorted_batting_ings = batting_ings.sort_values(by='Innings Runs',ascending=False)
top_batsmen_scores = sorted_batting_ings[:10]
ball_faced = deliveries_df.groupby(['match_id','batsman']).apply(lambda x : x['batsman_runs'].count()).reset_index(name='Balls Faced')
batsmen_performance = pd.merge(top_batsmen_scores, ball_faced, how='inner', left_on=['match_id','batsman'], right_on=['match_id','batsman'])
batsmen_performance['Strike Rate for Match'] = batsmen_performance['Innings Runs']*100 / batsmen_performance['Balls Faced']
batsmen_innings = pd.merge(batsmen_performance, deliveries_df, how='inner',left_on=['match_id','batsman'],right_on=['match_id','batsman'])
batsmen_innings_table = batsmen_innings.iloc[:,1:8]
batsmen_innings_table2 = batsmen_innings_table.drop_duplicates()
batsmen_innings_table2
x=batsmen_innings_table2['batsman']
y1=batsmen_innings_table2['Innings Runs']
plt.figure(figsize=(12,6))
plt.scatter(x,y1)
plt.xlabel('Batsmen',size=15)
plt.ylabel('Innings Score',size=15)
plt.title('IPL Best batting performances in an Inning')
plt.xticks(rotation=60)
plt.legend(['Runs']);
We can see that Chris Gayle is the one who has scored the maximum runs(175*) in a match followed by Brendon McCullum and AB de Villiers.
Well, if you are an IPL fan, you must have been aware of this Gayle Storm that hit the Chinnaswamy Stadium on April 23,2013. Gayle scored an unbeaten 175 against Pune Warriors with 17 sixes and 13 fours in just 66 balls and a strike rate of 265.15 . Isn't that something the Universe Boss can only do??
In the above graph, we can see two players, Chris Gayle and AB de Villers, have appeared twice and they both play for Royal Challengers Bangalore. But there's a fun fact, despite having world class players like Gayle, Kohli and ABD , RCB has never won any season of IPL.
bowling_wickets = deliveries_df[deliveries_df['dismissal_kind']!='run out']
bowling_total = bowling_wickets.groupby('bowler').apply(lambda x : x['dismissal_kind'].dropna()).reset_index(name='Wickets')
bowling_wicket_count = bowling_total.groupby('bowler').count().reset_index()
bowling_top = bowling_wicket_count.sort_values(by='Wickets',ascending=False)
top_bowlers = bowling_top.loc[:,['bowler','Wickets']][0:10]
top_bowlers
Since, run outs are not counted in the bowlers account, we can discard all those such dismissals.
Let's plot it.
plt.figure(figsize=(12,6))
plt.scatter(top_bowlers['bowler'],top_bowlers['Wickets'],color='r');
plt.plot(top_bowlers['bowler'],top_bowlers['Wickets'],color='g');
plt.xticks(rotation=60)
plt.xlabel('Top 10 Bowlers')
plt.ylabel('Wickets Taken')
plt.title('Top 10 Bowlers in last 12 seasons');
Lasith Malinga, the Sri Lankan fast bowler, has taken the maximum number(170) wickets in the last 12 seasons of the IPL, followed by Amit Mishra and Harbhajan Singh.
matches_played=pd.concat([ipl_df['team1'],ipl_df['team2']])
matches_played=matches_played.value_counts().reset_index()
matches_played.columns=['Team','Total Matches']
matches_played['wins']=ipl_df['winner'].value_counts().reset_index()['winner']
matches_played.set_index('Team',inplace=True)
matches_played.reset_index()
win_percentage = round(matches_played['wins']/matches_played['Total Matches'],3)*100
win_percentage
Team
Mumbai Indians 58.3
Royal Challengers Bangalore 55.6
Kolkata Knight Riders 51.7
Kings XI Punjab 47.7
Chennai Super Kings 50.0
Delhi Daredevils 46.6
Rajasthan Royals 45.6
Sunrisers Hyderabad 53.7
Deccan Chargers 38.7
Pune Warriors 32.6
Rising Pune Supergiants 43.3
Gujarat Lions 40.0
Delhi Capitals 62.5
Kochi Tuskers Kerala 42.9
dtype: float64
We can see the Win percentage of each team. When we look at it closely, we find that Delhi Capitals has the best stats. But when we see the number of matches played by Delhi Capitals, we find that it is quite low as compared with the other teams. This is so because earlier Delhi used to play with the name of Delhi Daredevils and then renamed to Delhi Capitals.
We can see the same data in a bar plot more easily, so let's plot it.
plt.figure(figsize=(12,6))
plt.xticks(rotation=90)
plt.title('Matches Played vs. Matches Won')
sns.barplot(matches_played['Total Matches'].index, win_percentage);
In this graph, we can clearly see that Delhi Capitals has the best stats. But now that we know the reason, we can say that Mumbai Indians have the best winning percentage as compared to other teams.
high_scores = deliveries_df.groupby(['match_id', 'inning','batting_team','bowling_team'])['total_runs'].sum().reset_index()
high_scores = high_scores[high_scores['total_runs']>=200]
high_scores.nlargest(10,'total_runs')
Royal Challengers Bangalore has scored the maximum runs in one innings against Pune Warriors India, followed by Kolkata Knight Riders against Kings XI Punjab and then again Royal Challengers Bangalore against Gujarat Lions.
Well, the match in which RCB scored 263 runs is the same match win which Chris Gayle scored unbeated 175.
First of all, in the dataset we have a column called Season which contains data as : 'IPL-2017'. But we want the data as : 2017 . So, we will add a new column called Year in the dataset which contains the year in which match was held.
season = ipl_df.Season.values
year = []
for i in range(len(season)):
year.append(int(season[i][4:8]))
ipl_df['Year'] = year
ipl_df
Now that we have a column called Year, we can compare in which year what was the maximum run by which a team won?
y = 2008
season=[]
win_by_runs_max=[]
while y<2020:
season.append(y)
win_by_runs_max.append(ipl_df[ipl_df['Year']==y]['win_by_runs'].max())
y += 1
win_by_runs_max
[140, 92, 98, 111, 86, 130, 93, 138, 144, 146, 102, 118]
We have got a list of runs . We can plot it right away.
plt.plot(season,win_by_runs_max)
plt.scatter(season,win_by_runs_max)
count=0
while count < 12:
plt.text(season[count]+0.1,win_by_runs_max[count],str(win_by_runs_max[count]),size=14)
count+=1
plt.xticks(range(2008,2020),fontsize=14)
plt.xlabel('Season')
plt.ylabel('Runs')
plt.title('Biggest win by runs per season',fontsize=14)
plt.show()
import jovian
jovian.commit()
[jovian] Attempting to save notebook..
[jovian] Updating notebook "ashutoshkrris/ipl-data-analysis" on https://jovian.ml/
[jovian] Uploading notebook..
[jovian] Capturing environment..
[jovian] Error: Failed to read Anaconda environment using command: "conda env export -n base --no-builds"
[jovian] Committed successfully! https://jovian.ml/ashutoshkrris/ipl-data-analysis
Now, let us answer some basic questions that one might want to know about the IPL seasons.
winner = ipl_df.winner.value_counts()
winner
Mumbai Indians 109
Chennai Super Kings 100
Kolkata Knight Riders 92
Royal Challengers Bangalore 84
Kings XI Punjab 82
Rajasthan Royals 75
Delhi Daredevils 67
Sunrisers Hyderabad 58
Deccan Chargers 29
Rising Pune Supergiants 15
Gujarat Lions 13
Pune Warriors 12
Delhi Capitals 10
Kochi Tuskers Kerala 6
Name: winner, dtype: int64
We can see that Mumbai Indians have won the maximum number of matches in the last 12 seasons of IPL, followed by Chennai Super Kings and Kolkata Knight Riders. So, we may say that Mumbai Indians has been the most successful team in the IPL. We can also see that few teams like Kochi Tuskers Kerala, Delhi Capitals, Pune Warriors, etc. have won very less number of matches. The reason for this is they had played the IPL for just 1 or 2 seasons.
Let's plot the above data using a barplot.
plt.figure(figsize=(12,6))
plt.xticks(rotation=75)
plt.title('Which team won the maximum number of matches in all seasons?')
sns.barplot(winner.index, winner);
m_o_m = ipl_df.player_of_match.value_counts()
m_o_m
CH Gayle 21
AB de Villiers 20
RG Sharma 17
MS Dhoni 17
DA Warner 17
..
Q de Kock 1
DL Vettori 1
M Ur Rahman 1
WPUJC Vaas 1
CRD Fernando 1
Name: player_of_match, Length: 226, dtype: int64
Whoh! The GayleStorm Chris Gayle has become Man of the Match most number of times, followed by Mr. 360 AB de Villiers, our very own Hitman Rohit Sharma, Thalaiva MS Dhoni and Reverend David Warner.
But here's a quick fun fact : Despite having outstanding performance in IPL, Chris Gayle had gone unsold twice in the IPL 2018 auction before being taken by Kings XI Punjab at his base price of ₹2 Crores.
Now, let's again plot a barplot to see the same data more clearly.
m_o_m_10 = m_o_m[:10]
plt.figure(figsize=(12,6))
plt.xticks(rotation=75)
plt.title('Top 10 players with most number of Man of the Match title')
sns.barplot(m_o_m_10.index, m_o_m_10);
win = ipl_df.winner == ipl_df.toss_winner
win_counts = win.value_counts()
win_counts
True 393
False 363
dtype: int64
Out of 756 matches in the last 12 seasons, we can see that the toss winning team has won the match 393 times but has lost the match 363 times. Well, the difference is not that much.
Let's see the data in percentage first and then plot a pie chart for it.
round((win_counts/win.count())*100,2)
True 51.98
False 48.02
dtype: float64
plt.figure(figsize=(12,6))
plt.title('Does winning the toss increase the chances of winning the match?')
plt.pie(win_counts,labels=win_counts.index,autopct='%1.1f%%', startangle=180);
batsmen = deliveries_df.groupby('batsman').apply(lambda x : np.sum(x['batsman_runs'])).reset_index(name='Runs')
batsmen_sorted=batsmen.sort_values(by='Runs',ascending=False)
top_batsmen=batsmen_sorted[:5]
top_batsmen
plt.figure(figsize=(12,6))
plt.title('Top 5 Batsmen')
sns.barplot(top_batsmen['batsman'],top_batsmen['Runs']);
Thus, we understood that why Virat Kohli is called the Run Machine. He has scored 5434 runs in total, followed by Suresh Raina with slightly less 5415 runs and Rohit Sharma with 4914 runs.
We know that a team can win by runs only if it bats first. So, to answer this question, we will first separate the teams which have batted first.
batting_first_team = ipl_df[ipl_df.win_by_runs != 0]
batting_first_team.head()
Now that we have a separate dataframe where we have teams who have batted first, we can plot a histogram and visualize the graph.
plt.figure(figsize=(12,6))
plt.hist(batting_first_team['win_by_runs'])
plt.xlabel('Runs')
plt.ylabel('Number of matches')
plt.show()
We can see that in almost around 140 matches, teams have won with a margin of 0-20 runs whereas only one or two team has won match with a margin of 120-140 runs. One of such matches is RCB vs PWI as we have mentioned it already.
Similarly to answer this question, we have to find which team has bowled first and we will separate them out.
bowling_first_team = ipl_df[ipl_df.win_by_wickets != 0]
bowling_first_team.head()
plt.figure(figsize=(12,6))
plt.hist(bowling_first_team['win_by_wickets'])
plt.xlabel('Wickets')
plt.ylabel('Number of matches')
plt.show()
We can see that around 85 matches have been won by 6 wickets, 70 matches with 5 wickets, 80 matches with 7 wickets. We can also find that 10 matches have been won by 10 wickets meaning the opening batsmen were enough for the opponents.
import jovian
jovian.commit()
[jovian] Attempting to save notebook..
[jovian] Updating notebook "ashutoshkrris/ipl-data-analysis" on https://jovian.ml/
[jovian] Uploading notebook..
[jovian] Capturing environment..
[jovian] Error: Failed to read Anaconda environment using command: "conda env export -n base --no-builds"
[jovian] Committed successfully! https://jovian.ml/ashutoshkrris/ipl-data-analysis
These are the few conclusions that I can draw from the above analysis.
This analysis can be further more improved by adding more visualizations as the dataset is quite interesting and clean in itself.
import jovian
jovian.commit()
[jovian] Attempting to save notebook..
[jovian] Updating notebook "ashutoshkrris/ipl-data-analysis" on https://jovian.ml/
[jovian] Uploading notebook..
[jovian] Capturing environment..
[jovian] Error: Failed to read Anaconda environment using command: "conda env export -n base --no-builds"
[jovian] Committed successfully! https://jovian.ml/ashutoshkrris/ipl-data-analysis
import jovian
jovian.commit()
[jovian] Attempting to save notebook..