This notebook provides the data analysis of matches that have taken place in Indian Premier League (IPL) from 2008 to 2019. The dataset used in this analysis is taken from https://www.kaggle.com/nowke9/ipldata. Once downloaded, there are two different datasets. One having information about the matches and the results (matches.csv). The other one has ball-by-ball data for all seasons (deliveries.csv). For this project, I have analysed the data from matches.csv.
The analysis done in this project is from a historical point of view, giving readers an overview of what has happended in the IPL. Tools such as Pandas, Matplotlib and Seaborn along with Python have been used to give a visual as well as numeric representation of the data in front of us.
The learnings about these tools have been received through the course Data Analysis with Python: Zero to Pandas conducted by Jovian.ml. The course was offered at no cost and made my journey of learning really easy and interesting. The course was done in partnership with freeCodeCamp.
This is an executable Jupyter notebook hosted on Jovian.ml, a platform for sharing data science projects. You can run and experiment with the code in a couple of ways: using free online resources (recommended) or on your own computer.
The easiest way to start executing this notebook is to click the "Run" button at the top of this page, and select "Run on Binder". This will run the notebook on mybinder.org, a free online service for running Jupyter notebooks. You can also select "Run on Colab" or "Run on Kaggle".
Install Conda by following these instructions. Add Conda binaries to your system PATH
, so you can use the conda
command on your terminal.
Create a Conda environment and install the required libraries by running these commands on the terminal:
conda create -n zerotopandas -y python=3.8 conda activate zerotopandas pip install jovian jupyter numpy pandas matplotlib seaborn opendatasets --upgrade
jovian clone notebook-owner/notebook-id
cd directory-name
and start the Jupyter notebook.jupyter notebook
You can now access Jupyter's web interface by clicking the link that shows up on the terminal or by visiting http://localhost:8888 on your browser. Click on the notebook file (it has a .ipynb
extension) to open it.
!pip install jovian --upgrade --quiet
!pip install pandas --upgrade
Requirement already up-to-date: pandas in c:\users\s\anaconda3\envs\courseproject\lib\site-packages (1.1.2)
Requirement already satisfied, skipping upgrade: numpy>=1.15.4 in c:\users\s\anaconda3\envs\courseproject\lib\site-packages (from pandas) (1.19.2)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.7.3 in c:\users\s\anaconda3\envs\courseproject\lib\site-packages (from pandas) (2.8.1)
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in c:\users\s\anaconda3\envs\courseproject\lib\site-packages (from pandas) (2020.1)
Requirement already satisfied, skipping upgrade: six>=1.5 in c:\users\s\anaconda3\envs\courseproject\lib\site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
!pip install matplotlib seaborn --upgrade --quiet
#Importing the libraries (tools) to be used
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
Let's load the CSV files using the Pandas library. We'll use the name matches_raw_df
for the data frame, to indicate that this is unprocessed data that which we might clean, filter and modify to prepare a data frame that's ready for analysis.
We will read the matches.csv file using read_csv()
.
matches_raw_df = pd.read_csv('matches.csv')
matches_raw_df
# know the no. of rows and columns using shape
matches_raw_df.shape
(756, 18)
So, the dataset has 756 rows (matches) and 18 columns. Let's find the names of those columns.
#Getting the list of columns
matches_raw_df.columns
Index(['id', 'season', 'city', 'date', 'team1', 'team2', 'toss_winner',
'toss_decision', 'result', 'dl_applied', 'winner', 'win_by_runs',
'win_by_wickets', 'player_of_match', 'venue', 'umpire1', 'umpire2',
'umpire3'],
dtype='object')
#Know the no. of columns using len
len(matches_raw_df.columns)
18
#Know about data
matches_raw_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 756 entries, 0 to 755
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 756 non-null int64
1 season 756 non-null int64
2 city 749 non-null object
3 date 756 non-null object
4 team1 756 non-null object
5 team2 756 non-null object
6 toss_winner 756 non-null object
7 toss_decision 756 non-null object
8 result 756 non-null object
9 dl_applied 756 non-null int64
10 winner 752 non-null object
11 win_by_runs 756 non-null int64
12 win_by_wickets 756 non-null int64
13 player_of_match 752 non-null object
14 venue 756 non-null object
15 umpire1 754 non-null object
16 umpire2 754 non-null object
17 umpire3 119 non-null object
dtypes: int64(5), object(13)
memory usage: 106.4+ KB
#Using isnull() to find the columns having null values
#Using sum() to find the total no. of null values for each column
matches_raw_df.isnull().sum()
id 0
season 0
city 7
date 0
team1 0
team2 0
toss_winner 0
toss_decision 0
result 0
dl_applied 0
winner 4
win_by_runs 0
win_by_wickets 0
player_of_match 4
venue 0
umpire1 2
umpire2 2
umpire3 637
dtype: int64
Almost all columns except umpire3
have none or very few null values. The null values coud be because of no information or wrong data entry. One thing that catches my eyes is the fact that though there are no null values for result
columns, there are some for winner
and player_of_match
. Let's find out why.
#Using value_counts() on result to find the different values in the result column and their total no.
matches_raw_df.result.value_counts()
normal 743
tie 9
no result 4
Name: result, dtype: int64
So, out of 756 matches 4 ended as no result, mainly due to rain. Therefore, we have no winners as well as player of the match for these 4 matches.
#Few stats about columns with integer type data
matches_raw_df.describe()
For our analysis, umpire3
column isn't needed. So we will drop the column using drop()
by passing the column name and axis value.
matches_df = matches_raw_df.drop('umpire3', axis = 1)
matches_df
We will use matches_df
for our analysis from here on.
import jovian
jovian.commit(project = 'ipl data analysis', files = ['matches.csv'])
[jovian] Attempting to save notebook..
[jovian] Updating notebook "srijansrj5901/ipl-data-analysis" on https://jovian.ml/
[jovian] Uploading notebook..
[jovian] Capturing environment..
[jovian] Uploading additional files...
[jovian] Committed successfully! https://jovian.ml/srijansrj5901/ipl-data-analysis
Let's found how many matches have been played in the IPL each of the season from 2008 to 2019.
We will group the rows by season using groupby()
and then count the no. of matches for each season using count()
on id
.
matches_per_season = matches_df.groupby('season').id.count()
plt.figure(figsize=(12,6))
plt.xticks(rotation=75)
plt.title('Matches Per Season')
match_per_season_plot = sns.barplot(x = matches_per_season.index, y = matches_per_season)
match_per_season_plot.set(xlabel = 'Seasons', ylabel = 'No. of Matches');
Each season, almost 60 matches have been played. However, we see a spike in the number of matches from 2011 to 2013. This is due to the fact that two new franchises, Pune Warrior and Kochi Tuskers Kerala were introduced, increasing the number of teams to 10.
However, they were removed from 2014, bringing the number down to 8.
Before the start of 2016 season, two teams, Chennai Super Kings and Rajasthan Royals were banned for two seasons. To make up for them two new teams, Rising Pune Supergiants and Gujarat Lions entered the competition.
When Chennai Super Kings and Rajasthan Royals returned, these two teams were removed from the competition.
One of the most significant happenings in any cricket match is the toss, which happens at the very start of a match. The toss winner can choose whether he wants to bat first or second. Let's see what teams have chosen to do across differrent seasons after winning the toss.
We will again group the rows by season and then count the different values of toss_decision
by using value_counts()
. To find the percentage, we will divide the above result with matches_per_season
.
toss_decision_percentage = matches_df.groupby('season').toss_decision.value_counts().sort_index() / matches_per_season * 100
toss_decision_percentage
season toss_decision
2008 bat 44.827586
field 55.172414
2009 bat 61.403509
field 38.596491
2010 bat 65.000000
field 35.000000
2011 bat 34.246575
field 65.753425
2012 bat 50.000000
field 50.000000
2013 bat 59.210526
field 40.789474
2014 bat 31.666667
field 68.333333
2015 bat 42.372881
field 57.627119
2016 bat 18.333333
field 81.666667
2017 bat 18.644068
field 81.355932
2018 bat 16.666667
field 83.333333
2019 bat 16.666667
field 83.333333
dtype: float64
toss_decision_percentage.unstack().plot(kind = 'bar', figsize=(12,6), title = 'Toss Decisions', xlabel = 'Seasons', ylabel = 'Percentage');
Interesting!
For 2008-2013, teams seem to have been favouring both batting first and second. For this period, the percentage of times batting first was chosen is more in seasons 2009, 2010 and 2013. While fielding was chosen more in 2008 and 2011. Things were even stevens for 2012.
This could be put to the fact that IPL and T20 cricket in general was in its budding stages. So, teams were probably learning and trying to figure out which would favour them.
However, since 2014, teams have overwhelmingly chosen to bat second. Especially since 2016, teams have chosen to field for more than 80% of the times.
With the use of data analysis and an incresing trend in ODIs to bat second as there is a fixed target to achieve, teams chose more and more to bat first. This made the batsmen tasks easier as they could now have a clear thought of how to scale the target put in front of them.
We saw how teams in the recent past have chosen to bat second more than 4 out of 5 times. Did this decision of theirs tranformed in results? Let's see
We will filter the dataframe using the required consitions, then grouping them by season and finding the count of winners.
wins_batting_second = matches_df[(matches_df.win_by_runs == 0) & (matches_df.result == 'normal')].groupby('season').winner.count() / matches_per_season * 100
wins_batting_first = matches_df[(matches_df.win_by_wickets == 0) & (matches_df.result == 'normal')].groupby('season').winner.count() / matches_per_season * 100
combined_wins_df = pd.concat([wins_batting_first, wins_batting_second], axis = 1)
combined_wins_df.columns = ['batting_first', 'batting_second']
combined_wins_df
combined_wins_df.plot(kind = 'bar', figsize=(12,6), title = 'Wins', xlabel = 'Seasons', ylabel = 'Percentage');
We saw earlier that for 2008-2013, teams were in conundrum to chose bat first or second. This is partially visisble in the results as well. The wins from batting first are very close to that from batting second. However, there is just one season where teams batting first won more, with things being equal in 2013.
Again, since 2014, things have been in favour of teams chasing except 2015. Leaving out 2015, things have been overwhelmingly in favour of teams batting second.
So, teams chosing batting second more have been justified in their decisions.
In leagues across different sports, there is always a talk about teams with "history" as in teams that have played the most in the league and continue to do so. Let's find such teams in the IPL.
We will count the different values for team1
and team2
using value_counts()
and sort them in descending order using sort_values()
.
total_matches_played = (matches_df.team2.value_counts() + matches_df.team1.value_counts()).sort_values(ascending = False)
total_matches_played
Mumbai Indians 187
Royal Challengers Bangalore 180
Kolkata Knight Riders 178
Kings XI Punjab 176
Chennai Super Kings 164
Delhi Daredevils 161
Rajasthan Royals 147
Sunrisers Hyderabad 108
Deccan Chargers 75
Pune Warriors 46
Gujarat Lions 30
Rising Pune Supergiant 16
Delhi Capitals 16
Rising Pune Supergiants 14
Kochi Tuskers Kerala 14
dtype: int64
plt.figure(figsize=(12,6))
plt.title('Total Matches Played')
total_matches_played_plot = sns.barplot(y = total_matches_played.index, x = total_matches_played)
total_matches_played_plot.set(ylabel = 'Teams', xlabel = 'No. of Matches');
Mumbai Indians are the team that have played the most number of matches. They are followed by Royal Challengers Bangalore, Kolkata Knight Riders, Kings XI Punjab and Chennai Super Kings.
Chennai Super Kings and Rajasthan Royals could have been higher had they not been banned.
You will see there are two teams from Delhi, Delhi Daredevils and Delhi Capitals. This is due to the change in owners and team name in 2018.
Similar story for Deccan Chargers and Sunrisers Hyderabad with Deccan Chargers being removed from IPL from 2013 and Sunrisers coming in their place.
Also, there are two teams with almost same names. Rising Pune Supergiants and Rising Pune Supergiant. Well, they are same teams, no owners change. But it is more to do with superstitions.
In the 2016 season, Rising Pune Supergiants finished 7th. The owners changed the captain for 2017 as well as dropped the 's' from Supergiants. Well, it paid off as they finished as runner-up that season!!!
Now, teams may have a lot of history but it's their "legacy" i.e. how often they win that makes them popular and attract the new and neutral fans.
We will use value_counts()
on winner
to find the different winners and the no. of matches they have won.
most_wins = matches_df.winner.value_counts()
most_wins
Mumbai Indians 109
Chennai Super Kings 100
Kolkata Knight Riders 92
Royal Challengers Bangalore 84
Kings XI Punjab 82
Rajasthan Royals 75
Delhi Daredevils 67
Sunrisers Hyderabad 58
Deccan Chargers 29
Gujarat Lions 13
Pune Warriors 12
Rising Pune Supergiant 10
Delhi Capitals 10
Kochi Tuskers Kerala 6
Rising Pune Supergiants 5
Name: winner, dtype: int64
So Mumbai have the highest number of wins. But a better metric to judge would be the win percentage.
We will divide most_wins
by total_matches_played
to find the win_percentage
for each team.
win_percentage = (most_wins / total_matches_played).sort_values(ascending = False) * 100
win_percentage
Rising Pune Supergiant 62.500000
Delhi Capitals 62.500000
Chennai Super Kings 60.975610
Mumbai Indians 58.288770
Sunrisers Hyderabad 53.703704
Kolkata Knight Riders 51.685393
Rajasthan Royals 51.020408
Royal Challengers Bangalore 46.666667
Kings XI Punjab 46.590909
Gujarat Lions 43.333333
Kochi Tuskers Kerala 42.857143
Delhi Daredevils 41.614907
Deccan Chargers 38.666667
Rising Pune Supergiants 35.714286
Pune Warriors 26.086957
dtype: float64
plt.figure(figsize=(12,6))
plt.title('Win Percentage')
win_percentage_plot = sns.barplot(y = win_percentage.index, x = win_percentage)
total_matches_played_plot.set(ylabel = 'Teams', xlabel = 'Percentage');
Rising Pune Supergiant and Delhi Capitals have the highest win percentage. This is largely due to the fact that they have played really few matches. Especially, Rising Pune Supergiant, who technically beacame a new team after leaving out the 's'.
Chennai Super Kings despite playing two less seasons than Mumbai Indians had only 9 less victories. They, along with Mumbai Indians are the only two teams in top 5 that were also part of IPL in 2008.
Chennai and Mumbai are the teams with legacy.
import jovian
jovian.commit(project = 'ipl data analysis', files = ['matches.csv'])
[jovian] Attempting to save notebook..
[jovian] Updating notebook "srijansrj5901/ipl-data-analysis" on https://jovian.ml/
[jovian] Uploading notebook..
[jovian] Capturing environment..
[jovian] Uploading additional files...
[jovian] Committed successfully! https://jovian.ml/srijansrj5901/ipl-data-analysis
We now know few things about aour data. Let's know some more!!!
Steps:
groupby()
to group the rows according to seasons.tail(1)
.sort_index()
.value_counts()
on winner
.ipl_win = matches_df.groupby('season').tail(1).sort_values('season', ascending = True)
ipl_win
ipl_winners = ipl_win.winner.value_counts()
ipl_winners
Mumbai Indians 4
Chennai Super Kings 3
Kolkata Knight Riders 2
Rajasthan Royals 1
Sunrisers Hyderabad 1
Deccan Chargers 1
Name: winner, dtype: int64
plt.figure(figsize=(18, 4))
plt.xlabel('Teams')
plt.ylabel('No. of Times')
plt.title('IPL Champions')
sns.barplot( x = ipl_winners.index, y = ipl_winners);
Mumbai and Chennai, our legacy teams, have won the IPL atleast 3 times. Sunrisers Hyderabad are the only team that have joined the league later and have won the trophy.
Steps:
groupby()
to group the rows according to seasons.value_counts()
on toss_winner
.matches_df.groupby('season').toss_winner.value_counts().plot(kind ='barh', figsize = (30, 60))
plt.title('Tosses Won per Season', size = 30)
plt.xlabel('Seasons and Teams', size = 30)
plt.ylabel('No. of Matches', size = 30);
#Double Click on the graph below to zoom
Except 2012, 2015 and 2019, the IPL winning teams have been amongst the top two in terms of toss win percentage. In 2012 and 2015, Kolkata Knight Riders and Mumbai Indians were 6th best (won 7) in winning tosses while in 2019 Mumbai were 4th best (won 8).
Kolkata and Mumbai in 2013 and Chennai in 2019 have won the most no. of tosses in a season - 12.
In cricket, teams can win by runs or wickets. We will look at both the scenarios.
Steps:
head(10)
.highest_wins_by_runs_df = matches_raw_df[matches_raw_df.win_by_runs != 0].sort_values('win_by_runs', ascending = False)
highest_wins_by_runs_df
plt.figure(figsize=(25, 10))
plt.xlabel('Seasons',size=30)
plt.ylabel('Runs',size=30)
plt.title('Highest Wins By Runs', size = 30)
sns.scatterplot(x = 'season',y = 'win_by_runs', data = highest_wins_by_runs_df, s =150, color = 'black');
sns.scatterplot(x = 'season',y = 'win_by_runs', data = highest_wins_by_runs_df.head(10), s =220, color = 'red');
for i in range(highest_wins_by_runs_df.head(10).shape[0]):
plt.annotate(highest_wins_by_runs_df.winner.tolist()[i], (highest_wins_by_runs_df.season.tolist()[i]+0.1, highest_wins_by_runs_df.win_by_runs.tolist()[i] - 1) , size = 20)
The biggest margin of victory by runs is 146 runs. In 2017 season, Mumbai Indians defeated Delhi Daredevils by this margin. Royal Challengers Bangalore have 3 victories amongst top 5, mainly because of having the services of arguably the best top 3 batsmen ever assembled in a T20 team.
If we look at margin of victories by wickets, it is fairly common to chase a total with all the wickets remaining. The top 10 wins in the below list have margin of victories as 10 wickets.
largest_wins_by_wickets = matches_raw_df.sort_values('win_by_wickets', ascending = False).head(10)