Sign In

Indian Premier League Data Analysis

Jovian logo freeCodeCamp logo IPL logo

This notebook provides the data analysis of matches that have taken place in Indian Premier League (IPL) from 2008 to 2019. The dataset used in this analysis is taken from Once downloaded, there are two different datasets. One having information about the matches and the results (matches.csv). The other one has ball-by-ball data for all seasons (deliveries.csv). For this project, I have analysed the data from matches.csv.

The analysis done in this project is from a historical point of view, giving readers an overview of what has happended in the IPL. Tools such as Pandas, Matplotlib and Seaborn along with Python have been used to give a visual as well as numeric representation of the data in front of us.

The learnings about these tools have been received through the course Data Analysis with Python: Zero to Pandas conducted by The course was offered at no cost and made my journey of learning really easy and interesting. The course was done in partnership with freeCodeCamp.

How to run the code

This is an executable Jupyter notebook hosted on, a platform for sharing data science projects. You can run and experiment with the code in a couple of ways: using free online resources (recommended) or on your own computer.

Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing this notebook is to click the "Run" button at the top of this page, and select "Run on Binder". This will run the notebook on, a free online service for running Jupyter notebooks. You can also select "Run on Colab" or "Run on Kaggle".

Option 2: Running on your computer locally
  1. Install Conda by following these instructions. Add Conda binaries to your system PATH, so you can use the conda command on your terminal.

  2. Create a Conda environment and install the required libraries by running these commands on the terminal:

conda create -n zerotopandas -y python=3.8 conda activate zerotopandas pip install jovian jupyter numpy pandas matplotlib seaborn opendatasets --upgrade

  1. Press the "Clone" button above to copy the command for downloading the notebook, and run it on the terminal. This will create a new directory and download the notebook. The command will look something like this:

jovian clone notebook-owner/notebook-id

  1. Enter the newly created directory using cd directory-name and start the Jupyter notebook.

jupyter notebook

You can now access Jupyter's web interface by clicking the link that shows up on the terminal or by visiting http://localhost:8888 on your browser. Click on the notebook file (it has a .ipynb extension) to open it.

In [1]:
!pip install jovian --upgrade --quiet
In [2]:
!pip install pandas --upgrade
Requirement already up-to-date: pandas in c:\users\s\anaconda3\envs\courseproject\lib\site-packages (1.1.2) Requirement already satisfied, skipping upgrade: numpy>=1.15.4 in c:\users\s\anaconda3\envs\courseproject\lib\site-packages (from pandas) (1.19.2) Requirement already satisfied, skipping upgrade: python-dateutil>=2.7.3 in c:\users\s\anaconda3\envs\courseproject\lib\site-packages (from pandas) (2.8.1) Requirement already satisfied, skipping upgrade: pytz>=2017.2 in c:\users\s\anaconda3\envs\courseproject\lib\site-packages (from pandas) (2020.1) Requirement already satisfied, skipping upgrade: six>=1.5 in c:\users\s\anaconda3\envs\courseproject\lib\site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
In [3]:
!pip install matplotlib seaborn --upgrade --quiet
In [1]:
#Importing the libraries (tools) to be used
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Data Preparation and Cleaning

Let's load the CSV files using the Pandas library. We'll use the name matches_raw_df for the data frame, to indicate that this is unprocessed data that which we might clean, filter and modify to prepare a data frame that's ready for analysis.

We will read the matches.csv file using read_csv().

In [2]:
matches_raw_df = pd.read_csv('matches.csv')
In [3]:
In [4]:
# know the no. of rows and columns using shape
(756, 18)

So, the dataset has 756 rows (matches) and 18 columns. Let's find the names of those columns.

In [5]:
#Getting the list of columns
Index(['id', 'season', 'city', 'date', 'team1', 'team2', 'toss_winner',
       'toss_decision', 'result', 'dl_applied', 'winner', 'win_by_runs',
       'win_by_wickets', 'player_of_match', 'venue', 'umpire1', 'umpire2',
In [6]:
#Know the no. of columns using len
In [7]:
#Know about data
<class 'pandas.core.frame.DataFrame'> RangeIndex: 756 entries, 0 to 755 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 756 non-null int64 1 season 756 non-null int64 2 city 749 non-null object 3 date 756 non-null object 4 team1 756 non-null object 5 team2 756 non-null object 6 toss_winner 756 non-null object 7 toss_decision 756 non-null object 8 result 756 non-null object 9 dl_applied 756 non-null int64 10 winner 752 non-null object 11 win_by_runs 756 non-null int64 12 win_by_wickets 756 non-null int64 13 player_of_match 752 non-null object 14 venue 756 non-null object 15 umpire1 754 non-null object 16 umpire2 754 non-null object 17 umpire3 119 non-null object dtypes: int64(5), object(13) memory usage: 106.4+ KB
In [8]:
#Using isnull() to find the columns having null values
#Using sum() to find the total no. of null values for each column
id                   0
season               0
city                 7
date                 0
team1                0
team2                0
toss_winner          0
toss_decision        0
result               0
dl_applied           0
winner               4
win_by_runs          0
win_by_wickets       0
player_of_match      4
venue                0
umpire1              2
umpire2              2
umpire3            637
dtype: int64

Almost all columns except umpire3 have none or very few null values. The null values coud be because of no information or wrong data entry. One thing that catches my eyes is the fact that though there are no null values for result columns, there are some for winner and player_of_match. Let's find out why.

In [9]:
#Using value_counts() on result to find the different values in the result column and their total no.
normal       743
tie            9
no result      4
Name: result, dtype: int64

So, out of 756 matches 4 ended as no result, mainly due to rain. Therefore, we have no winners as well as player of the match for these 4 matches.

In [10]:
#Few stats about columns with integer type data

For our analysis, umpire3 column isn't needed. So we will drop the column using drop() by passing the column name and axis value.

In [11]:
matches_df = matches_raw_df.drop('umpire3', axis = 1)
In [12]:

We will use matches_df for our analysis from here on.

In [6]:
import jovian
In [49]:
jovian.commit(project = 'ipl data analysis', files = ['matches.csv'])
[jovian] Attempting to save notebook.. [jovian] Updating notebook "srijansrj5901/ipl-data-analysis" on [jovian] Uploading notebook.. [jovian] Capturing environment.. [jovian] Uploading additional files... [jovian] Committed successfully!

Exploratory Analysis and Visualization


Let's found how many matches have been played in the IPL each of the season from 2008 to 2019.

We will group the rows by season using groupby() and then count the no. of matches for each season using count() on id.

In [22]:
matches_per_season = matches_df.groupby('season').id.count()
In [25]:
plt.title('Matches Per Season')
match_per_season_plot = sns.barplot(x = matches_per_season.index, y = matches_per_season)
match_per_season_plot.set(xlabel = 'Seasons', ylabel = 'No. of Matches');

Each season, almost 60 matches have been played. However, we see a spike in the number of matches from 2011 to 2013. This is due to the fact that two new franchises, Pune Warrior and Kochi Tuskers Kerala were introduced, increasing the number of teams to 10.

However, Kochi was removed in the very next season, while Pune Warriors were removed in 2013, bringing the number down to 8 from 2014.

Before the start of 2016 season, two teams, Chennai Super Kings and Rajasthan Royals were banned for two seasons. To make up for them two new teams, Rising Pune Supergiants and Gujarat Lions entered the competition.

When Chennai Super Kings and Rajasthan Royals returned, these two teams were removed from the competition.


One of the most significant happenings in any cricket match is the toss, which happens at the very start of a match. The toss winner can choose whether he wants to bat first or second. Let's see what teams have chosen to do across differrent seasons after winning the toss.

We will again group the rows by season and then count the different values of toss_decision by using value_counts(). To find the percentage, we will divide the above result with matches_per_season.

In [23]:
toss_decision_percentage = matches_df.groupby('season').toss_decision.value_counts().sort_index() / matches_per_season * 100
In [24]:
season  toss_decision
2008    bat              44.827586
        field            55.172414
2009    bat              61.403509
        field            38.596491
2010    bat              65.000000
        field            35.000000
2011    bat              34.246575
        field            65.753425
2012    bat              50.000000
        field            50.000000
2013    bat              59.210526
        field            40.789474
2014    bat              31.666667
        field            68.333333
2015    bat              42.372881
        field            57.627119
2016    bat              18.333333
        field            81.666667
2017    bat              18.644068
        field            81.355932
2018    bat              16.666667
        field            83.333333
2019    bat              16.666667
        field            83.333333
dtype: float64
In [26]:
MultiIndex([(2008,   'bat'),
            (2008, 'field'),
            (2009,   'bat'),
            (2009, 'field'),
            (2010,   'bat'),
            (2010, 'field'),
            (2011,   'bat'),
            (2011, 'field'),
            (2012,   'bat'),
            (2012, 'field'),
            (2013,   'bat'),
            (2013, 'field'),
            (2014,   'bat'),
            (2014, 'field'),
            (2015,   'bat'),
            (2015, 'field'),
            (2016,   'bat'),
            (2016, 'field'),
            (2017,   'bat'),
            (2017, 'field'),
            (2018,   'bat'),
            (2018, 'field'),
            (2019,   'bat'),
            (2019, 'field')],
           names=['season', 'toss_decision'])
In [27]:
In [28]:
toss_decision_percentage.unstack().plot(kind = 'bar', figsize=(12,6), title = 'Toss Decisions', xlabel = 'Seasons', ylabel = 'Percentage');


For 2008-2013, teams seem to have been favouring both batting first and second. For this period, the percentage of times batting first was chosen is more in seasons 2009, 2010 and 2013. While fielding was chosen more in 2008 and 2011. Things were even stevens for 2012.

This could be put to the fact that IPL and T20 cricket in general was in its budding stages. So, teams were probably learning and trying to figure out which would favour them.

However, since 2014, teams have overwhelmingly chosen to bat second. Especially since 2016, teams have chosen to field for more than 80% of the times.

With the use of data analysis and an incresing trend in ODIs to bat second as there is a fixed target to achieve, teams chose more and more to bat first. This made the batsmen tasks easier as they could now have a clear thought of how to scale the target put in front of them.


We saw how teams in the recent past have chosen to bat second more than 4 out of 5 times. Did this decision of theirs tranformed in results? Let's see

We will filter the dataframe using the required conditions, then grouping them by season and finding the count of winners.

In [42]:
filter1 = (matches_df.win_by_wickets == 0) & (matches_df.result == 'normal') 
In [43]:
wins_batting_first = matches_df[filter1].groupby('season').winner.count() / matches_per_season * 100
In [44]:
filter2 = (matches_df.win_by_runs == 0) & (matches_df.result == 'normal')
In [50]:
wins_fielding_first = matches_df[filter2].groupby('season').winner.count() / matches_per_season * 100
In [51]:
combined_wins_df = pd.concat([wins_batting_first, wins_fielding_first], axis = 1)
combined_wins_df.columns = ['batting_first', 'fielding_first']
In [52]:
combined_wins_df.plot(kind = 'bar', figsize=(12,6), title = 'Wins', xlabel = 'Seasons', ylabel = 'Percentage');

We saw earlier that for 2008-2013, teams were in conundrum to chose bat first or field first. This is partially visisble in the results as well. The wins from batting first are very close to that from fielding first. However, there is just one season where teams batting first won more, with things being equal in 2013.

Again, since 2014, things have been in favour of teams chasing except 2015. Leaving out 2015, things have been overwhelmingly in favour of teams fie.

So, teams chosing batting second more have been justified in their decisions.

Teams with "History"

In leagues across different sports, there is always a talk about teams with "history" as in teams that have played the most in the league and continue to do so. Let's find such teams in the IPL.

We will count the different values for team1 and team2 using value_counts() and sort them in descending order using sort_values().

In [34]:
total_matches_played = (matches_df.team2.value_counts() + matches_df.team1.value_counts()).sort_values(ascending = False)
Mumbai Indians                 187
Royal Challengers Bangalore    180
Kolkata Knight Riders          178
Kings XI Punjab                176
Chennai Super Kings            164
Delhi Daredevils               161
Rajasthan Royals               147
Sunrisers Hyderabad            108
Deccan Chargers                 75
Pune Warriors                   46
Gujarat Lions                   30
Rising Pune Supergiant          16
Delhi Capitals                  16
Rising Pune Supergiants         14
Kochi Tuskers Kerala            14
dtype: int64
In [35]:
plt.title('Total Matches Played')
total_matches_played_plot = sns.barplot(y = total_matches_played.index, x = total_matches_played)
total_matches_played_plot.set(ylabel = 'Teams', xlabel = 'No. of Matches');

Mumbai Indians are the team that have played the most number of matches. They are followed by Royal Challengers Bangalore, Kolkata Knight Riders, Kings XI Punjab and Chennai Super Kings.

Chennai Super Kings and Rajasthan Royals could have been higher had they not been banned.

You will see there are two teams from Delhi, Delhi Daredevils and Delhi Capitals. This is due to the change in owners and team name in 2018.

Similar story for Deccan Chargers and Sunrisers Hyderabad with Deccan Chargers being removed from IPL from 2013 and Sunrisers coming in their place.

Also, there are two teams with almost same names. Rising Pune Supergiants and Rising Pune Supergiant. Well, they are same teams, no owners change. But it is more to do with superstitions.

In the 2016 season, Rising Pune Supergiants finished 7th. The owners changed the captain for 2017 as well as dropped the 's' from Supergiants. Well, it paid off as they finished as runner-up that season!!!

Teams with "Legacy"

Now, teams may have a lot of history but it's their "legacy" i.e. how often they win that makes them popular and attract the new and neutral fans.

We will use value_counts() on winner to find the different winners and the no. of matches they have won.

In [38]:
most_wins = matches_df.winner.value_counts()
In [39]:
Mumbai Indians                 109
Chennai Super Kings            100
Kolkata Knight Riders           92
Royal Challengers Bangalore     84
Kings XI Punjab                 82
Rajasthan Royals                75
Delhi Daredevils                67
Sunrisers Hyderabad             58
Deccan Chargers                 29
Gujarat Lions                   13
Pune Warriors                   12
Rising Pune Supergiant          10
Delhi Capitals                  10
Kochi Tuskers Kerala             6
Rising Pune Supergiants          5
Name: winner, dtype: int64

So Mumbai have the highest number of wins. But a better metric to judge would be the win percentage.

We will divide most_wins by total_matches_played to find the win_percentage for each team.

In [40]:
win_percentage = (most_wins / total_matches_played).sort_values(ascending = False) * 100
Rising Pune Supergiant         62.500000
Delhi Capitals                 62.500000
Chennai Super Kings            60.975610
Mumbai Indians                 58.288770
Sunrisers Hyderabad            53.703704
Kolkata Knight Riders          51.685393
Rajasthan Royals               51.020408
Royal Challengers Bangalore    46.666667
Kings XI Punjab                46.590909
Gujarat Lions                  43.333333
Kochi Tuskers Kerala           42.857143
Delhi Daredevils               41.614907
Deccan Chargers                38.666667
Rising Pune Supergiants        35.714286
Pune Warriors                  26.086957
dtype: float64
In [41]:
plt.title('Win Percentage')
win_percentage_plot = sns.barplot(y = win_percentage.index, x = win_percentage)
total_matches_played_plot.set(ylabel = 'Teams', xlabel = 'Percentage');

Rising Pune Supergiant and Delhi Capitals have the highest win percentage. This is largely due to the fact that they have played really few matches. Especially, Rising Pune Supergiant, who technically beacame a new team after leaving out the 's'.

Chennai Super Kings despite playing two less seasons than Mumbai Indians had only 9 less victories. They, along with Mumbai Indians are the only two teams in top 5 that were also part of IPL in 2008.

Chennai and Mumbai are the teams with legacy.

In [42]:
import jovian
In [45]:
jovian.commit(project = 'ipl data analysis', files = ['matches.csv'])
[jovian] Attempting to save notebook.. [jovian] Updating notebook "srijansrj5901/ipl-data-analysis" on [jovian] Uploading notebook.. [jovian] Capturing environment.. [jovian] Uploading additional files... [jovian] Committed successfully!

Asking and Answering Questions

We now know few things about aour data. Let's know some more!!!

Q. Who have been the winners of IPL?


  • Using groupby() to group the rows according to seasons.
  • Find the last match of each season i.e the final using tail(1).
  • Sort the values as per season using sort_index().
  • Count the different winners and the times they won using value_counts() on winner.
In [46]:
ipl_win = matches_df.groupby('season').tail(1).sort_values('season', ascending = True)
In [47]:
ipl_winners = ipl_win.winner.value_counts()
Mumbai Indians           4
Chennai Super Kings      3
Kolkata Knight Riders    2
Rajasthan Royals         1
Sunrisers Hyderabad      1
Deccan Chargers          1
Name: winner, dtype: int64
In [49]:
plt.figure(figsize=(18, 4))
plt.ylabel('No. of Times')
plt.title('IPL Champions')
sns.barplot( x  = ipl_winners.index, y = ipl_winners);

Mumbai and Chennai, our legacy teams, have won the IPL atleast 3 times. Sunrisers Hyderabad are the only team that have joined the league later and have won the trophy.

Q. Which have been the most and the least consistent teams across all seasons?


  • Use crosstab() to create a data frame between different values of winner and season.
  • Plot the same as a heatmap.
In [28]:
matches_won_each_season = pd.crosstab(matches_df['winner'], matches_df['season'])
In [30]:
In [55]:
plt.figure(figsize=(15, 9))
plt.title('Matches Won Each Season')
sns.heatmap(matches_won_each_season, annot = True, cmap = 'flare', fmt = 'd', cbar_kws={"orientation": "horizontal"});

Chennai Super Kings have been the most consistent team, winning atleast 8 matches in each of the season they have played. This is backed up by the fact that they are the only team to reach the playoffs satage every season.

At the other end of the spectrum are 3 teams. Delhi Daredevils, Kings XI Punjab and Rajasthan Royals. All three of them have had two seasons, where they have performed really well. However, they have been pretty average in rest of the seasons by far.

Q. Which team has been the luckiest (won most no. of tosses) over the seasons?


  • Using groupby() to group the rows according to seasons.
  • Find the different toss winners and how many times they won for every season by using value_counts() on toss_winner.
In [50]:
matches_df.groupby('season').toss_winner.value_counts().plot(kind ='barh', figsize = (30, 60))
plt.title('Tosses Won per Season', size = 30)
plt.xlabel('Seasons and Teams', size = 30)
plt.ylabel('No. of Matches', size = 30);

#Double Click on the graph below to zoom

Except 2012, 2015 and 2019, the IPL winning teams have been amongst the top two in terms of toss win percentage. In 2012 and 2015, Kolkata Knight Riders and Mumbai Indians were 6th best (won 7) in winning tosses while in 2019 Mumbai were 4th best (won 8).

Kolkata and Mumbai in 2013 and Chennai in 2019 have won the most no. of tosses in a season - 12.

Q. What has been the largest margin of victory in IPL?

In cricket, teams can win by runs or wickets. We will look at both the scenarios.


  • Filter the dataframe using the required consition.
  • Sort the values in descending order.
  • Find the top 10 in the list using head(10).
In [51]:
highest_wins_by_runs_df = matches_raw_df[matches_raw_df.win_by_runs != 0].sort_values('win_by_runs', ascending = False)
In [52]:
plt.figure(figsize=(25, 10))
plt.title('Highest Wins By Runs', size = 30)
sns.scatterplot(x = 'season',y =  'win_by_runs', data = highest_wins_by_runs_df, s =150, color = 'black');
sns.scatterplot(x = 'season',y =  'win_by_runs', data = highest_wins_by_runs_df.head(10), s =220, color = 'red');
for i in range(highest_wins_by_runs_df.head(10).shape[0]):
    plt.annotate(highest_wins_by_runs_df.winner.tolist()[i], (highest_wins_by_runs_df.season.tolist()[i]+0.1, highest_wins_by_runs_df.win_by_runs.tolist()[i] - 1) , size = 20)