This notebook provides the data analysis of matches that have taken place in Indian Premier League (IPL) from 2008 to 2019. The dataset used in this analysis is taken from https://www.kaggle.com/nowke9/ipldata. Once downloaded, there are two different datasets. One having information about the matches and the results (matches.csv). The other one has ball-by-ball data for all seasons (deliveries.csv). For this project, I have analysed the data from matches.csv.
The analysis done in this project is from a historical point of view, giving readers an overview of what has happended in the IPL. Tools such as Pandas, Matplotlib and Seaborn along with Python have been used to give a visual as well as numeric representation of the data in front of us.
The learnings about these tools have been received through the course Zero to Pandas (http://zerotopandas.com) conducted by Jovian.ml (https://jovian.ml). The course was offered at no cost and made my journey of learning really easy and interesting. The course was done in partnership with freeCodeCamp (https://www.freecodecamp.org).
!pip install jovian --upgrade --quiet
!pip install pandas --upgrade
Requirement already up-to-date: pandas in c:\users\s\anaconda3\envs\courseproject\lib\site-packages (1.1.2) Requirement already satisfied, skipping upgrade: numpy>=1.15.4 in c:\users\s\anaconda3\envs\courseproject\lib\site-packages (from pandas) (1.19.2) Requirement already satisfied, skipping upgrade: python-dateutil>=2.7.3 in c:\users\s\anaconda3\envs\courseproject\lib\site-packages (from pandas) (2.8.1) Requirement already satisfied, skipping upgrade: pytz>=2017.2 in c:\users\s\anaconda3\envs\courseproject\lib\site-packages (from pandas) (2020.1) Requirement already satisfied, skipping upgrade: six>=1.5 in c:\users\s\anaconda3\envs\courseproject\lib\site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
!pip install matplotlib seaborn --upgrade --quiet
#Importing the libraries (tools) to be used import pandas as pd import matplotlib import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline sns.set_style('darkgrid') matplotlib.rcParams['font.size'] = 14 matplotlib.rcParams['figure.figsize'] = (9, 5) matplotlib.rcParams['figure.facecolor'] = '#00000000'
Let's load the CSV files using the Pandas library. We'll use the name
matches_raw_df for the data frame, to indicate that this is unprocessed data that which we might clean, filter and modify to prepare a data frame that's ready for analysis.
matches_raw_df = pd.read_csv('matches.csv')
So, the dataset has 756 rows (matches) and 18 columns. Let's find the names of those columns.
Index(['id', 'season', 'city', 'date', 'team1', 'team2', 'toss_winner', 'toss_decision', 'result', 'dl_applied', 'winner', 'win_by_runs', 'win_by_wickets', 'player_of_match', 'venue', 'umpire1', 'umpire2', 'umpire3'], dtype='object')
#Know about data matches_raw_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 756 entries, 0 to 755 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 756 non-null int64 1 season 756 non-null int64 2 city 749 non-null object 3 date 756 non-null object 4 team1 756 non-null object 5 team2 756 non-null object 6 toss_winner 756 non-null object 7 toss_decision 756 non-null object 8 result 756 non-null object 9 dl_applied 756 non-null int64 10 winner 752 non-null object 11 win_by_runs 756 non-null int64 12 win_by_wickets 756 non-null int64 13 player_of_match 752 non-null object 14 venue 756 non-null object 15 umpire1 754 non-null object 16 umpire2 754 non-null object 17 umpire3 119 non-null object dtypes: int64(5), object(13) memory usage: 106.4+ KB
#Find which columns have how many null values matches_raw_df.isnull().sum()
id 0 season 0 city 7 date 0 team1 0 team2 0 toss_winner 0 toss_decision 0 result 0 dl_applied 0 winner 4 win_by_runs 0 win_by_wickets 0 player_of_match 4 venue 0 umpire1 2 umpire2 2 umpire3 637 dtype: int64
Almost all columns except
umpire3 have none or very few null values. The null values coud be because of no information or wrong data entry. One thing that catches my eyes is the fact that though there are no null values for
result columns, there are some for
player_of_match. Let's find out why.
normal 743 tie 9 no result 4 Name: result, dtype: int64
So, out of 756 matches 4 ended as no result, mainly due to rain. Therefore, we have no winners as well as player of the match for these 4 matches.
#Few stats about columns with integer type data matches_raw_df.describe()
For our analysis,
umpire3 column isn't needed. So we will drop the column.
matches_df = matches_raw_df.drop('umpire3', axis = 1)
We will use
matches_df for our analysis from here on.
jovian.commit(project = 'ipl matches data analysis', files = ['matches.csv'])
[jovian] Attempting to save notebook.. [jovian] Updating notebook "srijansrj5901/ipl-matches-data-analysis" on https://jovian.ml/ [jovian] Uploading notebook.. [jovian] Capturing environment.. [jovian] Uploading additional files... [jovian] Committed successfully! https://jovian.ml/srijansrj5901/ipl-matches-data-analysis
Let's found how many matches have been played in the IPL each of the season from 2008 to 2019.
matches_per_season = matches_df.groupby('season').season.count()
plt.figure(figsize=(12,6)) plt.xticks(rotation=75) plt.title('Matches Per Season') match_per_season_plot = sns.barplot(x = matches_per_season.index, y = matches_per_season) match_per_season_plot.set(xlabel = 'Seasons', ylabel = 'No. of Matches');
Each season, almost 60 matches have been played. However, we see a spike in the number of matches from 2011 to 2013. This is due to the fact that two new franchises, Pune Warrior and Kochi Tuskers Kerala were introduced, increasing the number of teams to 10.
However, they were removed from 2014, bringing the number down to 8.
Before the start of 2016 season, two teams, Chennai Super Kings and Rajasthan Royals were banned for two seasons. To make up for them two new teams, Rising Pune Supergiants and Gujarat Lions entered the competition.
When Chennai Super Kings and Rajasthan Royals returned, these two teams were removed from the competition.
One of the most significant happenings in any cricket match is the toss, which happens at the very start of a match. The toss winner can choose whether he wants to bat first or second. Let's see what teams have chosen to do across differrent seasons after winning the toss.
toss_decision_percentage = matches_df.groupby('season').toss_decision.value_counts().sort_index() / matches_per_season * 100
season toss_decision 2008 bat 44.827586 field 55.172414 2009 bat 61.403509 field 38.596491 2010 bat 65.000000 field 35.000000 2011 bat 34.246575 field 65.753425 2012 bat 50.000000 field 50.000000 2013 bat 59.210526 field 40.789474 2014 bat 31.666667 field 68.333333 2015 bat 42.372881 field 57.627119 2016 bat 18.333333 field 81.666667 2017 bat 18.644068 field 81.355932 2018 bat 16.666667 field 83.333333 2019 bat 16.666667 field 83.333333 dtype: float64
toss_decision_percentage.unstack().plot(kind = 'bar', figsize=(12,6), title = 'Toss Decisions', xlabel = 'Seasons', ylabel = 'Percentage');
For 2008-2013, teams seem to have been favouring both batting first and second. For this period, the percentage of times batting first was chosen is more in seasons 2009, 2010 and 2013. While fielding was chosen more in 2008 and 2011. Things were even stevens for 2012.
This could be put to the fact that IPL and T20 cricket in general was in its budding stages. So, teams were probably learning and trying to figure out which would favour them.
However, since 2014, teams have overwhelmingly chosen to bat second. Especially since 2016, teams have chosen to field for more than 80% of the times.
With the use of data analysis and an incresing trend in ODIs to bat second as there is a fixed target to achieve, teams chose more and more to bat first. This made the batsmen tasks easier as they could now have a clear thought of how to scale the target put in front of them.
We saw how teams in the recent past have chosen to bat second more than 4 out of 5 times. Did this decision of theirs tranformed in results? Let's see
wins_batting_second = matches_df[(matches_df.win_by_runs == 0) & (matches_df.result == 'normal')].groupby('season').winner.count() / matches_per_season * 100
wins_batting_first = matches_df[(matches_df.win_by_wickets == 0) & (matches_df.result == 'normal')].groupby('season').winner.count() / matches_per_season * 100
combined_wins_df = pd.concat([wins_batting_first, wins_batting_second], axis = 1) combined_wins_df.columns = ['batting_first', 'batting_second'] combined_wins_df
combined_wins_df.plot(kind = 'bar', figsize=(12,6), title = 'Wins', xlabel = 'Seasons', ylabel = 'Percentage');
We saw earlier that for 2008-2013, teams were in conundrum to chose bat first or second. This is partially visisble in the results as well. The wins from batting first are very close to that from batting second. However, there is just one season where teams batting first won more, with things being equal in 2013.
Again, since 2014, things have been in favour of teams chasing except 2015. Leaving out 2015, things have been overwhelmingly in favour of teams batting second.
So, teams chosing batting second more have been justified in their decisions.
In leagues across different sports, there is always a talk about teams with "history" as in teams that have played the most in the league and continue to do so. Let's find such teams in the IPL.
total_matches_played = (matches_df.team2.value_counts() + matches_df.team1.value_counts()).sort_values(ascending = False) total_matches_played
Mumbai Indians 187 Royal Challengers Bangalore 180 Kolkata Knight Riders 178 Kings XI Punjab 176 Chennai Super Kings 164 Delhi Daredevils 161 Rajasthan Royals 147 Sunrisers Hyderabad 108 Deccan Chargers 75 Pune Warriors 46 Gujarat Lions 30 Rising Pune Supergiant 16 Delhi Capitals 16 Rising Pune Supergiants 14 Kochi Tuskers Kerala 14 dtype: int64
plt.figure(figsize=(12,6)) plt.title('Total Matches Played') total_matches_played_plot = sns.barplot(y = total_matches_played.index, x = total_matches_played) total_matches_played_plot.set(ylabel = 'Teams', xlabel = 'No. of Matches');
Mumbai Indians are the team that have played the most number of matches. They are followed by Royal Challengers Bangalore, Kolkata Knight Riders, Kings XI Punjab and Chennai Super Kings.
Chennai Super Kings and Rajasthan Royals could have been higher had they not been banned.
You will see there are two teams from Delhi, Delhi Daredevils and Delhi Capitals. This is due to the change in owners and team name in 2018.
Similar story for Deccan Chargers and Sunrisers Hyderabad with Deccan Chargers being removed from IPL from 2013 and Sunrisers coming in their place.
Also, there are two teams with almost same names. Rising Pune Supergiants and Rising Pune Supergiant. Well, they are same teams, no owners change. But it is more to do with superstitions.
In the 2016 season, Rising Pune Supergiants finished 7th. The owners changed the captain for 2017 as well as dropped the 's' from Supergiants. Well, it paid off as they finished as runner-up that season!!!
Now, teams may have a lot of history but it's their "legacy" i.e. how often they win that makes them popular and attract the new and neutral fans.
most_wins = matches_df.winner.value_counts()
Mumbai Indians 109 Chennai Super Kings 100 Kolkata Knight Riders 92 Royal Challengers Bangalore 84 Kings XI Punjab 82 Rajasthan Royals 75 Delhi Daredevils 67 Sunrisers Hyderabad 58 Deccan Chargers 29 Gujarat Lions 13 Pune Warriors 12 Rising Pune Supergiant 10 Delhi Capitals 10 Kochi Tuskers Kerala 6 Rising Pune Supergiants 5 Name: winner, dtype: int64
So Mumbai have the highest number of wins. But a better metric to judge would be the win percentage.
win_percentage = (most_wins / total_matches_played).sort_values(ascending = False) * 100 win_percentage
Rising Pune Supergiant 62.500000 Delhi Capitals 62.500000 Chennai Super Kings 60.975610 Mumbai Indians 58.288770 Sunrisers Hyderabad 53.703704 Kolkata Knight Riders 51.685393 Rajasthan Royals 51.020408 Royal Challengers Bangalore 46.666667 Kings XI Punjab 46.590909 Gujarat Lions 43.333333 Kochi Tuskers Kerala 42.857143 Delhi Daredevils 41.614907 Deccan Chargers 38.666667 Rising Pune Supergiants 35.714286 Pune Warriors 26.086957 dtype: float64
plt.figure(figsize=(12,6)) plt.title('Win Percentage') win_percentage_plot = sns.barplot(y = win_percentage.index, x = win_percentage) total_matches_played_plot.set(ylabel = 'Teams', xlabel = 'Percentage');
Rising Pune Supergiant and Delhi Capitals have the highest win percentage. This is largely due to the fact that they have played really few matches. Especially, Rising Pune Supergiant, who technically beacame a new team after leaving out the 's'.
Chennai Super Kings despite playing two less seasons than Mumbai Indians had only 9 less victories. They, along with Mumbai Indians are the only two teams in top 5 that were also part of IPL in 2008.
Chennai and Mumbai are the teams with legacy.
[jovian] Attempting to save notebook.. [jovian] Updating notebook "srijansrj5901/ipl-matches-data-analysis" on https://jovian.ml/ [jovian] Uploading notebook.. [jovian] Capturing environment.. [jovian] Committed successfully! https://jovian.ml/srijansrj5901/ipl-matches-data-analysis
We now know few things about aour data. Let's know some more!!!
ipl_win = matches_df.groupby('season').tail(1).sort_values('season', ascending = True) ipl_win
ipl_winners = ipl_win.winner.value_counts() ipl_winners
Mumbai Indians 4 Chennai Super Kings 3 Kolkata Knight Riders 2 Sunrisers Hyderabad 1 Rajasthan Royals 1 Deccan Chargers 1 Name: winner, dtype: int64
plt.figure(figsize=(15, 4)) plt.xlabel('Teams') plt.ylabel('No. of Times') plt.title('IPL Champions') sns.barplot( x = ipl_winners.index, y = ipl_winners);
Mumbai and Chennai, our legacy teams, have won the IPL atleast 3 times. Sunrisers Hyderabad are the only team that have joined the league later and have won the trophy.
matches_df.groupby('season').toss_winner.value_counts().plot(kind ='barh', figsize = (30, 60)) plt.title('Tosses Won per Season', size = 30) plt.xlabel('Seasons and Teams', size = 30) plt.ylabel('No. of Matches', size = 30); #Double Click on the graph below to zoom
Except 2012, 2015 and 2019, the IPL winning teams have been amongst the top two in terms of toss win percentage. In 2012 and 2015, Kolkata Knight Riders and Mumbai Indians were 6th best (won 7) in winning tosses while in 2019 Mumbai were 4th best (won 8).
Kolkata and Mumbai in 2013 and Chennai in 2019 have won the most no. of tosses in a season - 12.
In cricket, teams can win by runs or wickets. We will look at both the scenarios.
highest_wins_by_runs_df = matches_raw_df[matches_raw_df.win_by_runs != 0].sort_values('win_by_runs', ascending = False) highest_wins_by_runs_df
plt.figure(figsize=(25, 10)) plt.xlabel('Seasons',size=30) plt.ylabel('Runs',size=30) plt.title('Highest Wins By Runs', size = 30) sns.scatterplot(x = 'season',y = 'win_by_runs', data = highest_wins_by_runs_df, s =150, color = 'black'); sns.scatterplot(x = 'season',y = 'win_by_runs', data = highest_wins_by_runs_df.head(10), s =220, color = 'red'); for i in range(highest_wins_by_runs_df.head(10).shape): plt.annotate(highest_wins_by_runs_df.winner.tolist()[i], (highest_wins_by_runs_df.season.tolist()[i]+0.1, highest_wins_by_runs_df.win_by_runs.tolist()[i] - 1) , size = 20)