Jovian
⭐️
Sign In
Learn data science and machine learning by building real-world projects on Jovian

Premier League - An Exploratory Data Analysis

Season 2020-21 of English Premier League - Arguably the world's most entertaining football has come to an end. Congratulations to Manchester City for winning the league. It has been a different season without fans in the stadium, but it has given us some nail biting finishes, wild and breathtaking matches, memorable goals and cherishable moments. Before we get our studs and shin pads ready for next season, why not check some interesting trends in the 29 seasons of the beautiful game played so far and learn step-by-step approach to doing Exploratory Data Analysis along the way? Let's dig in.

The dataset we're using is a dataset on kaggle containing information about more than 10,000 Premier League matches played. The dataset can be found on https://www.kaggle.com/irkaal/english-premier-league-results.

We will use python libraries NumPy,Pandas,Matplotlib and Seaborn in this project. If you read through, you would be able to perform EDA on any dataset of your choice!

Downloading the Dataset

Let's start by downloading the dataset from kaggle. Here we use the opendatasets library made for python to download the same. By passing the URL of the kaggle page for the dataset to opendatasets.download(), we will download the dataset to our Jupyter notebook directly. To execute following cells select the cell and press Shift + Enter. Make sure you execute all the cells.

In [2]:
!pip install jovian opendatasets --upgrade --quiet

Let's begin by downloading the data, and listing the files within the dataset.

In [3]:
dataset_url = 'https://www.kaggle.com/irkaal/english-premier-league-results' 
import opendatasets as od
od.download(dataset_url)
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds Your Kaggle username: adityapatkar Your Kaggle Key: ··········
100%|██████████| 289k/289k [00:00<00:00, 35.6MB/s]
Downloading english-premier-league-results.zip to ./english-premier-league-results

The dataset has been downloaded and extracted.

In [4]:
data_dir = './english-premier-league-results'
In [5]:
import os
os.listdir(data_dir)
Out[5]:
['results.parquet', 'results.csv']

Our data resides in results.csv file.

In [6]:
project_name = "English Premier League"
In [7]:
!pip install jovian --upgrade -q
In [8]:
import jovian
In [ ]:
jovian.commit(project=project_name)
[jovian] Detected Colab notebook... [jovian] Please enter your API key ( from https://jovian.ai/ ): API KEY: ·········· [jovian] Uploading colab notebook to Jovian... [jovian] Capturing environment.. [jovian] Committed successfully! https://jovian.ai/aditya-pat10/english-premier-league

Data Preparation and Cleaning

In any data analysis project, while working with real world raw data, it is very necessary to ready the data for our analysis. There could be wrong values, missing values that need to be dealt with. Along with that, we might want to add new columns which are useful for our analysis to the dataset, we might want to merge a few datasets together, this should be done as a preliminary step.

In our case, let's start by converting the dataset into a Pandas dataframe. Pandas is a python library which gives us handy functions for data cleaning, merging, operations etc. It creates onject called DataFrame which is basically data in tabular form. We can read different types of files eg. CSV, JSON, XLSX etc. and create a dataframe using the same.

To know more about Pandas, check my Notion notebook. (https://www.notion.so/Pandas-79a9e56f662240bf990fdc8fde1ad40d)

In [9]:
import pandas as pd
import numpy as np
In [10]:
EPL_data = pd.read_csv('./english-premier-league-results/results.csv', encoding = 'unicode_escape')

Let's take a look at our dataset

In [11]:
EPL_data
Out[11]:

The dataset contains results of all the matches 1993-94 season to 2020-21 season. Labels to note :

  • FTHG : Full time home goals. *FTAG : Full time away goals. *FTR : Full time Result. *HTHG : Half time home goals. *HTAG : Half time home goals. *HTR : Half time home goals. *HS : Home team shots. *AS : Away team shots. *HST : Home team shots on target. *AST : Away team shots on target. *HC : Home team corners. *AC : Away team corners. *HF : Home team Fouls. *AF : Away team Fouls. *HY : Home team yellow cards. *AY : Away team yellow cards. *HR : Home team red cards. *AR : Away team red cards.

We can check the shape of our dataset using the .shape method. It returns a tuple in the form (Number of Rows, Number of columns) As we can see, the data contains information about more than 10000 Premier League matches.

In [12]:
EPL_data.shape
Out[12]:
(10794, 23)

.info method provides us a way to describe data, it shows us name of the column, number of non-null values and type of the data.

In [13]:
EPL_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10794 entries, 0 to 10793 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Season 10794 non-null object 1 DateTime 10794 non-null object 2 HomeTeam 10794 non-null object 3 AwayTeam 10794 non-null object 4 FTHG 10794 non-null int64 5 FTAG 10794 non-null int64 6 FTR 10794 non-null object 7 HTHG 9870 non-null float64 8 HTAG 9870 non-null float64 9 HTR 9870 non-null object 10 Referee 7970 non-null object 11 HS 7970 non-null float64 12 AS 7970 non-null float64 13 HST 7970 non-null float64 14 AST 7970 non-null float64 15 HC 7970 non-null float64 16 AC 7970 non-null float64 17 HF 7970 non-null float64 18 AF 7970 non-null float64 19 HY 7970 non-null float64 20 AY 7970 non-null float64 21 HR 7970 non-null float64 22 AR 7970 non-null float64 dtypes: float64(14), int64(2), object(7) memory usage: 1.9+ MB

As we can see, there are a few columns with null values. This must be as record keeping of exhaustive stats started later than inaugration of Premier League. Let's see number of null cells for every column using isna function and delete the rows which contain the null values using .drop method. We have to provide axis argument where 0 = rows and 1 = columns.

In [14]:
pd.isna(EPL_data).sum()
Out[14]:
Season         0
DateTime       0
HomeTeam       0
AwayTeam       0
FTHG           0
FTAG           0
FTR            0
HTHG         924
HTAG         924
HTR          924
Referee     2824
HS          2824
AS          2824
HST         2824
AST         2824
HC          2824
AC          2824
HF          2824
AF          2824
HY          2824
AY          2824
HR          2824
AR          2824
dtype: int64

There are null values in many columns. For the simplicity, let's get rid of all the rows which contain null values. Looks like those stats were not available during a few seasons.

In [15]:
EPL_data = EPL_data.drop(range(0,(2824)), axis =0)
pd.isna(EPL_data).sum()
Out[15]:
Season      0
DateTime    0
HomeTeam    0
AwayTeam    0
FTHG        0
FTAG        0
FTR         0
HTHG        0
HTAG        0
HTR         0
Referee     0
HS          0
AS          0
HST         0
AST         0
HC          0
AC          0
HF          0
AF          0
HY          0
AY          0
HR          0
AR          0
dtype: int64

We have verified that all the null values have been dropped. Let's see how our dataset looks now.

In [16]:
EPL_data.head()
Out[16]:

The exact time is not really that important for us. We can just extract the month and day using .month and .day attributes of DateTime ojects. We will then drop the DateTime column as it is not needed.

In [17]:
EPL_data['Month'] = pd.DatetimeIndex(EPL_data['DateTime']).month
EPL_data['Day'] = pd.DatetimeIndex(EPL_data['DateTime']).day
EPL_data = EPL_data.drop('DateTime', axis = 1)
EPL_data.head()
Out[17]:

Our data is now ready for visualizations. Let's save our work before proceeding.

Exploratory Analysis and Visualization

We have now prepared our dataset such that we can use it to get interesting insights into it. Let's compute a few things that will tell us more about the data. For that we will be using two data visualisation libraries, namely matplotlib and seaborn. These libraries contain useful methods to plot graphs and charts. Matplotlib provides basic functionality while Seaborn builds on top of matplotlib to give advanced functionality in very few lines of code.

To know more about matplotlib and seaborn, check out my Notion notebook. (https://www.notion.so/Data-Visualization-34175fa561104d0295a6c8d46ea43045) You can also go through the documentation for both by simply visiting their respective websites.

Data insights using .describe()

Before plotting graphs, Let's first get some insights from our dataset using the .describe() method provided by pandas.

In [19]:
EPL_data.describe()
Out[19]:

This gives us some interesting preliminary insights into our data. As we can see,

  1. On an average the home team scores more goals and takes more shots.
  2. The away team commits more fouls.
  3. Away team gets more yellow and red cards.
  4. Most goals scored by one team in a match are 9.
  5. Most goals scored at hal time are 5.

Let's importmatplotlib.pyplot and seaborn.

In [20]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (20, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Season-wise analysis

Let's look at the data and compare the trends based on season. Here we will first sort the data according to the season by using .groupby() function.

In [21]:
EPL_by_season =  EPL_data.groupby(['Season']).sum()
EPL_by_season
Out[21]:
Goals scored per season

Let's compare number of goals scored per season. We will create a new column which would be the addition of goals scored by home team and goals scored by away team.

In [22]:
EPL_by_season['Goals'] = EPL_by_season['FTHG'] + EPL_by_season['FTAG']
EPL_by_season = EPL_by_season.reset_index()

Now that we have the dataframe, let's plot the bar graph. Here we're using the barplot() function from seaborn library. As arguments we'll provide Goals series for x axis, Season series for y axis and a data argument. Seaborn has good support for pandas. By providing a dataframe in the 'data' argument, seaborn will automatically find the x and y series' from the given dataframe.

In [23]:
sns.barplot('Goals','Season',   data = EPL_by_season)
plt.title('Number of Goals Scored Per Season') 
plt.ylabel('Goals')
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning
Out[23]:
Text(0, 0.5, 'Goals')
Notebook Image

As we can see above,

  1. Number of goals scored per season have been fairly similar, around 1000.
  2. 2018-19 season saw the highest number of goals scored.
  3. 2005-06 season saw the lowest number of goals scored.
Home Goals vs Away Goals

We will plot a double barplot for the same. Before we do that, SNS requires us to melt our data which lets us use the comparison parameter as hue. It will get clear when you compare the data frame below with our original data frame.

In [24]:
EPL_melted = EPL_by_season[['Season', 'FTHG', 'FTAG']]
EPL_melt = EPL_melted.melt("Season",var_name="H/A",value_name="Goals")
EPL_melt
Out[24]:
In [25]:
sns.barplot('Goals', 'Season', hue = 'H/A',data = EPL_melt)
plt.title('Number of Goals Scored Per Season') 
plt.ylabel('Season')
plt.legend(['Home Goals', 'Away Goals'])
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning
Out[25]:
<matplotlib.legend.Legend at 0x7fadde1a43d0>
Notebook Image
  1. We can note that Goals scored by home team are almost always more than goals scored by away team.
  2. In 2020-21 Season, the games were played behind close doors. The absense of fans clearly reflects in the stats. Goals scored by away team are more than those scored by home team for this season.
Plotting a heatmap for distribution of goals per month

Now let's prepare our data for a heatmap that shows us the goals scored per month with respect to each season. For that, we use a special funcion provided by pandas called .pivot() which converts the dataframe into a 2D matrix, with first argument (series) being rows, 2nd argument being columns and third argument being the values corresponding to series 1 and series 2. Check it out below.

In [26]:
EPL_heat = EPL_data.groupby(['Season','Month']).sum().reset_index()
EPL_heat['Goals'] = EPL_heat['FTHG'] + EPL_heat['FTAG']
Final = EPL_heat.pivot('Season','Month', 'Goals' )
Final
Out[26]:

Now we plot a heatmap by passing the dataset to the heatmap() function provided by seaborn. annot argument can be set to True if you want to show the values in the blocks. Try it out!

In [27]:
sns.heatmap(Final, annot = True, cmap = 'Blues',fmt='g')
plt.title('Goals Distribution based on month') 
Out[27]:
Text(0.5, 1.0, 'Goals Distribution based on month')
Notebook Image

The above graphs tells us more about the data.

  1. No games are played in the months of June and July, thus no goals are scored.
  2. Almost every year, December has seen the most number of goals. This is because december sees the most number of matches played.
  3. The 2019-20 season was disrupted due to the pandemic. Games were played in June and July instead of April and May under project restart.
  4. The season ends mid-May, thus the low number of goals scored in May.

Let's quickly plot another graph of Goals scored per game. It is fairly simple.

In [28]:
EPL_count = EPL_data.groupby(['Season','Month']).count().reset_index()
EPL_heat['G/M'] = round(EPL_heat['Goals'] / EPL_count['FTHG'])
Final = EPL_heat.pivot('Season', 'Month', 'G/M')
Final
Out[28]:
In [29]:
sns.heatmap(Final, annot = True, cmap = 'Blues',fmt='g')
plt.title('Goals per match based on month') 
Out[29]:
Text(0.5, 1.0, 'Goals per match based on month')
Notebook Image

Here we can observe that Goals per game remains fairly obvious which shows that there are no real visible "Settling period", "Thriving period" or "Burnout period" or at least it doesn't affect the scoreline much. Pretty interesting that September witnessed the highest goals per game twice.

Shot comparison using line chart

Now, let's explore how the Shots taken by home team compare to those taken by away team for every Season. We will use the same EPL_by_season dataframe that we created before.

We can plot multiple the lines in same graph by writing all sns.lineplot() calls in the same cell as seen below.

In [55]:
import warnings
warnings.filterwarnings('ignore')
sns.lineplot(EPL_by_season.Season, EPL_by_season.HS, data = EPL_by_season)
sns.lineplot(EPL_by_season.Season, EPL_by_season.AS, data = EPL_by_season)
sns.lineplot(EPL_by_season.Season, EPL_by_season.HST, data = EPL_by_season)
sns.lineplot(EPL_by_season.Season, EPL_by_season.AST, data = EPL_by_season)
plt.legend(['Home Shots','Away Shots','Home Shots on Target','Away Shots on Target' ])
plt.ylabel('Shots')
plt.title('Home Shots vs Away Shots') 
plt.xticks(rotation=60)

Out[55]:
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
 <a list of 21 Text major ticklabel objects>)
Notebook Image

This is a very interesting chart. It tells us that,

  1. Around 60% of the shots taken end up on target.
  2. See how no line intersects. Shots taken by home team and shots on target by home team are always more than away team. There's clearly a home advantage.
  3. The lowest difference between the lines come during 2020-21 season. This season there were no fans allowed. This shows how fans literally play the role of 12th man of the team.
  4. Even the the shots lines remain fairly flat, there's a remarkable fall in Shots on target when it comes to 2013-14 season.

Team-wise comparison

Let's see how the stats compare based on teams that have played in the top division of English football. We would need to create two groupby objects because we don't have a teams column. We have HomeTeam and AwayTeam columns. We will use these objects further in our analysis.

In [32]:
EPL_home = EPL_data.groupby(EPL_data.HomeTeam)
EPL_away = EPL_data.groupby(EPL_data.AwayTeam)
Total goals scored by each team over all seasons

We will sum the goals scored by home teams from the EPL_home object and goals scored by away teams from the EPL_away object. We will plot a simple bar graph to visualise the same.

In [33]:
EPL_team_goals = EPL_home['FTHG'].sum() + EPL_away['FTAG'].sum()
EPL_team_goals = EPL_team_goals.reset_index()
EPL_team_goals.rename( columns={0 :'Goals'}, inplace=True )
EPL_team_goals = EPL_team_goals.sort_values(['Goals'], ascending=False)
In [34]:
matplotlib.rcParams['figure.figsize'] = (20, 20)
sns.barplot("Goals",'HomeTeam', data = EPL_team_goals )
plt.title('Overall Goals Scored')
plt.ylabel('Team')

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning
Out[34]:
Text(0, 0.5, 'Team')
Notebook Image

As we can see,

  1. Manchester United has scored the highest goals in Premier League since 2000-01 season till date.
  2. The traditional "big-six" teams occupy the top six positions.
  3. Bradford has scored the least goals in premier league.
Teamwise shot conversion rate of teams

Let's find which team has been the most lethal. Shot conversion rate is basically how many shots taken by a team end up in the net. We will create a new column which will divide goals scored by shots taken by the team.

In [35]:
EPL_team_HomeSCR = EPL_home['FTHG', 'HS'].sum().reset_index() 
EPL_team_AwaySCR = EPL_away['FTAG', 'AS'].sum().reset_index() 
EPL_team_HomeSCR.rename( columns={'HomeTeam' :'Team'}, inplace=True )
EPL_team_AwaySCR.rename( columns={'AwayTeam' :'Team'}, inplace=True )
EPL_team_SCR = pd.concat([EPL_team_HomeSCR, EPL_team_AwaySCR['FTAG'], EPL_team_AwaySCR['AS']], axis = 1)
EPL_team_SCR['SCR'] = (EPL_team_SCR['FTHG'] + EPL_team_SCR['FTAG']) / (EPL_team_SCR['AS'] + EPL_team_SCR['HS']) * 100
EPL_team_SCR
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead. """Entry point for launching an IPython kernel. /usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
Out[35]:
In [36]:
EPL_team_SCR = EPL_team_SCR.sort_values(['SCR'], ascending=False)
matplotlib.rcParams['figure.figsize'] = (20, 20)
sns.barplot("SCR",'Team', data = EPL_team_SCR )
plt.title('Shot Conversion Rate')
plt.ylabel('Team')
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning
Out[36]:
Text(0, 0.5, 'Team')
Notebook Image

We can see that

  1. Over the seasons Arsenal has been the most lethal team followed by Manchester United.
  2. Huddersfield has been the most wasteful with the shots they have gotten.
  3. Brighton have always been called out by pandits for playing really attacking football but not converting the chances into goals. Above graph proves that.
Home win percentage

Teams refer to their home ground as fortress. Let's see who has defended their fortress the most. We will calculate percentage of home matches that a team has won. We will use the loc method of dataframe to find the rows where FTR is H or in simple language, the home team has won.

In [37]:
home_win=100*round((EPL_data.loc[EPL_data['FTR']=='H'].groupby('HomeTeam')['FTR'].count()/EPL_data.groupby('HomeTeam')['FTR'].count()),3)
home_win.sort_values(ascending = False)
Out[37]:
HomeTeam
Man United          69.9
Arsenal             67.1
Chelsea             66.4
Liverpool           63.1
Man City            62.3
Tottenham           58.1
Everton             49.6
Newcastle           44.9
Ipswich             44.7
Stoke               42.6
Fulham              42.3
Leicester           42.3
Leeds               41.5
West Ham            41.1
Blackburn           41.1
Portsmouth          40.6
Charlton            40.6
Reading             40.4
Southampton         39.5
Bolton              38.8
Swansea             38.3
Birmingham          37.6
Sheffield United    37.5
Middlesbrough       37.4
Aston Villa         36.7
Wolves              36.4
Bournemouth         35.8
Burnley             35.3
Norwich             33.3
Watford             33.3
Crystal Palace      32.7
Wigan               31.6
West Brom           30.8
Hull                30.5
Sunderland          30.5
Cardiff             28.9
Brighton            28.9
Blackpool           26.3
QPR                 26.3
Derby               24.6
Bradford            21.1
Huddersfield        21.1
Coventry            21.1
Name: FTR, dtype: float64
In [38]:
home_win.sort_values(ascending=True).plot(kind = 'barh')
plt.xlabel('Percentage')
plt.ylabel('Team')
plt.legend(['% Wins'])
plt.title("Home Ground Win %")
Out[38]:
Text(0.5, 1.0, 'Home Ground Win %')
Notebook Image
  1. Manchester United has won the most % of the games played at Old Trafford, followed by Arsenal at Emirates and Highbury, and Chelsea at Stamford Bridge.
  2. Coventry holds the unwanted record of winning the least games of those played at St. Andrew's stadium and Ricoh Arena.

Deep Dive

Now we basically know fundamental details about the dataset we have. It's interesting how just a few lines of code can make sense of a huge dataset containing raw data and lots of values. The visualisations help to a great extent. We started with knowing nothing about the data except the columns it contained. Now just a few lines of codes later, we know a lot more about what goes on in the Premier League and the major trends.

Now let's dive deeper and answer a few interesting questions about the dataset which can help us understand the data even better.

Q1: Which Referee gave the most red cards?

To answer this, we can group the data based on Referee and sum the values. We will plot the graph for red cards.

In [40]:
EPL_ref = EPL_data.groupby(['Referee']).sum().reset_index()
EPL_ref['RC'] = EPL_ref['HR'] + EPL_ref['AR']
EPL_ref = EPL_ref.loc[EPL_ref['RC'] != 0].sort_values(['RC'], ascending=False)
matplotlib.rcParams['figure.figsize'] = (20, 30)
sns.barplot("RC",'Referee', data = EPL_ref )
plt.title('Red Cards given by Referees')
plt.ylabel('Referee')
plt.xlabel('Red Cards')
/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning
Out[40]:
Text(0.5, 0, 'Red Cards')
Notebook Image

Quite a lot of referees, but it is Mike Dean who tops the list, being the only referee to award more than 100 red cards.

Q2: Which season saw the most comebacks?

Let's define a comeback as a situation where a team which is trailing at half time ends up winning the game.

We will find results where,

  1. HTR is not equal to FTR
  2. FTR or HTR is not equal to D or draw.
In [41]:
EPL_data['Comeback']=np.where((EPL_data['FTR']!=EPL_data['HTR'])&(EPL_data['FTR']!='D') & (EPL_data['HTR']!='D'),1,0)
EPL_comeback = EPL_data.groupby('Season').sum().reset_index()

In [42]:
plt.rcParams['figure.figsize'] = 25, 10;
sns.barplot('Season', 'Comeback', data = EPL_comeback)
plt.title('Comebacks per season');
plt.xticks(rotation=60)
plt.ylabel('Comebacks')

/usr/local/lib/python3.7/dist-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. FutureWarning
Out[42]:
Text(0, 0.5, 'Comebacks')