Learn data science and machine learning by building real-world projects on Jovian

This is the Zero to GBMs Course-Project notebook.

All of the course work has been done on my local system and not on cloud environment.

Jovian M.L Project (PUBG Finish Placement Prediction)

You are given over 65,000 games' worth of anonymized player data, split into training and testing sets, and asked to predict final placement from final in-game stats and initial player ratings.

What's the best strategy to win in PUBG? Should you sit in one spot and hide your way into victory, or do you need to be the top shot? Let's let the data do the talking!

Link to dataset: HERE

Data Description

Features
  • DBNOs - Number of enemy players knocked.

  • assists - Number of enemy players this player damaged that were killed by teammates.

  • boosts - Number of boost items used.

  • damageDealt - Total damage dealt. Note: Self inflicted damage is subtracted.

  • headshotKills - Number of enemy players killed with headshots.

  • heals - Number of healing items used.

  • Id - Player’s Id

  • killPlace - Ranking in match of number of enemy players killed.

  • killPoints - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.

  • killStreaks - Max number of enemy players killed in a short amount of time.

  • kills - Number of enemy players killed.

  • longestKill - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.

  • matchDuration - Duration of match in seconds.

  • matchId - ID to identify match. There are no matches that are in both the training and testing set.

  • matchType - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.

  • rankPoints - Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.

  • revives - Number of times this player revived teammates.

  • rideDistance - Total distance traveled in vehicles measured in meters.

  • roadKills - Number of kills while in a vehicle.

  • swimDistance - Total distance traveled by swimming measured in meters.

  • teamKills - Number of times this player killed a teammate.

  • vehicleDestroys - Number of vehicles destroyed.

  • walkDistance - Total distance traveled on foot measured in meters.

  • weaponsAcquired - Number of weapons picked up.

  • winPoints - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.

  • groupId - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.

  • numGroups - Number of groups we have data for in the match.

  • maxPlace - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.

Target
  • winPlacePerc - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

Since the target column is a continous value we can use Regression technique. But first we need to take a look at the data and perform EDA and other techniques as and when required.

In [1]:
# Required Imports
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
%matplotlib inline

warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', 200)
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
In [2]:
# Loading the datasets
train_df = pd.read_csv("train_V2.csv")
test_df = pd.read_csv("test_V2.csv")

print(f"Training data Shape: {train_df.shape}\nTest data shape: {test_df.shape}")
Training data Shape: (4446966, 29) Test data shape: (1934174, 28)
In [3]:
# First five rows of the train data
train_df.head()
Out[3]:
In [4]:
# First five rows of the test data
test_df.head()
Out[4]:

Approach Map

  • Exploratory Data Analysis

  • Outlier removal

  • Feature Engineering

  • Model selection, training & prediction

1. Exploratory Data Analysis

In [5]:
# Get dtypes of all columns
train_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4446966 entries, 0 to 4446965 Data columns (total 29 columns): # Column Dtype --- ------ ----- 0 Id object 1 groupId object 2 matchId object 3 assists int64 4 boosts int64 5 damageDealt float64 6 DBNOs int64 7 headshotKills int64 8 heals int64 9 killPlace int64 10 killPoints int64 11 kills int64 12 killStreaks int64 13 longestKill float64 14 matchDuration int64 15 matchType object 16 maxPlace int64 17 numGroups int64 18 rankPoints int64 19 revives int64 20 rideDistance float64 21 roadKills int64 22 swimDistance float64 23 teamKills int64 24 vehicleDestroys int64 25 walkDistance float64 26 weaponsAcquired int64 27 winPoints int64 28 winPlacePerc float64 dtypes: float64(6), int64(19), object(4) memory usage: 983.9+ MB

We can observe that majority of the columns are either of type Int or Float.

In [6]:
# Statistical info about the numeric columns
train_df.describe()
Out[6]:
In [7]:
# NaN values count
train_df.isna().sum()
Out[7]:
Id                 0
groupId            0
matchId            0
assists            0
boosts             0
damageDealt        0
DBNOs              0
headshotKills      0
heals              0
killPlace          0
killPoints         0
kills              0
killStreaks        0
longestKill        0
matchDuration      0
matchType          0
maxPlace           0
numGroups          0
rankPoints         0
revives            0
rideDistance       0
roadKills          0
swimDistance       0
teamKills          0
vehicleDestroys    0
walkDistance       0
weaponsAcquired    0
winPoints          0
winPlacePerc       1
dtype: int64

Only one missing value so we can choose to drop this row.

In [8]:
# Dropping the only row with missing value
train_df.dropna(inplace=True)
test_df.dropna(inplace=True)

Since there are only 4 object type columns we can explore them first.

In [9]:
object_types = train_df.select_dtypes("object").columns.to_list()
object_types
Out[9]:
['Id', 'groupId', 'matchId', 'matchType']
In [10]:
# Working with different IDs
for ids in object_types[:-1]:
    print(f"Unique points in {ids} column are {train_df[ids].nunique()}")
Unique points in Id column are 4446965 Unique points in groupId column are 2026744 Unique points in matchId column are 47964

Some Insights from the above operation:

  • Looks like ID column has only unique value present and that makes sense as each player is assigned a unique "PlayerID" in the game.

  • GroupID is almost the half of the PlayerID which may point to the fact that most players like to play group matches

In [11]:
# Lets look at the last "matchType" column
print(f"There are {train_df['matchType'].nunique()} categories in the Match type column")
There are 16 categories in the Match type column
In [12]:
# Value count of each category
train_df["matchType"].value_counts()
Out[12]:
squad-fpp           1756186
duo-fpp              996691
squad                626526
solo-fpp             536761
duo                  313591
solo                 181943
normal-squad-fpp      17174
crashfpp               6287
normal-duo-fpp         5489
flaretpp               2505
normal-solo-fpp        1682
flarefpp                718
normal-squad            516
crashtpp                371
normal-solo             326
normal-duo              199
Name: matchType, dtype: int64
In [13]:
# Visualization of the above data
sns.countplot(x="matchType", data=train_df)
plt.title("Match type Category-wise frequency")
plt.xticks(rotation=75)
plt.show()
Notebook Image

As mentioned above looks like most of the players like to play squad matches. And we can also observe that the present 16 categories can be narrowed down to just 3 categories namely:

  • Squad

  • Duo

  • Solo

In [14]:
# Narrowing the present 16 categories to 3 categories
mapper = lambda x: 'solo' if ('solo' in x) else 'duo' if ('duo' in x) else 'squad'
train_df['matchType'] = train_df['matchType'].apply(mapper)
sns.countplot(train_df['matchType'])
plt.title('Count of different types of match')
plt.show()
Notebook Image
In [15]:
# Applying the same to test set
test_df["matchType"] = test_df["matchType"].apply(mapper)
test_df.head()
Out[15]:

Exploring the numeric columns

In [16]:
# Let's visualize the correlation of the features with target
plt.figure(figsize=(15, 15))

correlation = train_df.corr()
sns.heatmap(correlation, annot=True, fmt="0.1f", cmap="summer")
plt.title("Correlation Matrix Heatmap")
plt.show()
Notebook Image

From the above heatmap we can infer the following:

  • kills has a strong positive correlation with win %age.

  • boosts has a strong positive correlation with win %age.

  • weaponsAcquired has a nearly perfect correlation with win %age.

  • walkDistance has a perfect correlation with win %age.

  • heal has a strong positive correlation with win %age.

  • damageDealt has a strong positive correlation with win %age.

Let's analyze each one of them.

Analyzing the Kills
In [17]:

print("The average kill by a player are:", train_df["kills"].mean())
print(f"The minimum kills by a player are: {train_df['kills'].min()} and maximum kills are: {train_df['kills'].max()}")
The average kill by a player are: 0.9247835321393355 The minimum kills by a player are: 0 and maximum kills are: 72
In [18]:
# Unique kills value
train_df["kills"].unique()
Out[18]:
array([ 0,  1,  4,  2,  9,  3,  5,  6,  8,  7, 14, 13, 15, 12, 21, 11, 10,
       17, 20, 24, 18, 16, 22, 19, 23, 35, 31, 27, 25, 48, 42, 30, 26, 65,
       39, 33, 28, 29, 34, 57, 55, 56, 36, 38, 37, 44, 66, 41, 50, 53, 43,
       32, 40, 47, 45, 46, 49, 72])
In [19]:
train_df["kills"].value_counts()
Out[19]:
0     2529721
1      928079
2      472466
3      232441
4      124543
5       66577
6       37960
7       21816
8       12779
9        7644
10       4599
11       2799
12       1755
13       1137
14        757
15        484
16        325
17        234
18        165
19        112
20        109
22         77
21         70
23         47
24         44
25         27
26         27
28         22
27         21
30         13
29         13
31         13
33         12
36          8
38          7
35          7
34          5
41          5
37          5
32          4
53          4
40          4
39          4
43          3
42          3
56          2
55          2
44          2
46          2
57          2
49          1
45          1
47          1
48          1
50          1
66          1
65          1
72          1
Name: kills, dtype: int64

We can see that there are quite a lot of values of no. of kills. To visualize the number of kills we can map the no kills with a value greater than 6 to something such "Greater than 7" since most of the kills are in the range of 0 to 7.

In [20]:
kills = train_df["kills"].copy()
kills = kills.apply(lambda x: "Greater than 7" if x > 6 else x)
kills.value_counts()
Out[20]:
0                 2529721
1                  928079
2                  472466
3                  232441
4                  124543
5                   66577
Greater than 7      55178
6                   37960
Name: kills, dtype: int64
In [21]:
# Visualizing the kills count
sns.countplot(kills.astype(str).sort_values())
plt.title("No of kills in the game")
plt.xticks(rotation=45)
plt.show()
Notebook Image

If we take no of kills as the main factor in winning the game then it seems that most of the players are quite average at playing the game.

In [22]:
sns.scatterplot(x="winPlacePerc", y="kills", data=train_df, alpha=0.7)
plt.show()
Notebook Image

We can look at the longestKill column as well. Typically, long kills are made using Snipers that too within a range of 800 meters. So we can drop those players having a longKill value of greater than 800 meters as they might be playing with some hacks and mods

In [23]:
# Players who killed at a range greater than 800 meters
longest = train_df[train_df["longestKill"] > 800]
longest
Out[23]:
In [24]:
# Visualizing long shots by hackers
sns.histplot(longest["longestKill"], bins=10)
plt.title("Count of players killing at a range >800 meters")
plt.show()
Notebook Image

From the above visualization it can be concluded that these players are hackers. So we can choose to drop them from our dataframe

In [25]:
# Boxplot of no of kills.
kills = train_df.copy()

kills['killsCategories'] = pd.cut(kills['kills'], [-1, 0, 2, 5, 10, 60], labels=[
    '0_kills','1-2_kills', '3-5_kills', '6-10_kills', '10+_kills'])

plt.figure(figsize=(15,8))
sns.boxplot(x="killsCategories", y="winPlacePerc", data=kills)
plt.show()
Notebook Image

Though there are some outliers in the kills column but it seems that kills is correlated with win %age.

Analyzing boosts column
In [26]:
print("Average boost items used by players:", train_df["boosts"].mean())
print(f"Minimum no. of boosts item used are {train_df['boosts'].min()} and maximum are {train_df['boosts'].max()}")
Average boost items used by players: 1.1069079698176172 Minimum no. of boosts item used are 0 and maximum are 33
In [27]:
train_df["boosts"].value_counts()
Out[27]:
0     2521323
1      680252
2      491316
3      295883
4      195729
5      120271
6       70111
7       37626
8       18893
9        8638
10       3992
11       1644
12        726
13        295
14        126
15         62
16         30
17         16
18         13
19          6
21          4
20          3
24          2
33          1
28          1
23          1
22          1
Name: boosts, dtype: int64
In [28]:
# Visualizing boosts
sns.countplot(train_df["boosts"].sort_values(ascending=False))
plt.show()
Notebook Image

The maximum no. of boost items used by players is 0, which means that either the players died too early in the match or if they win without using any boosts then most probably they are hackers.

In [29]:
sns.scatterplot(x="winPlacePerc", y="boosts", data=train_df, color="magenta", alpha=0.7)
plt.show()
Notebook Image
Analyzing weaponsAcquired
In [30]:
print("The average number of weapons acquired by players are:", train_df["weaponsAcquired"].mean())
print(f"Minimum no. of weapons acquired are {train_df['weaponsAcquired'].min()} and maximum are {train_df['weaponsAcquired'].max()}")
The average number of weapons acquired by players are: 3.6604884454903512 Minimum no. of weapons acquired are 0 and maximum are 236

The maximum value of weapons acquired is 236 which is basically impossible in a single match. The min. number is 0 which tells us that the players died before acquiring any weapons.(i.e at the start of the match)

In [31]:
sns.scatterplot(x="winPlacePerc", y="weaponsAcquired", data=train_df, color="orange", alpha=0.7)
plt.show()
Notebook Image

Since each player start with no weapons and tries to acquire better weapons throughout the match, it can be concluded that the more the weapons acquired the more is the probability of winning. (Although there are some outliers needed to be taken care of).

In [32]:
# No weapons acquired but still won the game
no_weapons_won = train_df[(train_df["weaponsAcquired"] == 0) & (train_df["winPlacePerc"] == 1)]
no_weapons_won
Out[32]:

We can observe that there are around 200 players who didn't even use a weapon but still managed to win the game. Either they were sitting at a corner through out the game or they are fraudsters.

Analyzing walk distance
In [33]:
print(f"The average walk distance is {train_df['walkDistance'].mean()} meters.")
The average walk distance is 1154.2181186480893 meters.
In [34]:
train_df["walkDistance"].value_counts()
Out[34]:
0.0000       99602
1007.0000      955
1098.0000      945
1047.0000      939
1036.0000      934
             ...  
0.8005           1
0.3570           1
7935.0000        1
0.8721           1
0.9661           1
Name: walkDistance, Length: 38599, dtype: int64

About 99600 players walked 0 meters which means that they either died at the very start or they got disconnected from the game.

In [35]:
# Visualizing walk distance
data = train_df[train_df['walkDistance'] < train_df['walkDistance'].quantile(0.99)]
plt.figure(figsize=(15,7))
plt.title("Walking Distance Distribution")
sns.distplot(data['walkDistance'])
plt.show()
Notebook Image
In [36]:
sns.scatterplot(x="winPlacePerc", y="walkDistance", data=train_df, color="green", alpha=0.7)
plt.show()