Jovian
⭐️
Sign In

Analysis of "Pokemon with stats" dataset

TitleImg

In the following a dataset comprising all Pokemon of the first 6 generations is evaluated. The dataset contains 13 columns. Each Pokemon has a unique number that corresponds to their number in the Pokedex and a Name. Most of the other columns contain the stats for each Pokemon and in addition there is information about the Types, from which Generation this Pokemon is and if it is a legendary Pokemon.

The Dataset originates from Kaggle. Link to Kaggle dataset.

I do not own the copyright of this picture! Link to Banner.

To use with jovian, the jovian has to be installed and imported.

In [5]:
project_name = "zerotopandas-course-project-Pokemon_v1"
In [2]:
!pip install jovian --upgrade -q
In [3]:
import jovian
In [6]:
jovian.commit(project = project_name)
[jovian] Attempting to save notebook..
[jovian] Error: Failed to detect notebook filename. Please provide the correct notebook filename as the "filename" argument to "jovian.commit".

Additionaly several libraries that are used in this notebook are imported.

In [ ]:
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Later on a pairplot from seaborn is generated, which throws several warnings, but works anyway. The warnings library is imported to filter warnings.

In [ ]:
import warnings
warnings.filterwarnings("ignore")

For the generation of further plots, parameters are set in the following.

In [ ]:
sns.set_style('white')
In [ ]:
plt.rcParams['font.size'] = 12
plt.rcParams['figure.figsize'] = (15, 5)

Data Preparation and Cleaning

Beforehand I downloaded the dataset to my local drive. The dataset is now imported from the file Pokemon.csv.

In [ ]:
Pokemon = pd.read_csv('Pokemon.csv')

Let's quickly take a look on the dataset, to get an overview of what we are dealing with, using the head and shape function.

In [ ]:
Pokemon.head(10)
In [ ]:
Pokemon.shape

First, we are replacing the whitespace in the column names by an underscore to avoid any problem when calling columns.

In [ ]:
Pokemon.columns = Pokemon.columns.str.replace(' ', '_')

As we can see there is also a little problem with the Pokemon's names. If they are the Mega-evolution, the name contains redundancies. So now we are correcting for this.

In [ ]:
Pokemon['Name'] = Pokemon['Name'].str.replace(".*(?=Mega)", "")

Now, we are setting the name column as index to easily identify each Pokemon by its name.

In [ ]:
Pokemon = Pokemon.set_index('Name')

As every Pokemon can be of one or two different Types, we generate a new column called Type_combined which shows the combinations of both types.

In [ ]:
Pokemon['Type_combined'] = Pokemon[['Type_1', 'Type_2']].fillna('').sum(axis=1)
In [ ]:
Pokemon.head()

Exploratory Analysis and Visualization

Firstly we want to know how many Pokemon are containg in our dataset.

In [ ]:
Number_of_Pokemon = Pokemon.shape[0]
print('There are {} Pokemon in this dataset.'.format(Number_of_Pokemon))

To calculate the basic statistics of the dataset, we are using the describe function.

In [ ]:
Pokemon.describe()

Additionally we can plot the quickly plot all columns against each other to see potential connections and distributions using the pairplot function from seaborn library.

In [ ]:
sns.pairplot(Pokemon, hue="Generation");

How many generations of Pokemon are contained in this dataset?

In [ ]:
Pokemon['Generation'].unique()

As categorical data does not seem to be shown in the pairplot let see which different Types and combinations our Pokemon possess.

In [ ]:
Pokemon['Type_1'].unique()
In [ ]:
Pokemon['Type_2'].unique()
In [ ]:
Pokemon['Type_combined'].unique()

In the following, most of the analysis will focus on the Total-score column. Therefore we first have to test if the data in this column follow a normal distribution.

In [ ]:
plt.hist(Pokemon.Total, bins=15)
In [ ]:
p_thresh = 1e-3
p_norm = stats.normaltest(Pokemon.Total)[1]
In [ ]:
if p_norm < p_thresh:
    print('The p-value smaller than {} and therefore the data does not correspond to a normal distribution. Be aware of that fact when selecting statistical tests.'.format(p_thresh))
else:
    print('The p-value greater than {}. The data seems to be normally distributed. Go on with your statistics.'.format(p_thresh))

As the data in the total score column are not normal distributed, for testing of significant differences, ranks were generated using the rank function.

In [ ]:
Pokemon['Rank'] = Pokemon['Total'].rank(method='first', ascending=False)

Asking and Answering Questions

Question 1

Which are the strongest Types of Pokemon by Type 1 and by the Type_combined column, so we are also considering Type 2 without being redundant.

In [ ]:
Pokemon.groupby('Type_1').mean().sort_values(
    by='Total', ascending=False).head(10)
In [ ]:
Pokemon.groupby('Type_combined').mean().sort_values(
    by='Total', ascending=False).head(10)

Question 2

Seeing the mean total scores of the different types raises the question if there are types that are disproportionately stronger than others. Therefore we are calculating the difference of each Type_1's rank mean and median.

In [ ]:
mean_Rank_Type_1 = Pokemon.groupby(by='Type_1').mean()['Rank']
In [ ]:
mean_Rank_Type_1.sort_values(ascending=True)
In [ ]:
median_Rank_Type_1 = Pokemon.groupby(by='Type_1').median()['Rank']
In [ ]:
median_Rank_Type_1.sort_values(ascending=True)
In [ ]:
shifted_distribution_rank = mean_Rank_Type_1 - median_Rank_Type_1
In [ ]:
shifted_distribution_rank.sort_values(ascending=False)
In [ ]:
plt.plot(shifted_distribution_rank.sort_values(ascending=False))
plt.hlines(y=0, xmin=0, xmax=17, linestyles='dashed', alpha=0.25)
plt.title('Difference of mean and median rank brouped by Type 1')

Question 3

This plot visualizes that picking Flying or Dragon Pokemon chances are considerably larger to get a stronger Pokemon that picking a Pokemon of type Fairy. This raises the question, if there are significant differences in between the differen Types. Therefore we are calculating a t.test to check for significant differences.

In [ ]:
Dragon = Pokemon.Rank[Pokemon.Type_1 == 'Dragon']
Flying = Pokemon.Rank[Pokemon.Type_1 == 'Flying']
Fairy = Pokemon.Rank[Pokemon.Type_1 == 'Fairy']
In [ ]:
stats.ttest_ind(Dragon, Flying)
In [ ]:
stats.ttest_ind(Dragon, Fairy)
Testing the types Flying and Dragon leads to a p-value if 0.66 which means there is no significant difference. On the other hand testing the types Dragon and Fairy the p-value is 0.004 which means there is a highly significant difference.

Question 4

Which are the strongest Pokemon in the game concerning their Total-scores. As we already know that the total scores do not follow a normal distribution, we are using the rank column for this analysis.

In [ ]:
Pokemon.sort_values(by='Rank', ascending=True).head()

Also we are plotting the Pokemons rank and their Total-score against each other, while using their Type_1 as colorcode.

In [ ]:
Rank_plot = Pokemon.sort_values(by='Rank', ascending=True)[0:100]
In [ ]:
sns.scatterplot('Rank',
                'Total',
                hue='Type_1',
                s=50,
                alpha=1,
                data=Rank_plot)

plt.title('Ranking Top 100 Pokemon')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

Question 5

How many of the Top 100 Pokemon are legendary?

In [ ]:
RPLsum = Rank_plot.Legendary.sum()
print('The top 100 Pokemon contain {} legendary Pokemon.'.format(RPLsum))

Question 6

As there seem to be a lot of legendary Pokemons top 100 ranked Pokemon, we want to know if there is a considerable difference between legendary and non-legendary Pokemon.

In [ ]:
Rank_plot_non_legend = Pokemon[Pokemon.Legendary == False].sort_values(
    by='Rank', ascending=True)[0:100]
In [ ]:
sns.scatterplot('Rank',
                'Total',
                hue='Type_1',
                s=50,
                alpha=1,
                data=Rank_plot_non_legend)

plt.title('Ranking Top 100 non-legendary Pokemon')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

In the plot we can see considerable gaps in between the dots on Rank-axis. Also the scale is shifted down on the axis depicting the Total-score. Now we are calculating the mean of the Total score for the top 100 and top 100 non-legendary Pokemon.

In [ ]:
meanTotal = Rank_plot.Total.mean()
meanTotalNL = Rank_plot_non_legend.Total.mean()

print('The mean total score over all Top 100 Pokemon is {}, whereas the mean score of the Top 100 non-legendary Pokemon is {}.'.format(meanTotal, meanTotalNL))

To visualize this difference we are using the Legendary column for color coding the dots in our plot covering the Top 100 Pokemon.

In [ ]:
sns.scatterplot('Rank',
                'Total',
                hue='Legendary',
                s=50,
                alpha=1,
                data=Rank_plot)

plt.title('Ranking Top 100 Pokemon')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

Question 7

Which are the strongest and weakest Pokemon? In Addition which is the strongest non-legendary Pokemon?

In [ ]:
Pokemon.sort_values(by='Rank', ascending=True)[:1]

The strongest overall Pokemon is Mega Mewtwo X.

In [ ]:
Pokemon[Pokemon.Legendary == False].sort_values(by='Rank', ascending=True)[:1]

The strongest non-legendary Pokemon is Mega Tyranitar.

In [ ]:
Pokemon.sort_values(by='Rank', ascending=False)[:1]

The weakest Pokemon over all 6 generations is Sunkern.

Question 8

Which is the strongest and which the weakest generation of Pokemon?

In [ ]:
Pokemon.groupby(by='Generation').mean()['Rank'].sort_values(ascending = False)

Generation 2 seems to be the strongest on average and generation 4 the weakest. Except for generation 4, all the others are pretty close together. But this is just a generalization and there will be also exceptional Pokemon in generation 4.

Question 9

Which of the stats are impacting the Total-score the strongest. To answer this question we are calculating the correlation of the different stats and plotting it as a heatmap.

In [ ]:
corr = Pokemon.iloc[:,3:10].corr()
corr
In [ ]:
plt.matshow(corr);

It seems that the stats for Attack, Sp.Atk and Sp.Def are affecting the Total-score the strongest. A keene eye, will also discover that the Total-score is just the sum of all the other stats. To test this hypothesis we calculate, probably unnecessarily complicated, the sum and test if it is the same as Total.

In [ ]:
Pokemon.iloc[:,3] == Pokemon.iloc[:,4] + Pokemon.iloc[:,5] + Pokemon.iloc[:,6] + Pokemon.iloc[:,7] + Pokemon.iloc[:,8] + Pokemon.iloc[:,9]

Inferences and Conclusion

We now know which are the strongest and weakest Pokemon in the game. Getting a legendary oder 'Mega' Pokemon will probably one of the strongest. Also we know which types are disproportionaly stronger than others. Flying and Dragon seem to be a good choice. Also, now that we know that the Total score is just the sum of all stats, but is correlating the strongest with both Atk stats and Sp._Def these seem to be a good indicator when selecting Pokemon. On average, generation 4 seems to be the weakest and the second generation the strongest. But except for generation 4 all the others are pretty close to each other. There is a lot more to learn from this dataset, but that is it for now.

References and Future Work

One could link this dataset with a table containing factors for advantages and disadvantages in between different types. Also one could link a dataset containing all possible attacks and the Pokemon that can potentially learn them to create to strongest/ all-purpose Pokemon (if possible).

To generate this Notebook the Pandas documentation and the Scipy.stats documentation were very helpful.

In [ ]:
jovian.commit()
In [ ]: