Jovian
⭐️
Sign In

Data Analysis with Python: Zero to Pandas - Course Project Guidelines

Step 3: Perform exploratory Analysis & Visualization
  • Compute the mean, sum, range and other interesting statistics for numeric columns
  • Explore distributions of numeric columns using histograms etc.
  • Explore relationship between columns using scatter plots, bar charts etc.
  • Make a note of interesting insights from the exploratory analysis
Step 4: Ask & answer questions about the data
  • Ask at least 5 interesting questions about your dataset
  • Answer the questions either by computing the results using Numpy/Pandas or by plotting graphs using Matplotlib/Seaborn
  • Create new columns, merge multiple dataset and perform grouping/aggregation wherever necessary
  • Wherever you're using a library function from Pandas/Numpy/Matplotlib etc. explain briefly what it does
Step 5: Summarize your inferences & write a conclusion
  • Write a summary of what you've learned from the analysis
  • Include interesting insights and graphs from previous sections
  • Share ideas for future work on the same topic using other relevant datasets
  • Share links to resources you found useful during your analysis
Step 6: Make a submission & share your work

Evaluation Criteria

Your submission will be evaluated using the following criteria:

  • You must ask and answer at least 5 questions about the dataset
  • Your submission must include at least 5 visualizations (graphs)

Analysis of "Pokemon with stats" dataset

TitleImg

In the following a dataset comprising all Pokemon of the first 6 generations is evaluated. The dataset contains 13 columns. Each Pokemon has a unique number that corresponds to their number in the Pokedex and a Name. Most of the other columns contain the stats for each Pokemon and in addition there is information about the Types, from which Generation this Pokemon is and if it is a legendary Pokemon.

The Dataset originates from Kaggle. Link to Kaggle dataset.

I do not own the copyright of this picture! Link to Banner.

To use with jovian, the jovian has to be installed and imported.

In [1]:
project_name = "zerotopandas-course-project-Pokemon"
In [2]:
!pip install jovian --upgrade -q
ERROR: Could not install packages due to an EnvironmentError: [WinError 5] Zugriff verweigert: 'c:\\programdata\\anaconda3\\etc\\jupyter\\nbconfig\\notebook.d\\jovian_nb_ext.json' Consider using the `--user` option or check the permissions.
In [3]:
import jovian
In [ ]:
jovian.commit(project=project_name)
[jovian] Attempting to save notebook..

Additionaly several libraries that are used in this notebook are imported.

In [ ]:
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Later on a pairplot from seaborn is generated, which throws several warnings, but works anyway. The warnings library is imported to filter warnings.

In [ ]:
import warnings
warnings.filterwarnings("ignore") #multiscatter plot throws annoying warning but still works

For the generation of further plots, parameters are set in the following.

In [ ]:
sns.set_style('white')
In [ ]:
plt.rcParams['font.size'] = 12
plt.rcParams['figure.figsize'] = (15, 5)

Data Preparation and Cleaning

Beforehand I downloaded the dataset to my local drive. The dataset is now imported from the file Pokemon.csv.

In [ ]:
Pokemon = pd.read_csv('Pokemon.csv')

Let's quickly take a look on the dataset, to get an overview using the head and shape function.

In [ ]:
Pokemon.head(15)
In [ ]:
Pokemon.shape

First, we are replacing the whitespace in the column names by an underscore to avoid any problem when calling columns.

In [ ]:
Pokemon.columns = Pokemon.columns.str.replace(' ', '_')

As we can see there is also a little problem with the Pokemon's names. If they are the Mega-evolution, the name contains redundancies. So now we are correcting for this.

In [ ]:
Pokemon['Name'] = Pokemon['Name'].str.replace(".*(?=Mega)", "")

Now, we are setting the name column as index to easily identify each Pokemon by its name.

In [ ]:
Pokemon = Pokemon.set_index('Name')

As every Pokemon can be of one or two different Types, we generate a new column called Type_combined which shows the combinations of both types.

In [ ]:
Pokemon['Type_combined'] = Pokemon[['Type_1', 'Type_2']].fillna('').sum(axis=1)
In [ ]:
Pokemon.head()
In [ ]:
# jovian.commit()

Exploratory Analysis and Visualization

TODO

In [ ]:
Number_of_Pokemon = Pokemon.shape[0]
In [ ]:
print('There are {} Pokemon in this dataset.'.format(Number_of_Pokemon))
In [ ]:
Pokemon.describe()
In [ ]:
sns.pairplot(Pokemon, hue="Generation");
In [ ]:
Pokemon['Generation'].unique()
In [ ]:
Pokemon['Type_1'].unique()
In [ ]:
Pokemon['Type_2'].unique()
In [ ]:
Pokemon['Type_combined'].unique()
In [ ]:
# jovian.commit()

Asking and Answering Questions

TODO

In [ ]:
Pokemon.groupby('Type_1').mean().sort_values(
    by='Total', ascending=False).head(10)
In [ ]:
Pokemon.groupby('Type_2').mean().sort_values(
    by='Total', ascending=False).head(10)
In [ ]:
Pokemon.groupby('Type_combined').mean().sort_values(
    by='Total', ascending=False).head(10)
In [ ]:
Pokemon['Rank'] = Pokemon['Total'].rank(method='first', ascending=False)
In [ ]:
Pokemon.head()
In [ ]:
Pokemon.sort_values(by='Total', ascending=False)
In [ ]:
Rank_plot = Pokemon.sort_values(by='Rank', ascending=True)[0:100]
In [ ]:
Rank_plot
In [ ]:
sns.scatterplot('Rank',
                'Total',
                hue='Type_1',
                s=50,
                alpha=1,
                data=Rank_plot)

plt.title('Ranking Top 100 Pokemon')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
In [ ]:
Rank_plot_non_legend = Pokemon[Pokemon.Legendary == False].sort_values(
    by='Rank', ascending=True)[0:100]
In [ ]:
Rank_plot_non_legend
In [ ]:
sns.scatterplot('Rank',
                'Total',
                hue='Type_1',
                s=50,
                alpha=1,
                data=Rank_plot_non_legend)

plt.title('Ranking Top 100 non-legendary Pokemon')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
In [ ]:
mean_Rank_Type_1 = Pokemon.groupby(by='Type_1').mean()['Rank']
In [ ]:
mean_Rank_Type_1.sort_values(ascending=True)
In [ ]:
median_Rank_Type_1 = Pokemon.groupby(by='Type_1').median()['Rank']
In [ ]:
median_Rank_Type_1.sort_values(ascending=True)
In [ ]:
shifted_distribution_rank = mean_Rank_Type_1.sort_values(
    ascending=True) - median_Rank_Type_1.sort_values(ascending=True)
In [ ]:
plt.plot(shifted_distribution_rank.sort_values(ascending=False))
plt.hlines(y=0, xmin=0, xmax=17, linestyles='dashed', alpha=0.25)
plt.title('Difference of mean and median rank brouped by Type 1')
In [ ]:
# jovian.commit()

Inferences and Conclusion

TODO

  • Strongest/ weakest Pokemon
  • Strongest/ weakest non-legendary Pokemon
  • Strongest/ weakest Generation
  • Strongest/ weakest Type
  • What has the largest effect on total score? (correlaction matrix)
In [ ]:
plt.hist(Pokemon.Total, bins= 15);
In [ ]:
p_thresh= 1e-3
p_norm= stats.normaltest(Pokemon.Total)[1]
In [ ]:
if p_norm < p_thresh:
    print('The p-value smaller than {} and therefore the data does not correspond to a normal distribution. Be aware of that fact when selecting statistical tests.'.format(p_thresh))
else:
    print('The p-value greater than {}. The data seems to be normally distributed. Go on with your statistics.'.format(p_thresh))

As the data in the total score column are not normal distributed, for testing of significant differences, the ranked values were facilitated.

In [ ]:
Dragon= Pokemon.Rank[Pokemon.Type_1 == 'Dragon']
Flying= Pokemon.Rank[Pokemon.Type_1 == 'Flying']
Fairy= Pokemon.Rank[Pokemon.Type_1 == 'Fairy']
In [ ]:
stats.ttest_ind(Dragon, Flying)
In [ ]:
stats.ttest_ind(Dragon, Fairy)
In [ ]:
# jovian.commit()

References and Future Work

Link dataset with possible attacks to infere the strongest possible combination (the all-purpose Pokemon)

In [ ]:
 
In [ ]:
# jovian.commit()
In [ ]: