Course Project on EDA with Python
Analysis on Video Game Dataset
This Project is to perform the analysis on the Video Games Sales dataset. Here we use various libraries of Python for visualization of Data. The Dataset which is Used in Project is from Data World (👈 Click to Download)
The Libraries I used in Project are:
To install all required libraries, run the following Command:
pip install matplotlib seaborn numpy pandas plotly jovian --upgrade
Know About Data visualization:
The Data visualization is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers. Visualizing data is an essential part of data analysis and machine learning. In this tutorial, we'll use Python libraries Matplotlib and Seaborn to learn and apply some popular data visualization techniques.
Their Following Tasks are Implemented in the Project:
project_name = "analysis-on-videogames-sales-data"
!pip install jovian --upgrade -q
import jovian
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import numpy as np
from plotly.offline import init_notebook_mode,iplot
import pandas as pd
%matplotlib inline
Here various modes of displaying dataset which is in CSV format. First step is to load the data using pandas read_csv function. the data is stored in mutidimensional table called as dataframe.
data = 'VideoGameSales.csv' #locate the CSV dataset in variable data
videogame_df = pd.read_csv(data) #read the data using pandas and store it in videogame_df variable
videogame_df #display the data (completely )
This Cell is to Explain the details of all Columns :
videogame_df.describe()
From above Dataframe, we conclude that :
videogame_df.shape #To display the shape of the data (rows, columns)
(500, 11)
videogame_df.sort_values(by = ['Name']).head(30) #Display top 30 rows and Sort by 'Name' column
videogame_df.head(10) #To display top 10 rows from the dataset
videogame_df.tail(10) #Display 10 rows from bottom of dataframe
videogame_df[50:60] #Display the rows in range from 51 to 60
#Method 1 to print all column names
for col in videogame_df.columns:
print(col)
#Method 2 to print all col names
list(videogame_df.columns)
Rank
Name
Platform
Year
Genre
Publisher
NA_Sales
EUR_Sales
JAP_Sales
IND_Sales
Global_Sales
['Rank',
'Name',
'Platform',
'Year',
'Genre',
'Publisher',
'NA_Sales',
'EUR_Sales',
'JAP_Sales',
'IND_Sales',
'Global_Sales']
x = videogame_df['Name'].unique() #using numpy.ndarray to find all Names but only UNIQUE.
y = videogame_df['Genre'].unique()
z = videogame_df['Publisher'].unique()
print('Total Games by `Name` count(unique) :',len(x))
print('Total Games by `Genre` count(unique) :',len(y))
print('Total Games by `Publisher` count(unique) :',len(z))
Total Games by `Name` count(unique) : 431
Total Games by `Genre` count(unique) : 12
Total Games by `Publisher` count(unique) : 34
Now, to use Matplotlib, Seaborn library to visualize the Dataset.
vg_plot = videogame_df[0:25]
vg_plot
x = vg_plot['Rank']
y = vg_plot['Year']
plt.figure(figsize=(25,8), dpi= 80)
plt.plot(x,y, label = 'Year', color = 'green')
plt.xlabel('Rank')
plt.ylabel('Year')
plt.title('Global Sales by Rank For 25 Rows')
plt.legend()
plt.show()
Now we can also get a smooth estimate of the distribution using a kernel density estimation, which Seaborn does with sns.kdeplot
# Draw Plot
plt.figure(figsize=(25,8), dpi= 80)
sns.kdeplot(videogame_df.Global_Sales, shade=True, label = 'Global Sales', color="r", alpha=.7)
# Decoration
plt.title('Overall Global Sales Distribution', fontsize=16)
plt.legend()
plt.show()
total = vg_plot['Global_Sales']
NA = vg_plot['NA_Sales']
EUR = vg_plot['EUR_Sales']
JAP = vg_plot['JAP_Sales']
IND = vg_plot['IND_Sales']
plt.figure(figsize=(25,8), dpi= 80)
plt.grid(True)
plt.title('Comparision With all Countries with Global Sales')
plt.plot(total, label = 'Global')
plt.plot(NA, label = 'AMERICA')
plt.plot(EUR, label = 'EUROPE')
plt.plot(JAP, label = 'JAPAN')
plt.plot(IND, label = 'INDIA')
plt.legend(bbox_to_anchor =(1.0, 1.025), ncol = 2)
<matplotlib.legend.Legend at 0x23389a92cd0>
plt.figure(figsize=(25,8))
kwargs = dict(histtype='barstacked', alpha=0.3, bins=40)
plt.hist(total, **kwargs)
plt.hist(NA, **kwargs)
plt.hist(EUR, **kwargs)
plt.hist(JAP, **kwargs)
plt.hist(IND, **kwargs)
plt.xlabel('Global Sales')
plt.ylabel('Countries')
plt.title('Stepfield type of Comparision of Global with all Countries')
Text(0.5, 1.0, 'Stepfield type of Comparision of Global with all Countries')
plt.figure(figsize=(10,7))
x = vg_plot['Year']
y = vg_plot['Global_Sales']
plt.title('Global sales occur (in Millions)')
plt.hist2d(x, y, bins=22, cmap='hot_r')
cb = plt.colorbar()
cb.set_label('counts in bin')
The main idea of Seaborn is that it provides high-level commands to create a variety of plot types useful for statistical data exploration, and even some statistical model fitting. Let's take a look with our dataset 'videogame_df' and plot the types available in Seaborn.
plt.figure(figsize=(25,10))
sns.countplot('Year',data=videogame_df)
plt.title('Maximum Games sold on basis of Year')
plt.show()
#top platforms (name of the platform,total number of games developed for that platform)
topPlatforms_index = videogame_df.Platform.value_counts().head(10).index
topPlatforms_values = videogame_df.Platform.value_counts().head(10).values
#top genres (name of the genre,total number of games developed in that genre)
topGenres_index = videogame_df.Genre.value_counts().head(10).index
topGenres_values = videogame_df.Genre.value_counts().head(10).values
#top game developers/publishers (name of the publisher,total number of games published by that publisher)
topPublisher_index = videogame_df.Publisher.value_counts().head(10).index
topPublisher_values = videogame_df.Publisher.value_counts().head(10).values
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(25,8), facecolor='white')
##top platforms used for games
ax1.vlines(x=topPlatforms_index, ymin=0, ymax=topPlatforms_values, color='#AD0605', linewidth=30)
ax1.set_title('Top 10 Platforms',fontsize=16)
#top genres of Games accordingly
ax2.vlines(x=topGenres_index, ymin=0, ymax=topGenres_values, color='#AB0DD5', linewidth=30)
ax2.set_title('Top 10 Genres',fontsize=16)
plt.show()
fig, ax = plt.subplots(figsize=(25,8), facecolor='white')
#top publishers of the games
ax.vlines(x=topPublisher_index, ymin=0, ymax=topPublisher_values, linewidth=65, color='#969F79')
ax.set_title('Top 10 Publishers',fontsize=16)
Text(0.5, 1.0, 'Top 10 Publishers')
Conclution for above Bar Graph are :
Visualizing the multidimensional relationships among the samples is as easy as calling sns.pairplot:
# Corellating among all Continents/Countries using Seaborn to perform pairplot and to plot the graph with matplotlib:
sns.pairplot(videogame_df.loc[0:,['NA_Sales','EUR_Sales','JAP_Sales','IND_Sales','Global_Sales']])
plt.show()
Conclution Upon performing the correlation among various countries :
top15 = videogame_df[0:15]
top15
plt.figure(figsize = (18,8))
plt.barh(top15["Name"],top15["IND_Sales"], label = 'Top Games')
plt.title("Top 15 games sold in India",fontdict = {"fontsize":20})
plt.savefig("Top 15 games soldm in India.jpg",dpi = 300) #And to save it as an Jpeg image in the Directory
plt.legend()
plt.show()
Publisher = list(videogame_df.Publisher.unique())
global_sale_of_every_Publisher = pd.Series(dtype = float)
for pub in Publisher :
data = videogame_df.loc[videogame_df.Publisher == pub]
global_sale = sum(data.Global_Sales)
global_sale_of_every_Publisher[pub] = global_sale
top_10 = global_sale_of_every_Publisher[:10]
plt.figure(figsize = (10.5,9))
plt.pie(top_10,labels = top_10.index,autopct = "%.2f%%",textprops = {"fontsize":13},labeldistance = 1.05)
plt.legend(loc = 4,fontsize = 12, bbox_to_anchor =(1.75, 0.82), ncol = 2)
plt.title("Top 10 Publisher of Games",fontdict = {"fontsize":25,"fontweight":100})
plt.savefig("Top 10 Publisher of Games",dpi = 200)
plt.show()
Genre = videogame_df.Genre
Genre = Genre.value_counts()
plt.figure(figsize = (8,8))
labels = Genre.index
colors = ["#eeff00","#51ff00","#00ffdd","#ff9d00","#0033ff","#ff0800","#f700ff","#850012","#c7714a","#04615b","#ab8d5e","#00004a"]
plt.pie(Genre,labels = labels,colors = colors,autopct = "%.2f%%")
plt.title("Percentage of Top Genres of Games",fontdict = {"fontsize":17})
plt.savefig("Top Genres Chart",dpi = 200)
plt.show()
#Pie Plot
# For North America
df1 = pd.DataFrame(videogame_df.groupby('Name')['NA_Sales'].sum())
df1.sort_values(by=['NA_Sales'], inplace=True)
df1 = df1.tail(5)
df1.plot.pie(y='NA_Sales', autopct='%1.1f%%', figsize=(6, 6))
plt.title("Best selling games in North America")
# For Europe Sales
df1 = pd.DataFrame(videogame_df.groupby('Name')['EUR_Sales'].sum())
df1.sort_values(by=['EUR_Sales'], inplace=True)
df1 = df1.tail(5)
df1.plot.pie(y='EUR_Sales', autopct='%1.1f%%', figsize=(6, 6))
plt.title("Best selling games in Europe")
# For India Sales
df1 = pd.DataFrame(videogame_df.groupby('Name')['IND_Sales'].sum())
df1.sort_values(by=['IND_Sales'], inplace=True)
df1 = df1.tail(5)
df1.plot.pie(y='IND_Sales', autopct='%1.1f%%', figsize=(6, 6))
plt.title("Best selling games in INDIA")
# For Japan Sales
df1 = pd.DataFrame(videogame_df.groupby('Name')['JAP_Sales'].sum())
df1.sort_values(by=['JAP_Sales'], inplace=True)
df1 = df1.tail(5)
df1.plot.pie(y='JAP_Sales', autopct='%1.1f%%', figsize=(6, 6))
plt.title("Best selling games in Japan")
Text(0.5, 1.0, 'Best selling games in Japan')
df_genre = videogame_df.groupby('Genre')
def genreBased(region):
xrange = np.arange(1,len(df_genre.sum())+1)
fig,ax= plt.subplots(ncols=2,figsize=(18,6))
df_to_plot = df_genre.sum().sort_values(by=region,ascending =False)[::-1]
df_to_plot[region].plot(kind='barh')
plt.title(region)
#labels
ax[1].set_ylabel(None)
ax[1].tick_params(axis='both', which='major', labelsize=13)
ax[1].set_xlabel('Total Sales(in millions)', fontsize=15,labelpad=21)
#spines
ax[1].spines['top'].set_visible(False)
ax[1].spines['right'].set_visible(False)
ax[1].grid(False)
#annotations
for x,y in zip(np.arange(len(df_genre.sum())+1),df_genre.sum().sort_values(by=region,ascending =False)[::-1][region]):
label = "{:}".format(y)
labelr = round(y,2)
plt.annotate(labelr, # this is the text
(y,x), # this is the point to label
textcoords="offset points",# how to position the text
xytext=(6,0), # distance from text to points (x,y)
ha='left',va="center")
#donut chart
theme = plt.get_cmap('Blues')
ax[0].set_prop_cycle("color", [theme(1. * i / len(df_to_plot))for i in range(len(df_to_plot))])
wedges, texts,_ = ax[0].pie(df_to_plot[region], wedgeprops=dict(width=0.45), startangle=-45,labels=df_to_plot.index,
autopct="%.1f%%",textprops={'fontsize': 13,})
plt.tight_layout()
genreBased('Global_Sales') #ABOVE
genreBased('IND_Sales') #BELOW