Video game is always related to our childhood. We played game when we're small and even when we're already an adult. But is the industry doing well these day ? We can analyze the video game sale dataset with graphs visualization to get some insight about that.
The dataset is taken from https://www.kaggle.com/rishidamarla/video-game-sales
Libraries used in project :
Thanks Jovian for the course project.
This is an executable Jupyter notebook hosted on Jovian.ml, a platform for sharing data science projects. You can run and experiment with the code in a couple of ways: using free online resources (recommended) or on your own computer.
The easiest way to start executing this notebook is to click the "Run" button at the top of this page, and select "Run on Binder". This will run the notebook on mybinder.org, a free online service for running Jupyter notebooks. You can also select "Run on Colab" or "Run on Kaggle".
Install Conda by following these instructions. Add Conda binaries to your system PATH
, so you can use the conda
command on your terminal.
Create a Conda environment and install the required libraries by running these commands on the terminal:
conda create -n zerotopandas -y python=3.8
conda activate zerotopandas
pip install jovian jupyter numpy pandas matplotlib seaborn opendatasets --upgrade
jovian clone notebook-owner/notebook-id
cd directory-name
and start the Jupyter notebook.jupyter notebook
You can now access Jupyter's web interface by clicking the link that shows up on the terminal or by visiting http://localhost:8888 on your browser. Click on the notebook file (it has a .ipynb
extension) to open it.
Firstly We need to download the dataset to use. The link is already provided in the description above. You can also find a lot of interesting datasets on Kaggle
!pip install jovian opendatasets --upgrade --quiet
Let's begin by downloading the data, and listing the files within the dataset.
dataset_url = 'https://www.kaggle.com/rishidamarla/video-game-sales'
The downloader will need to use ur username and apikey (generated in ur profile account on Kaggle) so firstly you should probably regis an account on Kaggle.
import opendatasets as od
od.download(dataset_url)
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: akariiiii
Your Kaggle Key: ········
100%|██████████| 476k/476k [00:00<00:00, 63.4MB/s]
Downloading video-game-sales.zip to ./video-game-sales
The dataset has been downloaded and extracted.
data_dir = './video-game-sales'
import os
os.listdir(data_dir)
['Video_Games.csv']
Let us save and upload our work to Jovian before continuing.
project_name = "data-analysis-of-video-game-sales"
!pip install jovian --upgrade -q
import jovian
jovian.commit(project=project_name)
[jovian] Attempting to save notebook..
[jovian] Please enter your API key ( from https://jovian.ml/ ):
API KEY: ········
[jovian] Updating notebook "indexkyou/data-analysis-of-video-game-sales" on https://jovian.ml/
[jovian] Uploading notebook..
[jovian] Capturing environment..
[jovian] Committed successfully! https://jovian.ml/indexkyou/data-analysis-of-video-game-sales
Firstly we should load the dataset into Pandas data frame and take a look what can we get with this dataset.
import pandas as pd
game_sales_df = pd.read_csv('./video-game-sales/Video_Games.csv')
game_sales_df
Pretty cool we have 16719 rows equal to 16719 game titles here. We should probably check out the columns and info to see if this dataset is already workable
game_sales_df.columns
Index(['Name', 'Platform', 'Year_of_Release', 'Genre', 'Publisher', 'NA_Sales',
'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales', 'Critic_Score',
'Critic_Count', 'User_Score', 'User_Count', 'Developer', 'Rating'],
dtype='object')
game_sales_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16719 entries, 0 to 16718
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 16717 non-null object
1 Platform 16719 non-null object
2 Year_of_Release 16450 non-null float64
3 Genre 16717 non-null object
4 Publisher 16665 non-null object
5 NA_Sales 16719 non-null float64
6 EU_Sales 16719 non-null float64
7 JP_Sales 16719 non-null float64
8 Other_Sales 16719 non-null float64
9 Global_Sales 16719 non-null float64
10 Critic_Score 8137 non-null float64
11 Critic_Count 8137 non-null float64
12 User_Score 10015 non-null object
13 User_Count 7590 non-null float64
14 Developer 10096 non-null object
15 Rating 9950 non-null object
dtypes: float64(9), object(7)
memory usage: 2.0+ MB
Look at the info we can see that :
We should try removing nun object for a better dataframe.
game_sales_df.drop(game_sales_df[game_sales_df.Year_of_Release.isnull()].index, inplace = True) #remove null value in Year of release column
game_sales_df.drop(game_sales_df[game_sales_df.Name.isnull()].index, inplace = True) #remove null value in Name column
game_sales_df.drop(game_sales_df[game_sales_df.Publisher.isnull()].index, inplace = True) #remove null value in Publisher column
game_sales_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16416 entries, 0 to 16718
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 16416 non-null object
1 Platform 16416 non-null object
2 Year_of_Release 16416 non-null float64
3 Genre 16416 non-null object
4 Publisher 16416 non-null object
5 NA_Sales 16416 non-null float64
6 EU_Sales 16416 non-null float64
7 JP_Sales 16416 non-null float64
8 Other_Sales 16416 non-null float64
9 Global_Sales 16416 non-null float64
10 Critic_Score 7982 non-null float64
11 Critic_Count 7982 non-null float64
12 User_Score 9837 non-null object
13 User_Count 7461 non-null float64
14 Developer 9904 non-null object
15 Rating 9767 non-null object
dtypes: float64(9), object(7)
memory usage: 2.8+ MB
Ok that dataframe seems good enough. We should take a closer look at the description.
game_sales_df.describe()
import jovian
jovian.commit()
[jovian] Attempting to save notebook..
[jovian] Updating notebook "indexkyou/data-analysis-of-video-game-sales" on https://jovian.ml/
[jovian] Uploading notebook..
[jovian] Capturing environment..
[jovian] Committed successfully! https://jovian.ml/indexkyou/data-analysis-of-video-game-sales
At first look the dataframe is already sorted by Global_Sales. But for a better viewer we should try creating a few graphs.
Let's begin by importingmatplotlib.pyplot
and seaborn
.
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 13
matplotlib.rcParams['figure.figsize'] = (36, 20)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
First, We should see the total sales of games each year. It helps us know when video games are declining and when they are popular.
sns.countplot('Year_of_Release', data = game_sales_df)
plt.title('Total Game Sales Each Year')
plt.show()
Seems like we don't have much data from 2017 to 2020 let remove them and try using another graph for better view.
# remove games that were released after 2016
game_sales_df.drop(game_sales_df[game_sales_df.Year_of_Release > 2016].index, inplace = True)
sales_df = game_sales_df.groupby('Year_of_Release', as_index = False).sum()
x_axis = sales_df['Year_of_Release']
y_axis = sales_df['Global_Sales']
plt.figure(figsize=(20,10), dpi= 60)
plt.plot(x_axis, y_axis, label = 'Sales', color = 'green')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.title('Total Game Sale Each Year')
plt.legend()
plt.show()
Let add other sales area as well like NA | EU | JP
na = sales_df['NA_Sales']
eu = sales_df['EU_Sales']
jp = sales_df['JP_Sales']
total = sales_df['Global_Sales']
plt.title('Sales Comparison Between Region And Global')
plt.plot(x_axis, total, label = 'Global')
plt.plot(x_axis, na, label = 'US')
plt.plot(x_axis, eu, label = 'EU')
plt.plot(x_axis, jp, label = 'JP')
plt.legend(bbox_to_anchor =(1, 1))
<matplotlib.legend.Legend at 0x7fc7e91dfaf0>
We can see that the US is the largest market followed by the EU and JP. JP is pretty consistent and doesn't seem to be declined that much. In 2008 and 2009 video games were explored in popularity so we should take a look at the game list in these years.
top_games_2008 = game_sales_df.loc[game_sales_df['Year_of_Release'] == 2008]
top_games_2008.sort_values('Global_Sales',ascending = False).head(10)
top_games_2009 = game_sales_df.loc[game_sales_df['Year_of_Release'] == 2009]
top_games_2009.sort_values('Global_Sales',ascending = False).head(10)
In 2008 and 2009, the most popular game was from Wii platform. That's pretty interesting let see the pie graph for platform (We should combine two dataframe as well)
combine_list = top_games_2008.append(top_games_2009)
platform_counts = combine_list.Platform.value_counts()
platform_counts
DS 895
Wii 607
X360 318
PS3 300
PS2 287
PSP 261
PC 183
DC 1
XB 1
Name: Platform, dtype: int64
plt.figure(figsize=(24,12))
plt.title("Top 10 Platform in 2008 and 2009")
plt.pie(platform_counts, labels=platform_counts.index, autopct='%1.1f%%', startangle=180);
plt.legend(loc = 2,fontsize = 10, bbox_to_anchor = (1, 1), ncol = 2)
<matplotlib.legend.Legend at 0x7fc7eb1eaac0>
top10_platforms = game_sales_df.Platform.value_counts().head(10)
plt.figure(figsize=(24,12))
plt.title("Top 10 platform of all time")
plt.pie(top10_platforms, labels=top10_platforms.index, autopct='%1.1f%%', startangle=180);
plt.legend(loc = 2,fontsize = 10, bbox_to_anchor = (1, 1), ncol = 2)
<matplotlib.legend.Legend at 0x7fc7eb109fd0>
PS2 still dominated for many years, truly the best selling console of all time.
top_publishers = game_sales_df.Publisher.value_counts().head(10)
top_publishers
Electronic Arts 1344
Activision 976
Namco Bandai Games 935
Ubisoft 929
Konami Digital Entertainment 825
THQ 712
Nintendo 700
Sony Computer Entertainment 686
Sega 629
Take-Two Interactive 421
Name: Publisher, dtype: int64
plt.figure(figsize=(12,6))
plt.xticks(rotation=75)
sns.barplot(top_publishers.index, top_publishers);
top_genres = game_sales_df.Genre.value_counts().head(10)
plt.figure(figsize=(12,6))
sns.barplot(top_genres.index, top_genres);
We should use Pie chart for this kind of thing. Since It can give you the percent of each genre as well.
plt.figure(figsize=(24,12))
plt.title("Top 10 Genre")
plt.pie(top_genres, labels=top_genres.index, autopct='%1.1f%%', startangle=180);
plt.legend(loc = 2,fontsize = 10, bbox_to_anchor = (1, 1), ncol = 2)
<matplotlib.legend.Legend at 0x7fc7eaf40ee0>
Let us save and upload our work to Jovian before continuing
import jovian
jovian.commit()
[jovian] Attempting to save notebook..
[jovian] Updating notebook "indexkyou/data-analysis-of-video-game-sales" on https://jovian.ml/
[jovian] Uploading notebook..
[jovian] Capturing environment..
[jovian] Committed successfully! https://jovian.ml/indexkyou/data-analysis-of-video-game-sales
game_sales_2000_to_2016 = game_sales_df[(game_sales_df['Year_of_Release'] >= 2000) & (game_sales_df['Year_of_Release'] <= 2016)]
total_sales_us = game_sales_2000_to_2016.NA_Sales.sum()
total_sales_jp = game_sales_2000_to_2016.JP_Sales.sum()
total_sales_eu = game_sales_2000_to_2016.EU_Sales.sum()
total_sales_others = game_sales_2000_to_2016.Other_Sales.sum()
data = [['US', total_sales_us],['JP', total_sales_jp],['Others', total_sales_others],['EU', total_sales_eu]]
df = pd.DataFrame(data, columns = ['Name', 'Sales'])
plt.figure(figsize=(24,12))
plt.title("US Market Share")
plt.pie(df.Sales, labels=df.Name, autopct='%1.1f%%', startangle=180);
plt.legend(loc = 2,fontsize = 10, bbox_to_anchor = (1, 1), ncol = 2)
<matplotlib.legend.Legend at 0x7fc7ea8b6070>
After taking a look at the top 10 genre chart we can see that Action is the most popular genre.But we should check out the top genre in the US first then compare it to other regions.
# sort_values sort the data frame with the correct column name you can specific ascending true | false for
# head (number) return the number of row
# we get 1000 result and try get percent of genre that's popular in the US
top_1000_us = game_sales_df.sort_values('NA_Sales',ascending = False).head(1000)
top_1000_us
# value_counts : return a Series containing counts of unique values
top_1000_us_genre = top_1000_us.Genre.value_counts()
plt.figure(figsize=(24,12))
plt.title("Top 10 Genre US")
plt.pie(top_1000_us_genre, labels=top_1000_us_genre.index, autopct='%1.1f%%', startangle=180);
plt.legend(loc = 2,fontsize = 10, bbox_to_anchor = (1, 1), ncol = 2)
<matplotlib.legend.Legend at 0x7fc7eb265490>
Looking at the chart we can safely assume that Action and Shooter are really popular in the US. So for a better chance of success if we want to make games we should create a game combined between Action and Shooter like Overwatch!
Firstly, We should find out who is the current top publisher in Japan. Then we can calculate the genre percent of their published games and create a chart. Looking at the chart can give us a better view for the answer.
top_publishers = game_sales_df.groupby('Publisher').sum()
top_publishers_jp = top_publishers.sort_values('JP_Sales',ascending = False).head(10)
top_publishers_jp
So the top publisher in Japan is Nintendo with 457 millions sales. Next let see what is their best seller.
top_games_nintendo = game_sales_df.loc[game_sales_df['Publisher'] == 'Nintendo'].sort_values('JP_Sales',ascending = False).head(10)
top_games_nintendo
The best seller game of Nintendo in Japan is Pokemon Red/Pokemon Blue which sold 10.22 millions copy.
top_genre_nintendo = top_games_nintendo.Genre.value_counts()
plt.figure(figsize=(24,12))
plt.title("Top 10 Genre Nintendo")
plt.pie(top_genre_nintendo, labels=top_genre_nintendo.index, autopct='%1.1f%%', startangle=180);
plt.legend(loc = 2,fontsize = 10, bbox_to_anchor = (1, 1), ncol = 2)
<matplotlib.legend.Legend at 0x7fc7eaa31a60>
Their focus seems like Role-Playing (Pokemon series) and Platform (Mario).
top_game_2008 = game_sales_df.loc[game_sales_df['Year_of_Release'] == 2008].sort_values('Global_Sales',ascending = False).head(1)
top_game_2008
top_game_2015 = game_sales_df.loc[game_sales_df['Year_of_Release'] == 2015].sort_values('Global_Sales',ascending = False).head(1)
top_game_2015
so We now have 2 different genre : Racing and Shooting. Let get all the games released between 2008 and 2015.
games_list = game_sales_df[(game_sales_df['Year_of_Release'] >= 2008) & (game_sales_df['Year_of_Release'] <= 2015)]
games_list = games_list.groupby(['Genre', 'Year_of_Release'], as_index = False).sum()
games_list
racing_games_list = games_list.loc[games_list['Genre'] == 'Racing']
x = racing_games_list['Year_of_Release']
y = racing_games_list['Global_Sales']
plt.figure(figsize=(20,10), dpi= 60)
plt.plot(x, y, label = 'Sales', color = 'green')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.title('Racing Game Trending')
plt.legend()
plt.show()