Learn data science and machine learning by building real-world projects on Jovian

Netflix, IMDb Exploratory Data Analysis.

After searching through tens of datasets from kaggle, i stumbled accross this particular data which caught my fancy. Maybe it was due to love for movies that made me download this data.

In this EDA project, i am going analyse this data using several important python libraies such as pandas for reading, cleaning and mainipulation. Sorted_dataframe for sorting out days and months while plotly,seaborn and matplotlib for visualization of data.

Data Reading

In [5]:
import pandas as pd
In [6]:
df= pd.read_csv('netflix-rotten-tomatoes-metacritic-imdb.csv', encoding='utf-8')
df
Out[6]:
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 15480 entries, 0 to 15479 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Title 15480 non-null object 1 Genre 13770 non-null object 2 Tags 15413 non-null object 3 Languages 13545 non-null object 4 Series or Movie 15480 non-null object 5 Hidden Gem Score 13379 non-null float64 6 Country Availability 15461 non-null object 7 Runtime 15479 non-null object 8 Director 10772 non-null object 9 Writer 11150 non-null object 10 Actors 13555 non-null object 11 View Rating 8456 non-null object 12 IMDb Score 13381 non-null float64 13 Rotten Tomatoes Score 6382 non-null float64 14 Metacritic Score 4336 non-null float64 15 Awards Received 6075 non-null float64 16 Awards Nominated For 7661 non-null float64 17 Boxoffice 4007 non-null object 18 Release Date 13373 non-null object 19 Netflix Release Date 15480 non-null object 20 Production House 5149 non-null object 21 Netflix Link 15480 non-null object 22 IMDb Link 13177 non-null object 23 Summary 15471 non-null object 24 IMDb Votes 13379 non-null float64 25 Image 15480 non-null object 26 Poster 11842 non-null object 27 TMDb Trailer 7194 non-null object 28 Trailer Site 7194 non-null object dtypes: float64(7), object(22) memory usage: 3.4+ MB
In [8]:
# This shows the data they are about 15480 data row entries.With the row index ranging from 0 to 15479 and a total of 29 columns
In [9]:
df.describe()
Out[9]:
In [10]:
# check the columns of the data
df.columns
Out[10]:
Index(['Title', 'Genre', 'Tags', 'Languages', 'Series or Movie',
       'Hidden Gem Score', 'Country Availability', 'Runtime', 'Director',
       'Writer', 'Actors', 'View Rating', 'IMDb Score',
       'Rotten Tomatoes Score', 'Metacritic Score', 'Awards Received',
       'Awards Nominated For', 'Boxoffice', 'Release Date',
       'Netflix Release Date', 'Production House', 'Netflix Link', 'IMDb Link',
       'Summary', 'IMDb Votes', 'Image', 'Poster', 'TMDb Trailer',
       'Trailer Site'],
      dtype='object')
In [11]:
# Remove the columns that is not needed
updated_df= df.drop(['Trailer Site','TMDb Trailer','Poster','Image','Summary','IMDb Link','Netflix Link','Release Date','Hidden Gem Score','Awards Received','Awards Nominated For','Metacritic Score','Boxoffice','Rotten Tomatoes Score',"Production House"],axis=1)
In [12]:
updated_df.drop_duplicates(subset='Title',keep='first', inplace=True)
In [13]:
updated_df
Out[13]:
In [14]:
# check the number missing values available in each column
updated_df.isnull().sum()

Out[14]:
Title                      0
Genre                   1706
Tags                      66
Languages               1926
Series or Movie            0
Country Availability      18
Runtime                    1
Director                4612
Writer                  4258
Actors                  1919
View Rating             6914
IMDb Score              2083
Netflix Release Date       0
IMDb Votes              2085
dtype: int64
In [15]:
# remove the nan values in Runtime and country availability column
updated_df = updated_df[updated_df['Runtime'].notnull()]
updated_df=updated_df[updated_df['Tags'].notnull()]

In [16]:
updated_df["Genre"].fillna(value="others", inplace=True)
updated_df["Country Availability"].fillna(value="United States", inplace=True)
updated_df["Languages"].fillna(value="others",inplace=True)
updated_df["IMDb Score"].fillna(value=0,inplace=True)
updated_df["IMDb Votes"].fillna(value=0,inplace=True)
updated_df["Actors"].fillna(value="others",inplace=True)
updated_df["Director"].fillna(value="0",inplace=True)
updated_df['Writer'].fillna(value='others', inplace=True)
updated_df['View Rating'].fillna(value='Unrated', inplace=True)


In [17]:
# check the number missing values available in each column
updated_df.isnull().sum()

Out[17]:
Title                   0
Genre                   0
Tags                    0
Languages               0
Series or Movie         0
Country Availability    0
Runtime                 0
Director                0
Writer                  0
Actors                  0
View Rating             0
IMDb Score              0
Netflix Release Date    0
IMDb Votes              0
dtype: int64
Handling inconsistent enteries
In [18]:
# remove the white-spaces using the replace finction
updated_df['Genre']= updated_df["Genre"].str.replace(" ","")
updated_df['Tags']= updated_df["Tags"].str.replace(" ","")

Handling of date

first, convert both Netflix release date and Release date to dattime data type and then parse them into their respective columns

In [19]:
# covert to datetime datatype
updated_df['Netflix Release Date']= pd.to_datetime(updated_df['Netflix Release Date'])
In [20]:
updated_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 15004 entries, 0 to 15479 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Title 15004 non-null object 1 Genre 15004 non-null object 2 Tags 15004 non-null object 3 Languages 15004 non-null object 4 Series or Movie 15004 non-null object 5 Country Availability 15004 non-null object 6 Runtime 15004 non-null object 7 Director 15004 non-null object 8 Writer 15004 non-null object 9 Actors 15004 non-null object 10 View Rating 15004 non-null object 11 IMDb Score 15004 non-null float64 12 Netflix Release Date 15004 non-null datetime64[ns] 13 IMDb Votes 15004 non-null float64 dtypes: datetime64[ns](1), float64(2), object(11) memory usage: 1.7+ MB
In [21]:
# parse the netflix release date
updated_df['Year']= updated_df['Netflix Release Date'].dt.year
updated_df['Month']= updated_df['Netflix Release Date'].dt.month
updated_df['Month_name']= updated_df['Netflix Release Date'].dt.month_name()
updated_df['day']= updated_df['Netflix Release Date'].dt.day
updated_df['weekday']= updated_df['Netflix Release Date'].dt.day_name()
updated_df['weekofyear']= updated_df['Netflix Release Date'].dt.isocalendar().week
In [22]:
updated_df
Out[22]:
In [26]:
!pip install plotly
Collecting plotly Downloading plotly-4.14.3-py2.py3-none-any.whl (13.2 MB) |████████████████████████████████| 13.2 MB 6.2 MB/s eta 0:00:01 Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from plotly) (1.15.0) Collecting retrying>=1.3.3 Downloading retrying-1.3.3.tar.gz (10 kB) Building wheels for collected packages: retrying Building wheel for retrying (setup.py) ... done Created wheel for retrying: filename=retrying-1.3.3-py3-none-any.whl size=11429 sha256=da566502cbc9b235d7c7d26993adc31f8ea0e07f8e0cf47b199812425216629e Stored in directory: /home/jovyan/.cache/pip/wheels/c4/a7/48/0a434133f6d56e878ca511c0e6c38326907c0792f67b476e56 Successfully built retrying Installing collected packages: retrying, plotly Successfully installed plotly-4.14.3 retrying-1.3.3

Data Exploration

In [27]:
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib
import plotly.graph_objs as go
import plotly.graph_objs as go
from plotly.offline import iplot
import plotly.express as px
%matplotlib inline
In [28]:
sns.set_style('darkgrid')
matplotlib.rcParams['figure.figsize']=(12,6)
matplotlib.rcParams['figure.facecolor']="#00000000"
matplotlib.rcParams['font.size']= 14
matplotlib.rcParams['font.cursive']="Felipa"

Genre

In [30]:
def split_multicolumn(col_series):
    result_df = col_series.to_frame()
    options = []
    for idx, value in col_series[col_series.notnull()].iteritems():
        for option in value.split(','):
            if not option in result_df.columns:
                options.append(option)
                result_df[option] = False
                
            result_df.at[idx, option] = True
            
    return result_df[options]
In [31]:
split_genre= split_multicolumn(updated_df.Genre)
split_genre
Out[31]:
In [32]:
genre_count = (split_genre.mean().sort_values(ascending=False)*100).round(2)
genre_count
Out[32]:
Drama          40.80
Comedy         32.97
Action         17.90
Thriller       17.28
Romance        15.72
Crime          12.30
Adventure      11.75
others         11.29
Animation      10.74
Fantasy        10.23
Family          9.26
Sci-Fi          7.74
Mystery         7.60
Horror          6.77
Documentary     6.75
Biography       4.09
History         3.39
Music           2.75
Short           2.74
Sport           2.39
War             2.13
Musical         1.47
Reality-TV      1.25
Western         0.71
Game-Show       0.35
Talk-Show       0.17
News            0.13
Adult           0.09
Film-Noir       0.01
dtype: float64
In [33]:
plt.figure(figsize=(15, 8))
plt.xlabel('Percentage[%]')
plt.title("Number of Genre")
plt.ylabel("Genre")
sns.barplot(y=genre_count.index, x=genre_count);
Notebook Image

Tags

In [34]:
split_tags = split_multicolumn(updated_df.Tags)
In [35]:
Tags_count= split_tags.sum().sort_values(ascending=False).head(20)
Tags_count
Out[35]:
TVProgrammes             247
InternationalFilms       120
MoviesbasedonBooks       110
RomanticTVShows           88
CrimeTVShows              77
Dramasbasedonabook        75
TVShows                   73
TVSci-Fi&Fantasy          64
Filmsforages8to10         56
AnimebasedonComics        52
InternationalMovies       50
Moviesforages8to10        49
KidsTV                    48
Filmsforages5to7          47
Dramasbasedonreallife     46
Anime                     45
InternationalDramas       45
TVShowsbasedonBooks       44
BiographicalMovies        37
Dramas                    37
dtype: int64
In [36]:
plt.figure(figsize=(15, 8))
plt.xlabel('Count')
plt.title("Number of tags")
plt.ylabel("Tags")
sns.barplot(y=Tags_count.index, x=Tags_count, palette='inferno');
Notebook Image

Country Availability

In [96]:
split_Country_availability = split_multicolumn(updated_df['Country Availability'])
In [97]:
Country_availability_counts=split_Country_availability.sum().sort_values(ascending=False)
In [98]:
plt.figure(figsize=(15, 8))
plt.title("Countries availability")
plt.ylabel("Countries")
plt.xlabel("counts")
sns.barplot(y=Country_availability_counts.index, x=Country_availability_counts, palette='crest');
Notebook Image

Movies or Series

In [39]:
movies_series = updated_df['Series or Movie'].value_counts()
movies_series
Out[39]:
Movie     11357
Series     3647
Name: Series or Movie, dtype: int64

View Rating

In [40]:
plt.figure(figsize=(15,8))
plt.title("View Rating")
plt.xticks(rotation=75)
sns.countplot(data=updated_df, x='View Rating');
Notebook Image
In [41]:
updated_df
Out[41]:
In [43]:
runtime_count= updated_df.Runtime.value_counts().sort_values(ascending=False)
runtime_count
Out[43]:
1-2 hour        8870
< 30 minutes    3850
> 2 hrs         1952
30-60 mins       332
Name: Runtime, dtype: int64
In [44]:

plt.title("Content Runtime")
sns.countplot(data=updated_df, x="Runtime");

Notebook Image
In [45]:
def count_rows(rows):
    return len(rows)
In [46]:
def heatmap(col1, col2):
    by_cross = updated_df.groupby([col1, col2]).apply(count_rows)
    plt.figure(figsize=(15,9))
    plt.title("Month when contents are released")
    pivot=by_cross.unstack()
    return sns.heatmap(pivot,annot=True)
In [47]:
heatmap('Month_name', 'Year');
Notebook Image
Alternatively
In [48]:
month_count= updated_df['Month_name'].value_counts().reset_index()
month_count
Out[48]:
In [49]:
month_count.columns=['Month','Count']
In [51]:
!pip install sort_dataframeby-monthorweek
!pip install sorted-months-weekdays
Collecting sort_dataframeby-monthorweek Downloading sort_dataframeby_monthorweek-0.4.tar.gz (2.8 kB) Building wheels for collected packages: sort-dataframeby-monthorweek Building wheel for sort-dataframeby-monthorweek (setup.py) ... done Created wheel for sort-dataframeby-monthorweek: filename=sort_dataframeby_monthorweek-0.4-py3-none-any.whl size=3439 sha256=9a9bd87656476059628c858374bf8da95d89da12143c415df722dc58f8333c53 Stored in directory: /home/jovyan/.cache/pip/wheels/f8/84/15/465a1c7eabab8605e82621b7aa658c99010f3564a43ccd06aa Successfully built sort-dataframeby-monthorweek Installing collected packages: sort-dataframeby-monthorweek Successfully installed sort-dataframeby-monthorweek-0.4 Collecting sorted-months-weekdays Downloading sorted_months_weekdays-0.2.tar.gz (2.7 kB) Building wheels for collected packages: sorted-months-weekdays Building wheel for sorted-months-weekdays (setup.py) ... done Created wheel for sorted-months-weekdays: filename=sorted_months_weekdays-0.2-py3-none-any.whl size=3241 sha256=387470316d80b7fd8bcc4d0abd565119a1ce813fe3740fe5226efca75642f2c9 Stored in directory: /home/jovyan/.cache/pip/wheels/9a/0d/2e/309df76f445367ea1777b73df0feb347b9c8d204ead35fdb10 Successfully built sorted-months-weekdays Installing collected packages: sorted-months-weekdays Successfully installed sorted-months-weekdays-0.2
In [52]:
import sort_dataframeby_monthorweek as sd
In [53]:
sorted_month_count= sd.Sort_Dataframeby_Month(month_count, 'Month')
sorted_month_count
Out[53]:
In [54]:
plt.figure(figsize=(20,10))
plt.title("Month when contents are released")
sns.barplot(data=sorted_month_count, y="Count", x="Month", palette='rocket');
Notebook Image
In [55]:
plt.title("The year contents are released")
sns.countplot(data=updated_df,x='Year', palette='viridis');
Notebook Image
In [56]:
weekday_count= updated_df['weekday'].value_counts().reset_index()
weekday_count
Out[56]:
In [57]:
weekday_count.columns=['Weekday', 'Count']
In [58]:
sorted_weekday_count= sd.Sort_Dataframeby_Weekday(weekday_count,'Weekday')
sorted_weekday_count
Out[58]:
In [59]:
plt.title('Day of the week with most content')
sns.barplot(data=sorted_weekday_count, x='Weekday', y='Count',  palette='gist_stern_r');
Notebook Image
In [60]:
weekofyear_count= updated_df.groupby(['weekofyear','weekday']).size().reset_index()
weekofyear_count
Out[60]:
In [61]:
weekofyear_count.columns=['weekofyear','weekday','count']
weekofyear_count
Out[61]:
In [62]:
plt.figure(figsize=(20,10))
plt.title("The week with the most released content")
sns.barplot(data=weekofyear_count, x="weekofyear", y='count', hue='weekday');
Notebook Image
In [63]:
plt.title("The day with the most released content")
sns.histplot(data=updated_df, x='day', kde=True, bins=31);
Notebook Image

asking questions

Top 20 movies based IMDb votes

In [67]:
movies_df= updated_df[updated_df['Series or Movie']=='Movie']
In [68]:
movies_imdb_votes=movies_df.groupby(['Title'])[['IMDb Votes']].sum().reset_index().sort_values('IMDb Votes',ascending=False).head(20)
movies_imdb_votes
Out[68]:

Top 20 movies based on IMDb scores

In [69]:
movie_imdb_score =movies_df.groupby(['Title'])[['IMDb Score']].sum().reset_index().sort_values('IMDb Score',ascending=False).head(20)
movie_imdb_score
Out[69]:

The movies to look out for on netflix

In [71]:
# merge both top 20 imdb scores and imdb votes
movies_imdb_merge=movies_imdb_votes.merge(movie_imdb_score,on='Title')
movies_imdb_merge
Out[71]:

The series to watch out for on netflix

In [72]:

series_df= updated_df[updated_df['Series or Movie'] =='Series']
series_df
Out[72]:
In [73]:
# Top 20 series based imdb votes
series_imdb_votes=series_df.groupby(['Title'])[['IMDb Votes']].sum().reset_index().sort_values('IMDb Votes',ascending=False).head(10)
series_imdb_votes
Out[73]:
In [74]:
# Top 20 series based on imdb score
series_imdb_score =series_df.groupby(['Title'])[['IMDb Score']].sum().reset_index().sort_values('IMDb Score',ascending=False).head(10)
series_imdb_score
Out[74]:
In [76]:
# merge both dataframe
series_imdb_merge=series_imdb_votes.merge(series_imdb_score, on="Title")
series_imdb_merge
Out[76]:

How good were the movies released each year?

In [77]:
# Number of movies released each year
year_count=updated_df['Year'].value_counts().reset_index()
year_count 
Out[77]:
In [78]:
# rename the column
year_count.columns=['Year', "Count"]
year_count
Out[78]:
In [79]:
# create dataframe with imdb score >=7
imdb_score_greater_than_7=updated_df[updated_df['IMDb Score'] >=7]
imdb_score_greater_than_7
Out[79]:
In [81]:

imdb_year_count =imdb_score_greater_than_7['Year'].value_counts().reset_index()
imdb_year_count
Out[81]:
In [82]:
imdb_year_count.columns=['Year',"Count"]
In [83]:
year_count_merge= year_count.merge(imdb_year_count,on="Year").sort_values('Year',ascending=True)
year_count_merge
Out[83]:
In [84]:
year_count_merge.columns=['Year','Initial_year_count','Imdb_year_count']
year_count_merge
Out[84]:
In [95]:
px.line(year_count_merge,y=['Year','Initial_year_count','Imdb_year_count'], x="Year")
In [92]:
# percentage of movies >= 7 that was released each year
year_count_merge['Year_count_percentage'] =((year_count_merge['Imdb_year_count']/year_count_merge['Initial_year_count'])*100).round(2)
                                                                                                      
In [93]:
year_count_merge
Out[93]:

Inference and Conclusion

In this notebook, i have tried to analyse this data and after my analysis, i reach the following conclusion:

  • Most of the contents on Neflix are Drama
  • Most of the contents are available in the United Kingdom
  • Majority of the content are unrated
  • About 75.5% of the contents are movies while the remaining 24.5% are series
  • Majority of the content have the duration of 1-2 hours
  • Most of the content were released on the month of April.
  • Netflix released the most contents in year 2020.
  • 1st and 14th of every month has proven the day netflix releases most of its content.
  • Most of the contents were released on the Tuesday.
  • Based on IMDb ratings, the movies to watch-out for are: 'The Shawshank Redemption', 'The Dark Knight', 'Pulp Fiction','Fiction.', 'The Lord of the Rings: The Return of theKing', 'The Godfather.
  • The series to watch_out for are :Breaking bad and Sherlock.
  • About 42% of the movies released in 2015 have the IMDb score of 7 and above.This means that year 2015 has the highest percentage of good movies.
In [ ]:
 
In [ ]: