Sign In

Analysis on Netflix Movies & TV Shows

Netflix is a popular service that people across the world use for entertainment. In this EDA, I will explore the netflix-shows dataset through visualizations and graphs using matplotlib and seaborn.

First, we will install and import necessary packages.

In [1]:
!pip install jovian --upgrade --quiet
In [2]:
import jovian
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import matplotlib
In [3]:
# jovian.commit(files=['../input/netflix-shows/netflix_titles.csv'], project='netflix-movies-and-tv-shows-project')

Now we are ready to load the dataset. We will do this using the standard read_csv command from Pandas. Let's take a glimpse at how the data looks like.

In [4]:
netflix_titles_df = pd.read_csv('../input/netflix-shows/netflix_titles.csv')

After a quick glimpse at the dataset, it looks like a typical movies/shows dataset without user ratings. We can also see that there are NaN values in some columns.

Data Preparation and Cleaning

In [5]:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6234 entries, 0 to 6233 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 show_id 6234 non-null int64 1 type 6234 non-null object 2 title 6234 non-null object 3 director 4265 non-null object 4 cast 5664 non-null object 5 country 5758 non-null object 6 date_added 6223 non-null object 7 release_year 6234 non-null int64 8 rating 6224 non-null object 9 duration 6234 non-null object 10 listed_in 6234 non-null object 11 description 6234 non-null object dtypes: int64(2), object(10) memory usage: 584.6+ KB

There are 6,234 entries and 12 columns to work with for EDA. Right off the bat, there are a few columns that contain null values ('director', 'cast', 'country', 'date_added', 'rating').

In [6]:
netflix_titles_df.T.apply(lambda x: x.nunique(), axis=1)
show_id         6234
type               2
title           6172
director        3301
cast            5469
country          554
date_added      1524
release_year      72
rating            14
duration         201
listed_in        461
description     6226
dtype: int64

We can see that for each of the columns, there are alot different unique values for some of them. It makes sense that show_id is large since it is a unique key used to identify a movie/show. Title, director, cast, country, date_added, listed_in, and description contain many unique values as well.

In [11]:
In [12]:
In [13]:
sns.heatmap(netflix_titles_df.isnull(), cbar=False)
In [15]:
netflix_titles_df.T.apply(lambda x: x.isnull().sum(), axis=1)
show_id            0
type               0
title              0
director        1969
cast             570
country          476
date_added        11
release_year       0
rating            10
duration           0
listed_in          0
description        0
dtype: int64

Above, we can see that null values exist in the dataset. There are a total of 3,036 null values across the entire dataset with 1,969 missing points under 'director', 570 under 'cast', 476 under 'country', 11 under 'date_added', and 10 under 'rating'. We will have to handle all null data points before we can dive into EDA and modeling.

In [16]:
netflix_titles_df['director'].fillna('No Director', inplace=True)
netflix_titles_df['cast'].fillna('No Cast', inplace=True)
netflix_titles_df['country'].fillna('Country Unavailable', inplace=True)
In [17]:
show_id         False
type            False
title           False
director        False
cast            False
country         False
date_added      False
release_year    False
rating          False
duration        False
listed_in       False
description     False
dtype: bool

For null values, the easiest way to get rid of them would be to delete the rows with the missing data. However, this wouldn't be beneficial to our EDA since there is loss of information. Since 'director', 'cast', and 'country' contain the majority of null values, I will choose to treat each missing value as unavailable. The other two labels 'date_added' and 'rating' contains an insignificant portion of the data so I will drop them from the dataset. After, we can see that there are no more null values in the dataset.

Since the dataset can either contain movies or shows, it'd be nice to have datasets for both so we can take a deep dive into just Netflix movies or Netflix TV shows so we will create two new datasets. One for movies and the other one for shows.

In [18]:
netflix_movies_df = netflix_titles_df[netflix_titles_df['type'].str.contains('Movie')]
In [19]:
netflix_shows_df = netflix_titles_df[netflix_titles_df['type'].str.contains('TV Show')]

In the duration column, there appears to be a discrepancy between movies and shows. Movies are based on the duration of the movie and shows are based on the number of seasons. To make EDA easier, I will convert the values in these columns into integers for both the movies and shows datasets.

In [20]:
netflix_movies_df.duration = netflix_movies_df.duration.str.replace(' min','').astype(int)
netflix_shows_df.rename(columns={'duration':'seasons'}, inplace=True)
netflix_shows_df.replace({'seasons':{'1 Season':'1 Seasons'}}, inplace=True)
netflix_shows_df.seasons = netflix_shows_df.seasons.str.replace(' Seasons','').astype(int)
/opt/conda/lib/python3.7/site-packages/pandas/core/ SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: self[name] = value /opt/conda/lib/python3.7/site-packages/pandas/core/ SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: errors=errors, /opt/conda/lib/python3.7/site-packages/pandas/core/ SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: isetter(ilocs[0], value)

Exploratory Analysis and Visualization

First we will begin analysis on the entire Netflix dataset consisting of both movies and shows. Revisiting the data, let us see how it looked like again.

In [15]:

It'd be interesting to see the comparison between the total number of movies and shows in this dataset just to get an idea of which one is the majority.

In [16]:
g = sns.countplot(netflix_titles_df.type, palette="pastel");
plt.title("Count of Movies and TV Shows")
plt.xlabel("Type (Movie/TV Show)")
plt.ylabel("Total Count")
In [29]:
plt.title("% of Netflix Titles that are either Movies or TV Shows")
g = plt.pie(netflix_titles_df.type.value_counts(), explode=(0.025,0.025), labels=netflix_titles_df.type.value_counts().index, colors=['skyblue','navajowhite'],autopct='%1.1f%%', startangle=180);

So there are roughly 4,000+ movies and almost 2,000 shows with movies being the majority. This makes sense since shows are always an ongoing thing and have episodes. If we were to do a headcount of TV show episodes vs. movies, I am sure that TV shows would come out as the majority. However, in terms of title, there are far more movie titles (68.5%) than TV show titles (31.5%).

Now, we will explore the ratings which are based on the film rating system. The ordering of the ratings will be based on the age of the respective audience from youngest to oldest. We will not include the ratings 'NR' and 'UR' in the visuals since they stand for unrated and non-rated content.

In [33]:
order =  ['G', 'TV-Y', 'TV-G', 'PG', 'TV-Y7', 'TV-Y7-FV', 'TV-PG', 'PG-13', 'TV-14', 'R', 'NC-17', 'TV-MA']
g = sns.countplot(netflix_titles_df.rating, hue=netflix_titles_df.type, order=order, palette="pastel");
plt.title("Ratings for Movies & TV Shows")
plt.ylabel("Total Count")
In [57]:
fig, ax =plt.subplots(1,2, figsize=(19, 5))
g1 = sns.countplot(netflix_movies_df.rating, order=order,palette="Set2", ax=ax[0]);
g1.set_title("Ratings for Movies")
g1.set_ylabel("Total Count")
g2 = sns.countplot(netflix_shows_df.rating, order=order,palette="Set2", ax=ax[1]);
g2.set_title("Ratings for TV Shows")
g2.set_ylabel("Total Count")

Overall, there is much more content for a more mature audience. For the mature audience, there is much more movie content than there are TV shows. However, for the younger audience (under the age of 17), it is the opposite, there are slightly more TV shows than there are movies.

In [61]:
netflix_titles_df['year_added'] = pd.DatetimeIndex(netflix_titles_df['date_added']).year
netflix_movies_df['year_added'] = pd.DatetimeIndex(netflix_movies_df['date_added']).year
netflix_shows_df['year_added'] = pd.DatetimeIndex(netflix_shows_df['date_added']).year
/opt/conda/lib/python3.7/site-packages/ SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: /opt/conda/lib/python3.7/site-packages/ SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: This is separate from the ipykernel package so we can avoid doing imports until

Now we will take a look at the amount content Netflix has added throughout the previous years. Since we are interested in when Netflix added the title onto their platform, we will add a 'year_added' column shows the year of the date from the 'date_added' column as shown above.

In [124]:
netflix_year = netflix_titles_df['year_added'].value_counts().to_frame()
netflix_year.columns = ['releases']
In [123]:
g = sns.lineplot(data=netflix_year.drop(index=2020), x=netflix_year.drop(index=2020).index, y='releases')
plt.title("Total content added across all years (up to 2019)")

Based on the above timeline, we can see that the popular streaming platform started gaining traction after 2014. Since then, the amount of content added has been tremendous. I decided to exclude content added during 2020 since the data does not include a full years worth of data.

In [133]:
x = netflix_titles_df[netflix_titles_df.year_added != 2020].year_added
y = netflix_titles_df[netflix_titles_df.year_added != 2020].groupby('type').year_added.value_counts()
sns.lineplot(data=netflix_titles_df[netflix_titles_df.year_added != 2020], hue=netflix_titles_df.type)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) /opt/conda/lib/python3.7/site-packages/seaborn/ in establish_variables(self, x, y, hue, size, style, units, data) 60 try: ---> 61 data.astype(np.float) 62 except ValueError: /opt/conda/lib/python3.7/site-packages/pandas/core/ in astype(self, dtype, copy, errors) 5542 # else, only a single dtype is given -> 5543 new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,) 5544 return self._constructor(new_data).__finalize__(self, method="astype") /opt/conda/lib/python3.7/site-packages/pandas/core/internals/ in astype(self, dtype, copy, errors) 594 ) -> "BlockManager": --> 595 return self.apply("astype", dtype=dtype, copy=copy, errors=errors) 596 /opt/conda/lib/python3.7/site-packages/pandas/core/internals/ in apply(self, f, align_keys, **kwargs) 405 else: --> 406 applied = getattr(b, f)(**kwargs) 407 result_blocks = _extend_blocks(applied, result_blocks) /opt/conda/lib/python3.7/site-packages/pandas/core/internals/ in astype(self, dtype, copy, errors) 593 try: --> 594 values = astype_nansafe(vals1d, dtype, copy=True) 595 except (ValueError, TypeError): /opt/conda/lib/python3.7/site-packages/pandas/core/dtypes/ in astype_nansafe(arr, dtype, copy, skipna) 989 # Explicit copy, or required since NumPy can't view from / to object. --> 990 return arr.astype(dtype, copy=True) 991 ValueError: could not convert string to float: 'Movie' During handling of the above exception, another exception occurred: ValueError Traceback (most recent call last) <ipython-input-133-aa7be60342e1> in <module> 1 x = netflix_titles_df[netflix_titles_df.year_added != 2020].year_added 2 y = netflix_titles_df[netflix_titles_df.year_added != 2020].groupby('type').year_added.value_counts() ----> 3 sns.lineplot(data=netflix_titles_df[netflix_titles_df.year_added != 2020], hue=netflix_titles_df.type) /opt/conda/lib/python3.7/site-packages/seaborn/ in lineplot(x, y, hue, size, style, data, palette, hue_order, hue_norm, sizes, size_order, size_norm, dashes, markers, style_order, units, estimator, ci, n_boot, seed, sort, err_style, err_kws, legend, ax, **kwargs) 1129 dashes=dashes, markers=markers, style_order=style_order, 1130 units=units, estimator=estimator, ci=ci, n_boot=n_boot, seed=seed, -> 1131 sort=sort, err_style=err_style, err_kws=err_kws, legend=legend, 1132 ) 1133 /opt/conda/lib/python3.7/site-packages/seaborn/ in __init__(self, x, y, hue, size, style, data, palette, hue_order, hue_norm, sizes, size_order, size_norm, dashes, markers, style_order, units, estimator, ci, n_boot, seed, sort, err_style, err_kws, legend) 698 699 plot_data = self.establish_variables( --> 700 x, y, hue, size, style, units, data 701 ) 702 /opt/conda/lib/python3.7/site-packages/seaborn/ in establish_variables(self, x, y, hue, size, style, units, data) 62 except ValueError: 63 err = "A wide-form input must have only numeric values." ---> 64 raise ValueError(err) 65 66 plot_data = data.copy() ValueError: A wide-form input must have only numeric values.
In [62]:
In [58]:
g = sns.distplot(netflix_movies_df.duration, color='skyblue');
plt.title("Duration Distribution for Netflix Movies")
plt.ylabel("% of All Netflix Movies")
plt.xlabel("Duration (minutes)")
In [24]:
filtered_countries = netflix_titles_df.set_index('title').country.str.split(', ', expand=True).stack().reset_index(level=1, drop=True);
In [25]:
filtered_countries = filtered_countries[filtered_countries != 'Country Unavailable']
In [26]:
g = sns.countplot(y = filtered_countries, order=filtered_countries.value_counts().index[:20])
plt.title('Top 20 Countries on Netflix')
Text(0, 0.5, 'Country')
In [27]:
filtered_genres = netflix_titles_df.set_index('title').listed_in.str.split(', ', expand=True).stack().reset_index(level=1, drop=True);
In [28]:
g = sns.countplot(y = filtered_genres, order=filtered_genres.value_counts().index[:20])
plt.title('Top 20 Genres on Netflix')
Text(0, 0.5, 'Genres')
In [29]:
In [30]:
g = sns.countplot(netflix_shows_df.seasons, color='skyblue');
plt.title("Netflix TV Shows Seasons")
In [ ]:
[jovian] Attempting to save notebook.. [jovian] Detected Kaggle notebook... [jovian] Please enter your API key ( from ): API KEY: ········ [jovian] Uploading notebook to

Asking and Answering Questions

Inferences and Conclusion

In [ ]: