Jovian
⭐️
Sign In
In [93]:
### import numpy as np
import pandas as pd
import seaborn as sn
import ast, json
import matplotlib.pyplot as plt
import numpy as np

from datetime import datetime

% matplotlib inline
In [94]:
credits_df = pd.read_csv('C:/Users/manju.thumminakatti/Downloads/tmdb-5000-movie-dataset/tmdb_5000_credits.csv')
movies_df = pd.read_csv('C:/Users/manju.thumminakatti/Downloads/tmdb-5000-movie-dataset/tmdb_5000_movies.csv')
In [95]:
movies_df.head()
Out[95]:
In [96]:
credits_df.head()
Out[96]:
In [97]:
merged_df = movies_df.merge(credits_df, left_on='id', right_on='movie_id')
merged_df.head()
Out[97]:
In [98]:
merged_df.corr()
Out[98]:
In [99]:
correlation_merged_df = merged_df.corr()
In [100]:
type(correlation_merged_df)
Out[100]:
pandas.core.frame.DataFrame
In [101]:
correlation_merged_df.replace(1,np.nan, inplace=True)
In [102]:
correlation_merged_df.max()
Out[102]:
budget          0.730823
id              0.031202
popularity      0.778130
revenue         0.781487
runtime         0.375046
vote_average    0.375046
vote_count      0.781487
movie_id        0.031202
dtype: float64
In [103]:
correlation_merged_df.min()
Out[103]:
budget         -0.089377
id             -0.270595
popularity      0.031202
revenue        -0.050425
runtime        -0.153536
vote_average   -0.270595
vote_count     -0.004128
movie_id       -0.270595
dtype: float64
In [104]:
merged_df.corr().replace(1, np.nan).unstack().sort_values(ascending = False, kind = "quicksort").head()
Out[104]:
revenue     vote_count    0.781487
vote_count  revenue       0.781487
            popularity    0.778130
popularity  vote_count    0.778130
budget      revenue       0.730823
dtype: float64
In [105]:
merged_df.corr()
Out[105]:
In [106]:
correlation_merged_df.max().max()
Out[106]:
0.7814873262102175
In [107]:
df = correlation_merged_df.unstack()
In [108]:
df.sort_values(ascending=False, kind = "quicksort")
Out[108]:
revenue       vote_count      0.781487
vote_count    revenue         0.781487
              popularity      0.778130
popularity    vote_count      0.778130
budget        revenue         0.730823
revenue       budget          0.730823
popularity    revenue         0.644724
revenue       popularity      0.644724
budget        vote_count      0.593180
vote_count    budget          0.593180
budget        popularity      0.505414
popularity    budget          0.505414
runtime       vote_average    0.375046
vote_average  runtime         0.375046
              vote_count      0.312997
vote_count    vote_average    0.312997
vote_average  popularity      0.273952
popularity    vote_average    0.273952
vote_count    runtime         0.271944
runtime       vote_count      0.271944
budget        runtime         0.269851
runtime       budget          0.269851
              revenue         0.251093
revenue       runtime         0.251093
popularity    runtime         0.225502
runtime       popularity      0.225502
vote_average  revenue         0.197150
revenue       vote_average    0.197150
vote_average  budget          0.093146
budget        vote_average    0.093146
                                ...   
id            vote_count     -0.004128
vote_count    id             -0.004128
              movie_id       -0.004128
movie_id      vote_count     -0.004128
id            revenue        -0.050425
revenue       movie_id       -0.050425
              id             -0.050425
movie_id      revenue        -0.050425
id            budget         -0.089377
budget        movie_id       -0.089377
movie_id      budget         -0.089377
budget        id             -0.089377
id            runtime        -0.153536
runtime       id             -0.153536
              movie_id       -0.153536
movie_id      runtime        -0.153536
              vote_average   -0.270595
id            vote_average   -0.270595
vote_average  id             -0.270595
              movie_id       -0.270595
budget        budget               NaN
id            id                   NaN
              movie_id             NaN
popularity    popularity           NaN
revenue       revenue              NaN
runtime       runtime              NaN
vote_average  vote_average         NaN
vote_count    vote_count           NaN
movie_id      id                   NaN
              movie_id             NaN
Length: 64, dtype: float64

Data Cleansing Process

In [109]:
merged_df
Out[109]:

1) Removing unused columns

In [110]:
del_col_list = ['keywords', 'homepage', 'status', 'tagline', 'original_language', 
                'overview', 'production_companies', 'original_title', 'title_y', 'movie_id']
merged_df = merged_df.drop(del_col_list, axis=1)
merged_df.head()
Out[110]:

2) Removing the duplicacy in the rows (if any)

In [111]:
print(merged_df.shape)
merged_df = merged_df.drop_duplicates(keep='first')
print(merged_df.shape)
(4803, 14) (4803, 14)

3) Handling missing values

In [112]:
# replacing all the zeros from revenue and budget cols.

cols = ['budget', 'revenue']
merged_df[cols] = merged_df[cols].replace(0, np.nan)

# dropping all the rows with na in the columns mentioned above in the list.
merged_df.dropna(subset=cols, inplace=True)
merged_df.shape
Out[112]:
(3229, 14)

4) Manipulating the columns to our requirement

In [113]:
print(type(merged_df.release_date[0]))

# Changing the release_date column to DateTime column

merged_df.release_date = pd.to_datetime(merged_df['release_date'])
print(type(merged_df.release_date[0]))

# Extracting release year from release date to answer one of our question

merged_df["release_year"] = merged_df.release_date.dt.year
print(type(merged_df["release_year"][0]))
<class 'str'> <class 'pandas._libs.tslibs.timestamps.Timestamp'> <class 'numpy.int64'>
In [114]:
# Changing the data type of the below mentioned columns and 
change_cols=['budget', 'revenue']

#changing data type
print("Before processing :\n")
print(merged_df.dtypes)
merged_df[change_cols]=merged_df[change_cols].applymap(np.int64)
print("\n\nAfter processing : \n")
print(merged_df.dtypes)
Before processing : budget float64 genres object id int64 popularity float64 production_countries object release_date datetime64[ns] revenue float64 runtime float64 spoken_languages object title_x object vote_average float64 vote_count int64 cast object crew object release_year int64 dtype: object After processing : budget int64 genres object id int64 popularity float64 production_countries object release_date datetime64[ns] revenue int64 runtime float64 spoken_languages object title_x object vote_average float64 vote_count int64 cast object crew object release_year int64 dtype: object
In [115]:
# we see that there are columns which are in json format,
# let's flatten these json data into easyily interpretable lists

def parse_col_json(column, key, data_df):
    """
    Args:
        column: string
            name of the column to be processed.
        key: string
            name of the dictionary key which needs to be extracted
    """
    for index,json_df in zip(data_df.index,data_df[column].apply(json.loads)):
        list1=[]
        for i in range(len(json_df)):
            list1.append((json_df[i][key]))# the key 'name' contains the name of the genre
        data_df.loc[index,column]=str(list1)
    
parse_col_json('cast', 'name', merged_df)
parse_col_json('genres', 'name', merged_df)
parse_col_json('spoken_languages', 'name', merged_df)
parse_col_json('production_countries', 'name', merged_df)

merged_df.head()
Out[115]:

Identifying relationships between variables / features

  • Which are the 5 most expensive movies? Comparison between the extremes? Exploring the most expensive movies, it can tell you if they were even worth the money spent on them based on their performance and revenue generated.
In [116]:
# Answer to question #1.
# To find out the most expensive movies, we need to look at the budget set for them which is an indicator of expense.

expensive_movies_df = movies_df.sort_values(by ='budget', ascending=False).head()
expensive_movies_df
Out[116]:
In [117]:
# A generic method to find min and max of a column
def find_min_max_in(col, dataframe):
    """
    The function takes in a column and returns the top
    and bottom movies dataframe in that column.
    
    args:
        col: string - column name
    return:
        info_df: dataframe - 
    """
    
    top = dataframe[col].idxmax()
    top_df = pd.DataFrame(dataframe.loc[top])
    top
    
    bottom = dataframe[col].idxmin()
    bottom_df = pd.DataFrame(dataframe.loc[bottom])
    
    info_df = pd.concat([top_df, bottom_df], axis=1)
    return info_df

find_min_max_in('budget', merged_df)
Out[117]:
In [118]:
merged_df['budget'].sort_values().head()
Out[118]:
4238    1
3611    4
3372    7
3419    7
4608    8
Name: budget, dtype: int64
In [119]:
merged_df['budget'].sort_values(ascending = False).head()
Out[119]:
17    380000000
1     300000000
7     280000000
10    270000000
4     260000000
Name: budget, dtype: int64
In [120]:
# A generic method to find n number of min and max of a column
def find_n_min_max(col, dataframe, count):
    """
    The function takes in a column and returns the n number of top
    and n number of bottom movies dataframe in that column.
    
    args:
        col: string - column name
    return:
        info_df: dataframe - 
    """
    min_df = pd.DataFrame()
    max_df = pd.DataFrame()
    min_df['min_' + col] = dataframe[col].sort_values().head(count)
    max_df['max_' + col] = dataframe[col].sort_values(ascending = False).head(count)
    
    info_df = pd.concat([min_df, max_df],sort=True)
    
    return info_df
#     return min_df
#     return max_df

find_n_min_max('budget', merged_df, 5)
Out[120]:
  • Top 5 most profitable movies? Comparison between the min and max profits.
In [121]:
# Method to give n max diff between two columns
def max_diff(min_column, max_column, dataframe, count):
    max_diff = 0
    temp_df = dataframe
    max_diff_df = pd.DataFrame()
    
    for n in range(count):
        for i in temp_df.index:
            diff = temp_df[max_column][i] - temp_df[min_column][i]
            if (diff > max_diff):
                max_diff = diff
                max_diff_x = temp_df.loc[i]
                max_loc = i
        
        max_diff_df = pd.concat([max_diff_df, pd.DataFrame(max_diff_x)], axis = 1)
        temp_df = temp_df.drop([max_loc], axis = 0)
        print("maximum diff is ", max_diff)
        max_diff = 0
        diff = 0
        
    return max_diff_df

max_profit_movie = max_diff('budget', 'revenue', merged_df, 5)
print(merged_df.shape)
maximum diff is 2550965087 maximum diff is 1645034188 maximum diff is 1363528810 maximum diff is 1316249360 maximum diff is 1299557910 (3229, 15)
In [122]:
max_profit_movie
Out[122]:
In [123]:
merged_df['profit'] = merged_df.revenue - merged_df.budget
In [124]:
merged_df.sort_values(by = ['profit'], ascending=False).head(5)
Out[124]:
  • Most Talked about movies
In [125]:
# to find the most talked about movies, we can sort the dataframe on the popularity column
popular_movies_df = movies_df.sort_values(by ='popularity', ascending=False)
popular_movies_df.head()
Out[125]:
  • Average runtime of movies? Comparison between the extremes.
In [126]:
# in terms of runtime
# Average runtime of movies
merged_df['runtime'].mean()
# comparison
find_min_max_in('runtime', merged_df)
Out[126]:
  • Movies which are rated above 7
In [127]:
merged_df[merged_df.vote_average >= 7]
Out[127]:
  • Which year did we have the most profitable movies?
In [128]:
profit_year = merged_df.groupby('release_year').profit.sum()
In [129]:
profit_year.sort_values(ascending=False).head()
Out[129]:
release_year
2014    17029736072
2012    16665370551
2015    16082841939
2013    15191240622
2009    13798015000
Name: profit, dtype: int64
In [130]:
#figure size(width, height)
plt.figure(figsize=(12,6), dpi = 130)

#on x-axis
plt.xlabel('Release Year of Movies in the data set', fontsize = 12)
#on y-axis
plt.ylabel('Profits earned by Movies', fontsize = 12)
#title of the line plot
plt.title('Representing Total Profits earned by all movies Vs Year of their release.')

#plotting the graph
plt.plot(profit_year)

#displaying the line plot
plt.show()
Notebook Image
In [131]:
# Most profitable year from the given dataset.

profit_year.idxmax()
Out[131]:
2014

Following is the list of questions which can lead us in the desired direction: That is - “What is the recipe to make a blockbuster, profitable movie?”

  • Most successful genres — bar plot explaining the frequency of movies in each genre.
  • Frequent Cast in movies — plot showing the spectrum of movies for each actor.
  • Average budget of the profitable movies — what is the average budget allocated.
  • Average duration of the most profitable movies.
  • Language of the most profitable movies.
  • Setting a criteria for successful movie : profit figure of 50 Million dollars
In [132]:
#selecting the movies having profit $50M or more
profit_data = merged_df[merged_df['profit'] >= 50000000]

#reindexing new data
profit_data.index = range(1, len(profit_data)+1)

In [133]:
profit_data.head()
Out[133]:
  • Most successful genre
In [134]:
# formatting the data in the genres columns.
profit_data['genres'] = profit_data['genres'].str.strip('[]').str.replace(' ','').str.replace("'",'')
profit_data['genres'] = profit_data['genres'].str.split(',')
C:\Users\manju.thumminakatti\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy C:\Users\manju.thumminakatti\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy This is separate from the ipykernel package so we can avoid doing imports until
In [135]:
type(profit_data['genres'])
Out[135]:
pandas.core.series.Series
In [136]:
plt.subplots(figsize=(12,10))
list1=[]

# extending the list of genres to collect all the genres of all the profitable movies
for i in profit_data['genres']:
    list1.extend(i)
    

genre_count_series = pd.Series(list1).value_counts()[:10].sort_values(ascending=True)

ax = genre_count_series.plot.barh(
    width=0.9,
    color=sn.color_palette('summer_r',10))

ax
Out[136]:
<matplotlib.axes._subplots.AxesSubplot at 0x20de4753978>
Notebook Image
In [137]:
# formatting the data in the cast columns.
profit_data['cast'] = profit_data['cast'].str.strip('[]').str.replace(' ','').str.replace("'",'')
profit_data['cast'] = profit_data['cast'].str.split(',')

plt.subplots(figsize=(12,10))
list1=[]

# extending the list of casting to collect all the genres of all the profitable movies
for i in profit_data['cast']:
    list1.extend(i)
    

cast_count_series = pd.Series(list1).value_counts()[:10].sort_values(ascending=True)

ax = cast_count_series.plot.barh(
    width=0.9,
    color=sn.color_palette('summer_r',10))

ax
C:\Users\manju.thumminakatti\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy C:\Users\manju.thumminakatti\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy This is separate from the ipykernel package so we can avoid doing imports until
Out[137]:
<matplotlib.axes._subplots.AxesSubplot at 0x20de43d95f8>
Notebook Image
In [138]:
# formatting the data in the spoken_languages columns.
profit_data['spoken_languages'] = profit_data['spoken_languages'].str.strip('[]').str.replace(' ','').str.replace("'",'')
profit_data['spoken_languages'] = profit_data['spoken_languages'].str.split(',')

plt.subplots(figsize=(12,10))
list1=[]

# extending the list of casting to collect all the genres of all the profitable movies
for i in profit_data['spoken_languages']:
    list1.extend(i)
    

spoken_languages_count_series = pd.Series(list1).value_counts()[:10].sort_values(ascending=True)

ax = spoken_languages_count_series.plot.barh(
    width=0.9,
    color=sn.color_palette('summer_r',10))

ax
C:\Users\manju.thumminakatti\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy C:\Users\manju.thumminakatti\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy This is separate from the ipykernel package so we can avoid doing imports until
Out[138]:
<matplotlib.axes._subplots.AxesSubplot at 0x20dde3bdef0>
Notebook Image
In [139]:
profit_data.runtime.mean()
Out[139]:
114.37928286852589
In [140]:
profit_data.revenue.mean()
Out[140]:
262022090.17768925
In [141]:
profit_data.budget.mean()
Out[141]:
63032056.92111554

Hypothesis

This was a very interesting data analysis. We came out with some very interesting facts about movies. After this analysis we can conclude following: For a Movie to be profitable for the producer: Average Budget must be around 63 millon dollar Average duration of the movie must be 114 minutes Any one of these should be in the cast : Samuel Jackson, Robert De Neiro, Morgan Freeman, Bruce Willis Genre must be : Action, Adventure, Thriller, Comedy, Drama. By doing all this the movie might be one of the hits and hence can earn an average revenue of around 262 million dollar.

In [190]:
import jovian
In [ ]:
jovian.commit()
[jovian] Saving notebook..
In [ ]: