Learn practical skills, build real-world projects, and advance your career

TMDB Box Office Prediction

alt

I picked a topic from the recommended datasets list given in the project link.

It's the Tmdb kaggle competition which had data for movies from 1920s to 2017.

I have used linear regression, random forests and xgboost for this problem.

and took inspiration from a lot of notebooks on the competition code page

The details of the data according to the kaggle page is given below:

In this dataset, you are provided with 7398 movies and a variety of metadata obtained from The Movie Database (TMDB). Movies are labeled with id. Data points include cast, crew, plot keywords, budget, posters, release dates, languages, production companies, and countries.

You are predicting the worldwide revenue for 4398 movies in the test file.

Note - many movies are remade over the years, therefore it may seem like multiple instance of a movie may appear in the data, however they are different and should be considered separate movies. In addition, some movies may share a title, but be entirely unrelated.

E.g. The Karate Kid (id: 5266) was released in 1986, while a clearly (or maybe just subjectively) inferior remake (id: 1987) was released in 2010. Also, while the Frozen (id: 5295) released by Disney in 2013 may be the household name, don't forget about the less-popular Frozen (id: 139) released three years earlier about skiers who are stranded on a chairlift...

Acknowledgements
The Movie Database

This dataset has been collected from TMDB. The movie details, credits and keywords have been collected from the TMDB Open API. This competition uses the TMDB API but is not endorsed or certified by TMDB. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows. You can try it for yourself here.

Importing libraries

!pip install wordcloud opendatasets scikit-learn --upgrade --quiet
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')
from wordcloud import WordCloud
from datetime import datetime
import opendatasets as od
import jovian
import os 
import datetime
from collections import Counter
import ast
import os
import opendatasets as od
pd.set_option("display.max_columns", 120)
pd.set_option("display.max_rows", 120)

Downloading the data