Learn practical skills, build real-world projects, and advance your career
from os import path
import sys
!{sys.executable} -m pip install opendatasets --upgrade --quiet
import opendatasets as od
import pandas as pd
import numpy as np

!{sys.executable} -m pip install matplotlib --upgrade --quiet
import matplotlib.pyplot as plt
!{sys.executable} -m pip install seaborn --upgrade --quiet
import seaborn as sns

from datetime import datetime

import locale
locale.setlocale(locale.LC_ALL, '')

!{sys.executable} -m pip install jovian --upgrade --quiet
import jovian

Dataset

I've decided to choose and analyse IMDb movie dataset (https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset)

The dataset consists of 4 files:

  • "movies" - contains information about movies, their genre, average rating, year,
  • "ratings" - contains detailed rating information for each movie, voters age, gender,
  • "names" - information about people (not only from movies),
  • "title_principals" - contains data connecting people with movies, describing their role in the movie, and additionaly, name in the movie (if available).

The files come with a .csv extension. CSV stands for Comma Separated Values.

Downloading

This dataset is a bit big (above 200MB) so it might take a while while it downloads.
Luckily we have to download it only once :P

if not path.isdir('imdb-extensive-dataset'):
    od.download('https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset')

Check the contents

Let's load the acquired files as pandas dataframes. To do so, we'll use read_csv function.