Learn practical skills, build real-world projects, and advance your career

Dealing with Large Datasets using Pandas

It is no lie that 'Data is the new Oil', but the amount of data produced every day is mind-boggling. There is about 2.5 quintillion bytes of data created each day at our current pace. And it is not surprising that,
In the last two years alone, the astonishing 90% of the world’s data has been created.

To be able to handle and engineer such a vast amount of data is power.

In This Tutorial we will cover the following topics:

  • Loading datasets into Google Colab.
  • Fastening Data Loading processes with pandas.dataframe
  • Memory saving with pandas(Chunking)
  • Loading datasets into intermediate file formats.
  • Fastening Data Loading processes with other libraries.

Prerequisites:

  • You should be familliar with pandas, series and dataframes. If you are not familiar with these concepts, have a quick look at this helper notebook
  • You can find out how to run this notebook on google Colab with this helper notebook
import pandas as pd

Opendatasets is a Python library for downloading datasets from online sources like Kaggle and Google Drive using a simple Python command.

There are multiple ways to load your dataset into colab. I have mentioned two below.
Although, using a link to directly download the datasets should be the ideal way to do it.