05 Course Project Text Classification - Notebook by Sayak Paul (iamsayak1998)

Learn practical skills, build real-world projects, and advance your career

Created 4 years ago

# Uncomment and run the commands below if imports fail
#!conda install numpy pandas pytorch torchvision cpuonly -c pytorch -y
#!pip install matplotlib --upgrade --quiet
#!pip install torch
#!pip install torchtext

# https://towardsdatascience.com/understanding-pytorch-with-an-example-a-step-by-step-tutorial-81fc5f8c4e8e'
# here is an example of sentiment analysis - https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/2%20-%20Upgraded%20Sentiment%20Analysis.ipynb

import torch
import numpy as np
import matplotlib.pyplot as plt
import torch.nn as nn
import torch.nn.functional as F
import pandas as pd

from torch.utils.data.dataloader import DataLoader
from torch.utils.data import random_split

# text related
from torchtext import data

%matplotlib inline

project_name = '05-course-project-text-classification'

Let's inspect the data

You can see in the dataset has accompanied code that can help load the data into a dataframe. We use that code snippet to load the initial 1000 rows and inspect data to get an intuition.

It turns out that there may be latin characters that requires to use encoding parameter when reading the csv as you see below.

nRowsRead = 1000 # specify 'None' if want to read whole file
df1 = pd.read_csv('../input/sentiment140/training.1600000.processed.noemoticon.csv', delimiter=',', nrows = None, encoding='latin-1', names=["target", "id", "date", "flag", "user", "text"])
df1.dataframeName = 'training.1600000.processed.noemoticon.csv'
nRow, nCol = df1.shape
print(f'There are {nRow} rows and {nCol} columns')

There are 1600000 rows and 6 columns