Text Classification with Bag of Words - Natural Language Processing

alt

"Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data." - Wikipedia

Bag of Words: The bag-of-words (BOW) model is a representation that turns arbitrary text into fixed-length vectors by counting how many times each word appears.

Outline:

  1. Download and explore a real-world dataset
  2. Apply text preprocessing techniques
  3. Implement the bag of words model
  4. Train ML models for text classification
  5. Make predictions and submit to Kaggle

Dataset: https://www.kaggle.com/c/quora-insincere-questions-classification

Download and Explore the Data

Outline:

  1. Download the dataset from Kaggle to Colab
  2. Explore the data using Pandas
  3. Create a small working sample

Download the Data to Colab

Upload your kaggle.json to Colab. Get it here: https://www.kaggle.com/docs/api#authentication

!ls .
kaggle.json sample_data
import os