Learn practical skills, build real-world projects, and advance your career
!wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Health_and_Personal_Care_5.json.gz
--2021-05-05 13:05:19-- http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Health_and_Personal_Care_5.json.gz Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80 Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 85180885 (81M) [application/x-gzip] Saving to: ‘reviews_Health_and_Personal_Care_5.json.gz’ reviews_Health_and_ 100%[===================>] 81.23M 8.65MB/s in 8.4s 2021-05-05 13:05:28 (9.64 MB/s) - ‘reviews_Health_and_Personal_Care_5.json.gz’ saved [85180885/85180885]
import gzip
import pandas as pd
with gzip.open('/content/reviews_Health_and_Personal_Care_5.json.gz') as f:
    df = pd.read_json(f, lines=True)
import numpy as np
import re
df
df_all = df.drop(['reviewerName', 'helpful', 'unixReviewTime', 'reviewTime', 'summary', 'asin', 'reviewerID'], axis=1)