Jovian
⭐️
Sign In

Introduction

Zomato is a popular business in the restaurants space. It allows customer to search for restaurants based on their preferences. It also provides with food deliver services. The USP of Zomato is its reviews on restaurants.

Analysing this data, can help restaurants understand, what customer like and dislike about their restaurant and improve it. It also allows you to compare customer reviews and ratings of your competitior across location, cuisine and type of service provided.

These information can be vital in understanding customer requirements when starting a new restaurant or trying to improve your own restaurant business.

In this notebook, we are using the Zomato Restaurants in Bangalore data from kaggle. The link to download the notebook is as below:

https://www.kaggle.com/himanshupoddar/zomato-bangalore-restaurants

Though, there can be many other analysis done on this data,we will be using the reviews data to understand how NLP techniques can help make sense of the huge trove of data captured through Reviews.

Imports

Importing Basic Libraries

In [18]:
# Data manipulation
import pandas as pd
import numpy as np

# Options for pandas
pd.options.display.max_columns = None
pd.options.display.max_rows = None

pd.options.display.max_colwidth=-1

# Display all cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

from IPython import get_ipython
ipython = get_ipython()

# autoreload extension
if 'autoreload' not in ipython.extension_manager.loaded:
    %load_ext autoreload

%autoreload 2

# Visualizations
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns


import re

Analysis/Modeling

Load the Data

In [3]:
zomato_data=pd.read_csv("zomato.csv")
zomato_data.head()
Out[3]:
In [4]:
zomato_data['listed_in(type)'].value_counts()
Out[4]:
Delivery              25942
Dine-out              17779
Desserts              3593 
Cafes                 1723 
Drinks & nightlife    1101 
Buffet                882  
Pubs and bars         697  
Name: listed_in(type), dtype: int64

We have 51K restaurant reviews in the data. For each restaurant, there is an reviews list containing the review rating and the review comments. Let us extract these into a seperate dataframe. Also, so that we do not lose the other restaurant information, let us create a new resturant ID. Also, let us filter out for restraurant which have "Buffet"

Filter out "Buffet" Resturants

In [5]:
## Filter out all restaurants for Dine-Out
zomato_data=zomato_data[zomato_data['listed_in(type)']=='Buffet']
print(zomato_data.shape)
zomato_data['id']=zomato_data.index
#print(zomato_data.head())
(882, 17)

Extract the reviews data

In [6]:
reviews_data=zomato_data[['reviews_list','id']].reset_index(drop=True)
reviews_data['index']=reviews_data.index
review=reviews_data['reviews_list'][0]
print(review)
[('Rated 4.0', 'RATED\n A beautiful place to dine in.The interiors take you back to the Mughal era. The lightings are just perfect.We went there on the occasion of Christmas and so they had only limited items available. But the taste and service was not compromised at all.The only complaint is that the breads could have been better.Would surely like to come here again.'), ('Rated 4.0', 'RATED\n I was here for dinner with my family on a weekday. The restaurant was completely empty. Ambience is good with some good old hindi music. Seating arrangement are good too. We ordered masala papad, panner and baby corn starters, lemon and corrionder soup, butter roti, olive and chilli paratha. Food was fresh and good, service is good too. Good for family hangout.\nCheers'), ('Rated 2.0', 'RATED\n Its a restaurant near to Banashankari BDA. Me along with few of my office friends visited to have buffet but unfortunately they only provide veg buffet. On inquiring they said this place is mostly visited by vegetarians. Anyways we ordered ala carte items which took ages to come. Food was ok ok. Definitely not visiting anymore.'), ('Rated 4.0', 'RATED\n We went here on a weekend and one of us had the buffet while two of us took Ala Carte. Firstly the ambience and service of this place is great! The buffet had a lot of items and the good was good. We had a Pumpkin Halwa intm the dessert which was amazing. Must try! The kulchas are great here. Cheers!'), ('Rated 5.0', 'RATED\n The best thing about the place is itÃ\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82Â\x92s ambiance. Second best thing was yummy ? food. We try buffet and buffet food was not disappointed us.\nTest ?. ?? ?? ?? ?? ??\nQuality ?. ??????????.\nService: Staff was very professional and friendly.\n\nOverall experience was excellent.\n\nsubirmajumder85.wixsite.com'), ('Rated 5.0', 'RATED\n Great food and pleasant ambience. Expensive but Coll place to chill and relax......\n\nService is really very very good and friendly staff...\n\nFood : 5/5\nService : 5/5\nAmbience :5/5\nOverall :5/5'), ('Rated 4.0', 'RATED\n Good ambience with tasty food.\nCheese chilli paratha with Bhutta palak methi curry is a good combo.\nLemon Chicken in the starters is a must try item.\nEgg fried rice was also quite tasty.\nIn the mocktails, recommend "Alice in Junoon". Do not miss it.'), ('Rated 4.0', 'RATED\n You canÃ\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82Â\x92t go wrong with Jalsa. Never been a fan of their buffet and thus always order alacarteÃ\x83Â\x83Ã\x82Â\x82Ã\x83Â\x82Ã\x82Â\x92. Service at times can be on the slower side but food is worth the wait.'), ('Rated 5.0', 'RATED\n Overdelighted by the service and food provided at this place. A royal and ethnic atmosphere builds a strong essence of being in India and also the quality and taste of food is truly authentic. I would totally recommend to visit this place once.'), ('Rated 4.0', 'RATED\n The place is nice and comfortable. Food wise all jalea outlets maintain a good standard. The soya chaap was a standout dish. Clearly one of trademark dish as per me and a must try.\n\nThe only concern is the parking. It very congested and limited to just 5cars. The basement parking is very steep and makes it cumbersome'), ('Rated 4.0', 'RATED\n The place is nice and comfortable. Food wise all jalea outlets maintain a good standard. The soya chaap was a standout dish. Clearly one of trademark dish as per me and a must try.\n\nThe only concern is the parking. It very congested and limited to just 5cars. The basement parking is very steep and makes it cumbersome'), ('Rated 4.0', 'RATED\n The place is nice and comfortable. Food wise all jalea outlets maintain a good standard. The soya chaap was a standout dish. Clearly one of trademark dish as per me and a must try.\n\nThe only concern is the parking. It very congested and limited to just 5cars. The basement parking is very steep and makes it cumbersome')]
In [7]:
type(review)
Out[7]:
str

The list of reviews are present as a string.

Convert the reviews for each restaurant into a list of (rating,review) tuple

In [11]:
import ast
text = ast.literal_eval(review)

type(text)
Out[11]:
list
In [12]:
reviews_data['reviews_list']=reviews_data['reviews_list'].apply(lambda x:ast.literal_eval(x))

The review is present as a list of (rating,review) tuple.

Convert the list of (rating,review) to one row each

In [13]:
reviews = reviews_data['reviews_list'].apply(pd.Series).reset_index().melt(id_vars='index').dropna()
In [14]:
reviews=reviews[['index',"value"]]
reviews.columns=['index','reviews']
reviews.head()
Out[14]:

For each line, extract the ratings and the review

In [15]:
reviews['rating']=reviews['reviews'].apply(lambda x:x[0])
reviews['review_text']=reviews['reviews'].apply(lambda x:x[1])
reviews.head()
Out[15]:

Replace the word "rated " and "RATED\n" in rating and review_text respectively

In [19]:

reviews['rating']=reviews['rating'].apply(lambda x:re.sub("Rated ","",str(x)))
In [20]:
test="RATED\n I rated the restaurant badly because od the ambience"
re.sub("^(rated\n)","",test,flags=re.I) ## Removes only the rated in the beginning of the string
Out[20]:
' I rated the restaurant badly because od the ambience'
In [21]:
reviews['review_text']=reviews['review_text'].apply(lambda x:re.sub("^(rated\n)","",x,flags=re.I))
reviews['review_text']=reviews['review_text'].apply(lambda x:x.strip())

Drop of the reviews columns and also convert "rating" to integer

In [23]:
reviews.drop(['reviews'],inplace=True,axis=1)
In [25]:
reviews['rating']=pd.to_numeric(reviews['rating'])

In [26]:
reviews.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 42261 entries, 0 to 518887 Data columns (total 3 columns): index 42261 non-null int64 rating 42261 non-null float64 review_text 42261 non-null object dtypes: float64(1), int64(1), object(1) memory usage: 2.5+ MB

Text Cleaning

Let us look at another very popular library "spacy"

In [27]:
import spacy

The object “nlp” is used to create documents, access linguistic annotations and different nlp properties

In [28]:
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner']) #Loads the english model and we dont want prser or an ner so we will disable it

Let us go through Text Preprocessing using spacy for one review

In [29]:
review_txt=reviews.iloc[0]['review_text']
review_txt
Out[29]:
'A beautiful place to dine in.The interiors take you back to the Mughal era. The lightings are just perfect.We went there on the occasion of Christmas and so they had only limited items available. But the taste and service was not compromised at all.The only complaint is that the breads could have been better.Would surely like to come here again.'
In [30]:
document = nlp(review_txt)#Convert into an Spacy Document
In [31]:
print([token for token in document]) ## Prints list of all tokens in the document
[A, beautiful, place, to, dine, in, ., The, interiors, take, you, back, to, the, Mughal, era, ., The, lightings, are, just, perfect, ., We, went, there, on, the, occasion, of, Christmas, and, so, they, had, only, limited, items, available, ., But, the, taste, and, service, was, not, compromised, at, all, ., The, only, complaint, is, that, the, breads, could, have, been, better, ., Would, surely, like, to, come, here, again, .]
In [38]:
stopwords=spacy.lang.en.stop_words.STOP_WORDS
doc_cleaned=[token for token in document if  (token.text not in stopwords)]
print(doc_cleaned)
[A, beautiful, place, dine, ., The, interiors, Mughal, era, ., The, lightings, perfect, ., We, went, occasion, Christmas, limited, items, available, ., But, taste, service, compromised, ., The, complaint, breads, better, ., Would, surely, like, come, .]
In [40]:
## Let us get POS tags of the words
pos_tags=[(token.text,token.pos_) for token in document if (token.text not in stopwords)]
print(pos_tags)
[('A', 'DET'), ('beautiful', 'ADJ'), ('place', 'NOUN'), ('dine', 'NOUN'), ('.', 'PUNCT'), ('The', 'DET'), ('interiors', 'NOUN'), ('Mughal', 'PROPN'), ('era', 'NOUN'), ('.', 'PUNCT'), ('The', 'DET'), ('lightings', 'NOUN'), ('perfect', 'ADJ'), ('.', 'PUNCT'), ('We', 'PRON'), ('went', 'VERB'), ('occasion', 'NOUN'), ('Christmas', 'PROPN'), ('limited', 'VERB'), ('items', 'NOUN'), ('available', 'ADJ'), ('.', 'PUNCT'), ('But', 'CCONJ'), ('taste', 'NOUN'), ('service', 'NOUN'), ('compromised', 'VERB'), ('.', 'PUNCT'), ('The', 'DET'), ('complaint', 'NOUN'), ('breads', 'NOUN'), ('better', 'ADJ'), ('.', 'PUNCT'), ('Would', 'VERB'), ('surely', 'ADV'), ('like', 'VERB'), ('come', 'VERB'), ('.', 'PUNCT')]

Let us only extract Adjectives from the document

In [41]:
adj=[token for token in document if (token.text not in stopwords) and token.pos_=="ADJ"]
print(adj)
[beautiful, perfect, available, better]

Let us now lemmatise the document and remove space and punctuation and pronouns and numbers

In [44]:
doc_cleaned=[token.lemma_ for token in document if (token.text not in stopwords) and token.pos_ not in ['PUNCT','SPACE',"PRON","NUM"]]
print(doc_cleaned)
['a', 'beautiful', 'place', 'dine', 'the', 'interior', 'Mughal', 'era', 'the', 'lighting', 'perfect', 'go', 'occasion', 'Christmas', 'limit', 'item', 'available', 'but', 'taste', 'service', 'compromise', 'the', 'complaint', 'bread', 'well', 'Would', 'surely', 'like', 'come']

Create a function to do all the cleaning steps and clean all the reviews

In [59]:
COUNT=0
def cleanText(text):
    global COUNT
    nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner']) 
    
    text=text.lower()
    
    document = nlp(text)
    
    doc_cleaned=[token.lemma_ for token in document if (token.text not in stopwords) and token.pos_ not in ['PUNCT','SPACE',"PRON","NUM"]]
    print(doc_cleaned)
    doc_cleaned=" ".join(doc_cleaned)
    print(doc_cleaned)
    #print(doc_cleaned)
    COUNT=COUNT+1
    if COUNT%1000==0:
        print(COUNT)
    return doc_cleaned
In [60]:
reviews['cleaned_review']=reviews['review_text'].apply(lambda x:cleanText(x))
['beautiful', 'place', 'dine', 'in.the', 'interior', 'mughal', 'era', 'lighting', 'perfect.we', 'go', 'occasion', 'christmas', 'limit', 'item', 'available', 'taste', 'service', 'compromise', 'all.the', 'complaint', 'bread', 'better.would', 'surely', 'like', 'come'] beautiful place dine in.the interior mughal era lighting perfect.we go occasion christmas limit item available taste service compromise all.the complaint bread better.would surely like come ['dinner', 'family', 'turn', 'good', 'choose', 'suitable', 'age', 'people', 'try', 'place', 'like', 'starter', 'service', 'good', 'price', 'affordable', 'recommend', 'restaurant', 'early', 'dinner', 'place', 'little', 'noisy'] dinner family turn good choose suitable age people try place like starter service good price affordable recommend restaurant early dinner place little noisy ['ambience', 'good', 'pocket', 'friendly', 'cafe', 'quantity', 'good', 'dessert', 'good'] ambience good pocket friendly cafe quantity good dessert good ['great', 'food', 'proper', 'karnataka', 'style', 'meal', 'twice', 'fully', 'satisfied', 'star', 'manage'] great food proper karnataka style meal twice fully satisfied star manage ['good', 'restaurant', 'neighbourhood', 'buffet', 'system', 'properly', 'arrange', 'variety', 'dish', 'garba', 'dance', 'puppet', 'good', 'spread', 'dessert', 'live', 'paratha', '/', 'kulcha', 'making'] good restaurant neighbourhood buffet system properly arrange variety dish garba dance puppet good spread dessert live paratha / kulcha making ['food', 'ambience', 'service', 'family', 'lunch', 'place', 'serve', 'buffet', 'order', 'soup', 'babycorn', 'starter', 'butter', 'naan', 'kadai', 'panner', 'veg', 'kohlapuri', 'pease', 'pulav', 'food', 'good', 'service', 'slow', 'wait', 'min', 'order', 'place', 'apt', 'family', 'hangout', 'cheer'] food ambience service family lunch place serve buffet order soup babycorn starter butter naan kadai panner veg kohlapuri pease pulav food good service slow wait min order place apt family hangout cheer ['awesome', 'food', 'great', 'servicefriendly', 'staffsgood', 'quality', 'food', '#', 'complimentary', 'breakfast', '#', 'honey', 'lemon', 'chicken', 'chicken', 'manchow', 'soup', 'perfect', 'place', 'stay', 'family', 'stay', 'bangalore'] awesome food great servicefriendly staffsgood quality food # complimentary breakfast # honey lemon chicken chicken manchow soup perfect place stay family stay bangalore ['restaurant', 'think', 'good', 'buffet', 'affordable', 'cost', 'nice', 'ambience', 'n', 'music', 'staff', 'r', 'awesome', 'vodka', 'panipuri', 'z', 'unique', 'thing', 'pasta', 'z', 'yummy', 'occasion', 'b', '1st', 'choice', 'kid', 'buffet', 'time', 'strongly', 'recommend', 'ds', 'place'] restaurant think good buffet affordable cost nice ambience n music staff r awesome vodka panipuri z unique thing pasta z yummy occasion b 1st choice kid buffet time strongly recommend ds place ['place', 'definitely', 'visit', 'pleasant', 'ambiance', 'courteous', 'staff', 'amazing', 'food', 'quality', 'food', 'portion', 'order', 'fattoush', 'salad', 'tangy', 'dressing', 'love', 'starter', 'order', 'mutton', 'kebab', 'soft', 'juicy', 'succulent', 'mutton', 'accompany', 'hummus', 'mayo', 'main', 'course', 'order', 'kashmiri', 'pulao', 'okay', 'rice', 'sweet', 'apple', 'pomegranate', 'kernel', 'pineapple', 'course', 'serve', 'spicy', 'order', 'chicken', 'tikka', 'masala', 'eat', 'half', 'pack', 'overall', 'food', 'ambiance', 'service'] place definitely visit pleasant ambiance courteous staff amazing food quality food portion order fattoush salad tangy dressing love starter order mutton kebab soft juicy succulent mutton accompany hummus mayo main course order kashmiri pulao okay rice sweet apple pomegranate kernel pineapple course serve spicy order chicken tikka masala eat half pack overall food ambiance service ['buffet', 'lunch', '@', 'empire', 'restaurant', 'great', 'food', 'awesome', 'like', 'recommend', 'love', 'non', 'veg', 'budget', 'food', 'service', 'ambience'] buffet lunch @ empire restaurant great food awesome like recommend love non veg budget food service ambience ['okay', 'like', 'start', 'say', 'restaurant', 'team', 'unorganized', 'worker', 'order', 'food', 'online', 'boneless', 'biryani', 'receive', 'mutton', 'biryani', 'call', 'say', 'tell', 'send', 'receive', 'payment', 'care', 'horrible', 'service', 'biryani', 'taste', 'good', 'service'] okay like start say restaurant team unorganized worker order food online boneless biryani receive mutton biryani call say tell send receive payment care horrible service biryani taste good service ['average', 'place', 'average', 'option', 'average', 'taste', 'service', 'slow', 'get', 'unbearable', 'mosquito', 'daytime', 'beat', 'overall', 'average', 'place', 'specially', 'eat'] average place average option average taste service slow get unbearable mosquito daytime beat overall average place specially eat ['awesome', 'place', 'staff', 'friendly', 'music', 'good', 'think', 'need', 'talk', 'food', 'say', 'barbeque', 'town', 'water', 'melon', 'n', 'pineapple', 'barbeque', 'z', 'favorite', 'complimentary', 'birthday', '/', 'anniversary', 'cake', 'music', 'dance', 'n', 'lod', 'wish', 'love', 'place'] awesome place staff friendly music good think need talk food say barbeque town water melon n pineapple barbeque z favorite complimentary birthday / anniversary cake music dance n lod wish love place ['veg', 'food', 'taste', 'good', 'eat', 'everyday', 'gourmet', 'veg', 'food', 'good', 'prepare', 'long', 'wait', 'weekend', 'holiday', 'bustling', 'noisy', 'branch', 'white', 'wall', 'isckon', 'graffiti', 'pack', 'time', 'late', 'addition', 'appam', 'stew', 'winner', 'sweet', 'toother', 'celebrate', 'dessert', 'spread', 'main', 'course'] veg food taste good eat everyday gourmet veg food good prepare long wait weekend holiday bustling noisy branch white wall isckon graffiti pack time late addition appam stew winner sweet toother celebrate dessert spread main course ['buffet', '24th', 'main', 'ultimate', 'search', 'taste', 'restaurant', 'bangalore', 'decide', 'king', 'buffet', 'dis', 'definitely', 'heaven', 'person', 'specially', 'love', 'andhra', 'food', 'start', 'starter', 'dessert', 'item', 'super', 'mr', 'narayana', 'murty', 'chef', 'main', 'course', 'worth', 'big', 'applause', 'negative', 'point', 'ambience', 'service', 'guy', 'strongly', 'recommend', 'taste'] buffet 24th main ultimate search taste restaurant bangalore decide king buffet dis definitely heaven person specially love andhra food start starter dessert item super mr narayana murty chef main course worth big applause negative point ambience service guy strongly recommend taste ['bedridden', 'scour', 'number', 'restaurant', 'satiate', 'pipe', 'hot', 'soup', 'craving', 'land', 'zaitoon', 'page', 'hard', 'time', 'decide', 'order', 'coz', 'option', 'plenty', 'order', 'simple', 'chicken', 'hot', 'n', 'sour', 'soup', 'turn', 'amazing', 'taste', 'texture', 'balance', 'flavour', 'point', 'ample', 'chicken', 'veggie', 'be', 'honestly', 'crave', 'bowl', 'right', 'word', 'packaging', 'probably', 'little', 'well', 'plain', 'white', 'run', 'mill', 'container', 'justice', 's', 'inside', 'jusssayyin'] bedridden scour number restaurant satiate pipe hot soup craving land zaitoon page hard time decide order coz option plenty order simple chicken hot n sour soup turn amazing taste texture balance flavour point ample chicken veggie be honestly crave bowl right word packaging probably little well plain white run mill container justice s inside jusssayyin ['locate', 'km', 'place', 'know', 'buffet', 'option', 'area', 'good', 'option', 'nearby', 'drive', 'far', 'traffic', 'item', 'variety', 'mention', 'menu', 'section', 'definitely', 'serve', 'variety', 'icecream', 'flavour', 'value', 'money', 'weekday', 'avoid', 'crowd'] locate km place know buffet option area good option nearby drive far traffic item variety mention menu section definitely serve variety icecream flavour value money weekday avoid crowd ['wow', 'wala', 'punjabi', 'taste', 'look', 'correct', 'place', 'jump', 'authentic', 'punjabi', 'flavour', 'malai', 'paneer', 'tikka', 'chicken', 'tikka', 'taste', 'food', 'simply', 'awesome', 'order', 'lassi', 'person', 'finish', 'take', 'kinda', 'time', 'prepare', 'food', 'think', 'acceptable', 'know', 'great', 'food', 'come', 'way', 'ambiance-4', 'food-5', 'service-4'] wow wala punjabi taste look correct place jump authentic punjabi flavour malai paneer tikka chicken tikka taste food simply awesome order lassi person finish take kinda time prepare food think acceptable know great food come way ambiance-4 food-5 service-4
['want', 'nice', 'quiet', 'ambience', 'food', 'good', 'love', 'soup', 'disappoint', 'dessert', 'order', 'litchi', 'rasgulla', 'dessert', 'great', 'concept', 'sweet', 'rasgulla', 'dry', 'look', 'beautiful'] want nice quiet ambience food good love soup disappoint dessert order litchi rasgulla dessert great concept sweet rasgulla dry look beautiful ['itã\x83\x83ã\x82\x83ã\x83\x82ã\x82\x82ã\x83\x83ã\x82\x82ã\x83\x82ã\x82\x92s', 'good', 'place', 'suggest', 'wonderful', 'place', 'hear', 'place', 'time', 'finally', 'get', 'chance', 'visit'] itッゃヂもッもヂをs good place suggest wonderful place hear place time finally get chance visit ['try', 'chicken', 'ghee', 'roast', 'chicken', 'biryani', 'biryani', 'average', 'ghee', 'roast', 'chicken', 'hard', 'chewy', 'overall', 'average', 'place', 'consider', 'food', 'ambience', 'good', 'staff', 'pleasant'] try chicken ghee roast chicken biryani biryani average ghee roast chicken hard chewy overall average place consider food ambience good staff pleasant ['go', 'girlfriend', 'enjoy', 'lot', 'nice', 'bar', 'lounge', 'restaurant', 'menu', 'pricing', 'reasonable', 'serve', 'staff', 'active', 'food', 'delicious', 'choose', 'place', 'spend', 'quality', 'time', 'special'] go girlfriend enjoy lot nice bar lounge restaurant menu pricing reasonable serve staff active food delicious choose place spend quality time special ['colleague', 'seminar', 'n', 'lunch', 'small', 'buffet', 'option', 'dessert', 'food', 'quality', 'okay', 'cost', 'reasonable', 'great', 'place', 'o', 'buffet'] colleague seminar n lunch small buffet option dessert food quality okay cost reasonable great place o buffet ['end', 'late', 'lazy', 'sunday', 'afternoon', 'breakfast.i', 'order', 'paneer', 'masala', 'dosa', 'friend', 'order', 'paddu.the', 'dosa', 'good', 'crispy', 'beautiful', 'golden', 'colour', 'has.i', 'expect', 'aloo', 'replace', 'paneer', 'masala', 'stuff', 'turn', 'paneer', 'grate', 'aloo', 'masala.which', 'feel', 'change', 'paddu', 'take', 'long', 'time', 'blame', 'go', 'breakfast', 'come', 'burn', 'call', 'extra', 'fry', 'disappointed', 'try', 'definitely', 'masala', 'dosa(not', 'paneer', ';)', 'palak', 'paneer', 'order', 'place(keep', 'paddu:50rs', 'piece', 'big', 'paneer', 'masala', 'dosa:70r', 'location', 'actually', 'vijaya', 'bank', 'layout', 'rating', 'base', 'truelly', 'instance'] end late lazy sunday afternoon breakfast.i order paneer masala dosa friend order paddu.the dosa good crispy beautiful golden colour has.i expect aloo replace paneer masala stuff turn paneer grate aloo masala.which feel change paddu take long time blame go breakfast come burn call extra fry disappointed try definitely masala dosa(not paneer ;) palak paneer order place(keep paddu:50rs piece big paneer masala dosa:70r location actually vijaya bank layout rating base truelly instance ['look', 'vegetarian', 'restaurant', 'dinner', 'finalise', 'reach', 'place', 'realise', 'buffet', 'place', 'decide', 'try', 'place', 'spacious', 'seat', 'arrangement', 'nice', 'fine', 'dining', 'casual', 'dining', 'place', 'staff', 'friendly', 'address', 'guest', 'smile', 'service', 'quick', 'prompt', 'serve', 'soup', 'table', 'thing', 'need', 'pick', 'self', 'option', 'soup', 'tomato', 'shorba', 'good', 'good', 'spread', 'huge', 'live', 'counter', 'chaat', 'pasta', 'certain', 'item', 'dessert', 'okayish', 'option', 'limit', 'like', 'dahi', 'bhalla', 'good', 'ambience', 'food', 'service'] look vegetarian restaurant dinner finalise reach place realise buffet place decide try place spacious seat arrangement nice fine dining casual dining place staff friendly address guest smile service quick prompt serve soup table thing need pick self option soup tomato shorba good good spread huge live counter chaat pasta certain item dessert okayish option limit like dahi bhalla good ambience food service ['tiny', 'place', 'right', 'slimsin', 'cafe', 'chinese', 'style', 'bhel', 'super', 'tasty', 'innovative', 'kulcha', 'taco', 'kind', 'visit', 'place', 'definitely', 'try', 'chinese', 'style', 'bhel'] tiny place right slimsin cafe chinese style bhel super tasty innovative kulcha taco kind visit place definitely try chinese style bhel ['good', 'restaurant', 'neighbourhood', 'buffet', 'system', 'properly', 'arrange', 'variety', 'dish', 'garba', 'dance', 'puppet', 'good', 'spread', 'dessert', 'live', 'paratha', '/', 'kulcha', 'making'] good restaurant neighbourhood buffet system properly arrange variety dish garba dance puppet good spread dessert live paratha / kulcha making ['food', 'ambience', 'service', 'family', 'lunch', 'place', 'serve', 'buffet', 'order', 'soup', 'babycorn', 'starter', 'butter', 'naan', 'kadai', 'panner', 'veg', 'kohlapuri', 'pease', 'pulav', 'food', 'good', 'service', 'slow', 'wait', 'min', 'order', 'place', 'apt', 'family', 'hangout', 'cheer'] food ambience service family lunch place serve buffet order soup babycorn starter butter naan kadai panner veg kohlapuri pease pulav food good service slow wait min order place apt family hangout cheer ['place', 'surprise', 'locality', 'serve', 'good', 'food', 'tip', 'ambience', 'reasonable', 'price', 'gosht', 'peshawari', 'seekh', 'deserve', 'special', 'mention', 'indian', 'starter', 'chinese', 'main', 'course', 'average', 'believe', 'indian', 'item', 'worth', 'try'] place surprise locality serve good food tip ambience reasonable price gosht peshawari seekh deserve special mention indian starter chinese main course average believe indian item worth try ['roam', 'neighborhood', 'land', 'lunch', 'spacious', 'restaurant', 'buffet', 'ala', 'carte', 'option', 'order', 'dum', 'aloo', 'paneer', 'lababdar', 'pineapple', 'raita', 'assorted', 'bread', 'food', 'delicious', 'busy', 'place', 'service', 'prompt'] roam neighborhood land lunch spacious restaurant buffet ala carte option order dum aloo paneer lababdar pineapple raita assorted bread food delicious busy place service prompt ['great', 'place', 'team', 'lunch', 'place', 'buffet', 'food', 'cool', 'ambience', 'good', 'variety', 'find', 'live', 'counter', 'chat', 'desert', 'marshmallow', 'available', 'food', 'taste', 'good', 'veg', 'non', 'veg'] great place team lunch place buffet food cool ambience good variety find live counter chat desert marshmallow available food taste good veg non veg ['empire', 'usual', 'bit', 'costly', 'justice', 'good', 'quantity', 'quality', 'packing', 'good', 'hot', 'look', 'forward', 'visit', 'soon'] empire usual bit costly justice good quantity quality packing good hot look forward visit soon ['beautiful', 'place', 'dine', 'in.the', 'interior', 'mughal', 'era', 'lighting', 'perfect.we', 'go', 'occasion', 'christmas', 'limit', 'item', 'available', 'taste', 'service', 'compromise', 'all.the', 'complaint', 'bread', 'better.would', 'surely', 'like', 'come'] beautiful place dine in.the interior mughal era lighting perfect.we go occasion christmas limit item available taste service compromise all.the complaint bread better.would surely like come ['new', 'venue', 'breakfast', 'weekend', 'give', 'warm', 'welcome', 'staff', 'owner', 'start', 'fresh', 'fruit', 'juice', 'plenty', 'option', 'come', 'menu', 'unique', 'different', 'regular', 'breakfast', 'hotel', 'ambience', 'neat', 'standard', 'menu', 'cater', 'taste', 'fashion', 'dish', 'banana', 'kesri', 'bath', 'butter', 'masala', 'dosa', 'filter', 'coffee', 'try', 'spring', 'roll', 'dosa', 'mood', 'try', 'new', 'will', 'disappoint', 'love', 'service'] new venue breakfast weekend give warm welcome staff owner start fresh fruit juice plenty option come menu unique different regular breakfast hotel ambience neat standard menu cater taste fashion dish banana kesri bath butter masala dosa filter coffee try spring roll dosa mood try new will disappoint love service ['ambience', 'good', 'pocket', 'friendly', 'cafe', 'quantity', 'good', 'dessert', 'good'] ambience good pocket friendly cafe quantity good dessert good ['visit', 'restaurant', 'family', 'ambeince', 'absulotley', 'good', 'food', 'quality', 'arrangement', 'amazing', 'thank', 'good', 'hospitality', 'defneatley', 'visit', 'tree', 'restaurant'] visit restaurant family ambeince absulotley good food quality arrangement amazing thank good hospitality defneatley visit tree restaurant
['enjoy', 'buffet', 'spread', 'reasonably', 'price', 'polite', 'service', 'ala', 'carte', 'highly', 'price', 'guess', 'work', 'buffet', 'option'] enjoy buffet spread reasonably price polite service ala carte highly price guess work buffet option ['view', 'good', 'paneer', 'dish', 'fresh', 'tasty', 'biryani', 'good', 'spice', 'level', 'medium', 'service', 'well', 'food', 'ambience-', 'service-3/5'] view good paneer dish fresh tasty biryani good spice level medium service well food ambience- service-3/5 ['delightful', 'starter', 'tasty', 'mock', 'tail', 'find', 'place', 'unfortunately', 'menu', 'available', 'unavailability', 'issue', 'find', 'parking', 'spot', 'issue', 'plan', 'properly'] delightful starter tasty mock tail find place unfortunately menu available unavailability issue find parking spot issue plan properly ['place', 'local', 'like', 'polished', 'dinner', 'buffet', 'variety', 'starter', 'sweet', 'specially', 'foreigner', 'spice', 'mistake', 'star', 'hotel', 'nice', 'ambience', 'courteous', 'staff'] place local like polished dinner buffet variety starter sweet specially foreigner spice mistake star hotel nice ambience courteous staff ['huge', 'seafood', 'fan', 'goan', 'cuisine', 'close', 'heart', 'place', 'great', 'combination', 'good', 'food', 'fantastic', 'margarita', 'great', 'live', 'music', 'wrong', 'matter', 'order', 'personal', 'favourite', 'menu', 'chorizo', 'pao', 'masala', 'fry', 'calamari', 'watermelon', 'margarita', 'pitcher', 'tip', 'go', 'order', 'catch', 'day', 'fish', 'table', 'choose', 'check', 'weight', 'confirm', 'price', 'good', 'time', 'visit', 'lazy', 'weekend', 'afternoon', 'place', 'good'] huge seafood fan goan cuisine close heart place great combination good food fantastic margarita great live music wrong matter order personal favourite menu chorizo pao masala fry calamari watermelon margarita pitcher tip go order catch day fish table choose check weight confirm price good time visit lazy weekend afternoon place good ['go', 'lunch', 'love', 'buffet', 'dish', 'dish', 'scrumptuous', 'specially', 'variety', 'salad', 'inversatile', 'dessert', 'yummy', 'food', 'quality', 'good', 'service'] go lunch love buffet dish dish scrumptuous specially variety salad inversatile dessert yummy food quality good service ['beautiful', 'place', 'ideal', 'date', 'moment', 'enter', 'place', 'feel', 'like', 'middle', 'east', 'costume', 'wear', 'click', 'selfie', 'day', 'go', 'friday', 'dinner', 'post', 'pm', 'live', 'hindi', 'music', 'singer', 'amazing', 'enjoy', 'music', 'food', 'order', 'hukkah', 'cocktail', 'al', 'harira', 'soup', 'non', '-', 'veg', 'mezze', 'platter', 'murg', 'harimichwala', 'dal', 'makhani', 'lamb', 'mousaka', 'dessert', 'order', 'choco', 'bomb', 'baked', 'alaska', 'start', 'drink', 'hukkah', 'smokey', 'add', 'drama', 'enjoy', 'smoke', 'food', 'average', 'like', 'non', '-', 'veg', 'platter', 'lamb', 'mouska', 'interior', 'good', 'hall', 'room', 'inside', 'look', 'like', 'palace', 'room', 'hall', 'book', 'private', 'party', 'overall', 'enjoy', 'ambiance'] beautiful place ideal date moment enter place feel like middle east costume wear click selfie day go friday dinner post pm live hindi music singer amazing enjoy music food order hukkah cocktail al harira soup non - veg mezze platter murg harimichwala dal makhani lamb mousaka dessert order choco bomb baked alaska start drink hukkah smokey add drama enjoy smoke food average like non - veg platter lamb mouska interior good hall room inside look like palace room hall book private party overall enjoy ambiance ['weekday', 'evening', 'quick', 'drink', 'like', 'rooftop', 'ambiance', 'bar', 'x', 'change', 'concept', 'settle', 'table', 'oversee', 'ring', 'road', 'traffic', 'food', 'decent', 'wide', 'choice', 'beverage', 'course', 'depend', 'rate', 'switch', 'beer', 'hard', 'liquor', 'classic', 'cocktail', 'ideal', 'office', 'party', 'catch', 'friend'] weekday evening quick drink like rooftop ambiance bar x change concept settle table oversee ring road traffic food decent wide choice beverage course depend rate switch beer hard liquor classic cocktail ideal office party catch friend ['aloft', 'brown', 't', 'new', 'love', 'amazing', 'banana', 'cake', '&', 'chocolate', 'cupcake', 'think', 'croissant', 'good', 'time'] aloft brown t new love amazing banana cake & chocolate cupcake think croissant good time ['place', 'friend', 'visit', 'central', 'movie', 'shopping', 'ambience', 'place', 'decent', 'food', 'good', 'try', 'lot', 'dish', 'menu', 'favourite', 'tamatar', 'shorba', 'dahi', 'ke', 'kebab', 'crispy', 'corn', 'paneer', 'makhanwala', 'staff', 'pretty', 'courteous', 'service', 'quick', 'definitely', 'recommend', 'place'] place friend visit central movie shopping ambience place decent food good try lot dish menu favourite tamatar shorba dahi ke kebab crispy corn paneer makhanwala staff pretty courteous service quick definitely recommend place ['impromptu', 'meeting', 'decide', 'sit', 'place', 'beer', 'starter', 'good', 'place', 'require', 'sprucing', 'pull', 'crowd', 'lively', 'service', 'attentive', 'fast', 'like', 'place', 'separate', 'smoking', 'room', 'allow', 'people', 'smoke', 'floor', 'unfortunately', 'place', 'bangalore', 'smoking', 'room'] impromptu meeting decide sit place beer starter good place require sprucing pull crowd lively service attentive fast like place separate smoking room allow people smoke floor unfortunately place bangalore smoking room ['place', 'close', 'bud', 'usually', 'prefer', 'open', 'terrace', 'view', 'awesome', 'numerous', 'office', 'party', 'good', 'place', 'hang', 'alacarte', 'buffet', 'veg', 'option', 'buffet', 'limited', 'pleasing', 'come', 'non', 'veg', 'pack', 'punch', 'awesome', 'lucky', 'fish', 'fry', 'honey', 'chicken', 'fry', 'buffet', 'menu', 'drool', 'worthy', 'finger', 'lick', 'good', 'main', 'course', 'average', 'great', 'starter', 'regret', 'have', 'try', 'sure'] place close bud usually prefer open terrace view awesome numerous office party good place hang alacarte buffet veg option buffet limited pleasing come non veg pack punch awesome lucky fish fry honey chicken fry buffet menu drool worthy finger lick good main course average great starter regret have try sure ['good', 'place', 'bellandur', 'punjabi', 'time', 'specially', 'food', '.visited', 'place', 'time', 'order', 'food', 'amazing', 'monday', 'friday', 'serve', 'lunch', 'buffet', 'verity', 'food', 'absolutely', 'great', 'ambience', 'good', 'totally', 'punjabi', 'tradition', 'love', 'place', 'apart', 'food', 'service', 'good', 'staff', 'friendly', 'customer', 'jame', 'manoj', 'staff', 'helpful', 'ask', 'immediately', 'table', 'nice', 'experience', 'visit', 'place', 'notice', 'corporate', 'party', 'think', 'good', 'place', 'family', 'finally', 'personally', 'suggest', 'friend', 'guy', 'prefer', 'completely', 'indian', 'punjabi', 'food', 'pl', 'visit', 'place', 'different', 'experience', 'overall', 'experience'] good place bellandur punjabi time specially food .visited place time order food amazing monday friday serve lunch buffet verity food absolutely great ambience good totally punjabi tradition love place apart food service good staff friendly customer jame manoj staff helpful ask immediately table nice experience visit place notice corporate party think good place family finally personally suggest friend guy prefer completely indian punjabi food pl visit place different experience overall experience
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) <ipython-input-60-e0599f2d2e44> in <module> ----> 1 reviews['cleaned_review_1']=reviews['review_text'].apply(lambda x:cleanText(x)) /anaconda3/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds) 3589 else: 3590 values = self.astype(object).values -> 3591 mapped = lib.map_infer(values, f, convert=convert_dtype) 3592 3593 if len(mapped) and isinstance(mapped[0], Series): pandas/_libs/lib.pyx in pandas._libs.lib.map_infer() <ipython-input-60-e0599f2d2e44> in <lambda>(x) ----> 1 reviews['cleaned_review_1']=reviews['review_text'].apply(lambda x:cleanText(x)) <ipython-input-59-dd1506172a21> in cleanText(text) 2 def cleanText(text): 3 global COUNT ----> 4 nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner']) 5 6 text=text.lower() /anaconda3/lib/python3.6/site-packages/spacy/__init__.py in load(name, **overrides) 25 if depr_path not in (True, False, None): 26 deprecation_warning(Warnings.W001.format(path=depr_path)) ---> 27 return util.load_model(name, **overrides) 28 29 /anaconda3/lib/python3.6/site-packages/spacy/util.py in load_model(name, **overrides) 132 return load_model_from_link(name, **overrides) 133 if is_package(name): # installed as package --> 134 return load_model_from_package(name, **overrides) 135 if Path(name).exists(): # path to model data directory 136 return load_model_from_path(Path(name), **overrides) /anaconda3/lib/python3.6/site-packages/spacy/util.py in load_model_from_package(name, **overrides) 153 """Load a model from an installed package.""" 154 cls = importlib.import_module(name) --> 155 return cls.load(**overrides) 156 157 /anaconda3/lib/python3.6/site-packages/en_core_web_sm/__init__.py in load(**overrides) 10 11 def load(**overrides): ---> 12 return load_model_from_init_py(__file__, **overrides) /anaconda3/lib/python3.6/site-packages/spacy/util.py in load_model_from_init_py(init_file, **overrides) 191 if not model_path.exists(): 192 raise IOError(Errors.E052.format(path=path2str(data_path))) --> 193 return load_model_from_path(data_path, meta, **overrides) 194 195 /anaconda3/lib/python3.6/site-packages/spacy/util.py in load_model_from_path(model_path, meta, **overrides) 162 meta = get_model_meta(model_path) 163 cls = get_lang_class(meta["lang"]) --> 164 nlp = cls(meta=meta, **overrides) 165 pipeline = meta.get("pipeline", []) 166 disable = overrides.get("disable", []) /anaconda3/lib/python3.6/site-packages/spacy/language.py in __init__(self, vocab, make_doc, max_length, meta, **kwargs) 162 if make_doc is True: 163 factory = self.Defaults.create_tokenizer --> 164 make_doc = factory(self, **meta.get("tokenizer", {})) 165 self.tokenizer = make_doc 166 self.pipeline = [] /anaconda3/lib/python3.6/site-packages/spacy/language.py in create_tokenizer(cls, nlp) 77 suffix_search=suffix_search, 78 infix_finditer=infix_finditer, ---> 79 token_match=token_match, 80 ) 81 tokenizer.pyx in spacy.tokenizer.Tokenizer.__init__() tokenizer.pyx in spacy.tokenizer.Tokenizer.add_special_case() vocab.pyx in spacy.vocab.Vocab.make_fused_token() vocab.pyx in spacy.vocab.Vocab.get_by_orth() vocab.pyx in spacy.vocab.Vocab._new_lexeme() /anaconda3/lib/python3.6/site-packages/spacy/lang/lex_attrs.py in is_stop(string, stops) 215 216 --> 217 def is_stop(string, stops=set()): 218 return string.lower() in stops 219 KeyboardInterrupt:
In [55]:
reviews.to_csv("zomato_buffet_cleaned.csv",index=False)
In [56]:
reviews.head()
Out[56]:

There are still some punctuation marks. let us use regex and remove them. This may be because there was no space between the words

In [212]:
import string
def removePunct(text):
    text = text.translate(str.maketrans({key: " {0} ".format(key) for key in string.punctuation}))
    
    text_cleaned="".join([x for x in text if x not in string.punctuation])
    text_cleaned=re.sub(r'[^\x00-\x7F]+',' ', text_cleaned) ## Remove Ascii Characters
    text_cleaned=re.sub('\s+', ' ', text_cleaned).strip()
    return text_cleaned
In [213]:
reviews['cleaned_review']=reviews['cleaned_review'].apply(lambda x:removePunct(x))
reviews.head()
Out[213]:

Understanding the Reviews

What is the distribution of Ratings?

In [214]:
reviews['rating'].value_counts()
Out[214]:
4.0    14837
5.0    13088
3.0    5626 
1.0    2505 
2.0    1851 
4.5    1669 
3.5    1653 
2.5    615  
1.5    417  
Name: rating, dtype: int64

There are too many categories. Let us simplify it

Let us create a new rating category - Positive, Neutral and Negative.

Anythin below 3 is negative, 3 is neutral and above 3 is positive

In [215]:
def getRatingCategory(rating):
    if rating<3:
        return "negative"
    elif rating>3:
        return "positive"
    else:
        return "neutral"
In [216]:
reviews['rating_category']=reviews['rating'].apply(lambda x:getRatingCategory(x))
In [217]:
reviews.head()
Out[217]:
In [218]:
sns.countplot(reviews['rating_category']).set_title("Distribution of Reviews Category")
Out[218]:
Text(0.5,1,'Distribution of Reviews Category')
Notebook Image

majority of the reviews are positive.

Let us now look at what are the common words used in positive and negative reviews

First extract positive and negative reviews into seperate variables
In [219]:
from collections import Counter 

In [220]:
positive_reviews=reviews.loc[reviews['rating_category']=='positive','cleaned_review'].tolist()
positive_reviews[0:5]
Out[220]:
['beautiful place dine in the interior mughal era lighting perfect we go occasion christmas limit item available taste service compromise all the complaint bread better would surely like come',
 'dinner family turn good choose suitable age people try place like starter service good price affordable recommend restaurant early dinner place little noisy',
 'great food proper karnataka style meal twice fully satisfied star manage',
 'good restaurant neighbourhood buffet system properly arrange variety dish garba dance puppet good spread dessert live paratha kulcha making',
 'awesome food great servicefriendly staffsgood quality food complimentary breakfast honey lemon chicken chicken manchow soup perfect place stay family stay bangalore']
In [221]:
negative_reviews=reviews.loc[reviews['rating_category']=='negative','cleaned_review'].tolist()
negative_reviews[0:5]
Out[221]:
['okay like start say restaurant team unorganized worker order food online boneless biryani receive mutton biryani call say tell send receive payment care horrible service biryani taste good service',
 'visit place night dinner party place cozy design like ambience staff courteous buffet menu veg non veg starter veg non veg soup starter okish main course roti rice chow mien gravy sufficient option choose main course desert lack taste service item warm enjoy wait item need replenish quirky item photo booth new restaurant hope visit',
 'look ab find hard time maintain niche create exp joint absolute sense plate table dirty glass come food taste feel like repeat chef come ask feedback honest change work right flavour satisfy extra attentiveness apologetic service make quick modification mood set half hunger go toss hope exp well start',
 'go today impromptu lunch buffet booking wait good min seat manager steward show table service expect introduce concept menu little clueless go ahead ask detail go straight food serve unlimited non alcoholic beverage buffet believe good lunch serve mix italian mexican mediterranean pizza good pink pasta onion ring good pay head food worth surprised have lasagna buffet disaster',
 'visit place lunch party organise friend food pathetic place lack hygiene avoid place']
Tokenise the words and use Counter to keep count of words
In [222]:
test_string="the place was great and I enjoyed being in that place awesome great food"
In [223]:
tokenised_string=test_string.split(" ")
print(tokenised_string)
['the', 'place', 'was', 'great', 'and', 'I', 'enjoyed', 'being', 'in', 'that', 'place', 'awesome', 'great', 'food']
In [224]:
test_counter=Counter(tokenised_string)
test_counter

Out[224]:
Counter({'the': 1,
         'place': 2,
         'was': 1,
         'great': 2,
         'and': 1,
         'I': 1,
         'enjoyed': 1,
         'being': 1,
         'in': 1,
         'that': 1,
         'awesome': 1,
         'food': 1})
In [225]:
test_counter.most_common(2)
Out[225]:
[('place', 2), ('great', 2)]

Let us put the above into a function. To use it, we will have to concatenate all the postive reviews as one string and all negative reviews as another

In [226]:
def getMostCommon(reviews_list,topn=20):
    reviews=" ".join(reviews_list)
    tokenised_reviews=reviews.split(" ")
    
    
    freq_counter=Counter(tokenised_reviews)
    return freq_counter.most_common(topn)
    
In [227]:
top_20_positive_review_words=getMostCommon(positive_reviews,20)
In [228]:
top_20_positive_review_words
Out[228]:
[('good', 37180),
 ('food', 30654),
 ('place', 29353),
 ('service', 14751),
 ('ambience', 12642),
 ('great', 10885),
 ('buffet', 10363),
 ('visit', 10200),
 ('try', 9857),
 ('chicken', 8610),
 ('staff', 8306),
 ('love', 8011),
 ('starter', 7688),
 ('taste', 7527),
 ('nice', 7434),
 ('time', 7038),
 ('veg', 6993),
 ('restaurant', 6779),
 ('like', 6742),
 ('order', 6545)]
In [229]:
top_20_negative_review_words=getMostCommon(negative_reviews,20)
In [230]:
top_20_negative_review_words
Out[230]:
[('food', 5838),
 ('place', 3656),
 ('good', 3538),
 ('service', 2720),
 ('bad', 2700),
 ('buffet', 2169),
 ('order', 2093),
 ('serve', 1698),
 ('restaurant', 1557),
 ('chicken', 1508),
 ('taste', 1491),
 ('time', 1477),
 ('visit', 1466),
 ('like', 1411),
 ('go', 1378),
 ('experience', 1315),
 ('veg', 1258),
 ('ambience', 1256),
 ('starter', 1233),
 ('staff', 1029)]
In [231]:
neg_words=[val[0] for val in top_20_negative_review_words]
pos_words=[val[0] for val in top_20_positive_review_words]

set(neg_words) - set(pos_words)
Out[231]:
{'bad', 'experience', 'go', 'serve'}
In [232]:
set(pos_words) - set(neg_words)
Out[232]:
{'great', 'love', 'nice', 'try'}

There are a few common words as well when we consider unigrams. Let us plot this as a bar plot.

Plotting the Top 50 most common words in negative and positive words
In [233]:
def plotMostCommonWords(reviews_list,topn=50,title="Positive Review",color="blue",axis=None):
    top_words=getMostCommon(reviews_list,topn=topn)
    data=pd.DataFrame()
    data['words']=[val[0] for val in top_words]
    data['freq']=[val[1] for val in top_words]
    if axis!=None:
        sns.barplot(y='words',x='freq',data=data,color=color,ax=axis).set_title(title+" top "+str(topn))
    else:
        sns.barplot(y='words',x='freq',data=data,color=color).set_title(title+" top "+str(topn))
In [234]:
from matplotlib import rcParams

rcParams['figure.figsize'] = 8,6 ## Sets the heigth and width of image


fig,ax=plt.subplots(1,2)
fig.subplots_adjust(wspace=0.5) #Adjusts the space between the two plots
plotMostCommonWords(positive_reviews,20,"Positive Review Unigrams",axis=ax[0])

plotMostCommonWords(negative_reviews,20,"Negative Review Unigrams",color="red",axis=ax[1])

Notebook Image

Some takeaways from the Unigram plot

  1. "Starters","Ambience", "Service" and the "Staff" impact the reviews - they are present in both positive and negative reviews
  2. Words like "great", "love" are present in postive reviews and are not there in the negative ones
  3. Words like "bad","experince" are present in negative reviews and not there in positive reviews.
  4. Top 20 Unigrams on their own are not enough to distinguish between good vs bad reviews.
Dive into the Bigrams

So far we considered only each word as a token. Instead we will now have consecutive pair of words. These are known as bigrams

If we consider n contiguos words it becomes an n-gram

With Bigrams, if "bad" and "experience" occur very frequently togther we will be able to get "bad experience" as a very frequent word

In [235]:
test_review=negative_reviews[0]
test_review
Out[235]:
'okay like start say restaurant team unorganized worker order food online boneless biryani receive mutton biryani call say tell send receive payment care horrible service biryani taste good service'
In [236]:
test_tokens=test_review.split(" ")
[test_tokens[i:] for i in range(2)]
Out[236]:
[['okay',
  'like',
  'start',
  'say',
  'restaurant',
  'team',
  'unorganized',
  'worker',
  'order',
  'food',
  'online',
  'boneless',
  'biryani',
  'receive',
  'mutton',
  'biryani',
  'call',
  'say',
  'tell',
  'send',
  'receive',
  'payment',
  'care',
  'horrible',
  'service',
  'biryani',
  'taste',
  'good',
  'service'],
 ['like',
  'start',
  'say',
  'restaurant',
  'team',
  'unorganized',
  'worker',
  'order',
  'food',
  'online',
  'boneless',
  'biryani',
  'receive',
  'mutton',
  'biryani',
  'call',
  'say',
  'tell',
  'send',
  'receive',
  'payment',
  'care',
  'horrible',
  'service',
  'biryani',
  'taste',
  'good',
  'service']]
In [237]:
tuple(zip(*[test_tokens[i:] for i in range(2)]))
Out[237]:
(('okay', 'like'),
 ('like', 'start'),
 ('start', 'say'),
 ('say', 'restaurant'),
 ('restaurant', 'team'),
 ('team', 'unorganized'),
 ('unorganized', 'worker'),
 ('worker', 'order'),
 ('order', 'food'),
 ('food', 'online'),
 ('online', 'boneless'),
 ('boneless', 'biryani'),
 ('biryani', 'receive'),
 ('receive', 'mutton'),
 ('mutton', 'biryani'),
 ('biryani', 'call'),
 ('call', 'say'),
 ('say', 'tell'),
 ('tell', 'send'),
 ('send', 'receive'),
 ('receive', 'payment'),
 ('payment', 'care'),
 ('care', 'horrible'),
 ('horrible', 'service'),
 ('service', 'biryani'),
 ('biryani', 'taste'),
 ('taste', 'good'),
 ('good', 'service'))
In [238]:
def generateNGram(text,n=2):
    tokens=text.split(" ")
    ngrams = zip(*[tokens[i:] for i in range(n)])
    return ["_".join(ngram) for ngram in ngrams]
In [239]:
positive_reviews_bigrams=[" ".join(generateNGram(review)) for review in positive_reviews]
negative_reviews_bigrams=[" ".join(generateNGram(review)) for review in negative_reviews]

In [240]:
positive_reviews_bigrams[0:5]
Out[240]:
['beautiful_place place_dine dine_in in_the the_interior interior_mughal mughal_era era_lighting lighting_perfect perfect_we we_go go_occasion occasion_christmas christmas_limit limit_item item_available available_taste taste_service service_compromise compromise_all all_the the_complaint complaint_bread bread_better better_would would_surely surely_like like_come',
 'dinner_family family_turn turn_good good_choose choose_suitable suitable_age age_people people_try try_place place_like like_starter starter_service service_good good_price price_affordable affordable_recommend recommend_restaurant restaurant_early early_dinner dinner_place place_little little_noisy',
 'great_food food_proper proper_karnataka karnataka_style style_meal meal_twice twice_fully fully_satisfied satisfied_star star_manage',
 'good_restaurant restaurant_neighbourhood neighbourhood_buffet buffet_system system_properly properly_arrange arrange_variety variety_dish dish_garba garba_dance dance_puppet puppet_good good_spread spread_dessert dessert_live live_paratha paratha_kulcha kulcha_making',
 'awesome_food food_great great_servicefriendly servicefriendly_staffsgood staffsgood_quality quality_food food_complimentary complimentary_breakfast breakfast_honey honey_lemon lemon_chicken chicken_chicken chicken_manchow manchow_soup soup_perfect perfect_place place_stay stay_family family_stay stay_bangalore']
Get the top 20 most frequent bigrams in positive and negative reviews
In [241]:
top_20_positive_bigrams=getMostCommon(positive_reviews_bigrams,topn=20)
top_20_positive_bigrams
Out[241]:
[('main_course', 4469),
 ('food_good', 3374),
 ('good_place', 2724),
 ('visit_place', 2619),
 ('non_veg', 2581),
 ('good_food', 2341),
 ('service_good', 1878),
 ('good_service', 1742),
 ('it_s', 1408),
 ('ambience_good', 1354),
 ('food_service', 1353),
 ('value_money', 1338),
 ('place_good', 1271),
 ('taste_good', 1269),
 ('nice_place', 1269),
 ('ice_cream', 1266),
 ('veg_non', 1074),
 ('good_ambience', 1057),
 ('north_indian', 1045),
 ('great_place', 994)]
In [242]:
top_20_negative_bigrams=getMostCommon(negative_reviews_bigrams,topn=20)
In [243]:
top_20_negative_bigrams
Out[243]:
[('main_course', 679),
 ('non_veg', 602),
 ('visit_place', 460),
 ('food_good', 391),
 ('bad_experience', 372),
 ('ice_cream', 294),
 ('quality_food', 275),
 ('lunch_buffet', 274),
 ('ambience_good', 271),
 ('food_quality', 268),
 ('bad_service', 262),
 ('service_bad', 233),
 ('food_average', 218),
 ('waste_money', 213),
 ('good_service', 210),
 ('veg_starter', 208),
 ('food_serve', 206),
 ('bad_food', 197),
 ('good_thing', 196),
 ('food_service', 193)]
Plotting the top 20 Bigrams for Positive and Negative Reviews
In [249]:
rcParams['figure.figsize'] = 15,20
fig,ax=plt.subplots(1,2)
fig.subplots_adjust(wspace=1) #Adjusts the space between the two plots
plotMostCommonWords(positive_reviews_bigrams,50,"Positive Review Bigrams",axis=ax[0])

plotMostCommonWords(negative_reviews_bigrams,50,"Negative Review Bigrams",color="red",axis=ax[1])

Notebook Image

Key Takeaways from Bigrams

  1. While positive reviews talk about "value_money", negative reviews talk about "waste_money". Positive Reviews talk about "good service", negative reviews talk about "bad service".

Collocations

From simple bigrams, we may not always end up with useful bigrams.

Collocation is a phrase consisting of more than one word but these words more commonly co-occur in a given context than its individual word parts.

For example: non_veg is a more useful and meaningful bigram compares to veg_non. Collocations help us identify these.

Reference : https://medium.com/@nicharuch/collocations-identifying-phrases-that-act-like-individual-words-in-nlp-f58a93a2f84a

To identify important collocations, we will only extract Nouns and Adjectives
In [267]:

COUNT=0
def extract_allowed_pos(text,allowed_pos=['NOUN',"PROPN","ADJ"]):
    global COUNT
    
    
    text=re.sub(r'[^\x00-\x7F]+',' ', text) ## Remove Ascii Characters
    text=re.sub('\s+', ' ', text).strip()
    
    nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner']) 
    
    text=text.lower()
    
    
    document = nlp(text)
    
    doc_cleaned=[token.text for token in document if token.pos_ in allowed_pos]
    #print(doc_cleaned)
    doc_cleaned=" ".join(doc_cleaned)
    #print(doc_cleaned)
    #print(doc_cleaned)
    COUNT=COUNT+1
    if COUNT%1000==0:
        print(COUNT)
    return doc_cleaned
In [268]:
reviews['cleaned_noun_adj_review']=reviews['review_text'].apply(lambda x:extract_allowed_pos(x))
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 16000 17000 18000 19000 20000 21000 22000 23000 24000 25000 26000 27000 28000 29000 30000 31000 32000 33000 34000 35000 36000 37000 38000 39000 40000 41000 42000
In [269]:
reviews.to_csv("Cleaned_Reviews_V2.csv",index=False)
In [270]:
 
Out[270]:
In [271]:
reviews['cleaned_noun_adj_review']=reviews['cleaned_noun_adj_review'].apply(lambda x:removePunct(x))

reviews.head()
Out[271]:
Let us now get important bigrams and trigrams using Frequency Method

For identifying collocations, nltk has a function inbuilt. We can use it

In [276]:
import nltk

In [304]:
bigrams = nltk.collocations.BigramAssocMeasures()

## Concat all the positive reviews

positive_reviews_concat=" ".join(positive_reviews_adj_noun)
positive_reviews_tokens=positive_reviews_concat.split(" ")

### Create a bigram Finder
positive_bigramFinder = nltk.collocations.BigramCollocationFinder.from_words(positive_reviews_tokens)


bigram_freq_positive= positive_bigramFinder.ngram_fd.items()

bigram_freq_positive_df=pd.DataFrame()
bigram_freq_positive_df['word']=[val[0] for val in bigram_freq_positive]
bigram_freq_positive_df['freq']=[val[1] for val in bigram_freq_positive]

bigram_freq_positive_df= bigram_freq_positive_df.sort_values('freq',ascending=False)
#bigram_freq_positive_df.head()

In [305]:
## Concat all the positive reviews

negative_reviews_concat=" ".join(negative_reviews_adj_noun)
negative_reviews_tokens=negative_reviews_concat.split(" ")

### Create a bigram Finder
negative_bigramFinder = nltk.collocations.BigramCollocationFinder.from_words(negative_reviews_tokens)


bigram_freq_negative= negative_bigramFinder.ngram_fd.items()

bigram_freq_negative_df=pd.DataFrame()
bigram_freq_negative_df['word']=[val[0] for val in bigram_freq_negative]
bigram_freq_negative_df['freq']=[val[1] for val in bigram_freq_negative]

bigram_freq_negative_df= bigram_freq_negative_df.sort_values('freq',ascending=False)
#bigram_freq_positive_df.head()

In [306]:
bigram_freq_positive_df.head(20)
Out[306]:
In [307]:
bigram_freq_negative_df.head(20)
Out[307]:
Let us now use PMI

PMI (Point wise Mutual Information) : Say there is a set of location names : like New York, New Delhi, San Fransico etc. Using normal tokenissation techniques "New","York","Delhi","San" and "Fransico" become individual words. We want these words to be reporesented as one single token. For this we use PMI.It assigns score for a n-gram based of probability of the words occurring together in entire documents divided bu the probability of the words occurring individually.

Since word like "New" and "York" occurring together is more likely than word "New" and "York" occurring seperately.

But, if random word "abc xyz" occurrs only once in document and that is the only document "abc" and "xyz" occur - the PMI will be very high. To avoid this, we can filter where frequency of words is greater than a particular value. Or and importance of an n-gram can be calculated by multiplying PMI*Freq.

In [308]:
#filter for only those with more than 20 occurences
#positive_bigramFinder.apply_freq_filter(20)
#negative_bigramFinder.apply_freq_filter(20)


In [309]:
bigram_pmi_positive=positive_bigramFinder.score_ngrams(bigrams.pmi)
bigram_pmi_negative=negative_bigramFinder.score_ngrams(bigrams.pmi)


In [310]:
bigram_pmi_positive_df=pd.DataFrame()
bigram_pmi_positive_df['word']=[val[0] for val in bigram_pmi_positive]
bigram_pmi_positive_df['pmi']=[val[1] for val in bigram_pmi_positive]

bigram_pmi_positive_df= bigram_pmi_positive_df.sort_values('pmi',ascending=False)
#bigram_freq_positive_df.head()

In [311]:
bigram_pmi_positive_df.head()
Out[311]:
Let us merge the Frequency and PMI Dataframe for Positive Reviews
In [312]:
positive_bigram=pd.merge(bigram_pmi_positive_df,bigram_freq_positive_df,on='word')
In [314]:
positive_bigram.head(10)
Out[314]:

We can see that there are words that have high PMI score but has low frequency. To consider such instances, let us multiply PMI and Frequency

Merge the frequency and PMI values for the posoitove reviews and create new column which multplies PMI and frequency
In [315]:
positive_bigram['pmi_freq']=positive_bigram['pmi']*positive_bigram['freq']
positive_bigram=positive_bigram.sort_values('pmi_freq',ascending=False)
In [317]:
positive_bigram.head(20)
Out[317]:

Using PMI*freq gives a better result.

In [ ]: