Jovian
⭐️
Sign In

System Setup

List of all the python libraries that are required

  • numpy
  • pandas
  • matplotlib
  • seaborn
  • wordcloud
  • emoji
  • jovian

Run the following command to get all the listed python libraries

pip install numpy pandas matplotlib seaborn wordcloud emoji jovian --upgrade

Te check whether do you all the required libraries the next should run without any errors

In [1]:
import re
import jovian
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
import emoji
from collections import Counter

How to obtain Whatsapp Chat data

  • Open whatsapp
  • Open a Group/Inbox
  • Click on the 3 dotted options button
  • Click on more
  • Click on export chat
  • Click on without media
  • Export via Email/other IM's/....
  • Download to your system rename to chat-data.txt and put it in a folder

Without media: exports 40k messages 
With media: exports 10k messages along with pictures/videos 
As im are doing chat data analysis i went with `without media` option 

Data Preprocessing

Use a custom a regex and datatime format by reffering to the above links if you run into empty df or format errors. As the exports from whatsapp are not standardized.

In [2]:
def rawToDf(file, key):
    split_formats = {
        '12hr' : '\d{1,2}/\d{1,2}/\d{2,4},\s\d{1,2}:\d{2}\s[APap][mM]\s-\s',
        '24hr' : '\d{1,2}/\d{1,2}/\d{2,4},\s\d{1,2}:\d{2}\s-\s',
        'custom' : ''
    }
    datetime_formats = {
        '12hr' : '%d/%m/%Y, %I:%M %p - ',
        '24hr' : '%d/%m/%Y, %H:%M - ',
        'custom': ''
    }
    
    with open(file, 'r') as raw_data:
        raw_string = ' '.join(raw_data.read().split('\n')) # converting the list split by newline char. as one whole string as there can be multi-line messages
        user_msg = re.split(split_formats[key], raw_string) [1:] # splits at all the date-time pattern, resulting in list of all the messages with user names
        date_time = re.findall(split_formats[key], raw_string) # finds all the date-time patterns
        
        df = pd.DataFrame({'date_time': date_time, 'user_msg': user_msg}) # exporting it to a df
        
    # converting date-time pattern which is of type String to type datetime,
    # format is to be specified for the whole string where the placeholders are extracted by the method 
    df['date_time'] = pd.to_datetime(df['date_time'], format=datetime_formats[key])
    
    # split user and msg 
    usernames = []
    msgs = []
    for i in df['user_msg']:
        a = re.split('([\w\W]+?):\s', i) # lazy pattern match to first {user_name}: pattern and spliting it aka each msg from a user
        if(a[1:]): # user typed messages
            usernames.append(a[1])
            msgs.append(a[2])
        else: # other notifications in the group(eg: someone was added, some left ...)
            usernames.append("grp_notif")
            msgs.append(a[0])

    # creating new columns         
    df['user'] = usernames
    df['msg'] = msgs

    # dropping the old user_msg col.
    df.drop('user_msg', axis=1, inplace=True)
    
    return df

Import data

In [3]:
df = rawToDf('chat-data.txt', '12hr')
In [4]:
df.tail()
Out[4]:
In [5]:
df.shape # no. of msgs
Out[5]:
(39999, 3)
In [6]:
me = "Prajwal Prashanth"

Data Cleaning

In [7]:
images = df[df['msg']=="<Media omitted> "] #no. of images, images are represented by <media omitted>
images.shape
Out[7]:
(860, 3)
In [8]:
df["user"].unique()
Out[8]:
array(['Sandesh..!!', 'Sri Hari Colle', 'Prajwal Prashanth', 'Venkat',
       '+91 98863 53469', 'Nikil DB', 'Ktg', 'Billa', 'manish lakshman',
       'Kushal Ramakanth', 'Keshava', 'Abhishek Dharani', 'grp_notif',
       'Srinidhi Nie', 'Kranti Jio', 'Prajwal Kaaadi'], dtype=object)
In [9]:
grp_notif = df[df['user']=="grp_notif"] #no. of grp notifications
grp_notif.shape
Out[9]:
(41, 3)
In [10]:
df.drop(images.index, inplace=True) #removing images
df.drop(grp_notif.index, inplace=True) #removing grp_notif
In [11]:
df.tail()
Out[11]:
In [12]:
df.reset_index(inplace=True, drop=True)
df.shape
Out[12]:
(39098, 3)

Lets Discuss on what do we want to get out of this data

* Is raw data enough to get that insight?
* if not what can be possible way to get that insight?
* Whats the use of that insight?

Questions from the audience

Q 1) Who is the most active member of the group. Who is the least active?

In [13]:
df.groupby("user")["msg"].count().sort_values(ascending=False)
Out[13]:
user
Sandesh..!!          9257
Sri Hari Colle       9138
Venkat               5259
Nikil DB             4977
Prajwal Prashanth    4383
Billa                1762
Ktg                  1436
manish lakshman      1297
Abhishek Dharani      587
Kushal Ramakanth      342
Prajwal Kaaadi        191
Kranti Jio            182
Srinidhi Nie          103
Keshava                94
+91 98863 53469        90
Name: msg, dtype: int64

Q 2) Count of all the emojis that i have used?

In [14]:
emoji_ctr = Counter()
emojis_list = map(lambda x: ''.join(x.split()), emoji.UNICODE_EMOJI.keys())
r = re.compile('|'.join(re.escape(p) for p in emojis_list))
for idx, row in df.iterrows():
    if row["user"] == me:
        emojis_found = r.findall(row["msg"])
        for emoji_found in emojis_found:
            emoji_ctr[emoji_found] += 1
In [15]:
for item in emoji_ctr.most_common(10):
    print(item[0] + " - " + str(item[1]))
😂 - 74 🏻 - 30 😢 - 22 ✌ - 18 👎 - 18 👍 - 15 😶 - 4 😭 - 3 🏼 - 3 😅 - 2

Q 3) What can my activity say about my sleep cycle?

In [16]:
df['hour'] = df['date_time'].apply(lambda x: x.hour)
df[df['user']==me].groupby(['hour']).size().sort_index()
.plot(x="hour", kind='bar')
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8631472310>
Notebook Image

Q 4)

What is the difference in Weekend vs Weekday usage pattern?

How many words do I type on average on weekday vs weekend?

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.weekday.html

In [21]:
df['weekday'] = df['date_time'].apply(lambda x: x.day_name()) # can use day_name or weekday from datetime 
In [22]:
df['is_weekend'] = df.weekday.isin(['Sunday', 'Saturday'])
In [24]:
msgs_per_user = df['user'].value_counts(sort=True)
msgs_per_user
Out[24]:
Sandesh..!!          9257
Sri Hari Colle       9138
Venkat               5259
Nikil DB             4977
Prajwal Prashanth    4383
Billa                1762
Ktg                  1436
manish lakshman      1297
Abhishek Dharani      587
Kushal Ramakanth      342
Prajwal Kaaadi        191
Kranti Jio            182
Srinidhi Nie          103
Keshava                94
+91 98863 53469        90
Name: user, dtype: int64
In [25]:
top5_users = msgs_per_user.index.tolist()[:5]
top5_users
Out[25]:
['Sandesh..!!', 'Sri Hari Colle', 'Venkat', 'Nikil DB', 'Prajwal Prashanth']
In [26]:
df_top5 = df.copy()
df_top5 = df_top5[df_top5.user.isin(top5_users)]
df_top5.head()
Out[26]:
In [27]:
plt.figure(figsize=(30,10))
sns.countplot(x="user", hue="weekday", data=df)
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff51e2a8190>
Notebook Image
In [28]:
df_top5['is_weekend'] = df_top5.weekday.isin(['Sunday', 'Saturday'])
In [29]:
plt.figure(figsize=(20,10))
sns.countplot(x="user", hue="is_weekend", data=df_top5)
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff51c7cd610>
Notebook Image
In [30]:
def word_count(val):
    return len(val.split())
In [31]:
df['no_of_words'] = df['msg'].apply(word_count)
In [32]:
df_top5['no_of_words'] = df_top5['msg'].apply(word_count)
In [33]:
total_words_weekday = df[df['is_weekend']==False]['no_of_words'].sum()
total_words_weekday
Out[33]:
91889
In [34]:
total_words_weekend = df[df['is_weekend']]['no_of_words'].sum()
total_words_weekend
Out[34]:
41129
In [35]:
total_words_weekday/5 # average words on a weekday
Out[35]:
18377.8
In [36]:
total_words_weekend/2 # average words on a weekend
Out[36]:
20564.5
In [37]:
df.groupby('user')['no_of_words'].sum().sort_values(ascending=False)
Out[37]:
user
Sandesh..!!          32234
Sri Hari Colle       27111
Venkat               20728
Prajwal Prashanth    17724
Nikil DB             16901
Billa                 4852
manish lakshman       4203
Ktg                   3701
Abhishek Dharani      2001
Kushal Ramakanth      1331
Prajwal Kaaadi         764
Kranti Jio             516
+91 98863 53469        447
Srinidhi Nie           287
Keshava                218
Name: no_of_words, dtype: int64
In [39]:
(df_top5.groupby('user')['no_of_words'].sum()/df_top5.groupby('user').size()).sort_values(ascending=False)
Out[39]:
user
Prajwal Prashanth    4.043806
Venkat               3.941434
Sandesh..!!          3.482122
Nikil DB             3.395821
Sri Hari Colle       2.966842
dtype: float64
In [40]:
wordPerMsg_weekday_vs_weekend = (df_top5.groupby(['user', 'is_weekend'])['no_of_words'].sum()/df_top5.groupby(['user', 'is_weekend']).size())
wordPerMsg_weekday_vs_weekend
Out[40]:
user               is_weekend
Nikil DB           False         3.359782
                   True          3.456009
Prajwal Prashanth  False         4.004094
                   True          4.148179
Sandesh..!!        False         3.507355
                   True          3.429570
Sri Hari Colle     False         2.969789
                   True          2.960444
Venkat             False         4.049866
                   True          3.676913
dtype: float64
In [41]:
wordPerMsg_weekday_vs_weekend.plot(kind='barh')
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff51c51b710>
Notebook Image

Q 5)

Most Usage - Time of Day

In [45]:
x = df.groupby(['hour', 'weekday'])['msg'].size().reset_index()
x2 = x.pivot("hour", 'weekday', 'msg')
x2.head()
Out[45]:
In [46]:
days = ["Monday", 'Tuesday', "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
sns.heatmap(x2[days].fillna(0), robust=True)
Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff51c4afe10>
Notebook Image

Q 6)

In any group, do I have any inclination towards responding to someone?

In [47]:
my_msgs_index = np.array(df[df['user']==me].index)
print(my_msgs_index, my_msgs_index.shape)
[ 4 5 11 ... 39073 39076 39077] (4383,)
In [50]:
prev_msgs_index = my_msgs_index - 1
print(prev_msgs_index, prev_msgs_index.shape)
[ 3 4 10 ... 39072 39075 39076] (4383,)
In [51]:
df_replies = df.iloc[prev_msgs_index].copy()
df_replies.shape
Out[51]:
(4383, 7)
In [52]:
df_replies.groupby(["user"])["msg"].size().sort_values().plot(kind='barh')
Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff51c3bda10>
Notebook Image

Q 7)

Which are the most common words?

In [58]:
comment_words = ' '
stopwords = STOPWORDS.update(['lo', 'ge', 'Lo', 'illa', 'yea', 'ella', 'en', 'na', 'En', 'yeah', 'alli', 'ide', 'okay', 'ok', 'will'])
  
for val in df.msg.values: 
    val = str(val) 
    tokens = val.split() 
        
    for i in range(len(tokens)): 
        tokens[i] = tokens[i].lower() 
          
    for words in tokens: 
        comment_words = comment_words + words + ' '
  
  
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='black', 
                stopwords = stopwords, 
                min_font_size = 10).generate(comment_words) 

In [59]:
wordcloud.to_image()
Out[59]:
Notebook Image

Know What They Know(atleast by little

Assignment-kind

  • 1way ya 2way, check for a response time between two people
In [5]:
jovian.commit()
[jovian] Saving notebook..
[jovian] Updating notebook "1dbae75b1bc24bd7903e5e3f1ac24048" on https://jovian.ml/ [jovian] Uploading notebook.. [jovian] Capturing environment.. [jovian] Committed successfully! https://jovian.ml/PrajwalPrashanth/whatsapp-chat-data-analysis
In [ ]: