Jovian
⭐️
Sign In

System Setup

List of all the python libraries that are required

  • numpy
  • pandas
  • matplotlib
  • seaborn
  • wordcloud
  • emoji
  • jovian

Run the following command to get all the listed python libraries

pip install numpy pandas matplotlib seaborn wordcloud emoji jovian --upgrade

Te check whether do you all the required libraries the next should run without any errors

In [1]:
import re
import jovian
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
import emoji
from collections import Counter

How to obtain Whatsapp Chat data

  • Open whatsapp
  • Open a Group/Inbox
  • Click on the 3 dotted options button
  • Click on more
  • Click on export chat
  • Click on without media
  • Export via Email/other IM's/....
  • Download to your system rename to chat-data.txt and put it in a folder

Without media: exports 40k messages 
With media: exports 10k messages along with pictures/videos 
As im are doing chat data analysis i went with `without media` option 

Data Preprocessing

Use a custom a regex and datatime format by reffering to the above links if you run into empty df or format errors. As the exports from whatsapp are not standardized.

In [2]:
def rawToDf(file):
    with open(file, 'r',encoding='utf8') as raw_data:
        raw_string = ' '.join(raw_data.read().split('\n')) # converting the list split by newline char. as one whole string as there can be multi-line messages
        user_msg = re.split('\d{1,2}/\d{1,2}/\d{2,4},\s\d{1,2}:\d{2}\s-\s', raw_string) [1:] # splits at all the date-time pattern, resulting in list of all the messages with user names
        date_time = re.findall('\d{1,2}/\d{1,2}/\d{2,4},\s\d{1,2}:\d{2}\s-\s', raw_string) # finds all the date-time patterns
        
        df = pd.DataFrame({'date_time': date_time, 'user_msg': user_msg}) # exporting it to a df
   
    # converting date-time pattern which is of type String to type datetime, format is to be specified for the whole string where the placeholders are extracted by the method    
    try:
        df['date_time'] = df['date_time'].apply(lambda x: dateparser.parse(x))
    except:
        print("oo")
        try:
            df['date_time'] = pd.to_datetime(df['date_time'], format='%m/%d/%y, %H:%M - ') #10/20/19, 10:24 pm - 
        except:
            df['date_time'] = pd.to_datetime(df['date_time'], format='%d/%m/%Y, %H:%M - ') #20/10/2019, 10:24 pm -
    
    # split user and msg 
    usernames = []
    msgs = []
    for i in df['user_msg']:
        a = re.split('([\w\W]+?):\s', i) # lazy pattern match to first {user_name}: pattern and spliting it aka each msg from a user
        if(a[1:]): # user typed messages
            usernames.append(a[1])
            msgs.append(a[2])
        else: # other notifications in the group(eg: someone was added, some left ...)
            usernames.append("grp_notif")
            msgs.append(a[0])

    # creating new columns         

    df['user'] = usernames
    df['msg'] = msgs

    # dropping the old user_msg col.
    df.drop('user_msg', axis=1, inplace=True)
    
    return df

Import data

In [3]:
df = rawToDf('poorna.txt')
oo
In [4]:
df.tail()
Out[4]:
In [5]:
df.shape # no. of msgs
Out[5]:
(4608, 3)
In [6]:
me = "Poornachandra" #its not poorna in unique values you'll get the actual user name

Data Cleaning

In [7]:
images = df[df['msg']=="<Media omitted> "] #no. of images, images are represented by <media omitted>
images.shape
Out[7]:
(1738, 3)
In [8]:
df["user"].unique()
Out[8]:
array(['grp_notif', 'Milind Chitrak', 'Santos banavasi', 'Pooja Atte',
       'Kushal Chitrak.', '+91 89754 39527', 'Baby Chikkamma',
       'Hariprasad', 'Raghu Mama', 'Neetha Shetty', 'Meena Chitrak',
       '+91 87672 22300', 'Gayatri Chikkamma', 'Poornachandra',
       'Saisudha Chitrak.', '+91 97385 72018', '+91 88847 49720',
       '+91 98709 84057', 'Shubham Shetty', 'Dr Sai Charan',
       'Akshata Setty', 'Ganesh mama2', 'Manu Mama Chitrak',
       'Swetha Chitrak', '+91 98100 02459', 'Bagyashree Gudigar',
       '+91 98200 68823', '+91 89286 24627', 'Shubham', 'Sujatha Mami'],
      dtype=object)
In [9]:
grp_notif = df[df['user']=="grp_notif"] #no. of grp notifications
grp_notif.shape
Out[9]:
(104, 3)
In [10]:
df.drop(images.index, inplace=True) #removing images
df.drop(grp_notif.index, inplace=True) #removing grp_notif
In [11]:
df.tail()
Out[11]:
In [12]:
df.reset_index(inplace=True, drop=True)
df.shape
Out[12]:
(2766, 3)

Lets Discuss on what do we want to get out of this data

* Is raw data enough to get that insight?
* if not what can be possible way to get that insight?
* Whats the use of that insight?

Questions from the audience

Q 1) Who is the most active member of the group. Who is the least active?

In [13]:
df.groupby("user")["msg"].count().sort_values(ascending=False)
Out[13]:
user
Raghu Mama            286
Milind Chitrak        279
Santos banavasi       220
Poornachandra         197
Meena Chitrak         196
Kushal Chitrak.       188
Saisudha Chitrak.     176
+91 98709 84057       163
Pooja Atte            152
Gayatri Chikkamma     140
Ganesh mama2          139
Neetha Shetty         116
Hariprasad             97
+91 89754 39527        68
Dr Sai Charan          61
Manu Mama Chitrak      50
+91 88847 49720        46
Shubham Shetty         38
Akshata Setty          30
+91 97385 72018        28
+91 87672 22300        27
+91 98200 68823        21
Baby Chikkamma         18
Shubham                11
Swetha Chitrak         11
+91 98100 02459         5
Bagyashree Gudigar      1
Sujatha Mami            1
+91 89286 24627         1
Name: msg, dtype: int64

Q 2) Count of all the emojis that i have used?

In [14]:
emoji_ctr = Counter()
emojis_list = map(lambda x: ''.join(x.split()), emoji.UNICODE_EMOJI.keys())
r = re.compile('|'.join(re.escape(p) for p in emojis_list))
for idx, row in df.iterrows():
    if row["user"] == me:
        emojis_found = r.findall(row["msg"])
        for emoji_found in emojis_found:
            emoji_ctr[emoji_found] += 1
In [15]:
for item in emoji_ctr.most_common(10):
    print(item[0] + " - " + str(item[1]))
😂 - 11 😋 - 11 🤣 - 8 🇮🇳 - 4 😱 - 4 🥶 - 3 🧐 - 3 🐓 - 3 🙂 - 3 🏻 - 2

Q 3) What can my activity say about my sleep cycle?

In [16]:
def to_hour(val):
    return val.hour
In [17]:
df.head()
Out[17]:
In [18]:
df['hour'] = df['date_time'].apply(to_hour)
In [19]:
df[df['user']==me].groupby(['hour']).size().sort_index().plot(x="hour", kind='bar')
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c17f8bea58>
Notebook Image

Q 4)

What is the difference in Weekend vs Weekday usage pattern?

How many words do I type on average on weekday vs weekend?

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.weekday.html

In [20]:
df['weekday'] = df['date_time'].apply(lambda x: x.day_name()) # can use day_name or weekday from datetime 
In [21]:
df['is_weekend'] = df.weekday.isin(['Sunday', 'Saturday'])
In [22]:
msgs_per_user = df['user'].value_counts(sort=True)
msgs_per_user
Out[22]:
Raghu Mama            286
Milind Chitrak        279
Santos banavasi       220
Poornachandra         197
Meena Chitrak         196
Kushal Chitrak.       188
Saisudha Chitrak.     176
+91 98709 84057       163
Pooja Atte            152
Gayatri Chikkamma     140
Ganesh mama2          139
Neetha Shetty         116
Hariprasad             97
+91 89754 39527        68
Dr Sai Charan          61
Manu Mama Chitrak      50
+91 88847 49720        46
Shubham Shetty         38
Akshata Setty          30
+91 97385 72018        28
+91 87672 22300        27
+91 98200 68823        21
Baby Chikkamma         18
Swetha Chitrak         11
Shubham                11
+91 98100 02459         5
Bagyashree Gudigar      1
+91 89286 24627         1
Sujatha Mami            1
Name: user, dtype: int64
In [23]:
top5_users = msgs_per_user.index.tolist()[:5]
top5_users
Out[23]:
['Raghu Mama',
 'Milind Chitrak',
 'Santos banavasi',
 'Poornachandra',
 'Meena Chitrak']
In [24]:
df_top5 = df.copy()
df_top5 = df_top5[df_top5.user.isin(top5_users)]
df_top5.head()
Out[24]:
In [25]:
plt.figure(figsize=(30,10))
sns.countplot(x="user", hue="weekday", data=df)
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c17fb64978>
Notebook Image
In [26]:
df_top5['is_weekend'] = df_top5.weekday.isin(['Sunday', 'Saturday'])
In [27]:
plt.figure(figsize=(20,10))
sns.countplot(x="user", hue="is_weekend", data=df_top5)
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c17fb135c0>
Notebook Image
In [28]:
def word_count(val):
    return len(val.split())
In [29]:
df['no_of_words'] = df['msg'].apply(word_count)
In [30]:
df_top5['no_of_words'] = df_top5['msg'].apply(word_count)
In [31]:
total_words_weekday = df[df['is_weekend']==False]['no_of_words'].sum()
total_words_weekday
Out[31]:
20676
In [32]:
total_words_weekend = df[df['is_weekend']]['no_of_words'].sum()
total_words_weekend
Out[32]:
5926
In [33]:
total_words_weekday/5 # average words on a weekday
Out[33]:
4135.2
In [34]:
total_words_weekend/2 # average words on a weekend
Out[34]:
2963.0
In [35]:
df.groupby('user')['no_of_words'].sum().sort_values(ascending=False)
Out[35]:
user
Gayatri Chikkamma     5203
Raghu Mama            2715
Ganesh mama2          2546
Meena Chitrak         2232
Santos banavasi       2217
Milind Chitrak        1890
+91 98709 84057       1615
Manu Mama Chitrak     1199
Neetha Shetty         1104
Poornachandra         1026
Kushal Chitrak.        966
Saisudha Chitrak.      831
Pooja Atte             651
Hariprasad             507
+91 89754 39527        453
Shubham Shetty         258
Dr Sai Charan          217
+91 87672 22300        215
Akshata Setty          166
+91 88847 49720        155
+91 97385 72018         95
Baby Chikkamma          93
Swetha Chitrak          78
+91 98200 68823         56
+91 98100 02459         56
Shubham                 34
Bagyashree Gudigar      16
Sujatha Mami             4
+91 89286 24627          4
Name: no_of_words, dtype: int64
In [36]:
(df_top5.groupby('user')['no_of_words'].sum()/df_top5.groupby('user').size()).sort_values(ascending=False)
Out[36]:
user
Meena Chitrak      11.387755
Santos banavasi    10.077273
Raghu Mama          9.493007
Milind Chitrak      6.774194
Poornachandra       5.208122
dtype: float64
In [37]:
wordPerMsg_weekday_vs_weekend = (df_top5.groupby(['user', 'is_weekend'])['no_of_words'].sum()/df_top5.groupby(['user', 'is_weekend']).size())
wordPerMsg_weekday_vs_weekend
Out[37]:
user             is_weekend
Meena Chitrak    False         13.254902
                 True           4.744186
Milind Chitrak   False          6.878788
                 True           6.270833
Poornachandra    False          5.502994
                 True           3.566667
Raghu Mama       False         10.252252
                 True           6.859375
Santos banavasi  False         10.602210
                 True           7.641026
dtype: float64
In [38]:
wordPerMsg_weekday_vs_weekend.plot(kind='barh')
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c1010090f0>
Notebook Image

Q 5)

Most Usage - Time of Day

In [39]:
x = df.groupby(['hour', 'weekday'])['msg'].size().reset_index()
x2 = x.pivot("hour", 'weekday', 'msg')
x2.head()
Out[39]:
In [40]:
days = ["Monday", 'Tuesday', "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]
sns.heatmap(x2[days].fillna(0), robust=True)
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c101097cc0>
Notebook Image

Q 6)

In any group, do I have any inclination towards responding to someone?

In [41]:
my_msgs_index = np.array(df[df['user']==me].index)
In [42]:
prev_msgs_index = my_msgs_index - 1
In [43]:
df_replies = df.iloc[prev_msgs_index].copy()
df_replies.shape
Out[43]:
(197, 7)
In [44]:
df_replies.groupby(["user"])["msg"].size().sort_values().plot(kind='barh')
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c10115a550>
Notebook Image

Q 7)

Which are the most common words?

In [45]:
comment_words = ' '
stopwords = STOPWORDS.update(['lo', 'ge', 'Lo', 'illa', 'yea', 'ella', 'en', 'na', 'En', 'yeah', 'alli', 'ide', 'okay', 'ok', 'will'])
  
for val in df.msg.values: 
    val = str(val) 
    tokens = val.split() 
        
    for i in range(len(tokens)): 
        tokens[i] = tokens[i].lower() 
          
    for words in tokens: 
        comment_words = comment_words + words + ' '
  
  
wordcloud = WordCloud(width = 800, height = 800, 
                background_color ='black', 
                stopwords = stopwords, 
                min_font_size = 10).generate(comment_words) 

In [46]:
wordcloud.to_image()
Out[46]:
Notebook Image

Know What They Know(atleast by little

Assignment-kind

  • 1way ya 2way, check for a response time between two people
In [ ]:
jovian.commit()
[jovian] Saving notebook..
In [ ]: