Jovian
⭐️
Sign In

Project Whatsapp Message analysis

Write some introduction about your project here: describe the dataset, where you got it from, what you're trying to do with it, and which tools & techniques you're using. You can also mention about the course, and what you've learned from it.

As a first step, let's upload our Jupyter notebook to Jovian.ml.

In [1]:
!pip install jovian --upgrade --quiet
!pip install numpy --upgrade --quiet
!pip install pandas --upgrade --quiet
!pip install matplotlib --upgrade --quiet
!pip install seaborn --upgrade --quiet
!pip install wordcloud --upgrade --quiet
!pip install emoji  --upgrade --quiet
!pip install plotly_express --upgrade --quiet
In [2]:
project_name = "whatsapp-chat-analysis-course-project-try"

In [3]:
import jovian
In [4]:
jovian.commit(project=project_name,enviroment=None,files=["Chat.txt"])
[jovian] Attempting to save notebook.. [jovian] Please enter your API key ( from https://jovian.ml/ ): API KEY: ········ [jovian] Updating notebook "edsenmichaelcy/whatsapp-chat-analysis-course-project-try" on https://jovian.ml/ [jovian] Uploading notebook.. [jovian] Capturing environment.. [jovian] Uploading additional files... [jovian] Committed successfully! https://jovian.ml/edsenmichaelcy/whatsapp-chat-analysis-course-project-try

Data Preparation and Cleaning

In [4]:
import plotly.express as px
import os
import pandas as pd
import re
import datetime as time
import jovian
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import emoji
import re
from collections import Counter
In [5]:
whatsapp_df = pd.read_fwf('Chat.txt', header = None)

whatsapp_df
Out[5]:
In [6]:
whatsapp_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 23330 entries, 0 to 23329 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 0 23177 non-null object 1 1 23087 non-null object 2 2 788 non-null object dtypes: object(3) memory usage: 546.9+ KB

After that we will use the info() that provided by the pandas to understand the datatype in the dataframe. As you can see we need to do some clearning such as the date and the Media omitted. (re-explain)

In [7]:
whatsapp_df.shape
Out[7]:
(23330, 3)
In [8]:
def txtTodf(txt_file):
    '''Convert WhatsApp chat log text file to a Pandas dataframe.'''
    
    # some regex to account for messages taking up multiple lines
    pat = re.compile(r'^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)', re.S | re.M)
    with open(txt_file) as file:
        data = [m.group(1).strip().replace('\n', ' ') for m in pat.finditer(file.read())]

    user     = []; 
    message  = []; 
    datetime = []
    
    for row in data:

        # timestamp is before the first dash
        datetime.append(row.split(' - ')[0])

        # sender is between am/pm, dash and colon
        try:
            s = re.search('m - (.*?):', row).group(1)
            user.append(s)
        except:
            user.append('')

        # message content is after the first colon
        try:
            message.append(row.split(': ', 1)[1])
        except:
            message.append('')

    df = pd.DataFrame(zip(datetime, user, message), columns=['datetime', 'user', 'message'])
    df['datetime'] = pd.to_datetime(df.datetime, format='%d/%m/%Y, %I:%M %p')

    # remove events not associated with a sender
    df = df[df.user != ''].reset_index(drop=True)
    
    return df

whatsapp_df = txtTodf('Chat.txt')
In [9]:
whatsapp_df.head(10)
Out[9]:
In [10]:
whatsapp_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 22701 entries, 0 to 22700 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 22701 non-null datetime64[ns] 1 user 22701 non-null object 2 message 22701 non-null object dtypes: datetime64[ns](1), object(2) memory usage: 532.2+ KB
In [12]:
jovian.commit(project=project_name,enviroment=None,files=["Chat.txt"])
[jovian] Attempting to save notebook.. [jovian] Updating notebook "edsenmichaelcy/whatsapp-chat-analysis-course-project-try" on https://jovian.ml/ [jovian] Uploading notebook.. [jovian] Capturing environment.. [jovian] Uploading additional files... [jovian] Committed successfully! https://jovian.ml/edsenmichaelcy/whatsapp-chat-analysis-course-project-try

Clearning the image data

In [13]:
# To understand the number od the image data
img = whatsapp_df[whatsapp_df['message'] == "<Media omitted>" ]
img.shape

Out[13]:
(1182, 3)

Since we want to do analysis on the text rather than image so we have to clean the image data in the text file. In here we have 11k of image in the three row

In [14]:
# We will drop all the image file by using the Drop functions
whatsapp_df.drop(img.index, inplace=True)

So now we will drop all the img to make the dataset more clean. Moreover, we want to make sure it will not copy a new dataset that why we will use "inplace == True"

In [15]:
whatsapp_df.head(10)
Out[15]:

As you can see now the dataset is clean from the media format.But we have a problem because after we did the clearning the index of the dataset had been off-order. So now we have to clean the data by using the reset_index().

In [16]:
whatsapp_df.reset_index(inplace=True, drop=True)
whatsapp_df.shape
Out[16]:
(21519, 3)

So after the data is clean we have left 21519 data in our dataset. So now we are able to perform the data driven decision making!

Let get started on the data we are going analysis

  1. Which users have the most Chat/messages in the group?
  2. Which emojis use the most by which users?
  3. The most usage of whatsapp during the time and day?
  4. Which month have the highest messages and also the busiest month?
  5. What time did the users usually start chatting and sleep?
In [ ]:
jovian.commit(project=project_name,enviroment=None,files=["Chat.txt"])
[jovian] Attempting to save notebook..

1. Which users have the most Chat/messages in the group?

In [ ]:
#Understand how many user and messages in this chat first
totalNumberofMessage = whatsapp_df.message.count()
username   = whatsapp_df["user"].unique()

print('The total of the number of message:',totalNumberofMessage)
print('User name that involve in the chat:',username)
In [ ]:
whatsapp_df1 = whatsapp_df.copy()
whatsapp_df1['Number_of_messages'] = [1]* whatsapp_df1.shape[0]
whatsapp_df1.drop(columns = 'datetime', inplace = True)
whatsapp_df1 = whatsapp_df1.groupby('user')['Number_of_messages'].count().sort_values(ascending = False).reset_index() 
whatsapp_df1

We will use different data visualization method for this case

We are going to create a plot chart for the first data visualization method

In [ ]:
#  Using seaborn for Styles 
sns.set_style("darkgrid")

# Resize the  figure size
plt.figure(figsize=(12, 9))

# Here we are ploting the line chart using plt.plot 
plt.plot(whatsapp_df1.user, whatsapp_df1.Number_of_messages, 'o--c')  

# In here we are writing the Labels and Title for the plot chart
plt.xlabel('Users')
plt.ylabel('Total number of messages')

plt.title("The highest number of messages send by the user")
plt.legend(['Messages send']);

#plt.savefig('whatsapp_df1_Highest_messages.png', format = 'png')

In [ ]:
#Formating
sns.set_style("darkgrid")

#The background of the chart
matplotlib.rcParams['font.size'] = 12
matplotlib.rcParams['figure.figsize'] = (12, 9)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
fig, ax = plt.subplots()

#Creating a bar chart
sns.barplot(whatsapp_df1.user,whatsapp_df1.Number_of_messages,hue='user',data=whatsapp_df1,dodge=False,palette="CMRmap")
plt.title("The highest number of messages")

#Change the width of the bar chart plot
def change_width(ax, new_value) :
    for patch in ax.patches :
        current_width = patch.get_width()
        diff = current_width - new_value

        # we change the bar width
        patch.set_width(new_value)

        # we recenter the bar
        patch.set_x(patch.get_x() + diff * .5)

change_width(ax, .35)
plt.show()

#Save the chart image
#plt.savefig('whatsapp_df1_Highest_messages.png', format = 'png')

2.Which emojis use the most by which users?

In [ ]:
#Copy a dataset
whatsapp_df2 = whatsapp_df.copy()

#Count the number of emoji
emoji_ctr = Counter()
emojis_list = map(lambda x: ''.join(x.split()), emoji.UNICODE_EMOJI.keys())
r = re.compile('|'.join(re.escape(p) for p in emojis_list))
for idx, row in whatsapp_df2.iterrows():
    emojis_found = r.findall(row["message"])
    for emoji_found in emojis_found:
        emoji_ctr[emoji_found] += 1
In [ ]:
#Will create another helper column using emoji.demojize("<emoji>"), since emojis will not rendered.
emojis_df = pd.DataFrame()
emojis_df['emoji'] = [''] * 10
emojis_df['number_of_Emoji'] = [0] * 10

i = 0
for item in emoji_ctr.most_common(10):
    emojis_df.emoji[i] = item[0]
    emojis_df.number_of_Emoji[i] = int(item[1])
  
    i += 1

emojis_df

We will use plotly to create our pie-charts for Emojis Link: https://plotly.com/python/pie-charts/

In [ ]:
#This pei chart give us and ideas the overall view of which emoji use the most
fig = px.pie(emojis_df, values='number_of_Emoji', names='emoji',title='Emoji percentage used in chat group')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()
In [ ]:
#Now we want to know which emoji is use the most by each of the users. But since the first results only create 
#emoji and number_emoji in the dataframe now we need to create a dataframe contain user and emojio they use
whatsapp_df2.head()


emoji_ctr = Counter()
emojis_list = map(lambda x: ''.join(x.split()), emoji.UNICODE_EMOJI.keys())
r = re.compile('|'.join(re.escape(p) for p in emojis_list))
for idx, row in whatsapp_df2.iterrows():
    emojis_found = r.findall(row["message"])
    for emoji_found in emojis_found:
        emoji_ctr[emoji_found] += 1
In [ ]:
emojis_df = whatsapp_df2

emojis_df['emoji'] = [''] * 21519
emojis_df['number_of_Emoji'] = [0] * 21519

i = 0
for item in emoji_ctr.most_common(21519):
    emojis_df.emoji[i] = item[0]
    emojis_df.number_of_Emoji[i] = int(item[1])
  
    i += 1

emojis_df
In [ ]:
l = emojis_df.user.unique()
for i in range(len(l)):
    dummy_df = emojis_df[emojis_df['user'] == l[i]]
    total_emojis_list = list([a for b in dummy_df.emoji for a in b])
    emoji_dict = dict(Counter(total_emojis_list))
    emoji_dict = sorted(emoji_dict.items(), key=lambda x: x[1], reverse=True)
    print('Emoji Distribution for', l[i])
    author_emoji_df = pd.DataFrame(emoji_dict, columns=['emoji', 'count'])
    fig = px.pie(author_emoji_df, values='count', names='emoji')
    fig.update_traces(textposition='inside', textinfo='percent+label')
    fig.show()

3.The most usage of whatsapp during the time and day?

4. Which month have the highest messages and also the busiest month?

5.What time did the users usually start chatting and sleep?

In [ ]:
jovian.commit(project=project_name,enviroment=None,files=["Chat.txt"])
In [ ]:
userNumber = whatsapp_df.user.unique()
print("The total number of message from each of the users:\n")
for i in range(len(userNumber)):
    #Get one particular user name
    user_df = whatsapp_df[whatsapp_df['user'] == userNumber[i]]
    
    #user_df will show the user message 
    name = print(f'User name: {userNumber[i]}')
    
    #Get the total number of each user send
    messages = print('Messages', user_df.shape[0])
    
    print()