Write some introduction about your project here: describe the dataset, where you got it from, what you're trying to do with it, and which tools & techniques you're using. You can also mention about the course, and what you've learned from it.
As a first step, let's upload our Jupyter notebook to Jovian.ml.
!pip install jovian --upgrade --quiet !pip install numpy --upgrade --quiet !pip install pandas --upgrade --quiet !pip install matplotlib --upgrade --quiet !pip install seaborn --upgrade --quiet
project_name = "whatsapp-chat-analysis-course-project-try"
[jovian] Attempting to save notebook.. [jovian] Please enter your API key ( from https://jovian.ml/ ): API KEY: ········ [jovian] Updating notebook "edsenmichaelcy/whatsapp-chat-analysis-course-project-try" on https://jovian.ml/ [jovian] Uploading notebook.. [jovian] Capturing environment.. [jovian] Committed successfully! https://jovian.ml/edsenmichaelcy/whatsapp-chat-analysis-course-project-try
import os import pandas as pd import re import datetime as time import jovian import numpy as np import matplotlib.pyplot as plt import seaborn as sns
whatsapp_df = pd.read_fwf('Chat.txt', header = None) whatsapp_df
<class 'pandas.core.frame.DataFrame'> RangeIndex: 23330 entries, 0 to 23329 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 0 23177 non-null object 1 1 23087 non-null object 2 2 788 non-null object dtypes: object(3) memory usage: 546.9+ KB
After that we will use the info() that provided by the pandas to understand the datatype in the dataframe. As you can see we need to do some clearning such as the date and the Media omitted. (re-explain)
def txtTodf(txt_file): '''Convert WhatsApp chat log text file to a Pandas dataframe.''' # some regex to account for messages taking up multiple lines pat = re.compile(r'^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)', re.S | re.M) with open(txt_file) as file: data = [m.group(1).strip().replace('\n', ' ') for m in pat.finditer(file.read())] user = ; message = ; datetime =  for row in data: # timestamp is before the first dash datetime.append(row.split(' - ')) # sender is between am/pm, dash and colon try: s = re.search('m - (.*?):', row).group(1) user.append(s) except: user.append('') # message content is after the first colon try: message.append(row.split(': ', 1)) except: message.append('') df = pd.DataFrame(zip(datetime, user, message), columns=['datetime', 'user', 'message']) df['datetime'] = pd.to_datetime(df.datetime, format='%d/%m/%Y, %I:%M %p') # remove events not associated with a sender df = df[df.user != ''].reset_index(drop=True) return df whatsapp_df = txtTodf('Chat.txt')
<class 'pandas.core.frame.DataFrame'> RangeIndex: 22701 entries, 0 to 22700 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datetime 22701 non-null datetime64[ns] 1 user 22701 non-null object 2 message 22701 non-null object dtypes: datetime64[ns](1), object(2) memory usage: 532.2+ KB
[jovian] Attempting to save notebook..
# To understand the number od the image data img = whatsapp_df[whatsapp_df['message'] == "<Media omitted>" ] img.shape
Since we want to do analysis on the text rather than image so we have to clean the image data in the text file. In here we have 11k of image in the three row
# We will drop all the image file by using the Drop functions whatsapp_df.drop(img.index, inplace=True)
So now we will drop all the img to make the dataset more clean. Moreover, we want to make sure it will not copy a new dataset that why we will use "inplace == True"
As you can see now the dataset is clean from the media format.But we have a problem because after we did the clearning the index of the dataset had been off-order. So now we have to clean the data by using the reset_index().
whatsapp_df.reset_index(inplace=True, drop=True) whatsapp_df.shape
So after the data is clean we have left 21519 data in our dataset. So now we are able to perform the data driven decision making!
[jovian] Attempting to save notebook..