Learn data science and machine learning by building real-world projects on Jovian

An Exploratory Data Analysis on e-Commerce events

This dataset contains 900K detailed users events from an electronics store e-Commerce website

Image credits - rupixen.com from Unsplash

About the data

In this analysis we will use the "e-Commerce events history in electronics store" data by Michael Kechinov on Kaggle from the REES46 Marketing Platform.

The Kaggle dataset is in .csv file format having 900K row and 9 columns.

Each row in the file represents an event. All events are related to products and users. Each event is like many-to-many relation between products and users.

Below are description of 9 columns

  1. event_time: Time when the event occured (in UTC) in the format "2019-10-01 00:00:04+00:00"
  2. event_type: A single event either "view", "cart" or "purchase"
  3. product_id: A unique code for each product
  4. category_id: A unique code for each product category
  5. category_code: A '.' separated text with names of categories and subcategories "computers.network.router"
  6. brand: Brand name of the product "sony"
  7. price: Price of the product in decimals "31.90"
  8. user_id: A permanent unique ID per shopper "1515915625519388267"
  9. user_session: A temporary ID per session "LJuJVLEjPT"

Events in the column event_type can be:

  1. view - a user viewed a product
  2. cart - a user added a product to shopping cart
  3. remove_from_cart - a user removed a product from shopping cart
  4. purchase - a user purchased a product
What does each column tell us

At time event_time and during session user_session, user user_id purchased (event_type is "purchase") a product product_id of brand brand in category category_code of price price.

Assumptions and notes for this analysis :

  • Each unique session is a visit
  • There are no remove_from_cart events in this dataset
  • A session may have just one purchase event and no related view or cart events.
  • A session can have multiple purchase events
  • category_code is usually present for meaningful categories and skipped for the rest.
  • Price is assumed to be in US Dollars

8 Questions to be answered

  1. How are different event types are distributed?
  2. What is the Cart Abandonment Rate?
  3. How many buyers are making repeat purchases ?
  4. How Price is spread across different event types ?
  5. How the Price is spread across each event types ?
  6. How the Price is spread across Purchase and Cart event types ?
  7. Which are the most purchased brands and most frequent brands added to the cart ?
  8. What is the share of top 10 purchased brands by price ?

Installing the opendatasets library, which helps to download the datasets from Kaggle using the Kaggle user id and API key

In [1]:
# !pip install opendatasets --upgrade --quiet

Importing opendatasets with alias od

In [2]:
# import opendatasets as od
In [3]:
# ecomm_data_url = 'https://www.kaggle.com/mkechinov/ecommerce-events-history-in-electronics-store'

Downlaoding the data file from kaagle using .downlaod method

In [4]:
# od.download(ecomm_data_url)

To know the size of downloaded file

In [5]:
# !ls -lh ecommerce-purchase-history-from-electronics-store
In [6]:
# ecom_data_loc = ('ecommerce-events-history-in-electronics-store/events.csv')

Importing Pandas library

In [2]:
import pandas as pd
In [2]:
# %%time
# ecom_df = pd.read_csv(ecom_data_loc)
In [3]:
# ecom_df.head()

Saving the .csv file into feather format

As CSVs are plain text files with no structure, they often take longer to read compared to other binary formats like feather and paraquet which recognize the tabular structure of the data.

In [4]:
!pip install pyarrow --upgrade --quiet
In [5]:
# %%time
# ecom_df.to_feather('ecom_events_data.feather')
Uploading and reading the feather format to fetch the dataframe which was created in the previous session

This is will save time by skipping the step of downloading the datasets from Kaggle and reading into dataframe.

In [8]:
ecom_df_feather = pd.read_feather('ecom_events_data.feather')
ecom_df_feather.head()
Out[8]:
In [9]:
type(ecom_df_feather)
Out[9]:
pandas.core.frame.DataFrame
In [10]:
ecom_df1 = ecom_df_feather.copy()
In [12]:
ecom_df1.shape
Out[12]:
(885129, 9)

There are 885128 records and 9 columns

In [13]:
ecom_df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 885129 entries, 0 to 885128 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 event_time 885129 non-null object 1 event_type 885129 non-null object 2 product_id 885129 non-null int64 3 category_id 885129 non-null int64 4 category_code 648910 non-null object 5 brand 672765 non-null object 6 price 885129 non-null float64 7 user_id 885129 non-null int64 8 user_session 884964 non-null object dtypes: float64(1), int64(3), object(5) memory usage: 60.8+ MB
In [14]:
ecom_df1.event_time.head()
Out[14]:
0    2020-09-24 11:57:06 UTC
1    2020-09-24 11:57:26 UTC
2    2020-09-24 11:57:27 UTC
3    2020-09-24 11:57:33 UTC
4    2020-09-24 11:57:36 UTC
Name: event_time, dtype: object

Converting the event_time column which of object data type to datetime format

In [15]:
ecom_df1.event_time = ecom_df1.event_time.str.replace(' UTC','')

ecom_df1.event_time
Out[15]:
0         2020-09-24 11:57:06
1         2020-09-24 11:57:26
2         2020-09-24 11:57:27
3         2020-09-24 11:57:33
4         2020-09-24 11:57:36
                 ...         
885124    2021-02-28 23:55:01
885125    2021-02-28 23:58:05
885126    2021-02-28 23:58:09
885127    2021-02-28 23:58:14
885128    2021-02-28 23:59:09
Name: event_time, Length: 885129, dtype: object
In [16]:
ecom_df1.event_time = ecom_df1.event_time.astype('datetime64')

ecom_df1.event_time
# ecom_df1.event_time.dt.dayofweek
Out[16]:
0        2020-09-24 11:57:06
1        2020-09-24 11:57:26
2        2020-09-24 11:57:27
3        2020-09-24 11:57:33
4        2020-09-24 11:57:36
                 ...        
885124   2021-02-28 23:55:01
885125   2021-02-28 23:58:05
885126   2021-02-28 23:58:09
885127   2021-02-28 23:58:14
885128   2021-02-28 23:59:09
Name: event_time, Length: 885129, dtype: datetime64[ns]
In [17]:
ecom_df1.event_time.dt.dayofweek
Out[17]:
0         3
1         3
2         3
3         3
4         3
         ..
885124    6
885125    6
885126    6
885127    6
885128    6
Name: event_time, Length: 885129, dtype: int64
In [18]:
ecom_df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 885129 entries, 0 to 885128 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 event_time 885129 non-null datetime64[ns] 1 event_type 885129 non-null object 2 product_id 885129 non-null int64 3 category_id 885129 non-null int64 4 category_code 648910 non-null object 5 brand 672765 non-null object 6 price 885129 non-null float64 7 user_id 885129 non-null int64 8 user_session 884964 non-null object dtypes: datetime64[ns](1), float64(1), int64(3), object(4) memory usage: 60.8+ MB
In [19]:
ecom_df1.category_code.nunique()
Out[19]:
107
In [20]:
ecom_df1.brand.nunique()
Out[20]:
999
In [21]:
ecom_df1.category_code.head()
Out[21]:
0            electronics.telephone
1      computers.components.cooler
2                             None
3    computers.peripherals.printer
4                             None
Name: category_code, dtype: object
In [22]:
ecom_df1.category_code.fillna('No Category', inplace = True)

ecom_df1.brand.fillna('No Brand', inplace = True)
In [24]:
ecom_df1.isnull().sum()
Out[24]:
event_time         0
event_type         0
product_id         0
category_id        0
category_code      0
brand              0
price              0
user_id            0
user_session     165
dtype: int64

There are 165 na vlaues. Drop na vlaues

In [25]:
ecom_df1.dropna(inplace=True)

Checking the NA/NULL vlaues using .isna() and summing all the NA/NULL values using .sum() for each column

In [26]:
ecom_df1.isnull().sum()
Out[26]:
event_time       0
event_type       0
product_id       0
category_id      0
category_code    0
brand            0
price            0
user_id          0
user_session     0
dtype: int64
In [27]:
len(ecom_df1)
Out[27]:
884964
In [28]:
type(ecom_df1.event_time[0])
Out[28]:
pandas._libs.tslibs.timestamps.Timestamp

Question 1. How are different event types are distributed?

In [29]:
ecom_df1.event_type.unique()
Out[29]:
array(['view', 'cart', 'purchase'], dtype=object)
In [30]:
y1 = ecom_df1.groupby('event_type',as_index=False).size()
y1
Out[30]:
In [31]:
ecom_df1['event_type'].value_counts()
Out[31]:
view        793589
cart         54029
purchase     37346
Name: event_type, dtype: int64

Here y1['Size'] needs ot be used instaed of y.size as .size is a method

In [32]:
y1['Dist_event_type'] = (y1['size']/y1['size'].sum()*100).round(1)
y1
Out[32]:
In [33]:
y_event_type1= y1.event_type.str.upper()
y_event_type1
Out[33]:
0        CART
1    PURCHASE
2        VIEW
Name: event_type, dtype: object
In [34]:
y2= ecom_df1.groupby('event_type',as_index= False).sum()

y2['Dist_price'] = (y2.price/y2.price.sum()*100).round(1)
y2
Out[34]:
In [35]:
q1_df = y2.merge(y1, on = 'event_type')
q1_df
Out[35]:
In [38]:
!pip install matplotlib seaborn plotly --upgrade --quiet

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline
In [32]:
sns.set_style("darkgrid")
In [33]:
import matplotlib
In [34]:
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (12, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'
In [35]:
q1_df 
Out[35]:
In [36]:
import numpy as np

# Converting evebt_type series to array
x1_axis = np.arange(len(q1_df.event_type))


plt.bar(x1_axis+0.2, q1_df.Dist_event_type, 0.4, label = 'Distribution of Event Type')
plt.bar(x1_axis-0.2, q1_df.Dist_price, 0.4, label = 'Distribution of Price')

for i in range(len(q1_df.event_type)):
    plt.text(i+0.2, q1_df.Dist_event_type[i], q1_df.Dist_event_type[i], ha='center', va ='bottom')
    plt.text(i-0.2, q1_df.Dist_price[i], q1_df.Dist_price[i], ha ='center', va='bottom')

plt.xticks(x1_axis, q1_df.event_type.str.upper())

plt.xlabel("Event Type")
plt.ylabel("Percentage")

plt.ylim([0,110])

plt.legend(loc='upper left')

plt.show()
Notebook Image

As we can see a lot of

Question 2. What is the Cart Abandonment Rate?

In [37]:
ecom_df1.event_type.unique()
Out[37]:
array(['view', 'cart', 'purchase'], dtype=object)
In [38]:
q2_series = ecom_df1.groupby('event_type',as_index=True).size()
q2_series
Out[38]:
event_type
cart         54029
purchase     37346
view        793589
dtype: int64
In [39]:
Initiased_trasactions = q2_series.loc['cart'] + q2_series.loc['purchase']

Completed_transactions = q2_series.loc['purchase']


cart_abandonment_rate = (1-(Completed_transactions/Initiased_trasactions))*100
cart_abandonment_rate

print('The Cart Abandonment Rate is {0:.1f}%'.format(cart_abandonment_rate))
The Cart Abandonment Rate is 59.1%

Question 3. How many buyers are making repeat purchases ?

Filtering the event type where purhcase

In [40]:
ecom_df1_pur = ecom_df1[ecom_df1['event_type'] == 'purchase']
ecom_df1_pur.head(2)
Out[40]:
In [41]:
# ecom_df1_user_id_pur = ecom_df1[(ecom_df1['event_type'] == 'purchase')][['event_type', 'price']]

user_id_pur = ecom_df1_pur[['event_type','price','user_id']]
user_id_pur.head(2)
Out[41]:
In [42]:
user_id_pur.user_id.duplicated().sum()
Out[42]:
16042
In [43]:
user_id_pur.user_id.nunique()
Out[43]:
21304
In [44]:
user_id_pur.user_id.count()
Out[44]:
37346
In [45]:
user_id_pur['freq'] = user_id_pur['user_id'].map(user_id_pur['user_id'].value_counts())
<ipython-input-45-8e68a5b22190>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy user_id_pur['freq'] = user_id_pur['user_id'].map(user_id_pur['user_id'].value_counts())
In [46]:
df2 = user_id_pur.groupby(user_id_pur['freq']==1).sum()
df2
Out[46]:
In [47]:
repeated_price = df2.loc[False,'price']

trail_price = df2.loc[True, 'price']

print(repeated_price,trail_price)
3269825.170000042 1855570.450000015
In [48]:
trail_buyers = df2.loc[True, 'freq']
trail_buyers
Out[48]:
13598
In [49]:
unique_buyers = user_id_pur.user_id.nunique()
unique_buyers
Out[49]:
21304
In [50]:
repeated_buyers = unique_buyers - trail_buyers
repeated_buyers
Out[50]:
7706
In [51]:
user_id_pur.user_id[user_id_pur['user_id'].duplicated(keep=False)].nunique()
Out[51]:
7706
In [52]:
fig, axes = plt.subplots(1,2,figsize=(16,8))


data1 = [trail_buyers, repeated_buyers]
label1 = 'Trail Buyers', 'Repeated Buyers'
explode1 = (0, 0.1)


axes[0].set_title('Trial and Repeated Buyers')
axes[0].pie(data1, explode=explode1, labels=label1, autopct='%1.1f%%',
        shadow=True, startangle=45)


data2 = [trail_price, repeated_price]
label2 = 'Trail Price', 'Repeated Price'
explode2 = (0, 0.1)


axes[1].set_title('Trial and Repeated Sales')
axes[1].pie(data2, explode=explode2, labels=label2, autopct='%1.1f%%',
        shadow=True, startangle=90);

Notebook Image

Question 4. How Price is spread across different event types

In [53]:
ecom_df1.head(2)
Out[53]:
In [54]:
ecom_df1.price.describe()
Out[54]:
count    884964.000000
mean        146.320109
std         296.793600
min           0.220000
25%          26.460000
50%          65.710000
75%         190.490000
max       64771.060000
Name: price, dtype: float64

The standard deviation is too much from the mean, lets check how price is distributed.

In [55]:
plt.figure(figsize=(16,2))
plt.boxplot(x=ecom_df1.price,vert= False);
plt.title('Distribution of Price');
Notebook Image

Lets see the distribution across different event type

In [56]:
display(ecom_df1.price[ecom_df1['event_type']=='view'].describe())

display(ecom_df1.price[ecom_df1['event_type']=='cart'].describe()),

display(ecom_df1.price[ecom_df1['event_type']=='purchase'].describe())
count    793589.000000
mean        145.840013
std         307.234051
min           0.220000
25%          26.190000
50%          64.920000
75%         186.760000
max       64771.060000
Name: price, dtype: float64
count    54029.000000
mean       159.647668
std        190.124995
min          0.220000
25%         30.160000
50%         82.970000
75%        226.840000
max       4071.730000
Name: price, dtype: float64
count    37346.000000
mean       137.240819
std        169.934694
min          0.220000
25%         26.220000
50%         64.480000
75%        202.670000
max       3717.650000
Name: price, dtype: float64

Standard deviation reduced for purchase or cart

Question 5. How the Price is spread across each event types

In [57]:
sns.boxplot(x='event_type', y='price', data=ecom_df1);
Notebook Image

The scale is unlinkly for all the different event type, let's use indivual scale for different event types to get a better view

In [58]:
y1
Out[58]:
In [59]:
view_price = ecom_df1.price[ecom_df1['event_type']=='view']
cart_price = ecom_df1.price[ecom_df1['event_type']=='cart']
purchase_price = ecom_df1.price[ecom_df1['event_type']=='purchase']
In [60]:
fig, axes = plt.subplots(3, 1, figsize=(16,10))


axes[0].set_title('Distribution of price for the event View')
sns.boxplot(view_price, ax= axes[0])

axes[1].set_title('Distribution of price for the event Cart')
sns.boxplot(cart_price, ax= axes[1])

axes[2].set_title('Distribution of price for the event Purchase')
sns.boxplot(purchase_price, ax= axes[2])

fig.tight_layout(pad=3.0)

/opt/conda/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( /opt/conda/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( /opt/conda/lib/python3.9/site-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
Notebook Image

Question 6. How the Price is spread across Purchase and Cart event types

In [61]:
ecom_df1.head(2)
Out[61]:

Filetering the data based on different event type

In [62]:
view_day_time =  ecom_df1[ecom_df1['event_type']=='view']
view_day_time.head(2)
Out[62]:
In [63]:
view_day_time['day_of_the_week'] = view_day_time['event_time'].dt.day_name()
view_day_time['time'] = view_day_time['event_time'].dt.hour
view_day_time.head(2)
<ipython-input-63-79a8c2a26ef1>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy view_day_time['day_of_the_week'] = view_day_time['event_time'].dt.day_name() <ipython-input-63-79a8c2a26ef1>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy view_day_time['time'] = view_day_time['event_time'].dt.hour
Out[63]:
In [64]:
view_day_time_df = view_day_time[['day_of_the_week','time','price']]
view_day_time_df.head()
Out[64]:
In [65]:
view_day_time_df.price.sum()
Out[65]:
115737029.73000005
In [66]:
view_pivot = view_day_time_df.pivot_table(values = 'price', 
                                         columns ='time', 
                                         index = 'day_of_the_week',
                                         aggfunc= sum)
view_pivot.shape
Out[66]:
(7, 24)
In [67]:
view_pivot.sum().sum()
Out[67]:
115737029.7299994
In [68]:
view_pivot
Out[68]:
In [69]:
view_reindex = view_pivot.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
view_reindex
Out[69]:

Similarly creating the pivot tables for Purchase and Cart

In [70]:
purchase_day_time =  ecom_df1[ecom_df1['event_type']=='purchase']

purchase_day_time['day_of_the_week'] = purchase_day_time['event_time'].dt.day_name()
purchase_day_time['time'] = purchase_day_time['event_time'].dt.hour

purchase_day_time_df = purchase_day_time[['day_of_the_week','time','price']]

purchase_pivot = purchase_day_time_df.pivot_table(values = 'price', 
                                         columns ='time', 
                                         index = 'day_of_the_week',
                                         aggfunc= sum)

purchase_reindex = purchase_pivot.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])

purchase_reindex

<ipython-input-70-d8693f998b6b>:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy purchase_day_time['day_of_the_week'] = purchase_day_time['event_time'].dt.day_name() <ipython-input-70-d8693f998b6b>:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy purchase_day_time['time'] = purchase_day_time['event_time'].dt.hour
Out[70]:
In [71]:
cart_day_time =  ecom_df1[ecom_df1['event_type']=='cart']

cart_day_time['day_of_the_week'] = cart_day_time['event_time'].dt.day_name()
cart_day_time['time'] = cart_day_time['event_time'].dt.hour

cart_day_time_df = cart_day_time[['day_of_the_week','time','price']]

cart_pivot = cart_day_time_df.pivot_table(values = 'price', 
                                         columns ='time', 
                                         index = 'day_of_the_week',
                                         aggfunc= sum)

cart_reindex = cart_pivot.reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])

cart_reindex

<ipython-input-71-c977d0ccf99b>:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy cart_day_time['day_of_the_week'] = cart_day_time['event_time'].dt.day_name() <ipython-input-71-c977d0ccf99b>:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy cart_day_time['time'] = cart_day_time['event_time'].dt.hour
Out[71]:
In [72]:
plt.figure(figsize = (18,6))
plt.title('Viewers')

sns.heatmap(view_reindex)
Out[72]:
<AxesSubplot:title={'center':'Viewers'}, xlabel='time', ylabel='day_of_the_week'>
Notebook Image
In [73]:
plt.figure(figsize = (18,6))
plt.title('Cart')

sns.heatmap(cart_reindex,cmap='Blues')
Out[73]:
<AxesSubplot:title={'center':'Cart'}, xlabel='time', ylabel='day_of_the_week'>
Notebook Image
In [74]:
plt.figure(figsize = (18,6))
plt.title('Purchase')

sns.heatmap(purchase_reindex,cmap='Oranges')
Out[74]:
<AxesSubplot:title={'center':'Purchase'}, xlabel='time', ylabel='day_of_the_week'>
Notebook Image

Question 7. Which are the most purchased brands and most frequent brands added to the cart

In [75]:
ecom_df1.head(2)
Out[75]:
In [76]:
!pip install wordcloud --upgrade --quiet

from wordcloud import WordCloud

Brands that were added to the cart

In [77]:
ecom_df1_brand_cart = ecom_df1.brand[(ecom_df1['brand'] != 'No Brand') & (ecom_df1['event_type'] == 'cart')].str.upper()

text = ecom_df1_brand_cart.value_counts().to_dict()

wordcloud = WordCloud(background_color="white").generate_from_frequencies(text)
# Generate plot
plt.imshow(wordcloud)
plt.title('Most Brands that were added to the cart')
plt.axis("off")
plt.show()