Jovian
⭐️
Sign In

Let me see can i combine and see that can i differentiate the payments using geographical location

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression ## for performing the logistic Regression
from sklearn.metrics import accuracy_score ## to obtain the accuracy score
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
payment= pd.read_csv("olist_order_payments_dataset.csv")
In [3]:
customer=pd.read_csv("olist_customers_dataset.csv")
In [4]:
customer.head()
Out[4]:
In [5]:
payment.head()
Out[5]:
In [6]:
orders= pd.read_csv("olist_orders_dataset.csv")
In [7]:
orders.head()
Out[7]:
In [8]:
merge=pd.merge(orders,payment, how='outer')
In [9]:
merge.head()
Out[9]:
In [10]:
merge.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 103887 entries, 0 to 103886 Data columns (total 12 columns): order_id 103887 non-null object customer_id 103887 non-null object order_status 103887 non-null object order_purchase_timestamp 103887 non-null object order_approved_at 103712 non-null object order_delivered_carrier_date 101999 non-null object order_delivered_customer_date 100755 non-null object order_estimated_delivery_date 103887 non-null object payment_sequential 103886 non-null float64 payment_type 103886 non-null object payment_installments 103886 non-null float64 payment_value 103886 non-null float64 dtypes: float64(3), object(9) memory usage: 10.3+ MB
In [11]:
merge.isnull().sum()
Out[11]:
order_id                            0
customer_id                         0
order_status                        0
order_purchase_timestamp            0
order_approved_at                 175
order_delivered_carrier_date     1888
order_delivered_customer_date    3132
order_estimated_delivery_date       0
payment_sequential                  1
payment_type                        1
payment_installments                1
payment_value                       1
dtype: int64
In [12]:
merge=merge.drop(columns=['order_id','customer_id','order_approved_at','order_delivered_carrier_date'])
In [13]:
merge.head()
Out[13]:
In [14]:
merge.columns=['order_status','purchase_time','delivery_time','estimated_delivery_time','payment_sequence','payment_type','payment_installments','pay_value']
In [15]:
merge['purchase_time']= pd.to_datetime(merge['purchase_time'], errors='coerce')
merge['delivery_time']= pd.to_datetime(merge['delivery_time'], errors='coerce')
merge['estimated_delivery_time']= pd.to_datetime(merge['estimated_delivery_time'], errors='coerce')
In [16]:
merge.head()
Out[16]:
In [17]:
merge.groupby('payment_type')['pay_value'].mean()
Out[17]:
payment_type
boleto         145.034435
credit_card    163.319021
debit_card     142.570170
not_defined      0.000000
voucher         65.703354
Name: pay_value, dtype: float64

I just tried to understand which payment type generates more income to the company anyway more user's uses credit card but it is also important to know how much income is generating in which means

As we seen credit_card 163.31 is an average amount is being spent and form debit_card 142.57 is being spent on an average than debit cards it seems to be 'boleto' has more additional benefit

In [18]:
merge['payment_sequence'].value_counts()
Out[18]:
1.0     99360
2.0      3039
3.0       581
4.0       278
5.0       170
6.0       118
7.0        82
8.0        54
9.0        43
10.0       34
11.0       29
12.0       21
13.0       13
14.0       10
15.0        8
16.0        6
17.0        6
18.0        6
19.0        6
21.0        4
20.0        4
22.0        3
26.0        2
24.0        2
25.0        2
23.0        2
29.0        1
27.0        1
28.0        1
Name: payment_sequence, dtype: int64
In [19]:
merge.groupby('payment_type')['payment_sequence','payment_installments'].mean()
Out[19]:

So, it seems to be thta voucher has been used on an average to make payments and credit card is used to make payment on minimum of 3-4 times which ment installments

Which tells most of the users by high products on credit basis than one type payments

In [20]:
merge['order_status'].value_counts()
Out[20]:
delivered      100757
shipped          1166
canceled          664
unavailable       649
invoiced          325
processing        319
created             5
approved            2
Name: order_status, dtype: int64
In [21]:
merge.groupby('order_status')['estimated_delivery_time'].size()
Out[21]:
order_status
approved            2
canceled          664
created             5
delivered      100757
invoiced          325
processing        319
shipped          1166
unavailable       649
Name: estimated_delivery_time, dtype: int64

Here, you can see there are few products which has become unavailable which means is it lost in transit?? or is it just unavaliable to ship the product but if the product is unavaliable customer won't pay and wait for delivery

In [22]:
merge.groupby('payment_type')['payment_sequence'].size()
Out[22]:
payment_type
boleto         19784
credit_card    76795
debit_card      1529
not_defined        3
voucher         5775
Name: payment_sequence, dtype: int64

We know that credit card is the most frequent way to pay and I just drawn that may be the only type but now I understood that "boleto" is also one of the famous or wanted pay type

But reason I tried to do differently that i wanted to see with payment sequence and installment

In [23]:
merge.head()
Out[23]:
In [26]:
sns.kdeplot(merge.loc[merge['payment_type']== 'credit_card',"payment_installments"],label='credit_card');
sns.kdeplot(merge.loc[merge['payment_type']== 'voucher' ,"payment_installments"],label='voucher');
sns.kdeplot(merge.loc[merge['payment_type']== 'boleto',"payment_installments"],label='boleto');
sns.kdeplot(merge.loc[merge['payment_type']== 'debit_card',"payment_installments"],label='debit_card');
sns.kdeplot(merge.loc[merge['payment_type']== 'not_defined',"payment_installments"],label='not_defined');


# beautifying the labels
plt.xlabel('installments over payment type')
plt.ylabel('No. of installments')
plt.show()
Notebook Image

so it is proveded that if people are taking on installment basis then it would be only credit_card

You cannot use boleto for installments

In [ ]: