Jovian
⭐️
Sign In
In [508]:
import jovian
In [509]:
#Importing the libraries.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
In [510]:
#Supress Warnings
import warnings
warnings.filterwarnings('ignore')
In [511]:
#To view all columns of the dataframe
pd.set_option('display.max_columns',500)
In [512]:
#Importing the dataset
lead = pd.read_csv('Leads.csv')
DATA UNDERSTANDING
In [513]:
#To view the head of the dataset
lead.head()
Out[513]:
In [514]:
#Dimensions of the dataset
lead.shape
Out[514]:
(9240, 37)
In [515]:
#Info of the dataframe for initial spotting of missing values and identify type of the column.
lead.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 9240 entries, 0 to 9239 Data columns (total 37 columns): Prospect ID 9240 non-null object Lead Number 9240 non-null int64 Lead Origin 9240 non-null object Lead Source 9204 non-null object Do Not Email 9240 non-null object Do Not Call 9240 non-null object Converted 9240 non-null int64 TotalVisits 9103 non-null float64 Total Time Spent on Website 9240 non-null int64 Page Views Per Visit 9103 non-null float64 Last Activity 9137 non-null object Country 6779 non-null object Specialization 7802 non-null object How did you hear about X Education 7033 non-null object What is your current occupation 6550 non-null object What matters most to you in choosing a course 6531 non-null object Search 9240 non-null object Magazine 9240 non-null object Newspaper Article 9240 non-null object X Education Forums 9240 non-null object Newspaper 9240 non-null object Digital Advertisement 9240 non-null object Through Recommendations 9240 non-null object Receive More Updates About Our Courses 9240 non-null object Tags 5887 non-null object Lead Quality 4473 non-null object Update me on Supply Chain Content 9240 non-null object Get updates on DM Content 9240 non-null object Lead Profile 6531 non-null object City 7820 non-null object Asymmetrique Activity Index 5022 non-null object Asymmetrique Profile Index 5022 non-null object Asymmetrique Activity Score 5022 non-null float64 Asymmetrique Profile Score 5022 non-null float64 I agree to pay the amount through cheque 9240 non-null object A free copy of Mastering The Interview 9240 non-null object Last Notable Activity 9240 non-null object dtypes: float64(4), int64(3), object(30) memory usage: 2.6+ MB

We could see that there are many missing values in the dataset with many columns having missing values. We will observe and treat accordingly for the same later. Also we could see,almost many object/categorical columns in the dataset which might need to converted by suitable techniques such as one-hot encoding or simple replace, which also will be taken care of.

In [516]:
#Initial Stats of the dataframe
lead.describe()
Out[516]:

Amon the numerical columns in hand, the initial stats shows the presence of outliers. As we observe above, features such as TotalVisits,Total Time Spent on Website,Page Views Per Visit are not gradually increasing, hence presence of outlier in the dataset.

In [517]:
#Checking for the percentage of null values for each column.
round(100*(lead.isnull().sum()/len(lead.index)),2)
Out[517]:
Prospect ID                                       0.00
Lead Number                                       0.00
Lead Origin                                       0.00
Lead Source                                       0.39
Do Not Email                                      0.00
Do Not Call                                       0.00
Converted                                         0.00
TotalVisits                                       1.48
Total Time Spent on Website                       0.00
Page Views Per Visit                              1.48
Last Activity                                     1.11
Country                                          26.63
Specialization                                   15.56
How did you hear about X Education               23.89
What is your current occupation                  29.11
What matters most to you in choosing a course    29.32
Search                                            0.00
Magazine                                          0.00
Newspaper Article                                 0.00
X Education Forums                                0.00
Newspaper                                         0.00
Digital Advertisement                             0.00
Through Recommendations                           0.00
Receive More Updates About Our Courses            0.00
Tags                                             36.29
Lead Quality                                     51.59
Update me on Supply Chain Content                 0.00
Get updates on DM Content                         0.00
Lead Profile                                     29.32
City                                             15.37
Asymmetrique Activity Index                      45.65
Asymmetrique Profile Index                       45.65
Asymmetrique Activity Score                      45.65
Asymmetrique Profile Score                       45.65
I agree to pay the amount through cheque          0.00
A free copy of Mastering The Interview            0.00
Last Notable Activity                             0.00
dtype: float64

As seen above, we could find many columns with large number of missing values. Columns such as Lead Quality,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,Tags are having high % of missing values. We need to go for certain approach, either to drop these columns or impute these missing values with appropriate values. For this, we can understand each feature and their importance by deep diving.

DATA CLEANING
In [518]:
#Observing the existing different levels for Lead Quality
lead['Lead Quality'].unique()
Out[518]:
array(['Low in Relevance', nan, 'Might be', 'Not Sure', 'Worst',
       'High in Relevance'], dtype=object)
In [519]:
#Assessing the number of entries in lead-quality
from collections import Counter      
Counter(lead['Lead Quality'])
Out[519]:
Counter({'Low in Relevance': 583,
         nan: 4767,
         'Might be': 1560,
         'Not Sure': 1092,
         'Worst': 601,
         'High in Relevance': 637})

51.6% of the values for Lead Quality are not present. We can replace the null values with a new level say 'Unknown'. From the different levels for Lead quality we can see levels 'Might be' or 'Not Sure' will be almost similar to 'Unknown. Hence we will go with imputing 'Not Sure' values for the same.

In [520]:
#Replacing the missing values with 'Not Sure' category for 'Lead Quality'
lead['Lead Quality'].replace(np.nan,'Not Sure',inplace=True)
In [521]:
round(100*(lead.isnull().sum()/len(lead.index)),2)
Out[521]:
Prospect ID                                       0.00
Lead Number                                       0.00
Lead Origin                                       0.00
Lead Source                                       0.39
Do Not Email                                      0.00
Do Not Call                                       0.00
Converted                                         0.00
TotalVisits                                       1.48
Total Time Spent on Website                       0.00
Page Views Per Visit                              1.48
Last Activity                                     1.11
Country                                          26.63
Specialization                                   15.56
How did you hear about X Education               23.89
What is your current occupation                  29.11
What matters most to you in choosing a course    29.32
Search                                            0.00
Magazine                                          0.00
Newspaper Article                                 0.00
X Education Forums                                0.00
Newspaper                                         0.00
Digital Advertisement                             0.00
Through Recommendations                           0.00
Receive More Updates About Our Courses            0.00
Tags                                             36.29
Lead Quality                                      0.00
Update me on Supply Chain Content                 0.00
Get updates on DM Content                         0.00
Lead Profile                                     29.32
City                                             15.37
Asymmetrique Activity Index                      45.65
Asymmetrique Profile Index                       45.65
Asymmetrique Activity Score                      45.65
Asymmetrique Profile Score                       45.65
I agree to pay the amount through cheque          0.00
A free copy of Mastering The Interview            0.00
Last Notable Activity                             0.00
dtype: float64
In [522]:
Counter(lead['Asymmetrique Activity Index'])
Out[522]:
Counter({'02.Medium': 3839, '01.High': 821, '03.Low': 362, nan: 4218})
In [523]:
Counter(lead['Asymmetrique Profile Index'])
Out[523]:
Counter({'02.Medium': 2788, '01.High': 2203, '03.Low': 31, nan: 4218})

We can see that same amount of Null values for the columns Asymmetrique Activity Index and Asymmetrique Profile Index which are closely related and mostly these both columns would be empty for same records. We could see around 45.7% of these are missing values. Hence imputing this with mode can introduce bias here. We can impute/replace the null values here with a new level Unknown.

In [524]:
#Replacing the missing values with 'Unknown' category for 'Asymmetrique Activity Index'
lead['Asymmetrique Activity Index'].replace(np.nan,'Unknown',inplace=True)
In [525]:
#Replacing the missing values with 'Unknown' category for 'Asymmetrique Profile Index'
lead['Asymmetrique Profile Index'].replace(np.nan,'Unknown',inplace=True)

Similar approach can be done for the Asymmetrique Activity Score and Asymmetrique Profile Score columns which are also having same amount of NULL values. But interesting thing here is, these columns are continuos features.

In [526]:
lead['Asymmetrique Activity Score'].value_counts()
Out[526]:
14.0    1771
15.0    1293
13.0     775
16.0     467
17.0     349
12.0     196
11.0      95
10.0      57
9.0        9
18.0       5
8.0        4
7.0        1
Name: Asymmetrique Activity Score, dtype: int64
In [527]:
sns.boxplot(lead['Asymmetrique Activity Score'])
Out[527]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faf01675748>
Notebook Image
In [528]:
lead['Asymmetrique Profile Score'].value_counts()
Out[528]:
15.0    1759
18.0    1071
16.0     599
17.0     579
20.0     308
19.0     245
14.0     226
13.0     204
12.0      22
11.0       9
Name: Asymmetrique Profile Score, dtype: int64
In [529]:
sns.boxplot(lead['Asymmetrique Profile Score'])
Out[529]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faf013a3438>
Notebook Image

We can see the different unique values and their counts in the dataset. Again Asymmetrique Activity Score and Asymmetrique Profile Score are having high number of missing values. Hence we will go on imputing the value 0.0 instead of imputing with mean and introducing heavy bias.

In [530]:
sns.distplot(lead['Asymmetrique Activity Score'].dropna())
Out[530]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faf0137e2b0>
Notebook Image
In [531]:
#Replacing the missing values with 0 value for 'Asymmetrique Activity Score'
lead['Asymmetrique Activity Score'].replace(np.nan,lead['Asymmetrique Activity Score'].mean(),inplace=True)
In [532]:
sns.distplot(lead['Asymmetrique Activity Score'])
Out[532]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faf012577f0>
Notebook Image
In [533]:
sns.boxplot(lead['Asymmetrique Activity Score'])
Out[533]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faf011a05f8>
Notebook Image
In [534]:
sns.distplot(lead['Asymmetrique Profile Score'].dropna())
Out[534]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faf0116fc88>
Notebook Image
In [535]:
#Replacing the missing values with 0 value for 'Asymmetrique Profile Score'
lead['Asymmetrique Profile Score'].replace(np.nan,lead['Asymmetrique Profile Score'].mean(),inplace=True)
In [536]:
sns.distplot(lead['Asymmetrique Profile Score'])
Out[536]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faf011097f0>
Notebook Image
In [537]:
sns.boxplot(lead['Asymmetrique Profile Score'])
Out[537]:
<matplotlib.axes._subplots.AxesSubplot at 0x7faf01179cc0>
Notebook Image

We can see that, for the Profile Score after imputing with the mean value does introduce some outliers (this is introduced due to large number of null values replaced with mean values), but that is better than imputing with 0 as it changes the distribution of the profile scores.

In [538]:
Counter(lead['Tags'])
Out[538]:
Counter({'Interested in other courses': 513,
         'Ringing': 1203,
         'Will revert after reading the email': 2072,
         nan: 3353,
         'Lost to EINS': 175,
         'In confusion whether part time or DLP': 5,
         'Busy': 186,
         'switched off': 240,
         'in touch with EINS': 12,
         'Already a student': 465,
         'Diploma holder (Not Eligible)': 63,
         'Graduation in progress': 111,
         'Closed by Horizzon': 358,
         'number not provided': 27,
         'opp hangup': 33,
         'Not doing further education': 145,
         'invalid number': 83,
         'wrong number given': 47,
         'Interested  in full time MBA': 117,
         'Still Thinking': 6,
         'Lost to Others': 7,
         'Shall take in the next coming month': 2,
         'Lateral student': 3,
         'Interested in Next batch': 5,
         'Recognition issue (DEC approval)': 1,
         'Want to take admission but has financial problems': 6,
         'University not recognized': 2})
In [539]:
#Replacing the missing values with 'Unknown' category for 'Tags'
lead['Tags'].replace(np.nan,'Unknown_tags',inplace=True)
In [540]:
round(100*(lead.isnull().sum()/len(lead.index)),2)
Out[540]:
Prospect ID                                       0.00
Lead Number                                       0.00
Lead Origin                                       0.00
Lead Source                                       0.39
Do Not Email                                      0.00
Do Not Call                                       0.00
Converted                                         0.00
TotalVisits                                       1.48
Total Time Spent on Website                       0.00
Page Views Per Visit                              1.48
Last Activity                                     1.11
Country                                          26.63
Specialization                                   15.56
How did you hear about X Education               23.89
What is your current occupation                  29.11
What matters most to you in choosing a course    29.32
Search                                            0.00
Magazine                                          0.00
Newspaper Article                                 0.00
X Education Forums                                0.00
Newspaper                                         0.00
Digital Advertisement                             0.00
Through Recommendations                           0.00
Receive More Updates About Our Courses            0.00
Tags                                              0.00
Lead Quality                                      0.00
Update me on Supply Chain Content                 0.00
Get updates on DM Content                         0.00
Lead Profile                                     29.32
City                                             15.37
Asymmetrique Activity Index                       0.00
Asymmetrique Profile Index                        0.00
Asymmetrique Activity Score                       0.00
Asymmetrique Profile Score                        0.00
I agree to pay the amount through cheque          0.00
A free copy of Mastering The Interview            0.00
Last Notable Activity                             0.00
dtype: float64
In [541]:
Counter(lead['What matters most to you in choosing a course'])
Out[541]:
Counter({'Better Career Prospects': 6528,
         nan: 2709,
         'Flexibility & Convenience': 2,
         'Other': 1})

We could find the motive behind most of the people who choose this course is for Better Career Prospects. We can hence impute the missing values hence with the mode here.

In [542]:
lead['What matters most to you in choosing a course'].fillna(lead['What matters most to you in choosing a course'].mode()[0],inplace=True) #imputing with mode
In [543]:
Counter(lead['Lead Profile'])
Out[543]:
Counter({'Select': 4146,
         'Potential Lead': 1613,
         nan: 2709,
         'Other Leads': 487,
         'Lateral Student': 24,
         'Dual Specialization Student': 20,
         'Student of SomeSchool': 241})

Here the customer can be either be a potential Lead or a Lateral which we cannot know. Also, we could see the level Select is as similar as the value null. Hence we can replace both these values with a new level 'Unknown', instead of imputing with the values.

In [544]:
#Replacing the missing values as well for 'Select' with 'Unknown' category for 'Lead Profile'
lead['Lead Profile'].replace(np.nan,'Unknown_lead_profile',inplace=True)
lead['Lead Profile'].replace('Select','Unknown_lead_profile',inplace=True)

In [545]:
lead['Lead Profile'].replace('Unknown_lead_profile','Other Leads',inplace=True)
In [546]:
Counter(lead['Lead Profile'])    # merging unknown_lead_profile and other leads
Out[546]:
Counter({'Other Leads': 7342,
         'Potential Lead': 1613,
         'Lateral Student': 24,
         'Dual Specialization Student': 20,
         'Student of SomeSchool': 241})
In [547]:
Counter(lead['What is your current occupation'])
Out[547]:
Counter({'Unemployed': 5600,
         'Student': 210,
         nan: 2690,
         'Working Professional': 706,
         'Businessman': 8,
         'Other': 16,
         'Housewife': 10})

Again the approach of imputing mode here to the value 'Unempoyed' won't be a good approach, since the customer can be rated based on whether the customer is a student/working professional or unemployed. We can see a column Other,which can be used to add all the missing values.

In [548]:
lead['What is your current occupation'].replace(np.nan,'Other',inplace=True)
In [549]:
Counter(lead['What is your current occupation'])
Out[549]:
Counter({'Unemployed': 5600,
         'Student': 210,
         'Other': 2706,
         'Working Professional': 706,
         'Businessman': 8,
         'Housewife': 10})
In [550]:
Counter(lead['Country'])
Out[550]:
Counter({nan: 2461,
         'India': 6492,
         'Russia': 1,
         'Kuwait': 4,
         'Oman': 6,
         'United Arab Emirates': 53,
         'United States': 69,
         'Australia': 13,
         'United Kingdom': 15,
         'Bahrain': 7,
         'Ghana': 2,
         'Singapore': 24,
         'Qatar': 10,
         'Saudi Arabia': 21,
         'Belgium': 2,
         'France': 6,
         'Sri Lanka': 1,
         'China': 2,
         'Canada': 4,
         'Netherlands': 2,
         'Sweden': 3,
         'Nigeria': 4,
         'Hong Kong': 7,
         'Germany': 4,
         'Asia/Pacific Region': 2,
         'Uganda': 2,
         'Kenya': 1,
         'Italy': 2,
         'South Africa': 4,
         'Tanzania': 1,
         'unknown': 5,
         'Malaysia': 1,
         'Liberia': 1,
         'Switzerland': 1,
         'Denmark': 1,
         'Philippines': 2,
         'Bangladesh': 2,
         'Vietnam': 1,
         'Indonesia': 1})

Although we can see the highest number of customers are from India, still for many missing values as large as 2461, for those customers, the target branding and promotional activities will be stressed based on the number of customers in the country. Hence imputing mode won't be a good option. Also a level 'unknown' can be seen in the list of countries. We can replace null values with this category of 'unknown_country'.

In [551]:
lead['Country'].replace(np.nan,'unknown_country',inplace=True)
In [552]:
Counter(lead['Country'])
Out[552]:
Counter({'unknown_country': 2461,
         'India': 6492,
         'Russia': 1,
         'Kuwait': 4,
         'Oman': 6,
         'United Arab Emirates': 53,
         'United States': 69,
         'Australia': 13,
         'United Kingdom': 15,
         'Bahrain': 7,
         'Ghana': 2,
         'Singapore': 24,
         'Qatar': 10,
         'Saudi Arabia': 21,
         'Belgium': 2,
         'France': 6,
         'Sri Lanka': 1,
         'China': 2,
         'Canada': 4,
         'Netherlands': 2,
         'Sweden': 3,
         'Nigeria': 4,
         'Hong Kong': 7,
         'Germany': 4,
         'Asia/Pacific Region': 2,
         'Uganda': 2,
         'Kenya': 1,
         'Italy': 2,
         'South Africa': 4,
         'Tanzania': 1,
         'unknown': 5,
         'Malaysia': 1,
         'Liberia': 1,
         'Switzerland': 1,
         'Denmark': 1,
         'Philippines': 2,
         'Bangladesh': 2,
         'Vietnam': 1,
         'Indonesia': 1})
In [553]:
Counter(lead['How did you hear about X Education'])
Out[553]:
Counter({'Select': 5043,
         'Word Of Mouth': 348,
         'Other': 186,
         nan: 2207,
         'Online Search': 808,
         'Multiple Sources': 152,
         'Advertisements': 70,
         'Student of SomeSchool': 310,
         'Email': 26,
         'Social Media': 67,
         'SMS': 23})

Here we can see along with missing values,categories - 'Select' and 'Other'. We can merge all these values into a single category of Other.

In [554]:
lead['How did you hear about X Education'].replace(np.nan,'Other',inplace=True)
lead['How did you hear about X Education'].replace('Select','Other',inplace=True)
In [555]:
Counter(lead['How did you hear about X Education'])
Out[555]:
Counter({'Other': 7436,
         'Word Of Mouth': 348,
         'Online Search': 808,
         'Multiple Sources': 152,
         'Advertisements': 70,
         'Student of SomeSchool': 310,
         'Email': 26,
         'Social Media': 67,
         'SMS': 23})
In [556]:
Counter(lead['City'])
Out[556]:
Counter({'Select': 2249,
         'Mumbai': 3222,
         nan: 1420,
         'Thane & Outskirts': 752,
         'Other Metro Cities': 380,
         'Other Cities': 686,
         'Other Cities of Maharashtra': 457,
         'Tier II Cities': 74})

We replace the null and 'select' cities with Mode Values

In [557]:
lead['City'].replace(np.nan,lead['City'].mode()[0],inplace=True)
lead['City'].replace('Select',lead['City'].mode()[0],inplace=True)
In [558]:
Counter(lead['City'])
Out[558]:
Counter({'Mumbai': 6891,
         'Thane & Outskirts': 752,
         'Other Metro Cities': 380,
         'Other Cities': 686,
         'Other Cities of Maharashtra': 457,
         'Tier II Cities': 74})
In [559]:
Counter(lead['Specialization'])
Out[559]:
Counter({'Select': 1942,
         'Business Administration': 403,
         'Media and Advertising': 203,
         nan: 1438,
         'Supply Chain Management': 349,
         'IT Projects Management': 366,
         'Finance Management': 976,
         'Travel and Tourism': 203,
         'Human Resource Management': 848,
         'Marketing Management': 838,
         'Banking, Investment And Insurance': 338,
         'International Business': 178,
         'E-COMMERCE': 112,
         'Operations Management': 503,
         'Retail Management': 100,
         'Services Excellence': 40,
         'Hospitality Management': 114,
         'Rural and Agribusiness': 73,
         'Healthcare Management': 159,
         'E-Business': 57})
In [560]:
lead['Specialization'].replace(np.nan,'Unknown_Specialization',inplace=True)
lead['Specialization'].replace('Select','Unknown_Specialization',inplace=True)
In [561]:
Counter(lead['Specialization'])
Out[561]:
Counter({'Unknown_Specialization': 3380,
         'Business Administration': 403,
         'Media and Advertising': 203,
         'Supply Chain Management': 349,
         'IT Projects Management': 366,
         'Finance Management': 976,
         'Travel and Tourism': 203,
         'Human Resource Management': 848,
         'Marketing Management': 838,
         'Banking, Investment And Insurance': 338,
         'International Business': 178,
         'E-COMMERCE': 112,
         'Operations Management': 503,
         'Retail Management': 100,
         'Services Excellence': 40,
         'Hospitality Management': 114,
         'Rural and Agribusiness': 73,
         'Healthcare Management': 159,
         'E-Business': 57})
In [562]:
Counter(lead['What is your current occupation'])
Out[562]:
Counter({'Unemployed': 5600,
         'Student': 210,
         'Other': 2706,
         'Working Professional': 706,
         'Businessman': 8,
         'Housewife': 10})

Replacing the null value of current occupation with the mode.

In [563]:
lead['What is your current occupation'].replace(np.nan,lead['What is your current occupation'].mode(),inplace=True)
In [564]:
lead.isnull().sum()
Out[564]:
Prospect ID                                        0
Lead Number                                        0
Lead Origin                                        0
Lead Source                                       36
Do Not Email                                       0
Do Not Call                                        0
Converted                                          0
TotalVisits                                      137
Total Time Spent on Website                        0
Page Views Per Visit                             137
Last Activity                                    103
Country                                            0
Specialization                                     0
How did you hear about X Education                 0
What is your current occupation                    0
What matters most to you in choosing a course      0
Search                                             0
Magazine                                           0
Newspaper Article                                  0
X Education Forums                                 0
Newspaper                                          0
Digital Advertisement                              0
Through Recommendations                            0
Receive More Updates About Our Courses             0
Tags                                               0
Lead Quality                                       0
Update me on Supply Chain Content                  0
Get updates on DM Content                          0
Lead Profile                                       0
City                                               0
Asymmetrique Activity Index                        0
Asymmetrique Profile Index                         0
Asymmetrique Activity Score                        0
Asymmetrique Profile Score                         0
I agree to pay the amount through cheque           0
A free copy of Mastering The Interview             0
Last Notable Activity                              0
dtype: int64

We have now treated almost major portion of the null values present in the dataset. Now we are having very small amount of null values for which those records can be removed which won't result in major loosing of information.

In [565]:
lead = lead.loc[lead['TotalVisits'].isnull() !=True] 
lead = lead.loc[lead['Page Views Per Visit'].isnull() !=True] 
lead = lead.loc[lead['Last Activity'].isnull() !=True] 
lead = lead.loc[lead['Lead Source'].isnull() !=True] 
In [566]:
lead.isnull().sum()
Out[566]:
Prospect ID                                      0
Lead Number                                      0
Lead Origin                                      0
Lead Source                                      0
Do Not Email                                     0
Do Not Call                                      0
Converted                                        0
TotalVisits                                      0
Total Time Spent on Website                      0
Page Views Per Visit                             0
Last Activity                                    0
Country                                          0
Specialization                                   0
How did you hear about X Education               0
What is your current occupation                  0
What matters most to you in choosing a course    0
Search                                           0
Magazine                                         0
Newspaper Article                                0
X Education Forums                               0
Newspaper                                        0
Digital Advertisement                            0
Through Recommendations                          0
Receive More Updates About Our Courses           0
Tags                                             0
Lead Quality                                     0
Update me on Supply Chain Content                0
Get updates on DM Content                        0
Lead Profile                                     0
City                                             0
Asymmetrique Activity Index                      0
Asymmetrique Profile Index                       0
Asymmetrique Activity Score                      0
Asymmetrique Profile Score                       0
I agree to pay the amount through cheque         0
A free copy of Mastering The Interview           0
Last Notable Activity                            0
dtype: int64

Hence now data is cleaned and doesn't contain any NULL values.

We can observe that the column Prospect ID which helps in identifying the customer can be dropped,since the Lead Number is also a similar column with unique values that can identify each customer.

In [567]:
lead.drop(['Prospect ID'],axis=1,inplace=True)
In [568]:
#Dimensions of final dataset after data cleaning
lead.shape
Out[568]:
(9074, 36)
In [569]:
lead.head()
Out[569]:
In [570]:
#Data Inbalance check for the Converted column.
yes=lead[lead['Converted']==1]['Converted'].value_counts()
no=lead[lead['Converted']==0]['Converted'].value_counts()

converted=np.array((yes/len(lead))*100) 
not_converted=np.array((no/len(lead))*100) 
stat_summ=pd.DataFrame({'Percentage':[converted[0],not_converted[0]]},index=['converted','not_converted'])
sns.barplot(x=stat_summ.index,y=stat_summ['Percentage'],palette='RdYlGn')
plt.title('Percent of Customers who gets converted ')
plt.show()
Notebook Image

We can see a little bit of Data Inbalalance between these 2 categories of converted and Not converted. We can outof total percentage around 30%+ ~ nearly 40% of the leads are only getting converted.

In [ ]:
jovian.commit()
[jovian] Saving notebook..
In [ ]: