import jovian
#Importing the libraries.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#Supress Warnings
import warnings
warnings.filterwarnings('ignore')
#To view all columns of the dataframe
pd.set_option('display.max_columns',500)
#Importing the dataset
lead = pd.read_csv('Leads.csv')
#To view the head of the dataset
lead.head()
#Dimensions of the dataset
lead.shape
(9240, 37)
#Info of the dataframe for initial spotting of missing values and identify type of the column.
lead.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9240 entries, 0 to 9239
Data columns (total 37 columns):
Prospect ID 9240 non-null object
Lead Number 9240 non-null int64
Lead Origin 9240 non-null object
Lead Source 9204 non-null object
Do Not Email 9240 non-null object
Do Not Call 9240 non-null object
Converted 9240 non-null int64
TotalVisits 9103 non-null float64
Total Time Spent on Website 9240 non-null int64
Page Views Per Visit 9103 non-null float64
Last Activity 9137 non-null object
Country 6779 non-null object
Specialization 7802 non-null object
How did you hear about X Education 7033 non-null object
What is your current occupation 6550 non-null object
What matters most to you in choosing a course 6531 non-null object
Search 9240 non-null object
Magazine 9240 non-null object
Newspaper Article 9240 non-null object
X Education Forums 9240 non-null object
Newspaper 9240 non-null object
Digital Advertisement 9240 non-null object
Through Recommendations 9240 non-null object
Receive More Updates About Our Courses 9240 non-null object
Tags 5887 non-null object
Lead Quality 4473 non-null object
Update me on Supply Chain Content 9240 non-null object
Get updates on DM Content 9240 non-null object
Lead Profile 6531 non-null object
City 7820 non-null object
Asymmetrique Activity Index 5022 non-null object
Asymmetrique Profile Index 5022 non-null object
Asymmetrique Activity Score 5022 non-null float64
Asymmetrique Profile Score 5022 non-null float64
I agree to pay the amount through cheque 9240 non-null object
A free copy of Mastering The Interview 9240 non-null object
Last Notable Activity 9240 non-null object
dtypes: float64(4), int64(3), object(30)
memory usage: 2.6+ MB
We could see that there are many missing values in the dataset with many columns having missing values. We will observe and treat accordingly for the same later. Also we could see,almost many object/categorical columns in the dataset which might need to converted by suitable techniques such as one-hot encoding or simple replace, which also will be taken care of.
#Initial Stats of the dataframe
lead.describe()
Amon the numerical columns in hand, the initial stats shows the presence of outliers. As we observe above, features such as TotalVisits,Total Time Spent on Website,Page Views Per Visit
are not gradually increasing, hence presence of outlier in the dataset.
#Checking for the percentage of null values for each column.
round(100*(lead.isnull().sum()/len(lead.index)),2)
Prospect ID 0.00
Lead Number 0.00
Lead Origin 0.00
Lead Source 0.39
Do Not Email 0.00
Do Not Call 0.00
Converted 0.00
TotalVisits 1.48
Total Time Spent on Website 0.00
Page Views Per Visit 1.48
Last Activity 1.11
Country 26.63
Specialization 15.56
How did you hear about X Education 23.89
What is your current occupation 29.11
What matters most to you in choosing a course 29.32
Search 0.00
Magazine 0.00
Newspaper Article 0.00
X Education Forums 0.00
Newspaper 0.00
Digital Advertisement 0.00
Through Recommendations 0.00
Receive More Updates About Our Courses 0.00
Tags 36.29
Lead Quality 51.59
Update me on Supply Chain Content 0.00
Get updates on DM Content 0.00
Lead Profile 29.32
City 15.37
Asymmetrique Activity Index 45.65
Asymmetrique Profile Index 45.65
Asymmetrique Activity Score 45.65
Asymmetrique Profile Score 45.65
I agree to pay the amount through cheque 0.00
A free copy of Mastering The Interview 0.00
Last Notable Activity 0.00
dtype: float64
As seen above, we could find many columns with large number of missing values. Columns such as Lead Quality,Asymmetrique Activity Index,Asymmetrique Profile Index,Asymmetrique Activity Score,Asymmetrique Profile Score,Tags
are having high % of missing values. We need to go for certain approach, either to drop these columns or impute these missing values with appropriate values. For this, we can understand each feature and their importance by deep diving.
#Observing the existing different levels for Lead Quality
lead['Lead Quality'].unique()
array(['Low in Relevance', nan, 'Might be', 'Not Sure', 'Worst',
'High in Relevance'], dtype=object)
#Assessing the number of entries in lead-quality
from collections import Counter
Counter(lead['Lead Quality'])
Counter({'Low in Relevance': 583,
nan: 4767,
'Might be': 1560,
'Not Sure': 1092,
'Worst': 601,
'High in Relevance': 637})
51.6% of the values for Lead Quality
are not present. We can replace the null values with a new level say 'Unknown'. From the different levels for Lead quality we can see levels 'Might be' or 'Not Sure' will be almost similar to 'Unknown. Hence we will go with imputing 'Not Sure' values for the same.
#Replacing the missing values with 'Not Sure' category for 'Lead Quality'
lead['Lead Quality'].replace(np.nan,'Not Sure',inplace=True)
round(100*(lead.isnull().sum()/len(lead.index)),2)
Prospect ID 0.00
Lead Number 0.00
Lead Origin 0.00
Lead Source 0.39
Do Not Email 0.00
Do Not Call 0.00
Converted 0.00
TotalVisits 1.48
Total Time Spent on Website 0.00
Page Views Per Visit 1.48
Last Activity 1.11
Country 26.63
Specialization 15.56
How did you hear about X Education 23.89
What is your current occupation 29.11
What matters most to you in choosing a course 29.32
Search 0.00
Magazine 0.00
Newspaper Article 0.00
X Education Forums 0.00
Newspaper 0.00
Digital Advertisement 0.00
Through Recommendations 0.00
Receive More Updates About Our Courses 0.00
Tags 36.29
Lead Quality 0.00
Update me on Supply Chain Content 0.00
Get updates on DM Content 0.00
Lead Profile 29.32
City 15.37
Asymmetrique Activity Index 45.65
Asymmetrique Profile Index 45.65
Asymmetrique Activity Score 45.65
Asymmetrique Profile Score 45.65
I agree to pay the amount through cheque 0.00
A free copy of Mastering The Interview 0.00
Last Notable Activity 0.00
dtype: float64
Counter(lead['Asymmetrique Activity Index'])
Counter({'02.Medium': 3839, '01.High': 821, '03.Low': 362, nan: 4218})
Counter(lead['Asymmetrique Profile Index'])
Counter({'02.Medium': 2788, '01.High': 2203, '03.Low': 31, nan: 4218})
We can see that same amount of Null values for the columns Asymmetrique Activity Index and Asymmetrique Profile Index
which are closely related and mostly these both columns would be empty for same records. We could see around 45.7% of these are missing values. Hence imputing this with mode can introduce bias here. We can impute/replace the null values here with a new level Unknown
.
#Replacing the missing values with 'Unknown' category for 'Asymmetrique Activity Index'
lead['Asymmetrique Activity Index'].replace(np.nan,'Unknown',inplace=True)
#Replacing the missing values with 'Unknown' category for 'Asymmetrique Profile Index'
lead['Asymmetrique Profile Index'].replace(np.nan,'Unknown',inplace=True)
Similar approach can be done for the Asymmetrique Activity Score and Asymmetrique Profile Score
columns which are also having same amount of NULL values. But interesting thing here is, these columns are continuos features.
lead['Asymmetrique Activity Score'].value_counts()
14.0 1771
15.0 1293
13.0 775
16.0 467
17.0 349
12.0 196
11.0 95
10.0 57
9.0 9
18.0 5
8.0 4
7.0 1
Name: Asymmetrique Activity Score, dtype: int64
sns.boxplot(lead['Asymmetrique Activity Score'])
<matplotlib.axes._subplots.AxesSubplot at 0x7faf01675748>
lead['Asymmetrique Profile Score'].value_counts()
15.0 1759
18.0 1071
16.0 599
17.0 579
20.0 308
19.0 245
14.0 226
13.0 204
12.0 22
11.0 9
Name: Asymmetrique Profile Score, dtype: int64
sns.boxplot(lead['Asymmetrique Profile Score'])
<matplotlib.axes._subplots.AxesSubplot at 0x7faf013a3438>
We can see the different unique values and their counts in the dataset. Again Asymmetrique Activity Score and Asymmetrique Profile Score
are having high number of missing values. Hence we will go on imputing the value 0.0
instead of imputing with mean and introducing heavy bias.
sns.distplot(lead['Asymmetrique Activity Score'].dropna())
<matplotlib.axes._subplots.AxesSubplot at 0x7faf0137e2b0>
#Replacing the missing values with 0 value for 'Asymmetrique Activity Score'
lead['Asymmetrique Activity Score'].replace(np.nan,lead['Asymmetrique Activity Score'].mean(),inplace=True)
sns.distplot(lead['Asymmetrique Activity Score'])
<matplotlib.axes._subplots.AxesSubplot at 0x7faf012577f0>
sns.boxplot(lead['Asymmetrique Activity Score'])
<matplotlib.axes._subplots.AxesSubplot at 0x7faf011a05f8>
sns.distplot(lead['Asymmetrique Profile Score'].dropna())
<matplotlib.axes._subplots.AxesSubplot at 0x7faf0116fc88>
#Replacing the missing values with 0 value for 'Asymmetrique Profile Score'
lead['Asymmetrique Profile Score'].replace(np.nan,lead['Asymmetrique Profile Score'].mean(),inplace=True)
sns.distplot(lead['Asymmetrique Profile Score'])
<matplotlib.axes._subplots.AxesSubplot at 0x7faf011097f0>
sns.boxplot(lead['Asymmetrique Profile Score'])
<matplotlib.axes._subplots.AxesSubplot at 0x7faf01179cc0>
We can see that, for the Profile Score after imputing with the mean value does introduce some outliers (this is introduced due to large number of null values replaced with mean values), but that is better than imputing with 0 as it changes the distribution of the profile scores.
Counter(lead['Tags'])
Counter({'Interested in other courses': 513,
'Ringing': 1203,
'Will revert after reading the email': 2072,
nan: 3353,
'Lost to EINS': 175,
'In confusion whether part time or DLP': 5,
'Busy': 186,
'switched off': 240,
'in touch with EINS': 12,
'Already a student': 465,
'Diploma holder (Not Eligible)': 63,
'Graduation in progress': 111,
'Closed by Horizzon': 358,
'number not provided': 27,
'opp hangup': 33,
'Not doing further education': 145,
'invalid number': 83,
'wrong number given': 47,
'Interested in full time MBA': 117,
'Still Thinking': 6,
'Lost to Others': 7,
'Shall take in the next coming month': 2,
'Lateral student': 3,
'Interested in Next batch': 5,
'Recognition issue (DEC approval)': 1,
'Want to take admission but has financial problems': 6,
'University not recognized': 2})
#Replacing the missing values with 'Unknown' category for 'Tags'
lead['Tags'].replace(np.nan,'Unknown_tags',inplace=True)
round(100*(lead.isnull().sum()/len(lead.index)),2)
Prospect ID 0.00
Lead Number 0.00
Lead Origin 0.00
Lead Source 0.39
Do Not Email 0.00
Do Not Call 0.00
Converted 0.00
TotalVisits 1.48
Total Time Spent on Website 0.00
Page Views Per Visit 1.48
Last Activity 1.11
Country 26.63
Specialization 15.56
How did you hear about X Education 23.89
What is your current occupation 29.11
What matters most to you in choosing a course 29.32
Search 0.00
Magazine 0.00
Newspaper Article 0.00
X Education Forums 0.00
Newspaper 0.00
Digital Advertisement 0.00
Through Recommendations 0.00
Receive More Updates About Our Courses 0.00
Tags 0.00
Lead Quality 0.00
Update me on Supply Chain Content 0.00
Get updates on DM Content 0.00
Lead Profile 29.32
City 15.37
Asymmetrique Activity Index 0.00
Asymmetrique Profile Index 0.00
Asymmetrique Activity Score 0.00
Asymmetrique Profile Score 0.00
I agree to pay the amount through cheque 0.00
A free copy of Mastering The Interview 0.00
Last Notable Activity 0.00
dtype: float64
Counter(lead['What matters most to you in choosing a course'])
Counter({'Better Career Prospects': 6528,
nan: 2709,
'Flexibility & Convenience': 2,
'Other': 1})
We could find the motive behind most of the people who choose this course is for Better Career Prospects. We can hence impute the missing values hence with the mode here.
lead['What matters most to you in choosing a course'].fillna(lead['What matters most to you in choosing a course'].mode()[0],inplace=True) #imputing with mode
Counter(lead['Lead Profile'])
Counter({'Select': 4146,
'Potential Lead': 1613,
nan: 2709,
'Other Leads': 487,
'Lateral Student': 24,
'Dual Specialization Student': 20,
'Student of SomeSchool': 241})
Here the customer can be either be a potential Lead or a Lateral which we cannot know. Also, we could see the level Select
is as similar as the value null. Hence we can replace both these values with a new level 'Unknown', instead of imputing with the values.
#Replacing the missing values as well for 'Select' with 'Unknown' category for 'Lead Profile'
lead['Lead Profile'].replace(np.nan,'Unknown_lead_profile',inplace=True)
lead['Lead Profile'].replace('Select','Unknown_lead_profile',inplace=True)
lead['Lead Profile'].replace('Unknown_lead_profile','Other Leads',inplace=True)
Counter(lead['Lead Profile']) # merging unknown_lead_profile and other leads
Counter({'Other Leads': 7342,
'Potential Lead': 1613,
'Lateral Student': 24,
'Dual Specialization Student': 20,
'Student of SomeSchool': 241})
Counter(lead['What is your current occupation'])
Counter({'Unemployed': 5600,
'Student': 210,
nan: 2690,
'Working Professional': 706,
'Businessman': 8,
'Other': 16,
'Housewife': 10})
Again the approach of imputing mode here to the value 'Unempoyed' won't be a good approach, since the customer can be rated based on whether the customer is a student/working professional or unemployed. We can see a column Other
,which can be used to add all the missing values.
lead['What is your current occupation'].replace(np.nan,'Other',inplace=True)
Counter(lead['What is your current occupation'])
Counter({'Unemployed': 5600,
'Student': 210,
'Other': 2706,
'Working Professional': 706,
'Businessman': 8,
'Housewife': 10})
Counter(lead['Country'])
Counter({nan: 2461,
'India': 6492,
'Russia': 1,
'Kuwait': 4,
'Oman': 6,
'United Arab Emirates': 53,
'United States': 69,
'Australia': 13,
'United Kingdom': 15,
'Bahrain': 7,
'Ghana': 2,
'Singapore': 24,
'Qatar': 10,
'Saudi Arabia': 21,
'Belgium': 2,
'France': 6,
'Sri Lanka': 1,
'China': 2,
'Canada': 4,
'Netherlands': 2,
'Sweden': 3,
'Nigeria': 4,
'Hong Kong': 7,
'Germany': 4,
'Asia/Pacific Region': 2,
'Uganda': 2,
'Kenya': 1,
'Italy': 2,
'South Africa': 4,
'Tanzania': 1,
'unknown': 5,
'Malaysia': 1,
'Liberia': 1,
'Switzerland': 1,
'Denmark': 1,
'Philippines': 2,
'Bangladesh': 2,
'Vietnam': 1,
'Indonesia': 1})
Although we can see the highest number of customers are from India
, still for many missing values as large as 2461, for those customers, the target branding and promotional activities will be stressed based on the number of customers in the country. Hence imputing mode won't be a good option. Also a level 'unknown' can be seen in the list of countries. We can replace null values with this category of 'unknown_country'.
lead['Country'].replace(np.nan,'unknown_country',inplace=True)
Counter(lead['Country'])
Counter({'unknown_country': 2461,
'India': 6492,
'Russia': 1,
'Kuwait': 4,
'Oman': 6,
'United Arab Emirates': 53,
'United States': 69,
'Australia': 13,
'United Kingdom': 15,
'Bahrain': 7,
'Ghana': 2,
'Singapore': 24,
'Qatar': 10,
'Saudi Arabia': 21,
'Belgium': 2,
'France': 6,
'Sri Lanka': 1,
'China': 2,
'Canada': 4,
'Netherlands': 2,
'Sweden': 3,
'Nigeria': 4,
'Hong Kong': 7,
'Germany': 4,
'Asia/Pacific Region': 2,
'Uganda': 2,
'Kenya': 1,
'Italy': 2,
'South Africa': 4,
'Tanzania': 1,
'unknown': 5,
'Malaysia': 1,
'Liberia': 1,
'Switzerland': 1,
'Denmark': 1,
'Philippines': 2,
'Bangladesh': 2,
'Vietnam': 1,
'Indonesia': 1})
Counter(lead['How did you hear about X Education'])
Counter({'Select': 5043,
'Word Of Mouth': 348,
'Other': 186,
nan: 2207,
'Online Search': 808,
'Multiple Sources': 152,
'Advertisements': 70,
'Student of SomeSchool': 310,
'Email': 26,
'Social Media': 67,
'SMS': 23})
Here we can see along with missing values,categories - 'Select' and 'Other'. We can merge all these values into a single category of Other.
lead['How did you hear about X Education'].replace(np.nan,'Other',inplace=True)
lead['How did you hear about X Education'].replace('Select','Other',inplace=True)
Counter(lead['How did you hear about X Education'])
Counter({'Other': 7436,
'Word Of Mouth': 348,
'Online Search': 808,
'Multiple Sources': 152,
'Advertisements': 70,
'Student of SomeSchool': 310,
'Email': 26,
'Social Media': 67,
'SMS': 23})
Counter(lead['City'])
Counter({'Select': 2249,
'Mumbai': 3222,
nan: 1420,
'Thane & Outskirts': 752,
'Other Metro Cities': 380,
'Other Cities': 686,
'Other Cities of Maharashtra': 457,
'Tier II Cities': 74})
We replace the null and 'select' cities with Mode Values
lead['City'].replace(np.nan,lead['City'].mode()[0],inplace=True)
lead['City'].replace('Select',lead['City'].mode()[0],inplace=True)
Counter(lead['City'])
Counter({'Mumbai': 6891,
'Thane & Outskirts': 752,
'Other Metro Cities': 380,
'Other Cities': 686,
'Other Cities of Maharashtra': 457,
'Tier II Cities': 74})
Counter(lead['Specialization'])
Counter({'Select': 1942,
'Business Administration': 403,
'Media and Advertising': 203,
nan: 1438,
'Supply Chain Management': 349,
'IT Projects Management': 366,
'Finance Management': 976,
'Travel and Tourism': 203,
'Human Resource Management': 848,
'Marketing Management': 838,
'Banking, Investment And Insurance': 338,
'International Business': 178,
'E-COMMERCE': 112,
'Operations Management': 503,
'Retail Management': 100,
'Services Excellence': 40,
'Hospitality Management': 114,
'Rural and Agribusiness': 73,
'Healthcare Management': 159,
'E-Business': 57})
lead['Specialization'].replace(np.nan,'Unknown_Specialization',inplace=True)
lead['Specialization'].replace('Select','Unknown_Specialization',inplace=True)
Counter(lead['Specialization'])
Counter({'Unknown_Specialization': 3380,
'Business Administration': 403,
'Media and Advertising': 203,
'Supply Chain Management': 349,
'IT Projects Management': 366,
'Finance Management': 976,
'Travel and Tourism': 203,
'Human Resource Management': 848,
'Marketing Management': 838,
'Banking, Investment And Insurance': 338,
'International Business': 178,
'E-COMMERCE': 112,
'Operations Management': 503,
'Retail Management': 100,
'Services Excellence': 40,
'Hospitality Management': 114,
'Rural and Agribusiness': 73,
'Healthcare Management': 159,
'E-Business': 57})
Counter(lead['What is your current occupation'])
Counter({'Unemployed': 5600,
'Student': 210,
'Other': 2706,
'Working Professional': 706,
'Businessman': 8,
'Housewife': 10})
Replacing the null value of current occupation with the mode.
lead['What is your current occupation'].replace(np.nan,lead['What is your current occupation'].mode(),inplace=True)
lead.isnull().sum()
Prospect ID 0
Lead Number 0
Lead Origin 0
Lead Source 36
Do Not Email 0
Do Not Call 0
Converted 0
TotalVisits 137
Total Time Spent on Website 0
Page Views Per Visit 137
Last Activity 103
Country 0
Specialization 0
How did you hear about X Education 0
What is your current occupation 0
What matters most to you in choosing a course 0
Search 0
Magazine 0
Newspaper Article 0
X Education Forums 0
Newspaper 0
Digital Advertisement 0
Through Recommendations 0
Receive More Updates About Our Courses 0
Tags 0
Lead Quality 0
Update me on Supply Chain Content 0
Get updates on DM Content 0
Lead Profile 0
City 0
Asymmetrique Activity Index 0
Asymmetrique Profile Index 0
Asymmetrique Activity Score 0
Asymmetrique Profile Score 0
I agree to pay the amount through cheque 0
A free copy of Mastering The Interview 0
Last Notable Activity 0
dtype: int64
We have now treated almost major portion of the null values present in the dataset. Now we are having very small amount of null values for which those records can be removed which won't result in major loosing of information.
lead = lead.loc[lead['TotalVisits'].isnull() !=True]
lead = lead.loc[lead['Page Views Per Visit'].isnull() !=True]
lead = lead.loc[lead['Last Activity'].isnull() !=True]
lead = lead.loc[lead['Lead Source'].isnull() !=True]
lead.isnull().sum()
Prospect ID 0
Lead Number 0
Lead Origin 0
Lead Source 0
Do Not Email 0
Do Not Call 0
Converted 0
TotalVisits 0
Total Time Spent on Website 0
Page Views Per Visit 0
Last Activity 0
Country 0
Specialization 0
How did you hear about X Education 0
What is your current occupation 0
What matters most to you in choosing a course 0
Search 0
Magazine 0
Newspaper Article 0
X Education Forums 0
Newspaper 0
Digital Advertisement 0
Through Recommendations 0
Receive More Updates About Our Courses 0
Tags 0
Lead Quality 0
Update me on Supply Chain Content 0
Get updates on DM Content 0
Lead Profile 0
City 0
Asymmetrique Activity Index 0
Asymmetrique Profile Index 0
Asymmetrique Activity Score 0
Asymmetrique Profile Score 0
I agree to pay the amount through cheque 0
A free copy of Mastering The Interview 0
Last Notable Activity 0
dtype: int64
Hence now data is cleaned and doesn't contain any NULL values.
We can observe that the column Prospect ID
which helps in identifying the customer can be dropped,since the Lead Number
is also a similar column with unique values that can identify each customer.
lead.drop(['Prospect ID'],axis=1,inplace=True)
#Dimensions of final dataset after data cleaning
lead.shape
(9074, 36)
lead.head()
#Data Inbalance check for the Converted column.
yes=lead[lead['Converted']==1]['Converted'].value_counts()
no=lead[lead['Converted']==0]['Converted'].value_counts()
converted=np.array((yes/len(lead))*100)
not_converted=np.array((no/len(lead))*100)
stat_summ=pd.DataFrame({'Percentage':[converted[0],not_converted[0]]},index=['converted','not_converted'])
sns.barplot(x=stat_summ.index,y=stat_summ['Percentage'],palette='RdYlGn')
plt.title('Percent of Customers who gets converted ')
plt.show()
We can see a little bit of Data Inbalalance between these 2 categories of converted and Not converted. We can outof total percentage around 30%+ ~ nearly 40% of the leads are only getting converted.
jovian.commit()
[jovian] Saving notebook..