This kernel covers some of the commonly used Categorical Encoding Techniques .
1.OneHot Encoding
2.Label Encoding
3.Ordinal Encoding
4.Binary Encoding
5.Frequency Encoding
6.Mean Encoding
# importing libraries
import numpy as np
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import category_encoders as ce
%matplotlib inline
# loading dataset
df = pd.read_csv("../input/widsdatathon2020/training_v2.csv")
#printing the categorical variables
print([c for c in df.columns if (1<df[c].nunique()) & (df[c].dtype != np.number)& (df[c].dtype != int) ])
['ethnicity', 'gender', 'hospital_admit_source', 'icu_admit_source', 'icu_stay_type', 'icu_type', 'apache_3j_bodysystem', 'apache_2_bodysystem']
categorical_cols = ['hospital_id','ethnicity', 'gender', 'hospital_admit_source', 'icu_admit_source', 'icu_stay_type', 'icu_type', 'apache_3j_bodysystem', 'apache_2_bodysystem',"hospital_death",'age']
Categorical_df= df[categorical_cols]
Categorical_df.head(5)
One-hot encoding is the most widely used encoding scheme. It works by creating a column for each category present in the feature and assigning a 1 or 0 to indicate the presence of a category in the data.Pandas get_dummies method can be applied to a data frame and will only convert string columns into numbers .
X = pd.DataFrame(Categorical_df['ethnicity'])
Categorical_df['ethnicity'].value_counts()
Caucasian 70684
African American 9547
Other/Unknown 4374
Hispanic 3796
Asian 1129
Native American 788
Name: ethnicity, dtype: int64
one_hot_encoded_pandas = pd.get_dummies(X)
one_hot_encoded_pandas.head()
Label encoders transform non-numerical labels into numerical labels. Each category is assigned a unique label starting from 0 and going on till n_categories – 1 per feature. Label encoders are suitable for encoding variables where alphabetical alignment or numerical value of labels is important.
# label encoding the data
le = LabelEncoder()
gender = pd.DataFrame(le.fit_transform(Categorical_df['gender'].astype('str')))
gender.columns = ['Gender']
gender.head()
Ordinal encoding uses a single column of integers to represent the classes. An optional mapping dict can be passed in; in this case, we use the knowledge that there is some true order to the classes themselves
icu_st = pd.DataFrame(Categorical_df['icu_stay_type'])
icu_st.columns = ['icu_stay_type']
icu_st_dict={'admit':0,'readmit':1,'transfer':2}
icu_st['icu_st_ordinal'] = icu_st.icu_stay_type.map(icu_st_dict)
icu_st.head(5)
Binary encoding convert a category into a binary digits. Each binary digit creates one feature column. If there are n unique categories, then binary encoding results in only log(base 2)ⁿ features. In this example we have 4 feature, thus total number of binary encoded feature will be 3 features.
Categorical_df['apache_3j_bodysystem'].unique()
array(['Sepsis', 'Respiratory', 'Metabolic', 'Cardiovascular', 'Trauma',
'Neurological', 'Gastrointestinal', 'Genitourinary', nan,
'Hematological', 'Musculoskeletal/Skin', 'Gynecological'],
dtype=object)
apache_3 = pd.DataFrame(Categorical_df['apache_3j_bodysystem'])
apache_3 = apache_3.dropna()
apache_3.columns =['apache_3j_bodysystem']
encoder = ce.BinaryEncoder(cols = ['apache_3j_bodysystem'])
apache_bin = encoder.fit_transform(apache_3['apache_3j_bodysystem'])
apache_3 = pd.concat([apache_3,apache_bin],axis = 1)
apache_3.head(5)
It is a way to utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat with the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data.
Categorical_df['hospital_admit_source'].unique()
array(['Floor', 'Emergency Department', 'Operating Room', nan,
'Direct Admit', 'Other Hospital', 'Other ICU', 'ICU to SDU',
'Recovery Room', 'Chest Pain Center', 'Step-Down Unit (SDU)',
'Acute Care/Floor', 'PACU', 'Observation', 'ICU', 'Other'],
dtype=object)
hosp_asource = pd.DataFrame(Categorical_df['hospital_admit_source'])
hosp_asource = hosp_asource.dropna()
hosp_asource.columns =['hospital_admit_source']
fe = hosp_asource.groupby('hospital_admit_source').size()/len(hosp_asource)
hosp_asource.loc[:,'hospital_admit_source_fe'] = hosp_asource['hospital_admit_source'].map(fe)
hosp_asource.head()
Mean encoding is similar to label encoding, except here labels are correlated directly with the target. For example, in mean target encoding for each category in the feature label is decided with the mean value of the target variable on a training data.
Categorical_df['icu_type'].unique()
array(['CTICU', 'Med-Surg ICU', 'CCU-CTICU', 'Neuro ICU', 'MICU', 'SICU',
'Cardiac ICU', 'CSICU'], dtype=object)
icu_type = pd.DataFrame(Categorical_df[['icu_type','hospital_death']])
icu_type = icu_type.dropna()
icu_type.columns =['icu_type','hospital_death']
mean_encode = icu_type.groupby('icu_type')['hospital_death'].mean()
icu_type.loc[:,'icu_type_mean_en'] = icu_type['icu_type'].map(mean_encode)
icu_type.head()