Jovian
⭐️
Sign In

Categorical Encoding

This kernel covers some of the commonly used Categorical Encoding Techniques .

1.OneHot Encoding
2.Label Encoding
3.Ordinal Encoding
4.Binary Encoding
5.Frequency Encoding
6.Mean Encoding

In [1]:
# importing libraries
import numpy as np 
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import category_encoders as ce
%matplotlib inline
In [2]:
# loading dataset 
df = pd.read_csv("../input/widsdatathon2020/training_v2.csv")

In [3]:
#printing the categorical variables
print([c for c in df.columns if (1<df[c].nunique()) & (df[c].dtype != np.number)& (df[c].dtype != int) ])

['ethnicity', 'gender', 'hospital_admit_source', 'icu_admit_source', 'icu_stay_type', 'icu_type', 'apache_3j_bodysystem', 'apache_2_bodysystem']
In [4]:
categorical_cols =  ['hospital_id','ethnicity', 'gender', 'hospital_admit_source', 'icu_admit_source', 'icu_stay_type', 'icu_type', 'apache_3j_bodysystem', 'apache_2_bodysystem',"hospital_death",'age']
In [5]:
Categorical_df= df[categorical_cols]
Categorical_df.head(5)
Out[5]:

OneHot Encoding :

One-hot encoding is the most widely used encoding scheme. It works by creating a column for each category present in the feature and assigning a 1 or 0 to indicate the presence of a category in the data.Pandas get_dummies method can be applied to a data frame and will only convert string columns into numbers .

In [6]:
X = pd.DataFrame(Categorical_df['ethnicity'])
Categorical_df['ethnicity'].value_counts()
Out[6]:
Caucasian           70684
African American     9547
Other/Unknown        4374
Hispanic             3796
Asian                1129
Native American       788
Name: ethnicity, dtype: int64
In [7]:
one_hot_encoded_pandas = pd.get_dummies(X)
one_hot_encoded_pandas.head()
Out[7]:

Label Encoding :

Label encoders transform non-numerical labels into numerical labels. Each category is assigned a unique label starting from 0 and going on till n_categories – 1 per feature. Label encoders are suitable for encoding variables where alphabetical alignment or numerical value of labels is important.

In [8]:
# label encoding the data 
le = LabelEncoder()  
gender = pd.DataFrame(le.fit_transform(Categorical_df['gender'].astype('str')))
gender.columns = ['Gender']
gender.head()
Out[8]:

Ordinal Encoding :

Ordinal encoding uses a single column of integers to represent the classes. An optional mapping dict can be passed in; in this case, we use the knowledge that there is some true order to the classes themselves

In [9]:
icu_st = pd.DataFrame(Categorical_df['icu_stay_type'])
icu_st.columns = ['icu_stay_type']
icu_st_dict={'admit':0,'readmit':1,'transfer':2}
icu_st['icu_st_ordinal'] = icu_st.icu_stay_type.map(icu_st_dict)
icu_st.head(5)
Out[9]:

Binary Encoding :

Binary encoding convert a category into a binary digits. Each binary digit creates one feature column. If there are n unique categories, then binary encoding results in only log(base 2)ⁿ features. In this example we have 4 feature, thus total number of binary encoded feature will be 3 features.

In [10]:
Categorical_df['apache_3j_bodysystem'].unique()
Out[10]:
array(['Sepsis', 'Respiratory', 'Metabolic', 'Cardiovascular', 'Trauma',
       'Neurological', 'Gastrointestinal', 'Genitourinary', nan,
       'Hematological', 'Musculoskeletal/Skin', 'Gynecological'],
      dtype=object)
In [11]:
apache_3 = pd.DataFrame(Categorical_df['apache_3j_bodysystem'])
apache_3 = apache_3.dropna()
apache_3.columns =['apache_3j_bodysystem']
encoder = ce.BinaryEncoder(cols = ['apache_3j_bodysystem'])
apache_bin = encoder.fit_transform(apache_3['apache_3j_bodysystem'])
apache_3 = pd.concat([apache_3,apache_bin],axis = 1)
apache_3.head(5)
Out[11]:

Frequency Encoding :

It is a way to utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat with the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data.

In [12]:
Categorical_df['hospital_admit_source'].unique()
Out[12]:
array(['Floor', 'Emergency Department', 'Operating Room', nan,
       'Direct Admit', 'Other Hospital', 'Other ICU', 'ICU to SDU',
       'Recovery Room', 'Chest Pain Center', 'Step-Down Unit (SDU)',
       'Acute Care/Floor', 'PACU', 'Observation', 'ICU', 'Other'],
      dtype=object)
In [13]:
hosp_asource = pd.DataFrame(Categorical_df['hospital_admit_source'])
hosp_asource = hosp_asource.dropna()
hosp_asource.columns =['hospital_admit_source']
fe = hosp_asource.groupby('hospital_admit_source').size()/len(hosp_asource)
hosp_asource.loc[:,'hospital_admit_source_fe'] = hosp_asource['hospital_admit_source'].map(fe)
hosp_asource.head()
Out[13]:

Mean Encoding :

Mean encoding is similar to label encoding, except here labels are correlated directly with the target. For example, in mean target encoding for each category in the feature label is decided with the mean value of the target variable on a training data.

In [14]:
Categorical_df['icu_type'].unique()
Out[14]:
array(['CTICU', 'Med-Surg ICU', 'CCU-CTICU', 'Neuro ICU', 'MICU', 'SICU',
       'Cardiac ICU', 'CSICU'], dtype=object)
In [15]:
icu_type = pd.DataFrame(Categorical_df[['icu_type','hospital_death']])
icu_type = icu_type.dropna()
icu_type.columns =['icu_type','hospital_death']
mean_encode = icu_type.groupby('icu_type')['hospital_death'].mean()
icu_type.loc[:,'icu_type_mean_en'] = icu_type['icu_type'].map(mean_encode)
icu_type.head()

Out[15]: