Jovian
⭐️
Sign In

Handling Missing Data

Missing Values occur because of various reasons, such as, corrupt data, failure to load the information, or incomplete extraction. Handling the missing values is one of the greatest challenges faced by analysts, because making the right decision on how to handle it generates robust data models. Missing values cause a lot of problems when using the dataset for any machine learning algorithm.They hinder with data analysis and data visualization.

In this kernel , lets look at the following methods to handle missing data

**1.Drop all missing Values
2.Drop the values above a certain threshold
3.Imputation using mean ,median and mode
4.Imputation using forward fill and backward fill
5.sklearn Imputer for Numerical Variables
6.sklearn Imputer for Categorical Variables **

In [1]:
# importing libraries
import numpy as np 
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import category_encoders as ce
%matplotlib inline
In [2]:
# loading dataset 
df = pd.read_csv("../input/widsdatathon2020/training_v2.csv")

Drop all missing Values

In this method , we delete a particular row if it has a null value for a particular feature .This method is used only when there are enough samples in the data set. It has to be ensured that there is no bias after data deletion.Removing the data will lead to loss of information which will not give the expected results while predicting the output.All observations that are having null values can be deleted using pandas dropna() method.

In [3]:
# Number of missing values in the data
df = df[df.columns[0:20]]
df.isna().sum()
Out[3]:
encounter_id                 0
patient_id                   0
hospital_id                  0
hospital_death               0
age                       4228
bmi                       3429
elective_surgery             0
ethnicity                 1395
gender                      25
height                    1334
hospital_admit_source    21409
icu_admit_source           112
icu_id                       0
icu_stay_type                0
icu_type                     0
pre_icu_los_days             0
readmission_status           0
weight                    2720
albumin_apache           54379
apache_2_diagnosis        1662
dtype: int64
In [4]:
print("Orginal shape before dropna()" ,df.shape)
drop = df.dropna()
print("Sshape after dropna()" ,drop.shape)
Orginal shape before dropna() (91713, 20) Sshape after dropna() (25349, 20)

Drop the values above a certain threshold

If the information contained in the variable is not that high, you can drop the variable if it has more than 50% missing values.In this method we are dropping columns with null values above a certain threshold

In [5]:
# Drop columns based on threshold limit
threshold = len(df) * 0.60
df_thresh=df.dropna(axis=1, thresh=threshold)
# View columns in the dataset
df_thresh.shape
Out[5]:
(91713, 19)

Imputation using mean ,median and mode

This imputation method treats every variable individually, ignoring any interrelationships with other variables and is beneficial for simple linear models and NN. This method may not be suitable for tree based algorithms and this is an approximation which can add variance to the data set. The loss of the data can be negated by this method which yields better results compared to removal of rows and columns.

Null values are replaced with mean/median.mode in this method. This is the statistical method of handling Null values.The mean of the numerical column data is used to replace null values when the data is normally distributed. Median is used if the data comprised of outliers. Mode is used when the data having more occurences of a particular value or more frequent value.

In [6]:
impute_df = df[['age','bmi','height']]
mean_age = impute_df['age'].mean()
median_bmi = impute_df['bmi'].median()
mean_height = impute_df['height'].mean()

In [7]:
impute_df.isna().sum()
Out[7]:
age       4228
bmi       3429
height    1334
dtype: int64

The null values are replaced with mean/median/mode values by using fillna() method

In [8]:
# Replace Null Values (np.nan) with mean
impute_df['age'].fillna(mean_age, inplace=True)

impute_df['height'].fillna(mean_height, inplace=True)

# Alternate way to fill null values with mean
impute_df['bmi'].fillna(median_bmi,inplace=True)

impute_df.isna().sum()
/opt/conda/lib/python3.6/site-packages/pandas/core/generic.py:6287: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._update_inplace(new_data)
Out[8]:
age       0
bmi       0
height    0
dtype: int64

Imputation using forward fill and backward fill

Null values can also be replaced by it’s previous value in the column which is called Backward fill or next occurring value in the column which is called Forward fill.The NaN value will remain even after forward filling or back filling if a next or previous value isn’t available or it is also a NaN value.

In [9]:
fill_df = df[['gender','ethnicity']]
fill_df.isna().sum()
Out[9]:
gender         25
ethnicity    1395
dtype: int64
In [10]:
#backward fill
fill_df['gender'].fillna(method='bfill',inplace = True)
# Forward fill
fill_df['ethnicity'].fillna(method='ffill',inplace = True)
fill_df.isna().sum()
Out[10]:
gender       0
ethnicity    0
dtype: int64

sklearn Imputer for Numerical Variables

In [11]:
imputer_skdf = df[['readmission_status','weight']]
In [12]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
# Fit and transform to the parameters
imputer_skdf = pd.DataFrame(imputer.fit_transform(imputer_skdf))
imputer_skdf.columns = ['readmission_status','weight']
# Checking for any null values
imputer_skdf.isna().sum()
Out[12]:
readmission_status    0
weight                0
dtype: int64

sklearn Imputer for Categorical Variables

In [13]:
categ_df = df[['ethnicity','gender']]
print("ICU Stay Type - value Counts : \n " ,categ_df['ethnicity'].value_counts())
print("ICU Type - value Counts : \n " , categ_df['gender'].value_counts())

ICU Stay Type - value Counts : Caucasian 70684 African American 9547 Other/Unknown 4374 Hispanic 3796 Asian 1129 Native American 788 Name: ethnicity, dtype: int64 ICU Type - value Counts : M 49469 F 42219 Name: gender, dtype: int64
In [14]:
# Replacing null values in Embarked with most frequent value
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
categ_df = pd.DataFrame(imputer.fit_transform(categ_df))
categ_df.columns =['ethnicity','gender']
# Value counts for Embarked column
print("ICU Stay Type - value Counts : \n " ,categ_df['ethnicity'].value_counts())
print("ICU Type - value Counts : \n " , categ_df['gender'].value_counts())
ICU Stay Type - value Counts : Caucasian 72079 African American 9547 Other/Unknown 4374 Hispanic 3796 Asian 1129 Native American 788 Name: ethnicity, dtype: int64 ICU Type - value Counts : M 49494 F 42219 Name: gender, dtype: int64
In [15]:
categ_df.isna().sum()
Out[15]:
ethnicity    0
gender       0
dtype: int64