Learn practical skills, build real-world projects, and advance your career

This kernel is adapted from @willkoehrsen kaggle kernel

Feature Selection :

Feature selection, the process of finding and selecting the most useful features in a dataset, is a crucial step of the machine learning pipeline. Unnecessary features decrease training speed, decrease model interpretability, and, most importantly, decrease generalization performance on the test set.

The techniques discussed in this kernel are below

  1. Features with a high percentage of missing values
  2. Collinear (highly correlated) features
  3. Features with zero importance in a tree-based model
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.impute import SimpleImputer
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

%matplotlib inline
# loading dataset 
training_v2 = pd.read_csv("../input/widsdatathon2020/training_v2.csv")
# creating independent features X and dependant feature Y
y = training_v2['hospital_death']
X = training_v2.copy()
X = training_v2.drop('hospital_death',axis = 1)
X1 = training_v2.drop('hospital_death',axis = 1)

Collinear (highly correlated) features

Collinear features are features that are highly correlated with one another. In machine learning, these lead to decreased generalization performance on the test set due to high variance and less model interpretability.