Learn practical skills, build real-world projects, and advance your career

Implementing PCA with Scikit-Learn

How to deal with Large no of Independent Features?

  1. Dont do anything. Train on all, which may take days or weeks to train.
  1. Reduce the no of variables by merging the correlated variables.
  1. Use RFE, VIF, etc techniques.
  1. Extract the most important features from the dataset that are responsible for maximum variance in the output. Techniques available are PCA, linear discriminant analysis, factor analysis, etc.

Advantages of PCA:

  1. The training time of the algorithm reduces significantly with less no of features.
  1. It is not always possible to analyze data in high dimensions. For instance, if there are 100 features in a dataset, then the Total no of scatterplot required to visualize the data would be (100(100-1))/2. Practically, it is not possible to analyze data this way.

Key Notes:

  1. It is important to mention that the feature set must be normalized before applying PCA.
  1. PCA is a statistical technique and can be only applied to numeric data. Therefore, it is very important to convert all the categorical features to numerical features before applying PCA.

Crowdedness at the Campus Gym

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('https://raw.githubusercontent.com/ingledarshan/upGrad_Darshan/main/data.csv')
df.head()
df.shape
(62184, 11)