Jovian
⭐️
Sign In

Homework 2 Question 2 KNN Classification in sklearn

In [15]:
# The following line will import KNeighborsClassifier "Class"
# KNeighborsClassifier is name of a "sklearn class" to perform "KNN Classification" 

from sklearn.neighbors import KNeighborsClassifier

A) Read the iris dataset from the following URL:

https://raw.githubusercontent.com/mpourhoma/CS4661/master/iris.csv

In [16]:
# Importing the required packages and libraries
import pandas as pd

# creating and empty DataFrame:
df = pd.DataFrame()

# reading a CSV file directly from Web, and store it in a pandas DataFrame:
# "read_csv" is a pandas function to read csv files from web or local device:
df = pd.read_csv('https://raw.githubusercontent.com/mpourhoma/CS4661/master/iris.csv')

# displaying the DataFrame:
df
Out[16]:
In [17]:
# Creating the Feature Matrix for iris dataset:

# create a python list of feature names that would like to pick from the dataset:
feature_cols = ['sepal_length','sepal_width','petal_length','petal_width']

# use the above list to select the features from the original DataFrame
X = df[feature_cols]  

# print the first 5 rows
y=df['species']

B) Split the dataset into testing and training sets with the following parameters:

test_size=0.4, random_state=6

In [18]:
# Randomly splitting the original dataset into training set and testing set
# The function"train_test_split" from "sklearn.cross_validation" library performs random splitting.
# "test_size=0.4" means that pick 40% of data samples for testing set, and the rest (60%) for training set.

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split (X,y,test_size=0.4,random_state=6)

C) Instantiate a KNN object with K=3, train it on the training set and test it on the testing set.

In [19]:
# In the following line, "knn" is instantiated as an "object" of KNeighborsClassifier "class". 

k = 3
knn = KNeighborsClassifier(n_neighbors=k) 
In [20]:
# We use the method "fit" of the object along with training dataset and labels to train the model.

knn.fit(X, y)
Out[20]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')
In [21]:
# Testing on the testing set:

y_predict = knn.predict(X_test)

print(y_predict)
['setosa' 'virginica' 'setosa' 'setosa' 'virginica' 'versicolor' 'virginica' 'setosa' 'virginica' 'versicolor' 'virginica' 'versicolor' 'virginica' 'virginica' 'versicolor' 'versicolor' 'virginica' 'versicolor' 'versicolor' 'setosa' 'setosa' 'virginica' 'setosa' 'setosa' 'versicolor' 'virginica' 'versicolor' 'virginica' 'setosa' 'versicolor' 'setosa' 'versicolor' 'setosa' 'setosa' 'versicolor' 'virginica' 'versicolor' 'virginica' 'versicolor' 'setosa' 'setosa' 'virginica' 'versicolor' 'versicolor' 'setosa' 'setosa' 'versicolor' 'setosa' 'setosa' 'versicolor' 'virginica' 'virginica' 'virginica' 'setosa' 'virginica' 'setosa' 'setosa' 'setosa' 'versicolor' 'virginica']
In [22]:
# We can now compare the "predicted labels" for the Testing Set with its "actual labels" to evaluate the accuracy 
# Function "accuracy_score" from "sklearn.metrics" will perform the element-to-element comparision and returns the 
# portion of correct predictions:

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_predict)

print(accuracy)
0.95

D) Repeat part (c) for K=1, K=5, K=7, K=11, K=15, K=27, K=59

(you can simply use a “for loop,” and save the final accuracy results in a list).

Does the accuracy always get better by increasing the value K? Ans: NO

In [23]:
# Defined list of K with name Knn_list and empty list to store each individuals accuracy

knn_list=[1,5,7,11,15,27,59]

knn_result_d={}

# using for loop to perform KNeighborsClassifier on different value of K and appending accuracy in knn_result_d list

for k in knn_list:
    
    knn= KNeighborsClassifier(n_neighbors=k)
    
    knn.fit(X_train,y_train)
    
    y_predict_d =knn.predict(X_test)
    
    accuracy =accuracy_score(y_test,y_predict_d)
    
    knn_result_d[k]=accuracy
        
    
print(knn_result_d)
{1: 0.95, 5: 0.9833333333333333, 7: 0.9666666666666667, 11: 0.9666666666666667, 15: 0.9333333333333333, 27: 0.9166666666666666, 59: 0.8166666666666667}

E) Now, suppose that we would like to make prediction based on only ONE single feature.

To find the best single feature, we have to try every feature individually. In other word, we have to repeat part (c) with K=3, four times (each time using only one of the 4 features), and compute the accuracy each time. Then, check the accuracies. Which individual feature provide the best accuracy (the best feature)?

Which one is the second best feature?
Ans: petal_length is the second best fearure on basis of accuracy. (0.93 bit) first best feature is petal_width with 0.95bit
In [24]:
# we perform the KNeighborsClassifier on individual features and store each one's accuracy in knn_result_dict dictonary 

k=3

knn= KNeighborsClassifier(n_neighbors=k)

knn_result_dict={}

for f in feature_cols:
    
        X_feature=df[f]

        X_ftrain,X_ftest,y_ftrain,y_ftest = train_test_split (X_feature.values.reshape(len(X_feature),1),y,test_size=0.4,random_state=6)

        knn.fit(X_ftrain,y_ftrain)

        y_fpredict=knn.predict(X_ftest)

        accuracy = accuracy_score(y_ftest,y_fpredict)

        knn_result_dict[f]=accuracy
        
    
print(knn_result_dict)
{'sepal_length': 0.7166666666666667, 'sepal_width': 0.5666666666666667, 'petal_length': 0.9333333333333333, 'petal_width': 0.95}

F) Now, we want to repeat part (e), this time using two features.

you have to train, test, and evaluate your model for 6 different cases: using (1st and 2nd features), (1st and 3rd features), (1st and 4th features), (2nd and 3rd features), (2nd and 4th features), (3rd and 4th features)!

Which “feature pair” provides the best accuracy?

Ans: sepal_length, petal_length (feature pair) provides the best accuracy. (0.98bit)
In [25]:
# we perform the KNeighborsClassifier on different feature pairs and store each one's accuracy in knn_result 

k=3

knn= KNeighborsClassifier(n_neighbors=k)

features =[['sepal_length','sepal_width'],['sepal_length','petal_length'],['sepal_length','petal_width'],['sepal_width','petal_length'],['sepal_width','petal_width'],['petal_length','petal_width']]

knn_result={}

for f in features:
    
        X_feature=df[f]

        X_ftrain,X_ftest,y_ftrain,y_ftest = train_test_split (X_feature,y,test_size=0.4,random_state=6)

        knn.fit(X_ftrain,y_ftrain)

        y_fpredict=knn.predict(X_ftest)

        accuracy = accuracy_score(y_ftest,y_fpredict)

        knn_result[','.join(f)]=accuracy

print(knn_result)



for k in knn_result.keys():
        
        print(k,":",knn_result[k])
    
{'sepal_length,sepal_width': 0.8166666666666667, 'sepal_length,petal_length': 0.9833333333333333, 'sepal_length,petal_width': 0.95, 'sepal_width,petal_length': 0.95, 'sepal_width,petal_width': 0.95, 'petal_length,petal_width': 0.9666666666666667} sepal_length,sepal_width : 0.8166666666666667 sepal_length,petal_length : 0.9833333333333333 sepal_length,petal_width : 0.95 sepal_width,petal_length : 0.95 sepal_width,petal_width : 0.95 petal_length,petal_width : 0.9666666666666667

G) Does the “best feature pair” from part (f) contain of both “first best feature” and “second best feature” from part (e)? In other word, can we conclude that the “best two features” for classification are the first best feature along with the second best feature together?

Ans: NO, because 1st best feature is petal_width: 0.95bit and 2nd best feature petal_length: 0.93bit. And best feature pair is sepal_length,petal_length: 0.98bit

H) Optional Question: Justify your answer for part (g)! If yes, why? If no, why not?

Ans: When we take one feature at a time then we consider only that feature an it's distance from testing data. while we take pair of features then we have to consider both feature and it's distance from testing sample .so, it is not necessary that first and second individual feature together give a best accuracy.