Jovian
⭐️
Sign In

Model evaluation

1. Problem Statement

Given the dataset containing personal details of citizens, make a model to predict whether people will commit crime in future or not using random forest algorithm. Evaluate the model using possible model evaluation techniques.

2. Data Loading and Description

Courts are based on the principle that says "it is better 100 guilty Persons should escape than that one innocent Person should suffer". You are given a dataset containing answers to various questions concerning the professional and private lives of several people. A few of them have been arrested for various small and large crimes in the past. Use the given data to make a model for the court to predict if the convict is guilty or not. image.png

  • False positive - model predict a convict to be guilty but actually he isnt.
  • False negative - model predict a convict to be innocent but the person would be guilty.

So, based on the principle of court, false positives minimisation is the main aim. The dataset consists of 45718 rows.
Below is a table having brief description of features present in the dataset.

Feature Description
PERID Person ID
IFATHER FATHER IN HOUSEHOLD
NRCH17_2 RECODED # R's CHILDREN < 18 IN HOUSEHOLD
RHHSIZ2 RECODE - IMPUTATION-REVISED # PERSONS IN HH
IIHHSIZ2 IMPUTATION INDICATOR
IRKI17_2 IMPUTATION-REVISED # KIDS AGED<18 IN HH
IIKI17_2 IRKI17_2-IMPUTATION INDICATOR
IRHH65_2 REC - IMPUTATION-REVISED # OF PER IN HH AGED>=65
IIHH65_2 IRHH65_2-IMPUTATION INDICATOR
PRXRETRY SELECTED PROXY UNAVAILABLE, OTHER PROXY AVAILABLE?
PRXYDATA IS PROXY ANSWERING INSURANCE/INCOME QS
MEDICARE COVERED BY MEDICARE
CAIDCHIP COVERED BY MEDICAID/CHIP
CHAMPUS COV BY TRICARE, CHAMPUS, CHAMPVA, VA, MILITARY
PRVHLTIN COVERED BY PRIVATE INSURANCE
GRPHLTIN PRIVATE PLAN OFFERED THROUGH EMPLOYER OR UNION
HLTINNOS COVERED BY HEALTH INSUR
HLCNOTYR ANYTIME DID NOT HAVE HEALTH INS/COVER PAST 12 MOS
HLCNOTMO PAST 12 MOS, HOW MANY MOS W/O COVERAGE
HLCLAST TIME SINCE LAST HAD HEALTH CARE COVERAGE
HLLOSRSN MAIN REASON STOPPED COVERED BY HEALTH INSURANCE
HLNVCOST COST TOO HIGH
HLNVOFFR EMPLOYER DOESN'T OFFER
HLNVREF INSURANCE COMPANY REFUSED COVERAGE
HLNVNEED DON'T NEED IT
HLNVSOR NEVER HAD HLTH INS SOME OTHER REASON
IRMCDCHP IMPUTATION REVISED CAIDCHIP
IIMCDCHP MEDICAID/CHIP - IMPUTATION INDICATOR
IRMEDICR MEDICARE - IMPUTATION REVISED
IIMEDICR MEDICARE - IMPUTATION INDICATOR
IRCHMPUS CHAMPUS - IMPUTATION REVISED
IICHMPUS CHAMPUS - IMPUTATION INDICATOR
IRPRVHLT PRIVATE HEALTH INSURANCE - IMPUTATION REVISED
IIPRVHLT PRIVATE HEALTH INSURANCE - IMPUTATION INDICATOR
IROTHHLT OTHER HEALTH INSURANCE - IMPUTATION REVISED
IIOTHHLT OTHER HEALTH INSURANCE - IMPUTATION INDICATOR
HLCALLFG FLAG IF EVERY FORM OF HEALTH INS REPORTED
HLCALL99 YES TO MEDICARE/MEDICAID/CHAMPUS/PRVHLTIN
ANYHLTI2 COVERED BY ANY HEALTH INSURANCE - RECODE
IRINSUR4 RC-OVERALL HEALTH INSURANCE - IMPUTATION REVISED
IIINSUR4 RC-OVERALL HEALTH INSURANCE - IMPUTATION INDICATOR
OTHINS RC-OTHER HEALTH INSURANCE
CELLNOTCL NOT A CELL PHONE
CELLWRKNG WORKING CELL PHONE
IRFAMSOC FAM RECEIVE SS OR RR PAYMENTS - IMPUTATION REVISED
IIFAMSOC FAM RECEIVE SS OR RR PAYMENTS - IMPUTATION INDICATOR
IRFAMSSI FAM RECEIVE SSI - IMPUTATION REVISED
IIFAMSSI FAM RECEIVE SSI - IMPUTATION INDICATOR
IRFSTAMP RESP/OTH FAM MEM REC FOOD STAMPS - IMPUTATION REVISED
IIFSTAMP RESP/OTH FAM MEM REC FOOD STAMPS - IMPUTATION INDICATOR
IRFAMPMT FAM RECEIVE PUBLIC ASSIST - IMPUTATION REVISED
IIFAMPMT FAM RECEIVE PUBLIC ASSIST - IMPUTATION INDICATOR
IRFAMSVC FAM REC WELFARE/JOB PL/CHILDCARE - IMPUTATION REVISED
IIFAMSVC FAM REC WELFARE/JOB PL/CHILDCARE - IMPUTATION INDICATOR
IRWELMOS IMP. REVISED - NO.OF MONTHS ON WELFARE
IIWELMOS NO OF MONTHS ON WELFARE - IMPUTATION INDICATOR
IRPINC3 RESP TOT INCOME (FINER CAT) - IMP REV
IRFAMIN3 RECODE - IMP.REVISED - TOT FAM INCOME
IIPINC3 RESP TOT INCOME (FINER CAT) - IMP INDIC
IIFAMIN3 IRFAMIN3 - IMPUTATION INDICATOR
GOVTPROG RC-PARTICIPATED IN ONE OR MORE GOVT ASSIST PROGRAMS
POVERTY3 RC-POVERTY LEVEL
TOOLONG RESP SAID INTERVIEW WAS TOO LONG
TROUBUND DID RESP HAVE TROUBLE UNDERSTANDING INTERVIEW
PDEN10 POPULATION DENSITY 2010
COUTYP2 COUNTY METRO/NONMETRO STATUS
MAIIN102 MAJORITY AMER INDIAN AREA INDICATOR FOR SEGMENT
AIIND102 AMER INDIAN AREA INDICATOR
ANALWT_C FIN PRSN-LEVEL SIMPLE WGHT
VESTR ANALYSIS STRATUM
VEREP ANALYSIS REPLICATE
Criminal Target Variable

Importing Packages

In [1]:
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt 
plt.rc("font", size=14)
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
Importing the Dataset
In [2]:
crime = pd.read_csv('https://raw.githubusercontent.com/insaid2018/Term-2/master/Data/criminal_train.csv')
crime.head()
Out[2]:

3. Exploratory Data Analysis

Check the shape of the dataset
In [3]:
crime.shape
Out[3]:
(45718, 72)
Check the columns present in the dataset
In [4]:
crime.columns
Out[4]:
Index(['PERID', 'IFATHER', 'NRCH17_2', 'IRHHSIZ2', 'IIHHSIZ2', 'IRKI17_2',
       'IIKI17_2', 'IRHH65_2', 'IIHH65_2', 'PRXRETRY', 'PRXYDATA', 'MEDICARE',
       'CAIDCHIP', 'CHAMPUS', 'PRVHLTIN', 'GRPHLTIN', 'HLTINNOS', 'HLCNOTYR',
       'HLCNOTMO', 'HLCLAST', 'HLLOSRSN', 'HLNVCOST', 'HLNVOFFR', 'HLNVREF',
       'HLNVNEED', 'HLNVSOR', 'IRMCDCHP', 'IIMCDCHP', 'IRMEDICR', 'IIMEDICR',
       'IRCHMPUS', 'IICHMPUS', 'IRPRVHLT', 'IIPRVHLT', 'IROTHHLT', 'IIOTHHLT',
       'HLCALLFG', 'HLCALL99', 'ANYHLTI2', 'IRINSUR4', 'IIINSUR4', 'OTHINS',
       'CELLNOTCL', 'CELLWRKNG', 'IRFAMSOC', 'IIFAMSOC', 'IRFAMSSI',
       'IIFAMSSI', 'IRFSTAMP', 'IIFSTAMP', 'IRFAMPMT', 'IIFAMPMT', 'IRFAMSVC',
       'IIFAMSVC', 'IRWELMOS', 'IIWELMOS', 'IRPINC3', 'IRFAMIN3', 'IIPINC3',
       'IIFAMIN3', 'GOVTPROG', 'POVERTY3', 'TOOLONG', 'TROUBUND', 'PDEN10',
       'COUTYP2', 'MAIIN102', 'AIIND102', 'ANALWT_C', 'VESTR', 'VEREP',
       'Criminal'],
      dtype='object')
Check the descriptive statistics of the dataset
In [5]:
crime.describe()
Out[5]:
Check the info of the dataset
In [6]:
crime.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 45718 entries, 0 to 45717 Data columns (total 72 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PERID 45718 non-null int64 1 IFATHER 45718 non-null int64 2 NRCH17_2 45718 non-null int64 3 IRHHSIZ2 45718 non-null int64 4 IIHHSIZ2 45718 non-null int64 5 IRKI17_2 45718 non-null int64 6 IIKI17_2 45718 non-null int64 7 IRHH65_2 45718 non-null int64 8 IIHH65_2 45718 non-null int64 9 PRXRETRY 45718 non-null int64 10 PRXYDATA 45718 non-null int64 11 MEDICARE 45718 non-null int64 12 CAIDCHIP 45718 non-null int64 13 CHAMPUS 45718 non-null int64 14 PRVHLTIN 45718 non-null int64 15 GRPHLTIN 45718 non-null int64 16 HLTINNOS 45718 non-null int64 17 HLCNOTYR 45718 non-null int64 18 HLCNOTMO 45718 non-null int64 19 HLCLAST 45718 non-null int64 20 HLLOSRSN 45718 non-null int64 21 HLNVCOST 45718 non-null int64 22 HLNVOFFR 45718 non-null int64 23 HLNVREF 45718 non-null int64 24 HLNVNEED 45718 non-null int64 25 HLNVSOR 45718 non-null int64 26 IRMCDCHP 45718 non-null int64 27 IIMCDCHP 45718 non-null int64 28 IRMEDICR 45718 non-null int64 29 IIMEDICR 45718 non-null int64 30 IRCHMPUS 45718 non-null int64 31 IICHMPUS 45718 non-null int64 32 IRPRVHLT 45718 non-null int64 33 IIPRVHLT 45718 non-null int64 34 IROTHHLT 45718 non-null int64 35 IIOTHHLT 45718 non-null int64 36 HLCALLFG 45718 non-null int64 37 HLCALL99 45718 non-null int64 38 ANYHLTI2 45718 non-null int64 39 IRINSUR4 45718 non-null int64 40 IIINSUR4 45718 non-null int64 41 OTHINS 45718 non-null int64 42 CELLNOTCL 45718 non-null int64 43 CELLWRKNG 45718 non-null int64 44 IRFAMSOC 45718 non-null int64 45 IIFAMSOC 45718 non-null int64 46 IRFAMSSI 45718 non-null int64 47 IIFAMSSI 45718 non-null int64 48 IRFSTAMP 45718 non-null int64 49 IIFSTAMP 45718 non-null int64 50 IRFAMPMT 45718 non-null int64 51 IIFAMPMT 45718 non-null int64 52 IRFAMSVC 45718 non-null int64 53 IIFAMSVC 45718 non-null int64 54 IRWELMOS 45718 non-null int64 55 IIWELMOS 45718 non-null int64 56 IRPINC3 45718 non-null int64 57 IRFAMIN3 45718 non-null int64 58 IIPINC3 45718 non-null int64 59 IIFAMIN3 45718 non-null int64 60 GOVTPROG 45718 non-null int64 61 POVERTY3 45718 non-null int64 62 TOOLONG 45718 non-null int64 63 TROUBUND 45718 non-null int64 64 PDEN10 45718 non-null int64 65 COUTYP2 45718 non-null int64 66 MAIIN102 45718 non-null int64 67 AIIND102 45718 non-null int64 68 ANALWT_C 45718 non-null float64 69 VESTR 45718 non-null int64 70 VEREP 45718 non-null int64 71 Criminal 45718 non-null int64 dtypes: float64(1), int64(71) memory usage: 25.1 MB
Check the missing values present in the dataset.
In [7]:
crime.isnull().sum()
Out[7]:
PERID       0
IFATHER     0
NRCH17_2    0
IRHHSIZ2    0
IIHHSIZ2    0
           ..
AIIND102    0
ANALWT_C    0
VESTR       0
VEREP       0
Criminal    0
Length: 72, dtype: int64

4. Random Forest Classifier

Preparing X and y using pandas
In [ ]:
X = crime.drop(['Criminal'], axis=1)
X.head()
In [ ]:
y = crime["Criminal"]
Spliting X and y into train and test dataset.
In [ ]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)
Checking the shape of X and y of train dataset
In [ ]:
print(X_train.shape)
print(y_train.shape)
Checking the shape of X and y of test dataset
In [ ]:
print(X_test.shape)
print(y_test.shape)

Now, we are going to make 2 models one without any parameter specification and in other we will specify some parameter values. Then in later sections we will compare the performance of these models using various model evaluation techniques.

Instantiating Random Forest Classifier using scikit learn with default parameters.
In [ ]:
from sklearn.ensemble import RandomForestClassifier
model1 = RandomForestClassifier(random_state = 0)
Instantiating Random Forest Classifier using scikit learn with:
  • random_state = 0,
  • max_depth = 5,
  • min_samples_leaf = 5,
  • min_samples_split = 7,
  • min_weight_fraction_leaf = 0.0,
  • n_estimators = 12,
  • n_jobs = -1
In [ ]:
model2 = RandomForestClassifier(
                                random_state = 0,
                                max_depth = 5, 
                                min_samples_leaf = 5,
                                min_samples_split = 7,
                                min_weight_fraction_leaf = 0.0,
                                n_estimators = 12, 
                                n_jobs = -1,
                                ) 
Fitting the model on X_train and y_train
In [ ]:
model1.fit(X_train,y_train)
In [ ]:
model2.fit(X_train,y_train)
Using the model for prediction
In [ ]:
prediction1 = pd.DataFrame()
prediction1 = model1.predict(X_test)
In [ ]:
prediction2 = pd.DataFrame()
prediction2 = model2.predict(X_test)

5. Model evaluation

5.1 Model evaluation using accuracy score

In [ ]:
from sklearn.metrics import accuracy_score
print('Accuracy score for test data with model 1 is:',accuracy_score(y_test, prediction1))
print('Accuracy score for test data with model 2 is:',accuracy_score(y_test, prediction2))

Accuracy score of model1 is slightly greater than that of model2.
Lets see some other evaluation techniques, to compare the two models.

5.2 Model evaluation using confusion matrix

In [ ]:
from sklearn.metrics import confusion_matrix
print('Confusion matrix for test data with model 1 is:\n',confusion_matrix(y_test, prediction1))
print('Confusion matrix for test data with model 2 is:\n',confusion_matrix(y_test, prediction2))

Comparing confusion matrix for the two models:

  • No. of False negative cases are more in model2
  • No. of False positive cases are 0 in model2

Calculating Recall and precision score for a clearer picture of the scenario.

5.3. Model evaluation using precision score

In [ ]:
from sklearn.metrics import precision_score
precision1 = precision_score(y_test,prediction1)
print('Precision score for test data using model1 is:', precision1)
precision2 = precision_score(y_test,prediction2)
print('Precision score for test data using model2 is:', precision2)

Precision score for model2 is 1. This means that no innocent is convicted as guilty. Thats what the foundation of law.

5.4 Model evaluation using recall score

In [ ]:
from sklearn.metrics import recall_score
print('Recall score for test data using model1 is:',recall_score(y_test,prediction1))   
print('Recall score for test data using model2 is:',recall_score(y_test,prediction2))

Recall score of model1 is higher than that of model2.

5.5 Model evaluation using F1_score

In [ ]:
from sklearn.metrics import f1_score
print('F1_score for test data using model1 is:',f1_score(y_test, prediction1))
print('F1_score for test data using model2 is:',f1_score(y_test, prediction2))

F1_score for model1 is much higher than that of model 2, but we need to take decision on the basis of Precision Score.

5.6 Model evaluation using ROC_AUC curve

  • For model1
In [ ]:
from sklearn import metrics
probs = model1.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(fpr, tpr)

import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
  • For model2
In [ ]:
from sklearn import metrics
probs = model2.predict_proba(X_test)
pred = probs[:,1]
fpr1, tpr1, threshold = metrics.roc_curve(y_test, pred)
roc_auc = metrics.auc(fpr1, tpr1)

# method I: plt
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr1, tpr1, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Observing the ROC-AUC curve for both the models, AUC score of model2 is higher than model1.

5.7 Choosing better model using precision score

We have compared the performance of the two models using various model evaluation techinques.
Our objective is to minimize False Positive so that no innocent is convicted as guilty. Therefore, among recall & precision scores, we will give more importance to precision score.

  • Precision score for model1 is: 0.62
  • Precision score for model2 is: 1

As precision score of model2 is greater than that of model1, therefore, model2 is preferable.