Jovian
⭐️
Sign In

Crime Hotspot Finder

Idea

The idea of the research is to create a prototype of an application that can help police in a town/city to be able to view a crime map of the area once they give the inputs of "day", "month", "hour". At any particular timeframe determined by a day, a month and a year; the prototype will output a map of Chicago with all the 25 districts in the state of Chicago displayed and shaded according to the intensity of crime in that particular timeframe.

Below is the basic UI of our prototype

Chicago%20Crime%20Hotspot%20UI.png

In [109]:
# Loading the libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime
import seaborn as sns

Creating the Dataset

  • Append all datasets from 2015-2019 [5 years]
  • The final dataset should have 22 columns exactly
  • Missing data has to be handled
In [110]:
'''Let's write code to automate the creating of our dataset'''

file_names = ['crimes_2015.csv','crimes_2016.csv','crimes_2017.csv','crimes_2018.csv','crimes_2019.csv']

def create_df(filenames):
    main_df = pd.read_csv(filenames[0])
    print("Finished Loading Chicago Crime Dataset File for the year "+filenames[0][7:11]+".")
    main_df = main_df[list(main_df.columns[:22])]
    for file in filenames[1:]:
        print("Finished Loading Chicago Crime Dataset File for the year "+file[7:11]+".")
        df_temp = pd.read_csv(file)
        df_temp = df_temp[list(df_temp.columns[:22])]
        main_df = main_df.append(df_temp, ignore_index=True)
    print("All data files loaded onto the Main Dataframe.\nYOU ARE READY TO GO!\n")
    return main_df

main_df = create_df(file_names)
orig_shape = main_df.shape
print("The Number of Crimes: "+ str(main_df.shape[0]))
print("\nThe Columns: "+ str(main_df.shape[1]))
Finished Loading Chicago Crime Dataset File for the year 2015. Finished Loading Chicago Crime Dataset File for the year 2016. Finished Loading Chicago Crime Dataset File for the year 2017. Finished Loading Chicago Crime Dataset File for the year 2018. Finished Loading Chicago Crime Dataset File for the year 2019. All data files loaded onto the Main Dataframe. YOU ARE READY TO GO! The Number of Crimes: 1146382 The Columns: 22
In [111]:
main_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1146382 entries, 0 to 1146381 Data columns (total 22 columns): ID 1146382 non-null int64 Case Number 1146382 non-null object Date 1146382 non-null object Block 1146382 non-null object IUCR 1146382 non-null object Primary Type 1146382 non-null object Description 1146382 non-null object Location Description 1142740 non-null object Arrest 1146382 non-null bool Domestic 1146382 non-null bool Beat 1146382 non-null int64 District 1146381 non-null float64 Ward 1146373 non-null float64 Community Area 1146380 non-null float64 FBI Code 1146382 non-null object X Coordinate 1132399 non-null float64 Y Coordinate 1132399 non-null float64 Year 1146382 non-null int64 Updated On 1146382 non-null object Latitude 1132399 non-null float64 Longitude 1132399 non-null float64 Location 1132399 non-null object dtypes: bool(2), float64(7), int64(3), object(10) memory usage: 177.1+ MB
In [112]:
# Missing values in the dataset
main_df.isna().sum()
Out[112]:
ID                          0
Case Number                 0
Date                        0
Block                       0
IUCR                        0
Primary Type                0
Description                 0
Location Description     3642
Arrest                      0
Domestic                    0
Beat                        0
District                    1
Ward                        9
Community Area              2
FBI Code                    0
X Coordinate            13983
Y Coordinate            13983
Year                        0
Updated On                  0
Latitude                13983
Longitude               13983
Location                13983
dtype: int64
In [113]:
sns.heatmap(data = main_df.isna(), yticklabels=False, cbar=False, cmap='inferno')
Out[113]:
<matplotlib.axes._subplots.AxesSubplot at 0x161ad3f0048>
Notebook Image
In [114]:
# To drop the rows with missing data
main_df = main_df.dropna()
main_df.isna().sum()
Out[114]:
ID                      0
Case Number             0
Date                    0
Block                   0
IUCR                    0
Primary Type            0
Description             0
Location Description    0
Arrest                  0
Domestic                0
Beat                    0
District                0
Ward                    0
Community Area          0
FBI Code                0
X Coordinate            0
Y Coordinate            0
Year                    0
Updated On              0
Latitude                0
Longitude               0
Location                0
dtype: int64

The above dropping of rows does not cause too much of data loss as shown below :

In [115]:
# Inspecting the loss of data after such cleaning
print("Data Retained after Cleaning:",round(((main_df.shape[0]/orig_shape[0]) * 100),2),"%")
Data Retained after Cleaning: 98.55 %
In [116]:
# First 10 rows (instances) of our dataset
main_df.head(10)
Out[116]:
In [117]:
# What are the features of our dataset?
print(main_df.columns)
Index(['ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Primary Type', 'Description', 'Location Description', 'Arrest', 'Domestic', 'Beat', 'District', 'Ward', 'Community Area', 'FBI Code', 'X Coordinate', 'Y Coordinate', 'Year', 'Updated On', 'Latitude', 'Longitude', 'Location'], dtype='object')
In [118]:
# Time Conversion Function
def time_convert(date_time):
    s1 = date_time[:11]
    s2 = date_time[11:]
    
    month = s1[:2]
    date = s1[3:5]
    year = s1[6:10]
    
    hr = s2[:2]
    mins = s2[3:5]
    sec = s2[6:8]
    time_frame = s2[9:]
    if(time_frame == 'PM'):
        if (int(hr) != 12):
            hr = str(int(hr) + 12)
    else:
        if(int(hr) == 12):
            hr = '00'
    
    final_date = datetime(int(year), int(month), int(date), int(hr), int(mins), int(sec))
    return final_date
In [119]:
# Using apply() of pandas to apply time_convert on every row of the Date column
main_df['Date'] = main_df['Date'].apply(time_convert)
In [120]:
main_df['Date'].head()
Out[120]:
1   2015-12-31 23:59:00
2   2015-12-31 23:55:00
3   2015-12-31 23:50:00
4   2015-12-31 23:50:00
5   2015-12-31 23:45:00
Name: Date, dtype: datetime64[ns]

Patterns => Stories => Important Decisions => More Business/Social Value

In [121]:
'''Feature Engineering with Numerical Data'''

# Feature Engineering 1 : Month
def month_col(x):
    return int(x.strftime("%m"))
main_df['Month'] = main_df['Date'].apply(month_col)

# Feature Engineering 2 : Day
def day_col(x):
    return int(x.strftime("%w"))
main_df['Day'] = main_df['Date'].apply(day_col)  

# Feature Engineering 3 : Hour
def hour_col(x):
    return int(x.strftime("%H"))
main_df['Hour'] = main_df['Date'].apply(hour_col)
In [122]:
print(main_df.head())
ID Case Number Date Block IUCR \ 1 10365064 HZ100370 2015-12-31 23:59:00 075XX S EMERALD AVE 1320 2 10364662 HZ100006 2015-12-31 23:55:00 079XX S STONY ISLAND AVE 0430 3 10364740 HZ100010 2015-12-31 23:50:00 024XX W FARGO AVE 0820 4 10364683 HZ100002 2015-12-31 23:50:00 037XX N CLARK ST 0460 5 10365142 HZ100722 2015-12-31 23:45:00 001XX E WACKER DR 0880 Primary Type Description Location Description \ 1 CRIMINAL DAMAGE TO VEHICLE STREET 2 BATTERY AGGRAVATED: OTHER DANG WEAPON STREET 3 THEFT $500 AND UNDER APARTMENT 4 BATTERY SIMPLE SIDEWALK 5 THEFT PURSE-SNATCHING SIDEWALK Arrest Domestic ... X Coordinate Y Coordinate Year \ 1 False False ... 1172605.0 1854931.0 2015 2 False False ... 1188223.0 1852840.0 2015 3 False False ... 1158878.0 1949369.0 2015 4 True False ... 1167786.0 1925033.0 2015 5 False False ... 1177683.0 1902638.0 2015 Updated On Latitude Longitude \ 1 02/10/2018 03:50:01 PM 41.757367 -87.642993 2 02/10/2018 03:50:01 PM 41.751270 -87.585822 3 02/10/2018 03:50:01 PM 42.016804 -87.690709 4 02/10/2018 03:50:01 PM 41.949837 -87.658635 5 02/10/2018 03:50:01 PM 41.888165 -87.622937 Location Month Day Hour 1 (41.757366519, -87.642992854) 12 4 23 2 (41.751270452, -87.585822373) 12 4 23 3 (42.016804165, -87.690708662) 12 4 23 4 (41.949837364, -87.658635101) 12 4 23 5 (41.888165132, -87.622937212) 12 4 23 [5 rows x 25 columns]
In [123]:
# Top 10 crimes in Chicago
top_10 = list(main_df['Primary Type'].value_counts().head(10).index)
top_10
Out[123]:
['THEFT',
 'BATTERY',
 'CRIMINAL DAMAGE',
 'ASSAULT',
 'OTHER OFFENSE',
 'DECEPTIVE PRACTICE',
 'NARCOTICS',
 'BURGLARY',
 'MOTOR VEHICLE THEFT',
 'ROBBERY']
In [124]:
# Next, filter all the crimes that lie in top_10

# Let's make it a function 
'''
1. Take in each crime and make a dataset of it
2. Append the sub datasets to each other
'''
def filter_top_10(df):
    df2=df[df['Primary Type']=='THEFT']
    for crime in top_10[1:]:
        temp=df[df['Primary Type']==crime]
        df2 = df2.append(temp, ignore_index=True)
    return df2
    
df2=filter_top_10(main_df) # the dataframe with all the data of only the top 10 crimes
df2.shape
Out[124]:
(1036588, 25)
In [125]:
df2.head()
Out[125]:
In [126]:
df2[['Domestic', 'Beat', 'District', 'Ward', 'Community Area', 'FBI Code', 'Location', 'X Coordinate', 'Y Coordinate']].head()
Out[126]:

What do each of the above features mean?

  • Domestic : Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.
  • Beat : Indicates the beat where the incident occurred. A beat is the smallest police geographic area – each beat has a dedicated police beat car. Three to five beats make up a police sector, and three sectors make up a police district. The Chicago Police Department has 22 police districts.
  • District : Indicates the police district where the incident occurred
  • Ward : The ward(City Council District) where the incident occurred
  • Community Are : Indicates the community area where the incident occurred. Chicago has 77 community areas.
  • FBI Code : Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS).
  1. First Attempt was to use all these location-type attributes given above to understand where a crime would happen
  2. Didn't work! Because of the vast number of wards and beats
  3. So, we use only Month, Day, District and Hour.
In [127]:
'''Classifying Crime Hotspots based on Crime Intensity'''
# Creating our explicit dataset
cri5 = df2.groupby(['Month','Day','District','Hour'], as_index=False).agg({"Primary Type":"count"})
cri5 = cri5.sort_values(by=['District'], ascending=False)
cri5.head()
Out[127]:

We are not including Year because it is not of prime importance in predicting a future crime.

In [128]:
# Renaming our feature
cri6=cri5.rename(index=str, columns={"Primary Type":"Crime_Count"})
cri6.head()
Out[128]:

NOTE

cri6 is our main dataset for all further operations.
In [129]:
cri6 = cri6[['Month','Day','District','Hour','Crime_Count']]
cri6.head()
print("The shape of our final dataset is:", cri6.shape)
The shape of our final dataset is: (44361, 5)
In [130]:
# Viewing the maximum and minmum crime counts
print("Highest Crime Count :", cri6["Crime_Count"].max())
print("Lowest Crime Count :", cri6["Crime_Count"].min())
Highest Crime Count : 93 Lowest Crime Count : 1
In [131]:
print("Average no. of crimes per month per day per district per hour :",round(cri6['Crime_Count'].sum()/cri6.shape[0], 2),".")
Average no. of crimes per month per day per district per hour : 23.37 .
In [158]:
lower = np.mean(cri6['Crime_Count'])-0.75*np.std(cri6['Crime_Count'])
higher = np.mean(cri6['Crime_Count'])+0.75*np.std(cri6['Crime_Count'])
print(lower, higher)
13.96855215504785 32.765651311961456
In [159]:
# Crime Count Distribution plot (We need to be using this plot in order to dividing our alarm rate)
plt.hist(x='Crime_Count', data=cri6,bins=90,linewidth=1,edgecolor='black', color='#163ca9')
#plt.title("Distribution of Crimes in Chicago", fontfamily="Agency FB", fontsize=25)
plt.xlabel("Crimes per month per district per hour per day")
plt.ylabel("Number of Occurences")
plt.savefig("Distribution of crimes.png")
Notebook Image
In [160]:
# 0-14 : Low Crime Rate
# 15-33 : Medium Crime Rate
# 34 and above : High Crime Rate

# Feature Engineer the above dataset
def crime_rate_assign(x):
    if(x<=14):
        return 0
    elif(x>14 and x<=33):
        return 1
    else:
        return 2
cri6['Alarm'] = cri6['Crime_Count'].apply(crime_rate_assign)
cri6 = cri6[['Month','Day','Hour','District','Crime_Count','Alarm']]    
cri6.head()
Out[160]:
In [161]:
# To store the above dataset as a csv file
cri6.to_csv(r'Crime_Compress.csv')
In [162]:
temp = cri6[['Month', 'Day', 'Hour', 'District', 'Alarm']]
sns.heatmap(temp.corr(), annot=True)
#plt.title("Checking!", fontsize=17)
plt.savefig("Correlation.png")
Notebook Image

NO CORRELATION!

In [163]:
# Let's check how good our data is for classification
cri6['Alarm'].value_counts()
Out[163]:
1    22640
0    12449
2     9272
Name: Alarm, dtype: int64
In [164]:
cri6[cri6['Alarm']==2].count()
Out[164]:
Month          9272
Day            9272
Hour           9272
District       9272
Crime_Count    9272
Alarm          9272
dtype: int64
In [165]:
print("Low Crime Rate Percentage:", round(cri6['Alarm'].value_counts()[0]/cri6['Alarm'].value_counts().sum()*100,2))
print("Medium Crime Rate Percentage:", round(cri6['Alarm'].value_counts()[1]/cri6['Alarm'].value_counts().sum()*100,2))
print("High Crime Rate Percentage:", round(cri6['Alarm'].value_counts()[2]/cri6['Alarm'].value_counts().sum()*100.2))
Low Crime Rate Percentage: 28.06 Medium Crime Rate Percentage: 51.04 High Crime Rate Percentage: 21.0

Oops!
Our classification dataset here is IMBALANCED!
Ways to deal with imbalanced datasets:

  • Oversampling
  • Undersampling
In [166]:
''' Plotting distributions for the features - Do we need to scale these or not? '''
print(cri6.head())
sns.kdeplot(cri6["Crime_Count"], shade=True)
plt.savefig("kdeDist")
Month Day Hour District Crime_Count Alarm 16369 5 2 9 31.0 1 0 2113 1 3 10 31.0 1 0 12673 4 2 10 31.0 1 0 1584 1 2 16 31.0 1 0 1583 1 2 13 31.0 1 0
C:\Users\Ramshankar Yadhunath\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Notebook Image

Making our Final Test Dataset for the God Tests

In [167]:
'''Building our completely unseen final test dataset for the "GOD TEST 1"'''

# Load the Dataset
test_files = ['crimes_2013.csv', 'crimes_2012.csv', 'crimes_2014.csv']
test_df = create_df(test_files)
# Drop missing values
test_df = test_df.dropna()
# Using apply() of pandas to apply time_convert on every row of the Date column
test_df['Date'] = test_df['Date'].apply(time_convert)
# Feature Engineering our columns
test_df['Month'] = test_df['Date'].apply(month_col)
test_df['Day'] = test_df['Date'].apply(day_col)
test_df['Hour'] = test_df['Date'].apply(hour_col)
# Compressing
df7 = filter_top_10(test_df)
cri7 = df7.groupby(["Month", "Day", "District", "Hour"], as_index=False).agg({"Primary Type" : "count"})
cri7 = cri7.sort_values(by=["District"], ascending=False)
cri8 = cri7.rename(index=str, columns={"Primary Type" : "Crime_Count"})
cri8 = cri8[["Month", "Day", "District", "Hour", "Crime_Count"]]
cri8['Alarm'] = cri8['Crime_Count'].apply(crime_rate_assign)
cri8 = cri8[['Month','Day','Hour','District','Crime_Count','Alarm']]    
print(cri8.head())
print("Class Imbalance\n")
print(cri8['Alarm'].value_counts())
Finished Loading Chicago Crime Dataset File for the year 2013. Finished Loading Chicago Crime Dataset File for the year 2012. Finished Loading Chicago Crime Dataset File for the year 2014. All data files loaded onto the Main Dataframe. YOU ARE READY TO GO! Month Day Hour District Crime_Count Alarm 28465 8 4 20 31 1 0 24243 7 3 10 31 1 0 25299 7 5 17 31 1 0 32680 9 5 23 31 1 0 28464 8 4 4 31 1 0 Class Imbalance 1 22933 0 16979 2 4353 Name: Alarm, dtype: int64
In [175]:
'''Creating the Oversampled balanced dataset'''

from sklearn.utils import resample # for upsampling

# Set individual classes
cri6_low = cri6[cri6['Alarm']==0]
cri6_medium = cri6[cri6['Alarm']==1]
cri6_high = cri6[cri6['Alarm']==2]

# Upsample the minority classes to size of class 1 (medium)
cri6_low_upsampled = resample(cri6_low, 
                                 replace=True,     # sample with replacement
                                 n_samples=22640,    # to match majority class
                                 random_state=101) 

cri6_high_upsampled = resample(cri6_high, 
                                 replace=True,     # sample with replacement
                                 n_samples=22640,    # to match majority class
                                 random_state=101)

# Combine majority class with upsampled minority class
cri6_upsampled = pd.concat([cri6_medium, cri6_low_upsampled, cri6_high_upsampled])

The Modelling (God Test Included)

Algorithm 1 : Decision Trees

In [169]:
# Using Decision Trees for classification (Imbalanced Dataset)

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics 
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.utils.multiclass import unique_labels

X = cri6[['Month', 'Day', 'Hour', 'District']] # independent
y = cri6['Alarm'] # dependent

# Let's split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101) # 75:25 split

# print(X_train)
# print('hi')
# print(X_test)
# Creating tree
d_tree = DecisionTreeClassifier(random_state=101)
# Fitting tree
d_tree = d_tree.fit(X_train, y_train)
# Predicting !
y_pred = d_tree.predict(X_test)

# Model Evaluation
# print(y_test)
# print(y_pred)
print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
Accuracy: 71.73383824722748 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 2358 692 6 1 739 4086 876 2 1 821 1512 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.76 0.77 0.77 3056 1 0.73 0.72 0.72 5701 2 0.63 0.65 0.64 2334 micro avg 0.72 0.72 0.72 11091 macro avg 0.71 0.71 0.71 11091 weighted avg 0.72 0.72 0.72 11091 UAR -> 0.7120427114047846
In [176]:
# Using Decision Trees for classification (Balanced Dataset)

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics 
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.utils.multiclass import unique_labels

X = cri6_upsampled[['Month', 'Day', 'Hour', 'District']] # independent
y = cri6_upsampled['Alarm'] # dependent

# Let's split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101) # 75:25 split

# print(X_train)
# print('hi')
# print(X_test)
# Creating tree
d_tree = DecisionTreeClassifier(random_state=101)
# Fitting tree
d_tree = d_tree.fit(X_train, y_train)
# Predicting !
y_pred = d_tree.predict(X_test)

# Model Evaluation
# print(y_test)
# print(y_pred)
print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
Accuracy: 86.40164899882214 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 5333 300 2 1 716 3964 975 2 0 316 5374 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.88 0.95 0.91 5635 1 0.87 0.70 0.77 5655 2 0.85 0.94 0.89 5690 micro avg 0.86 0.86 0.86 16980 macro avg 0.86 0.86 0.86 16980 weighted avg 0.86 0.86 0.86 16980 UAR -> 0.8639476503835563
In [177]:
'''God Test 1 : Decision Trees'''

X = cri8.iloc[:,0:4].values
y = cri8.iloc[:,5].values

# Testing directly
y_pred = d_tree.predict(X)

print("Accuracy:",(metrics.accuracy_score(y, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
Accuracy: 65.3857449452163 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 10913 5829 237 1 1447 15006 6480 2 7 1322 3024 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.88 0.64 0.74 16979 1 0.68 0.65 0.67 22933 2 0.31 0.69 0.43 4353 micro avg 0.65 0.65 0.65 44265 macro avg 0.62 0.66 0.61 44265 weighted avg 0.72 0.65 0.67 44265 UAR -> 0.6639231214951584
In [178]:
# Let's try with KFold cross validation
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=100, shuffle=False)

X = cri6.iloc[:,0:4].values
y = cri6.iloc[:,5].values

i=1
scores = []
for train_index, test_index in skf.split(X, y):
    #print('{} of KFold {}'.format(i,skf.n_splits))
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    d_tree = DecisionTreeClassifier(random_state=101)
    # Fitting tree
    d_tree = d_tree.fit(X_train, y_train)
    # Predicting !
    y_pred = d_tree.predict(X_test)
    
    # Model Evaluation
    # print(y_test)
    # print(y_pred)
    scores.append(metrics.accuracy_score(y_test, y_pred)*100)
    #print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

# Accuracy
print("Accuracy:",np.mean(scores),"\n")   

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
Accuracy: 59.34723875040715 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 110 13 1 1 53 81 92 2 0 5 87 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.67 0.89 0.77 124 1 0.82 0.36 0.50 226 2 0.48 0.95 0.64 92 micro avg 0.63 0.63 0.63 442 macro avg 0.66 0.73 0.63 442 weighted avg 0.71 0.63 0.60 442 UAR -> 0.7303853425842032

Algorithm 2 : Random Forest

In [179]:
# Using Random Forest for classification (Imbalanced Dataset)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.externals import joblib

X = cri6.iloc[:,0:4].values
y = cri6.iloc[:,5].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 101)

#scaler = StandardScaler()
#X_train = scaler.fit_transform(X_train)
#X_test = scaler.transform(X_test)

classifier = RandomForestClassifier(n_estimators = 1000, criterion = 'entropy', random_state = 101)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
Accuracy: 78.35181678838698 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 2484 570 2 1 525 4627 549 2 0 755 1579 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.83 0.81 0.82 3056 1 0.78 0.81 0.79 5701 2 0.74 0.68 0.71 2334 micro avg 0.78 0.78 0.78 11091 macro avg 0.78 0.77 0.77 11091 weighted avg 0.78 0.78 0.78 11091 UAR -> 0.7669867390092366
In [180]:
# Using Random Forest for classification (Balanced Dataset)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.externals import joblib

X = cri6_upsampled.iloc[:,0:4].values
y = cri6_upsampled.iloc[:,5].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 101)

#scaler = StandardScaler()
#X_train = scaler.fit_transform(X_train)
#X_test = scaler.transform(X_test)

classifier = RandomForestClassifier(n_estimators = 1000, criterion = 'entropy', random_state = 101)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
Accuracy: 88.65135453474676 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 5403 232 0 1 590 4206 859 2 0 246 5444 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.90 0.96 0.93 5635 1 0.90 0.74 0.81 5655 2 0.86 0.96 0.91 5690 micro avg 0.89 0.89 0.89 16980 macro avg 0.89 0.89 0.88 16980 weighted avg 0.89 0.89 0.88 16980 UAR -> 0.8864538612435692
In [181]:
'''God Test 1 : Random Forest'''

X = cri8.iloc[:,0:4].values
y = cri8.iloc[:,5].values

# Testing directly
y_pred = classifier.predict(X)

print("Accuracy:",(metrics.accuracy_score(y, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
Accuracy: 66.24647012312211 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 11073 5683 223 1 1336 15168 6429 2 5 1265 3083 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.89 0.65 0.75 16979 1 0.69 0.66 0.67 22933 2 0.32 0.71 0.44 4353 micro avg 0.66 0.66 0.66 44265 macro avg 0.63 0.67 0.62 44265 weighted avg 0.73 0.66 0.68 44265 UAR -> 0.6739368989752799
In [52]:
# Using Random Forest for classification (Imbalanced Dataset) (using k-fold)

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.externals import joblib

X = cri6.iloc[:,0:4].values
y = cri6.iloc[:,5].values

scores = []
for train_index, test_index in skf.split(X, y):
    #print('{} of KFold {}'.format(i,skf.n_splits))
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 101)
    classifier.fit(X_train, y_train)

    y_pred = classifier.predict(X_test)
    
    # Model Evaluation
    # print(y_test)
    # print(y_pred)
    scores.append(metrics.accuracy_score(y_test, y_pred)*100)
    #print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

#scaler = StandardScaler()
#X_train = scaler.fit_transform(X_train)
#X_test = scaler.transform(X_test)

# Accuracy
print("Accuracy:",np.mean(scores),"\n") 

cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
Accuracy: 66.53434303377871 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 125 10 1 1 41 88 103 2 0 2 72 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.75 0.92 0.83 136 1 0.88 0.38 0.53 232 2 0.41 0.97 0.58 74 micro avg 0.64 0.64 0.64 442 macro avg 0.68 0.76 0.64 442 weighted avg 0.76 0.64 0.63 442 UAR -> 0.7571336549531275
In [53]:
#### plt.style.use('ggplot')
x=['Low (0)','Medium (1)','High (2)']
y=[13600, 23273, 7488]
fig, ax = plt.subplots(figsize=(3, 4))
plt.bar(x,y, color=['green', 'blue', 'red'], width=0.5)
# plt.title('THE IMBALANCE IN THE DATASET')
plt.xlabel('Alarm Rate Classification')
plt.ylabel('Count of Crimes')
plt.savefig("imbal.png")
Notebook Image

Algorithm 3 : Naive Bayes

In [182]:
# Using the Naive Bayes' Classifier[GaussianNB] (Imbalanced Dataset)
'''NOTE : The imbalanced dataset  when used caused NB classifier to not predict class 2 (high crime rate)'''

from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn import metrics 
from sklearn.metrics import confusion_matrix, classification_report

X = cri6.iloc[:,0:4].values
y = cri6.iloc[:,5].values

# Let's split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101) # 75:25 split

#Create a Gaussian Classifier
gnb = GaussianNB()

#Train the model using the training sets
gnb.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = gnb.predict(X_test)


# Model Evaluation
# print(y_test)
# print(y_pred)
print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
Accuracy: 58.50689748444685 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 1903 1147 6 1 1114 4425 162 2 142 2031 161 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.60 0.62 0.61 3056 1 0.58 0.78 0.67 5701 2 0.49 0.07 0.12 2334 micro avg 0.59 0.59 0.59 11091 macro avg 0.56 0.49 0.47 11091 weighted avg 0.57 0.59 0.54 11091 UAR -> 0.48928977768001497
In [183]:
# Using the Naive Bayes' Classifier[GaussianNB] (Balanced Dataset)
'''NOTE : The imbalanced dataset  when used caused NB classifier to not predict class 2 (high crime rate)'''

from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn import metrics 
from sklearn.metrics import confusion_matrix, classification_report

X = cri6_upsampled.iloc[:,0:4].values
y = cri6_upsampled.iloc[:,5].values

# Let's split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101) # 75:25 split

#Create a Gaussian Classifier
gnb = GaussianNB()

#Train the model using the training sets
gnb.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = gnb.predict(X_test)


# Model Evaluation
# print(y_test)
# print(y_pred)
print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
Accuracy: 59.840989399293285 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 4844 550 241 1 1976 1132 2547 2 647 858 4185 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.65 0.86 0.74 5635 1 0.45 0.20 0.28 5655 2 0.60 0.74 0.66 5690 micro avg 0.60 0.60 0.60 16980 macro avg 0.56 0.60 0.56 16980 weighted avg 0.56 0.60 0.56 16980 UAR -> 0.5984350141955873
In [184]:
'''God Test 1 : Naive Bayes'''

X = cri8.iloc[:,0:4].values
y = cri8.iloc[:,5].values

# Testing directly
y_pred = gnb.predict(X)

print("Accuracy:",(metrics.accuracy_score(y, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
Accuracy: 47.014571331751945 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 12979 2118 1882 1 6131 4589 12213 2 504 606 3243 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.66 0.76 0.71 16979 1 0.63 0.20 0.30 22933 2 0.19 0.75 0.30 4353 micro avg 0.47 0.47 0.47 44265 macro avg 0.49 0.57 0.44 44265 weighted avg 0.60 0.47 0.46 44265 UAR -> 0.569840988001759
In [185]:
# Using the Naive Bayes' Classifier[GaussianNB] (Imbalanced Dataset) (k fold)
'''NOTE : The imbalanced dataset  when used caused NB classifier to not predict class 2 (high crime rate)'''

from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn import metrics 
from sklearn.metrics import confusion_matrix, classification_report

X = cri6.iloc[:,0:4].values
y = cri6.iloc[:,5].values

# Let's split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101) # 75:25 split

scores = []
for train_index, test_index in skf.split(X, y):
    #print('{} of KFold {}'.format(i,skf.n_splits))
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    #Create a Gaussian Classifier
    gnb = GaussianNB()

    #Train the model using the training sets
    gnb.fit(X_train, y_train)

    #Predict the response for test dataset
    y_pred = gnb.predict(X_test)


    # Model Evaluation
    # print(y_test)
    # print(y_pred)
    scores.append(metrics.accuracy_score(y_test, y_pred)*100)
    #print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

#scaler = StandardScaler()
#X_train = scaler.fit_transform(X_train)
#X_test = scaler.transform(X_test)

# Accuracy
print("Accuracy:",np.mean(scores),"\n") 

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
Accuracy: 58.05841127795969 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 26 98 0 1 32 189 5 2 0 92 0 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.45 0.21 0.29 124 1 0.50 0.84 0.62 226 2 0.00 0.00 0.00 92 micro avg 0.49 0.49 0.49 442 macro avg 0.32 0.35 0.30 442 weighted avg 0.38 0.49 0.40 442 UAR -> 0.3486535350651822

Algorithm 4 : KNN Classifier

In [186]:
# KNN Classifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split

X = cri6.iloc[:,0:4].values
y = cri6.iloc[:,5].values

# Let's split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1) # 75:25 split

'''We need to decide the optimal value for k. So, let us do that.'''
k_vals = range(1,30)
acc = []
for k in k_vals:
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    acc.append(metrics.accuracy_score(y_test, y_pred))
# plot the graph
plt.plot(k_vals,acc)
plt.xlabel('Value of k')
plt.ylabel('Accuracy')
plt.title('Choosing k value for KNN Algorithm')
Out[186]:
Text(0.5, 1.0, 'Choosing k value for KNN Algorithm')
Notebook Image
In [187]:
'''KNN Classifier on the imbalanced dataset itself'''
# KNN Classifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split

X = cri6.iloc[:,0:4].values
y = cri6.iloc[:,5].values

# Let's split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101) # 75:25 split

# Choosing k as 5 (Seems to be the best value)
knn1 = KNeighborsClassifier(n_neighbors = 5)
knn1.fit(X_train, y_train)
y_pred = knn1.predict(X_test)

# Model Evaluation
# print(y_test)
# print(y_pred)
print('KNN Classifier on the imbalanced dataset itself')
print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
KNN Classifier on the imbalanced dataset itself Accuracy: 75.89036155441349 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 2451 604 1 1 630 4525 546 2 13 880 1441 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.79 0.80 0.80 3056 1 0.75 0.79 0.77 5701 2 0.72 0.62 0.67 2334 micro avg 0.76 0.76 0.76 11091 macro avg 0.76 0.74 0.75 11091 weighted avg 0.76 0.76 0.76 11091 UAR -> 0.7377147419109287
In [188]:
'''KNN Classifier on the balanced dataset'''
X = cri6_upsampled.iloc[:,0:4].values
y = cri6_upsampled.iloc[:,5].values

# Let's split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101) # 75:25 split

# Choosing k as 1 (Seems to be the best value)
knn2 = KNeighborsClassifier(n_neighbors = 5)
knn2.fit(X_train, y_train)
y_pred = knn2.predict(X_test)

# Model Evaluation
# print(y_test)
# print(y_pred)
print('\n\nKNN Classifier on the upsampled dataset')
print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
KNN Classifier on the upsampled dataset Accuracy: 80.10011778563015 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 5180 434 21 1 968 3372 1315 2 44 597 5049 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.84 0.92 0.88 5635 1 0.77 0.60 0.67 5655 2 0.79 0.89 0.84 5690 micro avg 0.80 0.80 0.80 16980 macro avg 0.80 0.80 0.79 16980 weighted avg 0.80 0.80 0.79 16980 UAR -> 0.8009624506582531
In [189]:
'''God Test 1 : KNN'''

X = cri8.iloc[:,0:4].values
y = cri8.iloc[:,5].values

# Testing directly
y_pred = knn1.predict(X)

print("Accuracy:",(metrics.accuracy_score(y, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
Accuracy: 69.27821077600814 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 11114 5724 141 1 1150 16714 5069 2 10 1505 2838 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.91 0.65 0.76 16979 1 0.70 0.73 0.71 22933 2 0.35 0.65 0.46 4353 micro avg 0.69 0.69 0.69 44265 macro avg 0.65 0.68 0.64 44265 weighted avg 0.74 0.69 0.71 44265 UAR -> 0.6784520639672884
In [71]:
'''KNN Classifier on the imbalanced dataset itself - Using k-fold validation'''
# KNN Classifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split

X = cri6.iloc[:,0:4].values
y = cri6.iloc[:,5].values

scores = []
for train_index, test_index in skf.split(X, y):
    #print('{} of KFold {}'.format(i,skf.n_splits))
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Choosing k as 5 (Seems to be the best value)
    knn = KNeighborsClassifier(n_neighbors = 5)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    # Model Evaluation
    # print(y_test)
    # print(y_pred)
    scores.append(metrics.accuracy_score(y_test, y_pred)*100)
    #print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")


# Model Evaluation
# print(y_test)
# print(y_pred)
print('KNN Classifier on the imbalanced dataset itself')
print("Accuracy:",(np.mean(scores)),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)

##################################################################################################

'''KNN Classifier on the upsampled dataset'''
X = cri6_upsampled.iloc[:,0:4].values
y = cri6_upsampled.iloc[:,5].values

scores = []
for train_index, test_index in skf.split(X, y):
    #print('{} of KFold {}'.format(i,skf.n_splits))
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Choosing k as 5 (Seems to be the best value)
    knn = KNeighborsClassifier(n_neighbors = 5)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    # Model Evaluation
    # print(y_test)
    # print(y_pred)
    scores.append(metrics.accuracy_score(y_test, y_pred)*100)
    #print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

# Model Evaluation
# print(y_test)
# print(y_pred)
print('\n\nKNN Classifier on the upsampled dataset')
print("Accuracy:",(np.mean(scores)),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
KNN Classifier on the imbalanced dataset itself Accuracy: 66.49466296986547 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 118 18 0 1 31 151 50 2 0 27 47 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.79 0.87 0.83 136 1 0.77 0.65 0.71 232 2 0.48 0.64 0.55 74 micro avg 0.71 0.71 0.71 442 macro avg 0.68 0.72 0.69 442 weighted avg 0.73 0.71 0.72 442 UAR -> 0.7178814209747273 KNN Classifier on the upsampled dataset Accuracy: 77.45164274086133 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 211 20 1 1 70 54 108 2 0 9 223 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.75 0.91 0.82 232 1 0.65 0.23 0.34 232 2 0.67 0.96 0.79 232 micro avg 0.70 0.70 0.70 696 macro avg 0.69 0.70 0.65 696 weighted avg 0.69 0.70 0.65 696 UAR -> 0.7011494252873564

Algorithm 5 : SVM

In [190]:
# Support Vector Machines (Imbalanced dataset)
from sklearn import svm
from sklearn.model_selection import train_test_split

'''Balanced dataset with oversampling'''
X = cri6.iloc[:,0:4].values
y = cri6.iloc[:,5].values

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=101) # 75:25 split

#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Evaluation
# print(y_test)
# print(y_pred)
print("SVM with oversampled balanced dataset")
print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
SVM with oversampled balanced dataset Accuracy: 58.3175547741412 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 Actual Alarm 0 1949 1107 1 1182 4519 2 145 2189 ----------Classification Report------------------------------------
C:\Users\Ramshankar Yadhunath\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. 'precision', 'predicted', average, warn_for) C:\Users\Ramshankar Yadhunath\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. 'precision', 'predicted', average, warn_for) C:\Users\Ramshankar Yadhunath\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1143: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. 'precision', 'predicted', average, warn_for)
precision recall f1-score support 0 0.59 0.64 0.62 3056 1 0.58 0.79 0.67 5701 2 0.00 0.00 0.00 2334 micro avg 0.58 0.58 0.58 11091 macro avg 0.39 0.48 0.43 11091 weighted avg 0.46 0.58 0.51 11091
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) ~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 3077 try: -> 3078 return self._engine.get_loc(key) 3079 except KeyError: pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item() KeyError: 2 During handling of the above exception, another exception occurred: KeyError Traceback (most recent call last) <ipython-input-190-f854c5f1d065> in <module>() 35 36 # Unweighted Average Recall ---> 37 print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3) ~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key) 2686 return self._getitem_multilevel(key) 2687 else: -> 2688 return self._getitem_column(key) 2689 2690 def _getitem_column(self, key): ~\Anaconda3\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key) 2693 # get column 2694 if self.columns.is_unique: -> 2695 return self._get_item_cache(key) 2696 2697 # duplicate columns & possible reduce dimensionality ~\Anaconda3\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item) 2487 res = cache.get(item) 2488 if res is None: -> 2489 values = self._data.get(item) 2490 res = self._box_item_values(item, values) 2491 cache[item] = res ~\Anaconda3\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath) 4113 4114 if not isna(item): -> 4115 loc = self.items.get_loc(item) 4116 else: 4117 indexer = np.arange(len(self.items))[isna(self.items)] ~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance) 3078 return self._engine.get_loc(key) 3079 except KeyError: -> 3080 return self._engine.get_loc(self._maybe_cast_indexer(key)) 3081 3082 indexer = self.get_indexer([key], method=method, tolerance=tolerance) pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item() pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item() KeyError: 2
In [191]:
# Support Vector Machines (Balanced dataset)
from sklearn import svm
from sklearn.model_selection import train_test_split

X = cri6_upsampled.iloc[:,0:4].values
y = cri6_upsampled.iloc[:,5].values

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=101) # 75:25 split

#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Evaluation
# print(y_test)
# print(y_pred)
print("SVM with oversampled balanced dataset")
print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
SVM with oversampled balanced dataset Accuracy: 59.252061248527674 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 4779 560 296 1 1889 1455 2311 2 597 1266 3827 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.66 0.85 0.74 5635 1 0.44 0.26 0.33 5655 2 0.59 0.67 0.63 5690 micro avg 0.59 0.59 0.59 16980 macro avg 0.57 0.59 0.57 16980 weighted avg 0.57 0.59 0.57 16980 UAR -> 0.5926567299625812
In [192]:
'''God Test 1 : SVM'''

X = cri8.iloc[:,0:4].values
y = cri8.iloc[:,5].values

# Testing directly
y_pred = clf.predict(X)

print("Accuracy:",(metrics.accuracy_score(y, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
Accuracy: 49.22851010956738 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 12779 2267 1933 1 5807 6043 11083 2 468 916 2969 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.67 0.75 0.71 16979 1 0.65 0.26 0.38 22933 2 0.19 0.68 0.29 4353 micro avg 0.49 0.49 0.49 44265 macro avg 0.50 0.57 0.46 44265 weighted avg 0.61 0.49 0.50 44265 UAR -> 0.5660668987574827

Algorithm 6 : Logistic Regression

In [193]:
# Logistic Regression for imbalanced dataset
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

X = cri6.iloc[:,0:4].values
y = cri6.iloc[:,5].values

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=101) # 75:25 split

logreg = LogisticRegression(C=1e5, solver='lbfgs', multi_class='multinomial')

# Create an instance of Logistic Regression Classifier and fit the data.
logreg.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = logreg.predict(X_test)

# Model Evaluation
# print(y_test)
# print(y_pred)
print("Logistic Regression with imbalanced dataset")
print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)


Logistic Regression with imbalanced dataset Accuracy: 56.126589126318635 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 1663 1374 19 1 1023 4414 264 2 138 2048 148 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.59 0.54 0.57 3056 1 0.56 0.77 0.65 5701 2 0.34 0.06 0.11 2334 micro avg 0.56 0.56 0.56 11091 macro avg 0.50 0.46 0.44 11091 weighted avg 0.52 0.56 0.51 11091 UAR -> 0.4606119927939933
In [194]:
# Logistic Regression for balanced dataset
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

X = cri6_upsampled.iloc[:,0:4].values
y = cri6_upsampled.iloc[:,5].values

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=101) # 75:25 split

logreg = LogisticRegression(C=1e5, solver='lbfgs', multi_class='multinomial')

# Create an instance of Logistic Regression Classifier and fit the data.
logreg.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = logreg.predict(X_test)

# Model Evaluation
# print(y_test)
# print(y_pred)
print("Logistic Regression with balanced dataset")
print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)


Logistic Regression with imbalanced dataset Accuracy: 58.8751472320377 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 4581 648 406 1 1700 1719 2236 2 542 1451 3697 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.67 0.81 0.74 5635 1 0.45 0.30 0.36 5655 2 0.58 0.65 0.61 5690 micro avg 0.59 0.59 0.59 16980 macro avg 0.57 0.59 0.57 16980 weighted avg 0.57 0.59 0.57 16980 UAR -> 0.5888899688568144
In [195]:
'''God Test 1 : Logistic Regression'''

X = cri8.iloc[:,0:4].values
y = cri8.iloc[:,5].values

# Testing directly
y_pred = logreg.predict(X)

print("Accuracy:",(metrics.accuracy_score(y, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
Accuracy: 49.632892804698976 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 12095 2679 2205 1 5241 7025 10667 2 422 1081 2850 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.68 0.71 0.70 16979 1 0.65 0.31 0.42 22933 2 0.18 0.65 0.28 4353 micro avg 0.50 0.50 0.50 44265 macro avg 0.50 0.56 0.47 44265 weighted avg 0.62 0.50 0.51 44265 UAR -> 0.5577995198927558
In [196]:
# Logistic Regression for imbalanced dataset (k fold)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

X = cri6.iloc[:,0:4].values
y = cri6.iloc[:,5].values

scores = []
for train_index, test_index in skf.split(X, y):
    #print('{} of KFold {}'.format(i,skf.n_splits))
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Choosing k as 5 (Seems to be the best value)
    logreg = LogisticRegression(C=1e5, solver='lbfgs', multi_class='multinomial')

    # Create an instance of Logistic Regression Classifier and fit the data.
    logreg.fit(X_train, y_train)

    #Predict the response for test dataset
    y_pred = logreg.predict(X_test)
    # Model Evaluation
    # print(y_test)
    # print(y_pred)
    scores.append(metrics.accuracy_score(y_test, y_pred)*100)
    #print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

# Model Evaluation
# print(y_test)
# print(y_pred)
print("Logistic Regression with imbalanced dataset")
print("Accuracy:",(np.mean(scores)),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)


Logistic Regression with imbalanced dataset Accuracy: 55.89737291023121 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 14 107 3 1 33 153 40 2 0 86 6 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.30 0.11 0.16 124 1 0.44 0.68 0.53 226 2 0.12 0.07 0.09 92 micro avg 0.39 0.39 0.39 442 macro avg 0.29 0.29 0.26 442 weighted avg 0.34 0.39 0.34 442 UAR -> 0.28503725585109246

Algorithm 7 : Linear Discriminant Analysis

In [ ]:
# Linear Discriminant Analysis for imbalanced dataset
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn import metrics

X = cri6.iloc[:,0:4].values
y = cri6.iloc[:,5].values

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=1) # 75:25 split

clf = LinearDiscriminantAnalysis()
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Evaluation
# print(y_test)
# print(y_pred)
print("LDA with imbalanced dataset")
print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
In [ ]:
'''God Test 8 : LDA'''

X = cri8.iloc[:,0:4].values
y = cri8.iloc[:,5].values

# Testing directly
y_pred = logreg.predict(X)

print("Accuracy:",(metrics.accuracy_score(y, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
In [43]:
# Linear Discriminant Analysis for balanced dataset
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn import metrics

X = cri6_upsampled.iloc[:,0:4].values
y = cri6_upsampled.iloc[:,5].values

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=1) # 75:25 split

clf = LinearDiscriminantAnalysis()
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Evaluation
# print(y_test)
# print(y_pred)
print("LDA with balanced oversampled dataset")
print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
LDA with balanced oversampled dataset Accuracy: 58.52191349183615 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 4730 627 438 1 1826 1733 2372 2 538 1439 3752 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.67 0.82 0.73 5795 1 0.46 0.29 0.36 5931 2 0.57 0.65 0.61 5729 micro avg 0.59 0.59 0.59 17455 macro avg 0.56 0.59 0.57 17455 weighted avg 0.56 0.59 0.57 17455 UAR -> 0.587776012273459

Algorithm 8 : Qudratic Discriminant Analysis

In [44]:
# Quadratic Discriminant Analysis for imbalanced dataset
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn import metrics

X = cri6.iloc[:,0:4].values
y = cri6.iloc[:,5].values

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=1) # 75:25 split

clf = QuadraticDiscriminantAnalysis()
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Evaluation
# print(y_test)
# print(y_pred)
print("QDA with imbalanced dataset")
print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
QDA with imbalanced dataset Accuracy: 61.626544044720944 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 2250 1124 1 1 1233 4552 64 2 126 1708 33 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.62 0.67 0.64 3375 1 0.62 0.78 0.69 5849 2 0.34 0.02 0.03 1867 micro avg 0.62 0.62 0.62 11091 macro avg 0.53 0.49 0.46 11091 weighted avg 0.57 0.62 0.56 11091 UAR -> 0.4875315915130356
In [45]:
# Quadratic Discriminant Analysis for imbalanced dataset
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn import metrics

X = cri6_upsampled.iloc[:,0:4].values
y = cri6_upsampled.iloc[:,5].values

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=1) # 75:25 split

clf = QuadraticDiscriminantAnalysis()
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Evaluation
# print(y_test)
# print(y_pred)
print("QDA with oversampled balanced dataset")
print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
QDA with oversampled balanced dataset Accuracy: 59.3067888857061 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 4899 633 263 1 2053 1301 2577 2 676 901 4152 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.64 0.85 0.73 5795 1 0.46 0.22 0.30 5931 2 0.59 0.72 0.65 5729 micro avg 0.59 0.59 0.59 17455 macro avg 0.56 0.60 0.56 17455 weighted avg 0.56 0.59 0.56 17455 UAR -> 0.5964912295361838

Algorithm 9 : Gradient Boosting Tree

In [197]:
# Gradient Boosting with imbalanced dataset
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

X = cri6.iloc[:,0:4].values
y = cri6.iloc[:,5].values

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=1) # 75:25 split

gbc = GradientBoostingClassifier(n_estimators=1000)
gbc.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = gbc.predict(X_test)

# Model Evaluation
# print(y_test)
# print(y_pred)
print("Gradient Boosting with imbalanced dataset")
print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
Gradient Boosting with imbalanced dataset Accuracy: 80.49770083851772 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 2605 516 0 1 428 4714 475 2 1 743 1609 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.86 0.83 0.85 3121 1 0.79 0.84 0.81 5617 2 0.77 0.68 0.73 2353 micro avg 0.80 0.80 0.80 11091 macro avg 0.81 0.79 0.80 11091 weighted avg 0.81 0.80 0.80 11091 UAR -> 0.7859047692466056
In [198]:
# Gradient Boosting with balanced dataset
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

X = cri6_upsampled.iloc[:,0:4].values
y = cri6_upsampled.iloc[:,5].values

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=1) # 75:25 split

gbc = GradientBoostingClassifier(n_estimators=1000)
gbc.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = gbc.predict(X_test)

# Model Evaluation
# print(y_test)
# print(y_pred)
print("Gradient Boosting with imbalanced dataset")
print("Accuracy:",(metrics.accuracy_score(y_test, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y_test, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y_test,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
Gradient Boosting with imbalanced dataset Accuracy: 81.32508833922262 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 4887 644 1 1 667 4078 1054 2 1 804 4844 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.88 0.88 0.88 5532 1 0.74 0.70 0.72 5799 2 0.82 0.86 0.84 5649 micro avg 0.81 0.81 0.81 16980 macro avg 0.81 0.81 0.81 16980 weighted avg 0.81 0.81 0.81 16980 UAR -> 0.8147090786441811
In [199]:
'''God Test : GBT'''

X = cri8.iloc[:,0:4].values
y = cri8.iloc[:,5].values

# Testing directly
y_pred = gbc.predict(X)

print("Accuracy:",(metrics.accuracy_score(y, y_pred)*100),"\n")

# Confusion Matrix for evaluating the model
cm = pd.crosstab(y, y_pred, rownames=['Actual Alarm'], colnames=['Predicted Alarm'])
print("\n----------Confusion Matrix------------------------------------")
print(cm)

# Classification Report
print("\n----------Classification Report------------------------------------")
print(classification_report(y,y_pred))

# Unweighted Average Recall
print("\nUAR ->",((cm[0][0])/(cm[0][0]+cm[1][0]+cm[2][0])+(cm[1][1])/(cm[0][1]+cm[1][1]+cm[2][1])+(cm[2][2])/(cm[2][2]+cm[0][2]+cm[1][2]))/3)
Accuracy: 67.48446854173726 ----------Confusion Matrix------------------------------------ Predicted Alarm 0 1 2 Actual Alarm 0 12311 4447 221 1 1201 13778 7954 2 0 570 3783 ----------Classification Report------------------------------------ precision recall f1-score support 0 0.91 0.73 0.81 16979 1 0.73 0.60 0.66 22933 2 0.32 0.87 0.46 4353 micro avg 0.67 0.67 0.67 44265 macro avg 0.65 0.73 0.64 44265 weighted avg 0.76 0.67 0.70 44265 UAR -> 0.731640529234566

Using Neural Networks to find a better result

We are trying to beat the 90% accuracy barrier!
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
import numpy as np

# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

X = cri6.iloc[:,0:4].values
y = cri6.iloc[:,5].values

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=1) # 75:25 split

'''
Our Neural Network Model

1. Let's try a one hidden layer neural network
2. Hidden layer has 10 nodes
3. The number of features we have as our input dimension is 4 (not taking the "Crime_Count" feature)
4. The number of classes to be classified into are 3

'''

# define baseline model
def baseline_model():
    # create model
    model = Sequential()
    model.add(Dense(10, input_dim=4, activation='relu'))
    model.add(Dense(3, activation='softmax'))
    # Compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

estimator = KerasClassifier(build_fn=baseline_model, epochs=200, batch_size=5, verbose=1)

# Evaluating our model
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
Deep Learning Architecture : Do you even need it?

A basic model with 2 layers and 8 nodes in the hidden layer gives only about 68% accuracy after almost 200 epochs.
So, we can either try building a new complex one or just drop the idea of a DL approach to this problem.

In [48]:
print(5)
5
In [56]:
'''Neural Network Architecture 2'''

# Importing the required headers
from keras.wrappers.scikit_learn import KerasClassifier
from keras.utils import np_utils
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D, AveragePooling1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

# fix random seed for reproducibility
seed = 7
np.random.seed(seed)

def baseline_model():
    # create model
    model = Sequential()
    model.add(Dense(10, input_dim=4, activation='relu'))
    model.add(Dense(3, activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
    return model

estimator = KerasClassifier(build_fn=baseline_model, epochs=3, batch_size=100, verbose=1)

# Evaluating our model
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
C:\Users\Ramshankar Yadhunath\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:542: FutureWarning: From version 0.22, errors during fit will result in a cross validation score of NaN by default. Use error_score='raise' if you want an exception raised or error_score=np.nan to adopt the behavior from version 0.22. FutureWarning)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-56-39d9e96c1611> in <module>() 33 # Evaluating our model 34 kfold = KFold(n_splits=10, shuffle=True, random_state=seed) ---> 35 results = cross_val_score(estimator, X, y, cv=kfold) 36 print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100)) ~\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score) 400 fit_params=fit_params, 401 pre_dispatch=pre_dispatch, --> 402 error_score=error_score) 403 return cv_results['test_score'] 404 ~\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score) 238 return_times=True, return_estimator=return_estimator, 239 error_score=error_score) --> 240 for train, test in cv.split(X, y, groups)) 241 242 zipped_scores = list(zip(*scores)) ~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self, iterable) 915 # remaining jobs. 916 self._iterating = False --> 917 if self.dispatch_one_batch(iterator): 918 self._iterating = self._original_iterator is not None 919 ~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator) 757 return False 758 else: --> 759 self._dispatch(tasks) 760 return True 761 ~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in _dispatch(self, batch) 714 with self._lock: 715 job_idx = len(self._jobs) --> 716 job = self._backend.apply_async(batch, callback=cb) 717 # A job can complete so quickly than its callback is 718 # called before we get here, causing self._jobs to ~\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in apply_async(self, func, callback) 180 def apply_async(self, func, callback=None): 181 """Schedule a func to be run""" --> 182 result = ImmediateResult(func) 183 if callback: 184 callback(result) ~\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in __init__(self, batch) 547 # Don't delay the application, to avoid keeping the input 548 # arguments in memory --> 549 self.results = batch() 550 551 def get(self): ~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self) 223 with parallel_backend(self._backend, n_jobs=self._n_jobs): 224 return [func(*args, **kwargs) --> 225 for func, args, kwargs in self.items] 226 227 def __len__(self): ~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in <listcomp>(.0) 223 with parallel_backend(self._backend, n_jobs=self._n_jobs): 224 return [func(*args, **kwargs) --> 225 for func, args, kwargs in self.items] 226 227 def __len__(self): ~\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score) 526 estimator.fit(X_train, **fit_params) 527 else: --> 528 estimator.fit(X_train, y_train, **fit_params) 529 530 except Exception as e: ~\Anaconda3\lib\site-packages\keras\wrappers\scikit_learn.py in fit(self, x, y, sample_weight, **kwargs) 208 if sample_weight is not None: 209 kwargs['sample_weight'] = sample_weight --> 210 return super(KerasClassifier, self).fit(x, y, **kwargs) 211 212 def predict(self, x, **kwargs): ~\Anaconda3\lib\site-packages\keras\wrappers\scikit_learn.py in fit(self, x, y, **kwargs) 150 fit_args.update(kwargs) 151 --> 152 history = self.model.fit(x, y, **fit_args) 153 154 return history ~\Anaconda3\lib\site-packages\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs) 950 sample_weight=sample_weight, 951 class_weight=class_weight, --> 952 batch_size=batch_size) 953 # Prepare validation data. 954 do_validation = False ~\Anaconda3\lib\site-packages\keras\engine\training.py in _standardize_user_data(self, x, y, sample_weight, class_weight, check_array_lengths, batch_size) 787 feed_output_shapes, 788 check_batch_axis=False, # Don't enforce the batch size. --> 789 exception_prefix='target') 790 791 # Generate sample-wise weight values given the `sample_weight` and ~\Anaconda3\lib\site-packages\keras\engine\training_utils.py in standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix) 136 ': expected ' + names[i] + ' to have shape ' + 137 str(shape) + ' but got array with shape ' + --> 138 str(data_shape)) 139 return data 140 ValueError: Error when checking target: expected dense_4 to have shape (3,) but got array with shape (1,)
In [ ]: