Learn data science and machine learning by building real-world projects on Jovian

## Deep Learning for tabular data using Pytorch

Problem Statement: Given certain features about a shelter animal (like age, sex, color, breed), predict its outcome.

There are 5 possible outcomes: Return_to_owner, Euthanasia, Adoption, Transfer, Died. We are expected to find the probability of an animal's outcome belonging to each of the 5 categories.

### Library imports

In :
``````import pandas as pd
import numpy as np
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import torch
import torch.optim as torch_optim
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models
from datetime import datetime``````

##### Training set
In :
``````train = pd.read_csv('train.csv')
print("Shape:", train.shape)
``````
```Shape: (26729, 10) ```
Out:
##### Test set
In :
``````test = pd.read_csv('test.csv')
print("Shape:", test.shape)
``````
```Shape: (11456, 8) ```
Out:
##### Sample submission file

For each row, each outcome's probability needs to be filled into the columns

In :
``````sample = pd.read_csv('sample_submission.csv')
``````
Out:

### Very basic data exploration

##### How balanced is the dataset?

Adoption and Transfer seem to occur a lot more than the rest

In :
``Counter(train['OutcomeType'])``
Out:
``````Counter({'Return_to_owner': 4786,
'Euthanasia': 1555,
'Transfer': 9422,
'Died': 197})``````
##### What are the most common names and how many times do they occur?

There seem to be too many Nan values. Name might not be a very important factor too

In :
``Counter(train['Name']).most_common(5)``
Out:
``[(nan, 7691), ('Max', 136), ('Bella', 135), ('Charlie', 107), ('Daisy', 106)]``

### Data preprocessing

OutcomeSubtype column seems to be of no use, so we drop it. Also, since animal ID is unique, it doesn't help in training

In :
``````train_X = train.drop(columns= ['OutcomeType', 'OutcomeSubtype', 'AnimalID'])
Y = train['OutcomeType']
test_X = test
``````
##### Stacking train and test set so that they undergo the same preprocessing
In :
``stacked_df = train_X.append(test_X.drop(columns=['ID']))``
##### splitting datetime into month and year
In :
``````# stacked_df['DateTime'] = pd.to_datetime(stacked_df['DateTime'])
# stacked_df['year'] = stacked_df['DateTime'].dt.year
# stacked_df['month'] = stacked_df['DateTime'].dt.month
stacked_df = stacked_df.drop(columns=['DateTime'])
``````
Out:
##### dropping columns with too many nulls
In :
``````for col in stacked_df.columns:
if stacked_df[col].isnull().sum() > 10000:
print("dropping", col, stacked_df[col].isnull().sum())
stacked_df = stacked_df.drop(columns = [col])
``````
```dropping Name 10916 ```
In :
``stacked_df.head()``
Out:
##### label encoding
In :
``````for col in stacked_df.columns:
if stacked_df.dtypes[col] == "object":
stacked_df[col] = stacked_df[col].fillna("NA")
else:
stacked_df[col] = stacked_df[col].fillna(0)
stacked_df[col] = LabelEncoder().fit_transform(stacked_df[col])
``````
In :
``stacked_df.head()``
Out:
In :
``````# making all variables categorical
for col in stacked_df.columns:
stacked_df[col] = stacked_df[col].astype('category')``````
##### splitting back train and test
In :
``````X = stacked_df[0:26729]
test_processed = stacked_df[26729:]

#check if shape matches original
print("train shape: ", X.shape, "orignal: ", train.shape)
print("test shape: ", test_processed.shape, "original: ", test.shape)``````
```train shape: (26729, 5) orignal: (26729, 10) test shape: (11456, 5) original: (11456, 8) ```
##### Encoding target
In :
``````Y = LabelEncoder().fit_transform(Y)

#sanity check to see numbers match and matching with previous counter to create target dictionary
print(Counter(train['OutcomeType']))
print(Counter(Y))
target_dict = {
'Return_to_owner' : 3,
'Euthanasia': 2,
'Transfer': 4,
'Died': 1
}
``````
```Counter({'Adoption': 10769, 'Transfer': 9422, 'Return_to_owner': 4786, 'Euthanasia': 1555, 'Died': 197}) Counter({0: 10769, 4: 9422, 3: 4786, 2: 1555, 1: 197}) ```
##### train-valid split
In :
``````X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=0.10, random_state=0)
``````
Out:
##### Choosing columns for embedding
In :
``````#categorical embedding for columns having more than two values
embedded_cols = {n: len(col.cat.categories) for n,col in X.items() if len(col.cat.categories) > 2}
embedded_cols
``````
Out:
``{'SexuponOutcome': 6, 'AgeuponOutcome': 46, 'Breed': 1678, 'Color': 411}``
In :
``````embedded_col_names = embedded_cols.keys()
len(X.columns) - len(embedded_cols) #number of numerical columns``````
Out:
``1``
##### Determining size of embedding

(borrowed from https://www.usfca.edu/data-institute/certificates/fundamentals-deep-learning lesson 2)

In :
``````embedding_sizes = [(n_categories, min(50, (n_categories+1)//2)) for _,n_categories in embedded_cols.items()]
embedding_sizes
``````
Out:
``[(6, 3), (46, 23), (1678, 50), (411, 50)]``

### Pytorch Dataset

In :
``````class ShelterOutcomeDataset(Dataset):
def __init__(self, X, Y, embedded_col_names):
X = X.copy()
self.X1 = X.loc[:,embedded_col_names].copy().values.astype(np.int64) #categorical columns
self.X2 = X.drop(columns=embedded_col_names).copy().values.astype(np.float32) #numerical columns
self.y = Y

def __len__(self):
return len(self.y)

def __getitem__(self, idx):
return self.X1[idx], self.X2[idx], self.y[idx]``````
In :
``````#creating train and valid datasets
train_ds = ShelterOutcomeDataset(X_train, y_train, embedded_col_names)
valid_ds = ShelterOutcomeDataset(X_val, y_val, embedded_col_names)
``````

### Making device (GPU/CPU) compatible

(borrowed from https://jovian.ml/aakashns/04-feedforward-nn)

In order to make use of a GPU if available, we'll have to move our data and model to it.

In :
``````def get_default_device():
"""Pick GPU if available, else CPU"""
if torch.cuda.is_available():
else:
In :
``````def to_device(data, device):
"""Move tensor(s) to chosen device"""
if isinstance(data, (list,tuple)):
return [to_device(x, device) for x in data]
return data.to(device, non_blocking=True)``````
In :
``````class DeviceDataLoader():
"""Wrap a dataloader to move data to a device"""
def __init__(self, dl, device):
self.dl = dl
self.device = device

def __iter__(self):
"""Yield a batch of data after moving it to device"""
for b in self.dl:
yield to_device(b, self.device)

def __len__(self):
"""Number of batches"""
return len(self.dl)``````
In :
``````device = get_default_device()
device
``````
Out:
``device(type='cpu')``

### Model

(modified from https://www.usfca.edu/data-institute/certificates/fundamentals-deep-learning lesson 2)

In :
``````class ShelterOutcomeModel(nn.Module):
def __init__(self, embedding_sizes, n_cont):
super().__init__()
self.embeddings = nn.ModuleList([nn.Embedding(categories, size) for categories,size in embedding_sizes])
n_emb = sum(e.embedding_dim for e in self.embeddings) #length of all embeddings combined
self.n_emb, self.n_cont = n_emb, n_cont
self.lin1 = nn.Linear(self.n_emb + self.n_cont, 200)
self.lin2 = nn.Linear(200, 70)
self.lin3 = nn.Linear(70, 5)
self.bn1 = nn.BatchNorm1d(self.n_cont)
self.bn2 = nn.BatchNorm1d(200)
self.bn3 = nn.BatchNorm1d(70)
self.emb_drop = nn.Dropout(0.6)
self.drops = nn.Dropout(0.3)

def forward(self, x_cat, x_cont):
x = [e(x_cat[:,i]) for i,e in enumerate(self.embeddings)]
x = torch.cat(x, 1)
x = self.emb_drop(x)
x2 = self.bn1(x_cont)
x = torch.cat([x, x2], 1)
x = F.relu(self.lin1(x))
x = self.drops(x)
x = self.bn2(x)
x = F.relu(self.lin2(x))
x = self.drops(x)
x = self.bn3(x)
x = self.lin3(x)
return x``````
In :
``````model = ShelterOutcomeModel(embedding_sizes, 1)
to_device(model, device)
``````
Out:
``````ShelterOutcomeModel(
(embeddings): ModuleList(
(0): Embedding(6, 3)
(1): Embedding(46, 23)
(2): Embedding(1678, 50)
(3): Embedding(411, 50)
)
(lin1): Linear(in_features=127, out_features=200, bias=True)
(lin2): Linear(in_features=200, out_features=70, bias=True)
(lin3): Linear(in_features=70, out_features=5, bias=True)
(bn1): BatchNorm1d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(bn2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(bn3): BatchNorm1d(70, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(emb_drop): Dropout(p=0.6, inplace=False)
(drops): Dropout(p=0.3, inplace=False)
)``````
##### Optimizer
In :
``````def get_optimizer(model, lr = 0.001, wd = 0.0):
parameters = filter(lambda p: p.requires_grad, model.parameters())
return optim``````
##### Training function
In :
``````def train_model(model, optim, train_dl):
model.train()
total = 0
sum_loss = 0
for x1, x2, y in train_dl:
batch = y.shape
output = model(x1, x2)
loss = F.cross_entropy(output, y)
loss.backward()
optim.step()
total += batch
sum_loss += batch*(loss.item())
return sum_loss/total``````
##### Evaluation function
In :
``````def val_loss(model, valid_dl):
model.eval()
total = 0
sum_loss = 0
correct = 0
for x1, x2, y in valid_dl:
current_batch_size = y.shape
out = model(x1, x2)
loss = F.cross_entropy(out, y)
sum_loss += current_batch_size*(loss.item())
total += current_batch_size
pred = torch.max(out, 1)
correct += (pred == y).float().sum().item()
print("valid loss %.3f and accuracy %.3f" % (sum_loss/total, correct/total))
return sum_loss/total, correct/total``````
In :
``````def train_loop(model, epochs, lr=0.01, wd=0.0):
optim = get_optimizer(model, lr = lr, wd = wd)
for i in range(epochs):
loss = train_model(model, optim, train_dl)
print("training loss: ", loss)
val_loss(model, valid_dl)
``````

### Training

In :
``````batch_size = 1000
In :
``````train_dl = DeviceDataLoader(train_dl, device)
``````
In :
``train_loop(model, epochs=8, lr=0.05, wd=0.00001)``
```training loss: 1.2027205801462073 valid loss 0.942 and accuracy 0.609 training loss: 1.012246810026172 valid loss 0.904 and accuracy 0.608 training loss: 0.982185461055729 valid loss 0.909 and accuracy 0.626 training loss: 0.970147291746574 valid loss 0.881 and accuracy 0.636 training loss: 0.9620310133146533 valid loss 0.919 and accuracy 0.615 training loss: 0.9558281915941863 valid loss 0.869 and accuracy 0.639 training loss: 0.9427465338385062 valid loss 0.869 and accuracy 0.636 training loss: 0.9387064109341742 valid loss 0.870 and accuracy 0.632 ```

### Test Output

In :
``````test_ds = ShelterOutcomeDataset(test_processed, np.zeros(len(test_processed)), embedded_col_names)
``````
In :
``````preds = []
for x1,x2,y in test_dl:
out = model(x1, x2)
prob = F.softmax(out, dim=1)
preds.append(prob)
``````
In :
``final_probs = [item for sublist in preds for item in sublist]``
In :
``len(final_probs)``
Out:
``11456``
In :
``target_dict``
Out:
``````{'Return_to_owner': 3,
'Euthanasia': 2,
'Transfer': 4,
'Died': 1}``````
In :
``sample.head()``
Out:
In :
``````sample['Adoption'] = [float(t) for t in final_probs]
sample['Died'] = [float(t) for t in final_probs]
sample['Euthanasia'] = [float(t) for t in final_probs]
sample['Return_to_owner'] = [float(t) for t in final_probs]
sample['Transfer'] = [float(t) for t in final_probs]
``sample.to_csv('samp.csv', index=False)``
``import jovian``
``jovian.commit()``
```[jovian] Saving notebook.. ```