Jovian
⭐️
Sign In

Objective

The challenge is to create a model that uses data from the first 24 hours of intensive care to predict patient survival. MIT's GOSSIS community initiative, with privacy certification from the Harvard Privacy Lab, has provided a dataset of more than 130,000 hospital Intensive Care Unit (ICU) visits from patients, spanning a one-year timeframe. This data is part of a growing global effort and consortium spanning Argentina, Australia, New Zealand, Sri Lanka, Brazil, and more than 200 hospitals in the United States.

Data Description

MIT's GOSSIS community initiative, with privacy certification from the Harvard Privacy Lab, has provided a dataset of more than 130,000 hospital Intensive Care Unit (ICU) visits from patients, spanning a one-year timeframe. This data is part of a growing global effort and consortium spanning Argentina, Australia, New Zealand, Sri Lanka, Brazil, and more than 200 hospitals in the United States.

The data includes:

Training data for 91,713 encounters.
Unlabeled test data for 39,308 encounters, which includes all the information in the training data except for the values for hospital_death.
WiDS Datathon 2020 Dictionary with supplemental information about the data, including the category (e.g., identifier, demographic, vitals), unit of measure, data type (e.g., numeric, binary), description, and examples.
Sample submission files

H2O :

H2O is ‘the open source in-memory, prediction engine for Big Data science’. H2O is a feature-rich, open source machine learning platform known for its R and Spark integration and its ease of use. It is a Java virtual machine that is optimised for doing in-memory processing of distributed, parallel machine learning algorithms on clusters.

The motive of H2O is to provide a platform which made easy for the non-experts to do experiments with machine learning.H2O architecture can be divided into different layers in which the top layer will be different APIs, and the bottom layer will be H2O JVM.

In [1]:
# importing libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier ,AdaBoostClassifier
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn import preprocessing

Starting H2O and Inspecting the Cluster

There are many tools for directly interacting with user-visible objects in the H2O cluster. Every new python session begins by initializing a connection between the python client and the H2O cluster.The h2o.init() function to initialize H2O.

In [2]:
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found. Attempting to start a local H2O server... Java Version: openjdk version "1.8.0_232"; OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-1~deb9u1-b09); OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode) Starting server from /opt/conda/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar Ice root: /tmp/tmpdwh2s6nf JVM stdout: /tmp/tmpdwh2s6nf/h2o_unknownUser_started_from_python.out JVM stderr: /tmp/tmpdwh2s6nf/h2o_unknownUser_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321 ... successful.
In [3]:
# loading dataset 
training_v2 = pd.read_csv("../input/widsdatathon2020/training_v2.csv")
In [4]:
# creating independent features X and dependant feature Y
y = pd.DataFrame(training_v2['hospital_death'])
X = training_v2
X = training_v2.drop('hospital_death',axis = 1)
In [5]:
# Remove Features with more than 75 percent missing values
train_missing = (X.isnull().sum() / len(X)).sort_values(ascending = False)
train_missing = train_missing.index[train_missing > 0.60]
X = X.drop(columns = train_missing)
In [6]:
#Convert categorical variable into dummy/indicator variables.
X = pd.get_dummies(X)
In [7]:
# Imputation transformer for completing missing values.
my_imputer = SimpleImputer()
new_data = pd.DataFrame(my_imputer.fit_transform(X))
new_data.columns = X.columns
X= new_data
In [8]:
# Threshold for removing correlated variables
threshold = 0.9

# Absolute value correlation matrix
corr_matrix = X.corr().abs()
corr_matrix.head()
# Upper triangle of correlations
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
upper.head()
# Select columns with correlations above threshold
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
print('There are %d columns to remove.' % (len(to_drop)))
#Drop the columns with high correlations
X = X.drop(columns = to_drop)
There are 36 columns to remove.
In [9]:
# Initialize an empty array to hold feature importances
feature_importances = np.zeros(X.shape[1])

# Create the model with several hyperparameters
model = lgb.LGBMClassifier(objective='binary', boosting_type = 'goss', n_estimators = 10000, class_weight = 'balanced')
for i in range(2):
    
    # Split into training and validation set
    train_features, valid_features, train_y, valid_y = train_test_split(X, y, test_size = 0.25, random_state = i)
    
    # Train using early stopping
    model.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)],eval_metric = 'auc', verbose = 200)
    
    # Record the feature importances
    feature_importances += model.feature_importances_

/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
Training until validation scores don't improve for 100 rounds Early stopping, best iteration is: [90] valid_0's auc: 0.895395 valid_0's binary_logloss: 0.356349 Training until validation scores don't improve for 100 rounds [200] valid_0's auc: 0.891263 valid_0's binary_logloss: 0.313702 Early stopping, best iteration is: [162] valid_0's auc: 0.892908 valid_0's binary_logloss: 0.326333
In [10]:
# Make sure to average feature importances! 
feature_importances = feature_importances / 2
feature_importances = pd.DataFrame({'feature': list(X.columns), 'importance': feature_importances}).sort_values('importance', ascending = False)
# Find the features with zero importance
zero_features = list(feature_importances[feature_importances['importance'] == 0.0]['feature'])
print('There are %d features with 0.0 importance' % len(zero_features))
# Drop features with zero importance
X = X.drop(columns = zero_features)
There are 17 features with 0.0 importance
In [11]:
X = y.join(X)

H2OFrame :

H2OFrame is the primary data store for H2O.H2OFrame is similar to pandas’ DataFrame . One of the critical distinction is that the data is generally not held in memory, instead it is located on a (possibly remote) H2O cluster, and thus H2OFrame represents a mere handle to that data.

In [12]:
X = h2o.H2OFrame(X)
Parse progress: |█████████████████████████████████████████████████████████| 100%

split_frame():

split_frame() splits a frame into distinct subsets of size determined by the given ratios.The number of subsets is always 1 more than the number of ratios given. This does not give an exact split and H2O is designed to be efficient on big data using a probabilistic splitting method rather than an exact split.

In [13]:
# split into train and validation sets
train, valid = X.split_frame(ratios = [.8], seed = 1234)

asfactor() converts columns in the current frame to categoricals.

In [14]:
train[0] = train[0].asfactor()
valid[0] = valid[0].asfactor()
In [15]:
param = {
      "ntrees" : 100
    , "max_depth" : 10
    , "learn_rate" : 0.02
    , "sample_rate" : 0.7
    , "col_sample_rate_per_tree" : 0.9
    , "min_rows" : 5
    , "seed": 4241
    , "score_tree_interval": 100
}
from h2o.estimators import H2OXGBoostEstimator
model = H2OXGBoostEstimator(**param)
model.train(x = list(range(1, train.shape[1])), y = 0, training_frame = train,validation_frame = valid)
xgboost Model Build progress: |███████████████████████████████████████████| 100%
In [16]:
model.model_performance(valid)
ModelMetricsBinomial: xgboost ** Reported on test data. ** MSE: 0.060203122517298036 RMSE: 0.24536324606040333 LogLoss: 0.23250324076518067 Mean Per-Class Error: 0.201726318219548 AUC: 0.88128607601079 AUCPR: 0.5405850313795718 Gini: 0.76257215202158 Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.3183365629778968:
Maximum Metrics: Maximum metrics at their respective thresholds
Gains/Lift Table: Avg response rate: 8.86 %, avg score: 14.35 %
Out[16]:

References :

Lee, M., Raffa, J., Ghassemi, M., Pollard, T., Kalanidhi, S., Badawi, O., Matthys, K., Celi, L. A. (2020). WiDS (Women in Data Science) Datathon 2020: ICU Mortality Prediction. PhysioNet. doi:10.13026/vc0e-th79

Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals (2003). Circulation. 101(23):e215-e220.

Official H20 documentation