Jovian
⭐️
Sign In

Objective

The challenge is to create a model that uses data from the first 24 hours of intensive care to predict patient survival. MIT's GOSSIS community initiative, with privacy certification from the Harvard Privacy Lab, has provided a dataset of more than 130,000 hospital Intensive Care Unit (ICU) visits from patients, spanning a one-year timeframe. This data is part of a growing global effort and consortium spanning Argentina, Australia, New Zealand, Sri Lanka, Brazil, and more than 200 hospitals in the United States.

Data Description

MIT's GOSSIS community initiative, with privacy certification from the Harvard Privacy Lab, has provided a dataset of more than 130,000 hospital Intensive Care Unit (ICU) visits from patients, spanning a one-year timeframe. This data is part of a growing global effort and consortium spanning Argentina, Australia, New Zealand, Sri Lanka, Brazil, and more than 200 hospitals in the United States.

The data includes:

Training data for 91,713 encounters.
Unlabeled test data for 39,308 encounters, which includes all the information in the training data except for the values for hospital_death.
WiDS Datathon 2020 Dictionary with supplemental information about the data, including the category (e.g., identifier, demographic, vitals), unit of measure, data type (e.g., numeric, binary), description, and examples.
Sample submission files

H2O :

H2O is ‘the open source in-memory, prediction engine for Big Data science’. H2O is a feature-rich, open source machine learning platform known for its R and Spark integration and its ease of use. It is a Java virtual machine that is optimised for doing in-memory processing of distributed, parallel machine learning algorithms on clusters.

The motive of H2O is to provide a platform which made easy for the non-experts to do experiments with machine learning.H2O architecture can be divided into different layers in which the top layer will be different APIs, and the bottom layer will be H2O JVM.

Pic Credit : Datacamp.com

H2O AutoML

**H2O’s AutoML can also be a helpful tool for the advanced user, by providing a simple wrapper function that performs a large number of modeling-related tasks that would typically require many lines of code, and by freeing up their time to focus on other aspects of the data science pipeline tasks such as data-preprocessing, feature engineering and model deployment.
**

H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. Stacked Ensembles – one based on all previously trained models, another one on the best model of each family – will be automatically trained on collections of individual models to produce highly predictive ensemble models which, in most cases, will be the top performing models in the AutoML Leaderboard.

AutoML Interface

The H2O AutoML interface is designed to have as few parameters as possible so that all the user needs to do is point to their dataset, identify the response column and optionally specify a time constraint or limit on the number of total models trained.

Pic Credit : Towards Data Science

Starting H2O and Inspecting the Cluster

There are many tools for directly interacting with user-visible objects in the H2O cluster. Every new python session begins by initializing a connection between the python client and the H2O cluster.The h2o.init() function to initialize H2O.

In [1]:
import h2o
from h2o.automl import H2OAutoML
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found. Attempting to start a local H2O server... Java Version: openjdk version "1.8.0_232"; OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-1~deb9u1-b09); OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode) Starting server from /opt/conda/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar Ice root: /tmp/tmpmbj3ix5x JVM stdout: /tmp/tmpmbj3ix5x/h2o_unknownUser_started_from_python.out JVM stderr: /tmp/tmpmbj3ix5x/h2o_unknownUser_started_from_python.err Server is running at http://127.0.0.1:54321 Connecting to H2O server at http://127.0.0.1:54321 ... successful.
In [2]:
# importing libraries
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier ,AdaBoostClassifier
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn import preprocessing

In [3]:
# loading dataset 
training_v2 = pd.read_csv("../input/widsdatathon2020/training_v2.csv")
In [4]:
# creating independent features X and dependant feature Y
y = pd.DataFrame(training_v2['hospital_death'])
X = training_v2
X = training_v2.drop('hospital_death',axis = 1)
In [5]:
# Remove Features with more than 75 percent missing values
train_missing = (X.isnull().sum() / len(X)).sort_values(ascending = False)
train_missing = train_missing.index[train_missing > 0.60]
X = X.drop(columns = train_missing)
In [6]:
#Convert categorical variable into dummy/indicator variables.
X = pd.get_dummies(X)
In [7]:
# Imputation transformer for completing missing values.
my_imputer = SimpleImputer()
new_data = pd.DataFrame(my_imputer.fit_transform(X))
new_data.columns = X.columns
X= new_data
In [8]:
# Threshold for removing correlated variables
threshold = 0.9

# Absolute value correlation matrix
corr_matrix = X.corr().abs()
corr_matrix.head()
# Upper triangle of correlations
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
upper.head()
# Select columns with correlations above threshold
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
print('There are %d columns to remove.' % (len(to_drop)))
#Drop the columns with high correlations
X = X.drop(columns = to_drop)
There are 36 columns to remove.
In [9]:
# Initialize an empty array to hold feature importances
feature_importances = np.zeros(X.shape[1])

# Create the model with several hyperparameters
model = lgb.LGBMClassifier(objective='binary', boosting_type = 'goss', n_estimators = 10000, class_weight = 'balanced')
for i in range(2):
    
    # Split into training and validation set
    train_features, valid_features, train_y, valid_y = train_test_split(X, y, test_size = 0.25, random_state = i)
    
    # Train using early stopping
    model.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)],eval_metric = 'auc', verbose = 200)
    
    # Record the feature importances
    feature_importances += model.feature_importances_

/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
Training until validation scores don't improve for 100 rounds Early stopping, best iteration is: [90] valid_0's auc: 0.895395 valid_0's binary_logloss: 0.356349 Training until validation scores don't improve for 100 rounds [200] valid_0's auc: 0.891263 valid_0's binary_logloss: 0.313702 Early stopping, best iteration is: [162] valid_0's auc: 0.892908 valid_0's binary_logloss: 0.326333
In [10]:
# Make sure to average feature importances! 
feature_importances = feature_importances / 2
feature_importances = pd.DataFrame({'feature': list(X.columns), 'importance': feature_importances}).sort_values('importance', ascending = False)
# Find the features with zero importance
zero_features = list(feature_importances[feature_importances['importance'] == 0.0]['feature'])
print('There are %d features with 0.0 importance' % len(zero_features))
# Drop features with zero importance
X = X.drop(columns = zero_features)
There are 17 features with 0.0 importance

H2OFrame :

H2OFrame is the primary data store for H2O.H2OFrame is similar to pandas’ DataFrame . One of the critical distinction is that the data is generally not held in memory, instead it is located on a (possibly remote) H2O cluster, and thus H2OFrame represents a mere handle to that data.

In [11]:
features = list(X)
X = y.join(X)
X = h2o.H2OFrame(X)
Parse progress: |█████████████████████████████████████████████████████████| 100%

split_frame():

split_frame() splits a frame into distinct subsets of size determined by the given ratios.The number of subsets is always 1 more than the number of ratios given. This does not give an exact split and H2O is designed to be efficient on big data using a probabilistic splitting method rather than an exact split.

In [12]:
train,test = X.split_frame(ratios=[.7])

H2OAutoML :

The Automatic Machine Learning (AutoML) function automates the supervised machine learning model training process. The current version of AutoML trains and cross-validates a Random Forest (DRF), an Extremely-Randomized Forest (DRF/XRT), a random grid of Generalized Linear Models (GLM) a random grid of XGBoost (XGBoost), a random grid of Gradient Boosting Machines (GBM), a random grid of Deep Neural Nets (DeepLearning), and 2 Stacked Ensembles, one of all the models, and one of only the best models of each kind.

In [13]:
model = H2OAutoML(max_runtime_secs=300, seed = 1)
model.train(x = features, y = 0, training_frame = train)
AutoML progress: |████████████████████████████████████████████████████████| 100%

leaderboard :

leaderboard() retrieves the leaderboard from an H2OAutoML object and returns an H2OFrame with model ids in the first column and evaluation metric in the second column sorted by the evaluation metric

In [14]:
lb = model.leaderboard
lb.head()

Out[14]:
In [15]:
preds = model.predict(test)

xgboost prediction progress: |████████████████████████████████████████████| 100%

References :

Lee, M., Raffa, J., Ghassemi, M., Pollard, T., Kalanidhi, S., Badawi, O., Matthys, K., Celi, L. A. (2020). WiDS (Women in Data Science) Datathon 2020: ICU Mortality Prediction. PhysioNet. doi:10.13026/vc0e-th79

Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals (2003). Circulation. 101(23):e215-e220.

Official H20 documentation

https://opensourceforu.com/2017/01/introduction-h2o-relation-deep-learning/

https://www.datacamp.com/community/tutorials/h2o-automl