The challenge is to create a model that uses data from the first 24 hours of intensive care to predict patient survival. MIT's GOSSIS community initiative, with privacy certification from the Harvard Privacy Lab, has provided a dataset of more than 130,000 hospital Intensive Care Unit (ICU) visits from patients, spanning a one-year timeframe. This data is part of a growing global effort and consortium spanning Argentina, Australia, New Zealand, Sri Lanka, Brazil, and more than 200 hospitals in the United States.
MIT's GOSSIS community initiative, with privacy certification from the Harvard Privacy Lab, has provided a dataset of more than 130,000 hospital Intensive Care Unit (ICU) visits from patients, spanning a one-year timeframe. This data is part of a growing global effort and consortium spanning Argentina, Australia, New Zealand, Sri Lanka, Brazil, and more than 200 hospitals in the United States.
The data includes:
Training data for 91,713 encounters.
Unlabeled test data for 39,308 encounters, which includes all the information in the training data except for the values for hospital_death.
WiDS Datathon 2020 Dictionary with supplemental information about the data, including the category (e.g., identifier, demographic, vitals), unit of measure, data type (e.g., numeric, binary), description, and examples.
Sample submission files
H2O is ‘the open source in-memory, prediction engine for Big Data science’. H2O is a feature-rich, open source machine learning platform known for its R and Spark integration and its ease of use. It is a Java virtual machine that is optimised for doing in-memory processing of distributed, parallel machine learning algorithms on clusters.
The motive of H2O is to provide a platform which made easy for the non-experts to do experiments with machine learning.H2O architecture can be divided into different layers in which the top layer will be different APIs, and the bottom layer will be H2O JVM.
# importing libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier ,AdaBoostClassifier
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn import preprocessing
There are many tools for directly interacting with user-visible objects in the H2O cluster. Every new python session begins by initializing a connection between the python client and the H2O cluster.The h2o.init() function to initialize H2O.
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
Java Version: openjdk version "1.8.0_232"; OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-1~deb9u1-b09); OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)
Starting server from /opt/conda/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
Ice root: /tmp/tmpk52nfdng
JVM stdout: /tmp/tmpk52nfdng/h2o_unknownUser_started_from_python.out
JVM stderr: /tmp/tmpk52nfdng/h2o_unknownUser_started_from_python.err
Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
# loading dataset
training_v2 = pd.read_csv("../input/widsdatathon2020/training_v2.csv")
# creating independent features X and dependant feature Y
y = pd.DataFrame(training_v2['hospital_death'])
X = training_v2
X = training_v2.drop('hospital_death',axis = 1)
# Remove Features with more than 75 percent missing values
train_missing = (X.isnull().sum() / len(X)).sort_values(ascending = False)
train_missing = train_missing.index[train_missing > 0.60]
X = X.drop(columns = train_missing)
#Convert categorical variable into dummy/indicator variables.
X = pd.get_dummies(X)
# Imputation transformer for completing missing values.
my_imputer = SimpleImputer()
new_data = pd.DataFrame(my_imputer.fit_transform(X))
new_data.columns = X.columns
X= new_data
# Threshold for removing correlated variables
threshold = 0.9
# Absolute value correlation matrix
corr_matrix = X.corr().abs()
corr_matrix.head()
# Upper triangle of correlations
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
upper.head()
# Select columns with correlations above threshold
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
print('There are %d columns to remove.' % (len(to_drop)))
#Drop the columns with high correlations
X = X.drop(columns = to_drop)
There are 36 columns to remove.
# Initialize an empty array to hold feature importances
feature_importances = np.zeros(X.shape[1])
# Create the model with several hyperparameters
model = lgb.LGBMClassifier(objective='binary', boosting_type = 'goss', n_estimators = 10000, class_weight = 'balanced')
for i in range(2):
# Split into training and validation set
train_features, valid_features, train_y, valid_y = train_test_split(X, y, test_size = 0.25, random_state = i)
# Train using early stopping
model.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)],eval_metric = 'auc', verbose = 200)
# Record the feature importances
feature_importances += model.feature_importances_
/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/opt/conda/lib/python3.6/site-packages/sklearn/preprocessing/label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[90] valid_0's auc: 0.895395 valid_0's binary_logloss: 0.356349
Training until validation scores don't improve for 100 rounds
[200] valid_0's auc: 0.891263 valid_0's binary_logloss: 0.313702
Early stopping, best iteration is:
[162] valid_0's auc: 0.892908 valid_0's binary_logloss: 0.326333
# Make sure to average feature importances!
feature_importances = feature_importances / 2
feature_importances = pd.DataFrame({'feature': list(X.columns), 'importance': feature_importances}).sort_values('importance', ascending = False)
# Find the features with zero importance
zero_features = list(feature_importances[feature_importances['importance'] == 0.0]['feature'])
print('There are %d features with 0.0 importance' % len(zero_features))
# Drop features with zero importance
X = X.drop(columns = zero_features)
There are 17 features with 0.0 importance
X = y.join(X)
H2OFrame is the primary data store for H2O.H2OFrame is similar to pandas’ DataFrame . One of the critical distinction is that the data is generally not held in memory, instead it is located on a (possibly remote) H2O cluster, and thus H2OFrame represents a mere handle to that data.
X = h2o.H2OFrame(X)
Parse progress: |█████████████████████████████████████████████████████████| 100%
split_frame() splits a frame into distinct subsets of size determined by the given ratios.The number of subsets is always 1 more than the number of ratios given. This does not give an exact split and H2O is designed to be efficient on big data using a probabilistic splitting method rather than an exact split.
# split into train and validation sets
train, valid = X.split_frame(ratios = [.8], seed = 1234)
asfactor() converts columns in the current frame to categoricals.
train[0] = train[0].asfactor()
valid[0] = valid[0].asfactor()
param = {
"ntrees" : 100
, "max_depth" : 10
, "learn_rate" : 0.02
, "sample_rate" : 0.7
, "col_sample_rate_per_tree" : 0.9
, "min_rows" : 5
, "seed": 4241
, "score_tree_interval": 100
}
from h2o.estimators import H2OXGBoostEstimator
model = H2OXGBoostEstimator(**param)
model.train(x = list(range(1, train.shape[1])), y = 0, training_frame = train,validation_frame = valid)
xgboost Model Build progress: |███████████████████████████████████████████| 100%
model.model_performance(valid)
ModelMetricsBinomial: xgboost
** Reported on test data. **
MSE: 0.060203122517298036
RMSE: 0.24536324606040333
LogLoss: 0.23250324076518067
Mean Per-Class Error: 0.201726318219548
AUC: 0.88128607601079
AUCPR: 0.5405850313795718
Gini: 0.76257215202158
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.3183365629778968:
Maximum Metrics: Maximum metrics at their respective thresholds
Gains/Lift Table: Avg response rate: 8.86 %, avg score: 14.35 %
References :
Lee, M., Raffa, J., Ghassemi, M., Pollard, T., Kalanidhi, S., Badawi, O., Matthys, K., Celi, L. A. (2020). WiDS (Women in Data Science) Datathon 2020: ICU Mortality Prediction. PhysioNet. doi:10.13026/vc0e-th79
Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals (2003). Circulation. 101(23):e215-e220.
Official H20 documentation