The challenge is to create a model that uses data from the first 24 hours of intensive care to predict patient survival. MIT's GOSSIS community initiative, with privacy certification from the Harvard Privacy Lab, has provided a dataset of more than 130,000 hospital Intensive Care Unit (ICU) visits from patients, spanning a one-year timeframe. This data is part of a growing global effort and consortium spanning Argentina, Australia, New Zealand, Sri Lanka, Brazil, and more than 200 hospitals in the United States.
MIT's GOSSIS community initiative, with privacy certification from the Harvard Privacy Lab, has provided a dataset of more than 130,000 hospital Intensive Care Unit (ICU) visits from patients, spanning a one-year timeframe. This data is part of a growing global effort and consortium spanning Argentina, Australia, New Zealand, Sri Lanka, Brazil, and more than 200 hospitals in the United States.
The data includes:
Training data for 91,713 encounters.
Unlabeled test data for 39,308 encounters, which includes all the information in the training data except for the values for hospital_death.
WiDS Datathon 2020 Dictionary with supplemental information about the data, including the category (e.g., identifier, demographic, vitals), unit of measure, data type (e.g., numeric, binary), description, and examples.
Sample submission files
A collection of several models working together on a single set is called an Ensemble and the method is called Ensemble Learning.
Ensemble methods combine several trees base algorithms to construct better predictive performance than a single tree base algorithm. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner, thus increasing the accuracy of the model. When we try to predict the target variable using any machine learning technique, the main causes of difference in actual and predicted values are noise, variance, and bias.
This kernel gives a baseline script to implement one type of Ensemble Technique called Stacking .
Ensemble learning is a machine learning paradigm where multiple models (often called “weak learners”) are trained to solve the same problem and combined to get better results. The main hypothesis is that when weak models are correctly combined we can obtain more accurate and/or robust models.
When a single base learning algorithm is used to train homogeneous weak learners in different ways , the technique is referred to as “homogeneous”. When different type of base learning algorithms used to train some heterogeneous weak learners the technique is referred to as “heterogeneous”.
Stacking is a ensemble technique which considers heterogeneous weak learners, learns them in parallel and combines them by training a meta-model to output a prediction based on the different weak models predictions
Scikit-learn is the most useful library for machine learning in Python and the library contains a lot of effiecient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.**
# importing libraries import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from sklearn.impute import SimpleImputer from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier ,AdaBoostClassifier from sklearn.model_selection import train_test_split import lightgbm as lgb import matplotlib.pyplot as plt import seaborn as sns # roc curve and auc score from sklearn.metrics import roc_curve from sklearn.metrics import roc_auc_score
# loading dataset training_v2 = pd.read_csv("../input/widsdatathon2020/training_v2.csv") test = pd.read_csv("../input/widsdatathon2020/unlabeled.csv")
# creating independent features X and dependant feature Y y = training_v2['hospital_death'] X = training_v2 X = training_v2.drop('hospital_death',axis = 1) test = test.drop('hospital_death',axis = 1)
# Remove Features with more than 75 percent missing values train_missing = (X.isnull().sum() / len(X)).sort_values(ascending = False) train_missing = train_missing.index[train_missing > 0.75] X = X.drop(columns = train_missing) test = test.drop(columns = train_missing)
categoricals_features = ['hospital_id','ethnicity','gender','hospital_admit_source','icu_admit_source','icu_stay_type','icu_type','apache_3j_bodysystem','apache_2_bodysystem'] X = X.drop(columns = categoricals_features) test = test.drop(columns = categoricals_features)
# Imputation transformer for completing missing values. my_imputer = SimpleImputer() new_data = pd.DataFrame(my_imputer.fit_transform(X)) test_data = pd.DataFrame(my_imputer.fit_transform(test)) new_data.columns = X.columns test_data.columns = test.columns X= new_data test = test_data
The data is split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.The above is achieved in Scikit-Learn library using the train_test_split method.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)
Ada-boost classifier combines weak classifier algorithm to form strong classifier. A single algorithm may classify the objects poorly. Good accuracy can be achieved by combining multiple classifiers with selection of training set at every iteration and assigning right amount of weight in final voting.
Ada-boost retrains the algorithm iteratively by choosing the training set based on accuracy of previous training. The weight-age of each trained classifier at any iteration depends on the accuracy achieved.
model1 = AdaBoostClassifier(random_state=1) model1.fit(X_train,y_train) train_pred1=pd.DataFrame(model1.predict(X_val)) test_pred1=pd.DataFrame(model1.predict(X_test))
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. The intuition behind gradient boosting algorithm is to repetitively leverage the patterns in residuals and strengthen a model with weak predictions and make it better.
model2= GradientBoostingClassifier(learning_rate=0.01,random_state=1) model2.fit(X_train,y_train) train_pred2=pd.DataFrame(model2.predict(X_val)) test_pred2=pd.DataFrame(model2.predict(X_test))
df = pd.concat([train_pred1, train_pred2], axis=1) df_test = pd.concat([test_pred1, test_pred2], axis=1)
Random forest consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction . The fudamental concept of Random Forest is that large number of relatively uncorrelated modelsoperating as a committee will outperform any of the individual constituent models
stackmodel=RandomForestClassifier(n_estimators=100) stackmodel.fit(df,y_val) stackmodel.score(df_test,y_test)
AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. Higher the AUC, better the model is at distinguishing between patients with disease and no disease. The ROC curve is plotted with TPR against the FPR where TPR is on y-axis and FPR is on the x-axis.
def plot_roc_curve(fpr, tpr): plt.plot(fpr, tpr, color='orange', label='ROC') plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver Operating Characteristic (ROC) Curve') plt.legend() plt.show()
probs = stackmodel.predict_proba(df_test) probs = probs[:, 1] auc = roc_auc_score(y_test, probs) fpr, tpr, thresholds = roc_curve(y_test, probs) plot_roc_curve(fpr, tpr) print("AUC-ROC :",auc)
AUC-ROC : 0.649842697414035
Submissions will be evaluated on the Area under the Receiver Operating Characteristic (ROC) curve between the predicted mortality and the observed target (hospital_death)
Lee, M., Raffa, J., Ghassemi, M., Pollard, T., Kalanidhi, S., Badawi, O., Matthys, K., Celi, L. A. (2020). WiDS (Women in Data Science) Datathon 2020: ICU Mortality Prediction. PhysioNet. doi:10.13026/vc0e-th79
Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals (2003). Circulation. 101(23):e215-e220.