Learn data science and machine learning by building real-world projects on Jovian

Heart Failure - Analysis

Heart Failure Image

Introduction

About - Dataset:

Cardiovascular diseases kill approximately 17 million people globally every year and they mainly exhibit as myocardial infarctions and heart failures. Heart failure (HF) occurs when the heart cannot pump enough blood to meet the needs of the body. In this project, we analyze a dataset containing the medical records of 299 heart failure patients collected at the Faisalabad Institute of Cardiology and at the Allied Hospital in Faisalabad (Punjab - Pakistan) during the months of April - December in 2015. It consists of 105 women and 194 men with their ages range between 40 and 95 years old. All 299 patients had left ventricular systolic dysfunction and had previous heart failures. This dataset contains 13 features, which reports clinical, body and lifestyle information of a patient namely Age, Anaemia, High Blood Pressure, Creatinine Phosphokinase (CPK), Diabetes, Ejection Fraction, Sex, Platelets, Serum Creatinine, Serum Sodium, Smoking Habit etc.

About - Project:

This Exploratory Data Analysis project is a part of "Data Analysis with Python: Zero to Pandas" course structured and provided by Jovian. In this project, we'll analyse the relationship between the different features of the heart failure patient included in this dataset namely the distribution of age among the patients, death rate, percentage of male and female patients, variation in the platelets amount, creatinine and sodium level in the blood. The graphical representation and visualisation of data using matplotlib and seaborn library in python helps us to easily understand a lot better about the dataset.

Dataset - Source:

The dataset is obtained from Kaggle.

Please click here to know more about the dataset.

The dataset consist of column names (attributes) which doesn't provide complete information regarding the data recorded, so we have to refer to the another table / websites to see the complete information regarding the attributes (column names) including measurement units and normal level, if required.

Please click the below link to view the table containing information regarding column names.

Attributes Information Table

Download the Dataset:

There are several options for getting the dataset into Jupyter:

  • Download the CSV manually and upload it via Jupyter's GUI

  • Use the urlretrieve function from the urllib.request to download CSV files from a raw URL

  • Use a helper library, e.g., opendatasets, which contains a collection of curated datasets and provides a helper function for direct download.

Initially, I used the opendatasets helper library to download the files from Kaggle using my username and API key. Later, I uploaded the same dataset to my Github profile, to fetch the dataset directly with just few lines of code (using urllib.request.urlretrieve function) without any username or API key, just for my convenience.

Let's assign github raw url of the dataset which is already retrieved using opendatasets helper function to the variable named 'url'.

#assign the dataset (.csv) file url to a variable 
url = "https://raw.githubusercontent.com/lafirm/datasets/main/heart_failure_clinical_records_dataset.csv"
#import urlretrieve function to download the dataset 
from urllib.request import urlretrieve 
urlretrieve(url, 'heart_failure_dataset.csv')
('heart_failure_dataset.csv', <http.client.HTTPMessage at 0x7f7f5059f760>)

We downloaded the .csv file (dataset) using urlretrieve function from urllib.request module. And we named it as 'heart_failure_dataset.csv'.

Let's check whether the dataset was downloaded into the current working directory using listdir() function from os module.

import os #import os module to work with files and directory 
os.listdir() #to view list of files 
['.bash_logout',
 '.profile',
 '.bashrc',
 '.ipynb_checkpoints',
 '.ipython',
 '.local',
 '.cache',
 '.jupyter',
 'heart_failure_dataset.csv',
 '.jovian',
 '.config',
 '.conda',
 'course-project-exploratory-data-analysis.ipynb',
 '.wget-hsts',
 '.jovianrc',
 '.git',
 'work',
 '.npm']

os.listdir() function helps us to fetch the list of files in a directory by specifying directory name as an argument, by default it's current working directory, if no arguments passed.

Save and upload our notebook

Whether we are running this Jupyter notebook online or on our computer, it's essential to save our work from time to time. We can continue working on a saved notebook later or share it with friends and colleagues to let them execute our code. Jovian offers an easy way of saving and sharing our Jupyter notebooks online.

#to install jovian module
!pip install jovian --upgrade --quiet
import jovian
#assign the name for our project notebook 
project_name = 'heart-failure-analysis'
#let's save our notebook
jovian.commit(project = project_name)
[jovian] Updating notebook "lafirm/heart-failure-analysis" on https://jovian.ai [jovian] Committed successfully! https://jovian.ai/lafirm/heart-failure-analysis

Data Preparation and Cleaning

Let's load the CSV files using the Pandas library. We'll use the name "heart_failure_raw_df" for the data frame to indicate this is unprocessed data that we might clean, filter and modify to prepare a data frame ready for analysis. So we have to extract a copy of data frame and name it as "heart_failure_df". We'll perform data preparation and cleaning operations on "heart_failure_df" and leave the raw data frame untouched and unmodified.

import pandas as pd
#convert the csv file into pandas data frame 
heart_failure_raw_df = pd.read_csv('heart_failure_dataset.csv')
#let's extract a copy of raw df to keep the raw df unaffected / untouched 
heart_failure_df = heart_failure_raw_df.copy()
heart_failure_df 

We can easily identify from the above pandas data frame that the responses have been anonymized to remove personally identifiable information like name, address etc.

#importing numpy module as np to change value in sex column 
import numpy as np 

Let's modify our "heart_failure_df" data frame by dropping unnecessary columns / rows, renaming the column names, changing the data type of column and so on.

#renaming DEATH_EVENT as patient_dead for my convenience 
heart_failure_df.rename(columns = {'DEATH_EVENT':'patient_dead'}, inplace = True)
#drop the time column which is not necessary for our analysis 
heart_failure_df.drop(['time'],axis=1, inplace =True, errors = 'ignore')
#changing the data type of age column from float to int
heart_failure_df.age = heart_failure_df.age.astype(int)
#changing the data type of following columns to bool type for our convenience 
heart_failure_df[['anaemia','diabetes','high_blood_pressure', 'smoking', 'patient_dead']] = heart_failure_df[['anaemia','diabetes', 'high_blood_pressure', 'smoking', 'patient_dead']].astype(bool)
#changing the value of sex column to male or female
heart_failure_df['sex'] = np.where(heart_failure_df['sex'] == 1, 'Male','Female')
#to convert the platelet-count into kilo-platelets/mcL 
heart_failure_df.platelets = (heart_failure_df. platelets/1000).astype(int)
heart_failure_df 

Check for null values in data frame

isna().any() function returns 'True', if there's any null (Nan) value in data frame and returns False if there's no null value.

#checking for NaN values 
heart_failure_df.isnull().any()
age                         False
anaemia                     False
creatinine_phosphokinase    False
diabetes                    False
ejection_fraction           False
high_blood_pressure         False
platelets                   False
serum_creatinine            False
serum_sodium                False
sex                         False
smoking                     False
patient_dead                False
dtype: bool

From the above result, it's clear that there's no null values present in our data frame. If there's limited number of columns, we can also check whether a column has null value or not using info() function.

heart_failure_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 299 entries, 0 to 298 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 299 non-null int64 1 anaemia 299 non-null bool 2 creatinine_phosphokinase 299 non-null int64 3 diabetes 299 non-null bool 4 ejection_fraction 299 non-null int64 5 high_blood_pressure 299 non-null bool 6 platelets 299 non-null int64 7 serum_creatinine 299 non-null float64 8 serum_sodium 299 non-null int64 9 sex 299 non-null object 10 smoking 299 non-null bool 11 patient_dead 299 non-null bool dtypes: bool(5), float64(1), int64(5), object(1) memory usage: 17.9+ KB

info() function in pandas module is used to view some basic information of a data frame. From the above output, we can clearly see that there are 299 rows and 12 column and there's no null value. Also we can see the data types of each column in the data frame.

heart_failure_df.describe()

describe() function is used to find some basic statistical information regarding a data frame in pandas. It's normal that, the max age of patient recorded is 95 years and min age is 40 years. All other numeric value records found okay, but the maximum value of Creatinine Phosphokinase (CPK) measured is 7861, that's too high for a normal person. The maximum value of CKP found in a heart failure patient was around 600 micrograms per liter.

heart_failure_df[heart_failure_df.creatinine_phosphokinase > 1000].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 36 entries, 1 to 297 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 36 non-null int64 1 anaemia 36 non-null bool 2 creatinine_phosphokinase 36 non-null int64 3 diabetes 36 non-null bool 4 ejection_fraction 36 non-null int64 5 high_blood_pressure 36 non-null bool 6 platelets 36 non-null int64 7 serum_creatinine 36 non-null float64 8 serum_sodium 36 non-null int64 9 sex 36 non-null object 10 smoking 36 non-null bool 11 patient_dead 36 non-null bool dtypes: bool(5), float64(1), int64(5), object(1) memory usage: 2.4+ KB

The normal value of CPK ranges from 10 to 120 micrograms per liter, but around 36 patients has CPK level more than 1000 micrograms per liter in our observation. Since, we are not sure about the source of error or that the recorded data is correct or not. This error and huge difference in value might be due to the different measurement units. So let's drop that column and not include in our analysis.

#to drop creatinine_phosphokinase column
heart_failure_df.drop('creatinine_phosphokinase', inplace = True, axis = 1, errors ='ignore')

Let's check the column names of our data frame using columns() function in pandas.

#to display column names
heart_failure_df.columns 
Index(['age', 'anaemia', 'diabetes', 'ejection_fraction',
       'high_blood_pressure', 'platelets', 'serum_creatinine', 'serum_sodium',
       'sex', 'smoking', 'patient_dead'],
      dtype='object')

There are 11 column in our data frame, where each column represents the attribute of the 299 heart failure patients.

column names (attributes) of the data frame doesn't provide complete information regarding the data recorded, so we have to refer to the another table or any website for reference to see the complete information regarding the attributes (column names) like explanation and measurement units. Also we have to add normal level value for the attributes, for the required column values.

Pandas library in python provides various useful functions to read various file formats, here we'll use read_html function to read a table from the source website. Kindly check the reference links at the bottom of this page.

#to save the html table as pandas data frame 
column_details_df = pd.read_html("https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5/tables/1 ")[0]
#to drop the unnecessary column
column_details_df.drop('Range', axis =1,inplace = True, errors='ignore' )
#to drop the unnecessary rows
column_details_df.drop([3,4,7,13], axis = 0, inplace =True, errors ='ignore')
#to rename the column names
column_details_df.columns = ['feature', 'explanation', 'measurement_unit']
column_details_df 
#to rearrange and rename the rows to match the heart_failure_df 
column_details_df = column_details_df.reindex([0,1,5,6,2,9,10,11,8,12,14])
column_details_df.feature = heart_failure_df.columns

#to set the feature column as index for our convenience 
column_details_df.set_index(['feature'], inplace =True)

Let's correct the data in explanation and measurement unit column and another column mentioning the normal value for the attributes.

#to change the details in explanation column 
column_details_df['explanation']['anaemia', 'diabetes', 'ejection_fraction', 'high_blood_pressure', 'platelets', 'sex', 'smoking', 'patient_dead'] = ['True, if the patient has Anaemia',
                                                                                                                                                      'True, if the patient has Diabetes', 
                                                                                                                                                      '% of blood leaving the heart at each contraction', 
                                                                                                                                                      'True, if the patient has High blood pressure', 
                                                                                                                                                      'Amount of platelets in the blood', 
                                                                                                                                                      'Male or Female',
                                                                                                                                                      'True, if the patient smokes', 
                                                                                                                                                      'True, if the patient died during the follow-up period'] 


#to change the details in measurement unit column 
column_details_df.measurement_unit['sex', 'platelets','serum_creatinine','serum_sodium'] = ['Boolean',
                                                                                            'kilo-platelets / mcL (microliter)', 
                                                                                            'mg/dL (milligrams per deciliter)', 
                                                                                            'mEq/L (milliequivalents per litre)'
                                                                                            ] 
#let's add another column to mention normal values of the attributes 
column_details_df["normal_value"] = ['None', 
                                     'None', 
                                     'None',
                                     '55% - 70%',
                                     'None', 
                                     '150 - 400 kilo-platelets / mcL', 
                                     '0.6 - 1.2 mg/dL', 
                                     '135 - 145 mEq /L', 
                                     'None', 'None', 'None'
                                    ] 
column_details_df

The normal values of the ejection fraction, creatinine level, sodium level and platelets count was taken from various resources, please check the reference links at the bottom. And these values varies based on patients age, body and gender etc.,but we took the most suitable approximate value to simplify our analysis. Please be aware that there might be small changes in our analysis when compared to the real world.

Datasets - Cleaned & Prepared

Now, our both the datasets heart_failure_df and column_details_df are ready for our analysis. Let's check some basic information regarding our cleaned datasets before proceeding for visualisation methods.

column_details_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 11 entries, age to patient_dead Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 explanation 11 non-null object 1 measurement_unit 11 non-null object 2 normal_value 11 non-null object dtypes: object(3) memory usage: 652.0+ bytes

There are 11 rows in our column_details_df which clearly describes the informations (like explanation, measurement units and normal value) regarding the 11 attributes in the heart_failure_df data frame.

heart_failure_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 299 entries, 0 to 298 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 299 non-null int64 1 anaemia 299 non-null bool 2 diabetes 299 non-null bool 3 ejection_fraction 299 non-null int64 4 high_blood_pressure 299 non-null bool 5 platelets 299 non-null int64 6 serum_creatinine 299 non-null float64 7 serum_sodium 299 non-null int64 8 sex 299 non-null object 9 smoking 299 non-null bool 10 patient_dead 299 non-null bool dtypes: bool(5), float64(1), int64(4), object(1) memory usage: 15.6+ KB

There are 299 rows and 11 columns in our heart_failure_df with no null values. In raw dataset, there were 299 rows and 13 columns, we performed some operations to remove the unnecessary columns.

heart_failure_df.describe()

From the above result, we can find the average age of patients, maximum and minimum value of clinical records stored in our data frame.

Number of Male and Female patients
heart_failure_df.sex.value_counts()
Male      194
Female    105
Name: sex, dtype: int64
Number of Patients with Anaemia
heart_failure_df.anaemia.value_counts()
False    170
True     129
Name: anaemia, dtype: int64
Number of Patients with Diabetes
heart_failure_df.diabetes.value_counts()
False    174
True     125
Name: diabetes, dtype: int64
Number of Patients with High Blood Pressure
heart_failure_df.high_blood_pressure.value_counts()
False    194
True     105
Name: high_blood_pressure, dtype: int64
Number of Patients with Smoking Habit
heart_failure_df.smoking.value_counts()
False    203
True      96
Name: smoking, dtype: int64
Number of Patients died during follow-up period
heart_failure_df.groupby(['patient_dead', 'sex']).size().reset_index().pivot(columns= 'patient_dead',index = 'sex', values=0) 
Number of Patients with Abnormal Ejection Fraction
#to find the normal value of Ejection Fraction 
column_details_df.normal_value['ejection_fraction']
'55% - 70%'
abn_ef = heart_failure_df[(heart_failure_df.ejection_fraction < 55) | (heart_failure_df.ejection_fraction >70)]
abn_ef

The normal value of Ejection Fraction ranges from 55% to 70%. Here, we found that there are 261 patients with abnormal Ejection Fraction.

Number of Patients with Abnormal Platelets Count
#to find the normal value of Platelets Count 
column_details_df.normal_value['platelets']
'150 - 400 kilo-platelets / mcL'
abn_platelets = heart_failure_df[(heart_failure_df.platelets < 150) | (heart_failure_df.platelets >400)]
abn_platelets 

The normal value of Platelets Count ranges from 150 - 400 kilo-platelets / mcL. Here, we found that there are 47 patients with abnormal Platelets Count.

Number of Patients with Abnormal Creatinine level in the blood
#to find the normal value of Creatinine level in the blood 
column_details_df.normal_value['serum_creatinine']
'0.6 - 1.2 mg/dL'
abn_creatinine = heart_failure_df[(heart_failure_df.serum_creatinine < 0.6) | (heart_failure_df.serum_creatinine >1.2)]
abn_creatinine 

The normal value of Platelets Count ranges from 0.6 - 1.2 mg/dL. Here, we found that there are 102 patients with abnormal Creatinine level in the blood.

Number of Patients with Abnormal Sodium level in the blood
#to find the normal value of Sodium level in the blood 
column_details_df.normal_value['serum_sodium']
'135 - 145 mEq /L'
abn_sodium = heart_failure_df[(heart_failure_df.serum_sodium < 135) | (heart_failure_df.serum_sodium >145)]
abn_sodium 

The normal value of Sodium level in the blood ranges from 135 - 145 mEq/L. Here, we found that there are 85 patients with abnormal Sodium level in the blood.

Sample Data

We've now cleaned up and prepared the dataset for our analysis. Let's take a look at a sample of rows from the data frame.

heart_failure_df.sample(5)

Exploratory Data Analysis & Visualization

Data visualization is the representation of data through use of common graphics, such as charts, plots, infographics, and even animations. These visual displays of information communicate complex data relationships and data-driven insights in a way that is easy to understand.

Let's install and import "seaborn" library as "sns" and "matplotlib.pyplot" module as "plt" to perform some visualization operations on our data frame to understand the distribution and relationships of attributes.

#install matplotlib and seaborn 
!pip install matplotlib seaborn --upgrade --quiet 
#import seaborn and matplotlib.pyplot 
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
%matplotlib inline 
#set some default style for our graphs
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = 'white'

%matplotlib inline is used to display our plots embedded within the Jupyter notebook itself. Without this command, sometimes plots may be displayed as pop-ups.

Age

Let's have a look at the distribution of age of the heart failure patients recorded in our dataset by using "hist" function from "matplotlib.pyplot" which is used to create histograms.

"A histogram represents the distribution of a variable by creating bins (interval) along the range of values and showing vertical bars to indicate the number of observations in each bin".

plt.hist(heart_failure_df.age, bins=np.arange(40,100, 5), color ='mediumpurple')
plt.xlabel("Age of Patients (Years)")
plt.ylabel("Number of Patients")
plt.title("Distribution of Age");
Notebook Image

From the above histogram, it's clear that most of the patients falls under 60-65 age group and the next place goes to 50-55 age group.

The life expectancy of a person in Pakistan in 2015 was approximately 67 years. It's obvious that the number of patients recorded starts declining from 65-70 age group. You can see the complete list of life expectancy of people in Pakistan (1950 - 2021) using the link given in the reference section.

Gender

The distribution of gender of the Heart Failure patients is another crucial factor to look at. Let's visualize the gender distribution using pie chart.

#to store the gender counts into a variable
gender_counts = heart_failure_df.sex.value_counts()
plt.figure(figsize=(12, 6))
plt.pie(gender_counts, labels = gender_counts.index, autopct ='%.1f%%', startangle = 90, explode = [0.1, 0], colors = ['lightskyblue', 'plum'])
plt.title("Gender Distribution (Male or Female)");
Notebook Image

As we can clearly see, only 35.1% of the heart failure patients were female in our observation. This means that we have more number of male patients when compared to females in the dataset. So, we can easily say that this dataset is slightly imbalanced.

Death

Let's check the relationship between death of the patients and their age group and gender.

Death Rate

An important factor to look at is the percentage of patients died during the follow-up period. Let's use pie chart to visualize.

#count number of patients dead
dead_counts = heart_failure_df.patient_dead.value_counts()
dead_counts 
False    203
True      96
Name: patient_dead, dtype: int64
plt.figure(figsize=(12, 6))
plt.pie(dead_counts, labels = ['Alive', 'Dead'] , autopct ='%.1f%%', startangle = 90, explode=[0.1, 0], colors =['aquamarine', 'lightcoral'])
plt.title("% of Patients Dead & Alive");
Notebook Image

As we can clearly see that, the percentage of patients died during the follow-up period is 32.1%. This might be because the data mentioned in our data frame corresponds to only the certain period of time (April'15 - December'15) and more number of patients might have died after this period which isn't recorded.

Death and Age Group

Let's define a helper function to create another column in our actual data frame which describes the age group of the heart failure patients. Since, the age of patients ranges from 40 to 95 years. Let's group them into a category like '40-45', '45-50', '50-55', '55-60' and so on. Age groups of the patient helps us to easily understand and visualize relationship of various attributes with age.

def create_range_series(number_series):
    """Creates a series with range(group) for the numeric values 
    provided in another series which is passed as an argument.
    This function takes only one argument which is (Pandas) series object, 
    returns another (Pandas) series object. 
    
    Argument:
        number_series - A column in pandas data frame with numeric values. 
    """
    condition = [
    (number_series >= 0) & (number_series < 5), 
    (number_series >= 5) & (number_series < 10), 
    (number_series >= 10) & (number_series < 15), 
    (number_series >= 15) & (number_series < 20), 
    (number_series >= 20) & (number_series < 25), 
    (number_series >= 25) & (number_series < 30), 
    (number_series >= 30) & (number_series < 35), 
    (number_series >= 35) & (number_series < 40), 
    (number_series >= 40) & (number_series < 45), 
    (number_series >= 45) & (number_series < 50), 
    (number_series >= 50) & (number_series < 55), 
    (number_series >= 55) & (number_series < 60), 
    (number_series >= 60) & (number_series < 65), 
    (number_series >= 65) & (number_series < 70), 
    (number_series >= 70) & (number_series < 75), 
    (number_series >= 75) & (number_series < 80), 
    (number_series >= 80) & (number_series < 85), 
    (number_series >= 85) & (number_series < 90), 
    (number_series >= 90) & (number_series < 95), 
    (number_series >= 95) & (number_series < 100)
    ]
    
    output = ['0-5', 
              '5-10', 
              '10-15', 
              '15-20', 
              '20-25', 
              '25-30', 
              '30-35', 
              '35-40', 
              '40-45', 
              '45-50', 
              '50-55', 
              '55-60', 
              '60-65', 
              '65-70', 
              '70-75', 
              '75-80', 
              '80-85', 
              '85-90', 
              '90-95', 
              '95-100'
              ] 
    result = np.select(condition, output, '>100')
    return pd.Series(result)
#to create a column with age group 
heart_failure_df['age_group'] = create_range_series(heart_failure_df.age)
heart_failure_df

Now that we have created another column in our 'heart_failure_df' data frame for the age groups using our helper function 'create_range_series', let's find the number of patients dead corresponding to the particular age group.

dead_patients = heart_failure_df[heart_failure_df.patient_dead == True].groupby('age_group').count()
alive_patients = heart_failure_df[heart_failure_df.patient_dead == False].groupby('age_group').count()
plt.figure(figsize=(12,10))
sns.barplot(x = dead_patients.index , y = dead_patients.patient_dead, alpha=1, palette =['red'])
sns.barplot(x = alive_patients.index, y = alive_patients.patient_dead, alpha = 0.5, palette = ['aquamarine'], estimator=sum, ci=None, )
plt.title('Dead or Alive')
plt.xlabel('Age group of Patients')
dead = mpatches.Patch(color= 'red', label='Dead')
alive = mpatches.Patch(color='aquamarine', label='Alive')
plt.legend(handles=[dead, alive])
plt.ylabel('Number of Patients');
Notebook Image

As we can clearly see, that there are more number of patients died (during the follow-up period is) in 60-65 age group (i.e 15 Patients). It's obvious, since we have more number of patients in 60-65 age group. And the proportion of patients died starts increasing from 65-70 age group and the number of patients died is more than alive patients in the age group of '80-85', '85-90' and '90-95'. This is because over the time, the body’s immune system naturally becomes less capable of handling new threats, such as viruses including novel corona virus. This increases the risk of having issues with various illnesses which ultimately leads to their end.