Learn practical skills, build real-world projects, and advance your career
#Importing the libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#load the dataset
df = pd.read_csv('haberman.csv')
print(df)
age year nodes status 0 30 64 1 1 1 30 62 3 1 2 30 65 0 1 3 31 59 2 1 4 31 65 4 1 .. ... ... ... ... 301 75 62 1 1 302 76 67 0 1 303 77 65 3 1 304 78 65 1 2 305 83 58 2 2 [306 rows x 4 columns]
#Summary about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 306 entries, 0 to 305 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 306 non-null int64 1 year 306 non-null int64 2 nodes 306 non-null int64 3 status 306 non-null int64 dtypes: int64(4) memory usage: 9.7 KB

Observation

  • Since there is no missing value there is no need for imputation
  • Status column is an integer value that needs to be converted into categorical value i.e "Yes" or "No".
#Replacing the status column with a meaningful data
df.loc[df.status == 1,'status'] = 'yes'
df.loc[df.status == 2,'status'] = 'no'
print(df.info)
<bound method DataFrame.info of age year nodes status 0 30 64 1 yes 1 30 62 3 yes 2 30 65 0 yes 3 31 59 2 yes 4 31 65 4 yes .. ... ... ... ... 301 75 62 1 yes 302 76 67 0 yes 303 77 65 3 yes 304 78 65 1 no 305 83 58 2 no [306 rows x 4 columns]>