Eda Haberman - Notebook by Abdul Azim (abdulazim0402)

Learn practical skills, build real-world projects, and advance your career

Updated 3 years ago

Run on Colab

Run on Kaggle

Run on Binder

Duplicate

#Importing the libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#load the dataset
df = pd.read_csv('haberman.csv')
print(df)

     age  year  nodes  status
0     30    64      1       1
1     30    62      3       1
2     30    65      0       1
3     31    59      2       1
4     31    65      4       1
..   ...   ...    ...     ...
301   75    62      1       1
302   76    67      0       1
303   77    65      3       1
304   78    65      1       2
305   83    58      2       2

[306 rows x 4 columns]

#Summary about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   age     306 non-null    int64
 1   year    306 non-null    int64
 2   nodes   306 non-null    int64
 3   status  306 non-null    int64
dtypes: int64(4)
memory usage: 9.7 KB

Observation

Since there is no missing value there is no need for imputation
Status column is an integer value that needs to be converted into categorical value i.e "Yes" or "No".

#Replacing the status column with a meaningful data
df.loc[df.status == 1,'status'] = 'yes'
df.loc[df.status == 2,'status'] = 'no'
print(df.info)

<bound method DataFrame.info of      age  year  nodes status
0     30    64      1    yes
1     30    62      3    yes
2     30    65      0    yes
3     31    59      2    yes
4     31    65      4    yes
..   ...   ...    ...    ...
301   75    62      1    yes
302   76    67      0    yes
303   77    65      3    yes
304   78    65      1     no
305   83    58      2     no

[306 rows x 4 columns]>