Learn practical skills, build real-world projects, and advance your career
Updated 3 years ago
#Importing the libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#load the dataset
df = pd.read_csv('haberman.csv')
print(df)
age year nodes status
0 30 64 1 1
1 30 62 3 1
2 30 65 0 1
3 31 59 2 1
4 31 65 4 1
.. ... ... ... ...
301 75 62 1 1
302 76 67 0 1
303 77 65 3 1
304 78 65 1 2
305 83 58 2 2
[306 rows x 4 columns]
#Summary about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 306 non-null int64
1 year 306 non-null int64
2 nodes 306 non-null int64
3 status 306 non-null int64
dtypes: int64(4)
memory usage: 9.7 KB
Observation
- Since there is no missing value there is no need for imputation
- Status column is an integer value that needs to be converted into categorical value i.e "Yes" or "No".
#Replacing the status column with a meaningful data
df.loc[df.status == 1,'status'] = 'yes'
df.loc[df.status == 2,'status'] = 'no'
print(df.info)
<bound method DataFrame.info of age year nodes status
0 30 64 1 yes
1 30 62 3 yes
2 30 65 0 yes
3 31 59 2 yes
4 31 65 4 yes
.. ... ... ... ...
301 75 62 1 yes
302 76 67 0 yes
303 77 65 3 yes
304 78 65 1 no
305 83 58 2 no
[306 rows x 4 columns]>