Jovian
⭐️
Sign In
# Exploratory Data Analysis I

1. Problem Statement

The notebooks explores the basic use of Pandas and will cover the basic commands of Exploratory Data Analysis(EDA) which includes cleaning, munging, combining, reshaping, slicing, dicing, and transforming data for analysis purpose.

  • Exploratory Data Analysis
    Understand the data by EDA and derive simple models with Pandas as baseline. EDA ia a critical and first step in analyzing the data and we do this for below reasons :
    • Finding patterns in Data
    • Determining relationships in Data
    • Checking of assumptions
    • Preliminary selection of appropriate models
    • Detection of mistakes

2. Data Loading and Description

image.png

  • The dataset consists of the information about people boarding the famous RMS Titanic. Various variables present in the dataset includes data of age, sex, fare, ticket etc.
  • The dataset comprises of 891 observations of 12 columns. Below is a table showing names of all the columns and their description.

| Column Name | Description | | ------------- |:------------- :| | PassengerId | Passenger Identity | | Survived | Whether passenger survived or not |
| Pclass | Class of ticket | | Name | Name of passenger |
| Sex | Sex of passenger | | Age | Age of passenger | | SibSp | Number of sibling and/or spouse travelling with passenger | | Parch | Number of parent and/or children travelling with passenger| | Ticket | Ticket number | | Fare | Price of ticket | | Cabin | Cabin number |

Some Background Information

The sinking of the RMS Titanic in the early morning of 15 April 1912, four days into the ship's maiden voyage from Southampton to New York City, was one of the deadliest peacetime maritime disasters in history, killing more than 1,500 people. The largest passenger liner in service at the time, Titanic had an estimated 2,224 people on board when she struck an iceberg in the North Atlantic. The ship had received six warnings of sea ice but was travelling at near maximum speed when the lookouts sighted the iceberg. Unable to turn quickly enough, the ship suffered a glancing blow that buckled the starboard (right) side and opened five of sixteen compartments to the sea. The disaster caused widespread outrage over the lack of lifeboats, lax regulations, and the unequal treatment of the three passenger classes during the evacuation. Inquiries recommended sweeping changes to maritime regulations, leading to the International Convention for the Safety of Life at Sea (1914), which continues to govern maritime safety.

Importing packages
In [1]:
import numpy as np                                                 # Implemennts milti-dimensional array and matrices
import pandas as pd                                                # For data manipulation and analysis
import pandas_profiling
import matplotlib.pyplot as plt                                    # Plotting library for Python programming language and it's numerical mathematics extension NumPy
import seaborn as sns                                              # Provides a high level interface for drawing attractive and informative statistical graphics
%matplotlib inline
sns.set()

from subprocess import check_output


Importing the Dataset
In [2]:
titanic_data = pd.read_csv("https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Casestudy/titanic_train.csv")     # Importing training dataset using pd.read_csv

3. Data Profiling

  • In the upcoming sections we will first understand our dataset using various pandas functionalities.
  • Then with the help of pandas profiling we will find which columns of our dataset need preprocessing.
  • In preprocessing we will deal with erronous and missing values of columns.
  • Again we will do pandas profiling to see how preprocessing have transformed our dataset.

3.1 Understanding the Dataset

To gain insights from data we must look into each aspect of it very carefully. We will start with observing few rows and columns of data both from the starting and from the end

In [3]:
titanic_data.shape                                                    # This will print the number of rows and comlumns of the Data Frame

titanic_data has 891 rows and 12 columns.

In [4]:
titanic_data.columns                                            # This will print the names of all columns.
In [5]:
titanic_data.head()

In [6]:
titanic_data.tail()                                                   # This will print the last n rows of the Data Frame
In [7]:
titanic_data.info()                                                   # This will give Index, Datatype and Memory information
In [8]:
titanic_data.describe()
In [9]:
titanic_data.isnull().sum()

From the above output we can see that Age and Cabin columns contains maximum null values. We will see how to deal with them.

3.2 Pre Profiling

  • By pandas profiling, an interactive HTML report gets generated which contins all the information about the columns of the dataset, like the counts and type of each column. Detailed information about each column, coorelation between different columns and a sample of dataset.
  • It gives us visual interpretation of each column in the data.
  • Spread of the data can be better understood by the distribution plot.
  • Grannular level analysis of each column.
In [10]:
profile = pandas_profiling.ProfileReport(titanic_data)
profile.to_file(outputfile="titanic_before_preprocessing.html")

Here, we have done Pandas Profiling before preprocessing our dataset, so we have named the html file as titanic_before_preprocessing.html. Take a look at the file and see what useful insight you can develop from it.
Now we will process our data to better understand it.

3.3 Preprocessing

  • Dealing with missing values
    • Dropping/Replacing missing entries of Embarked.
    • Replacing missing values of Age with median values.
    • Dropping the column 'Cabin' as it has too many null values.
    • Replacing 0 values of fare with median values.
In [11]:
titanic_data.Embarked = titanic_data.Embarked.fillna(titanic_data['Embarked'].mode()[0])
In [12]:
median_age = titanic_data.Age.median()
titanic_data.Age.fillna(median_age, inplace = True)
In [13]:
titanic_data.drop('Cabin', axis = 1,inplace = True)
In [14]:
titanic_data['Fare']=titanic_data['Fare'].replace(0,titanic_data['Fare'].median())
In [15]:
titanic_data['FamilySize'] = titanic_data['SibSp'] + titanic_data['Parch']+1
  • Segmenting Sex column as per Age, Age less than 15 as Child, Age greater than 15 as Males and Females as per their gender.
In [16]:
titanic_data['GenderClass'] = titanic_data.apply(lambda x: 'child' if x['Age'] < 15 else x['Sex'],axis=1)
In [17]:
titanic_data[titanic_data.Age<15].head(2)
In [18]:
titanic_data[titanic_data.Age>15].head(2)

3.4 Post Pandas Profiling

In [19]:
import pandas_profiling
profile = pandas_profiling.ProfileReport(titanic_data)
profile.to_file(outputfile="titanic_after_preprocessing.html")

Now we have preprocessed the data, now the dataset doesnot contain missing values, we have also introduced new feature named FamilySize. So, the pandas profiling report which we have generated after preprocessing will give us more beneficial insights. You can compare the two reports, i.e titanic_after_preprocessing.html and titanic_before_preprocessing.html.
In titanic_after_preprocessing.html report, observations:

  • In the Dataset info, Total Missing(%) = 0.0%
  • Number of variables = 13
  • Observe the newly created variable FamilySize, Click on Toggle details to get more detailed information about it.

4. Questions

4.1 Of all the passengers, how many survived and how many died ?

  • Using Countplot
In [20]:
sns.countplot(x='Survived', data=titanic_data).set_title('Count plot for survived.')

You can see that more people died than survived. To know the exact count:

  • Using groupby
In [21]:
titanic_data.groupby(['Survived'])['Survived'].count()

Notice that 549 people died and only 340 survived.

4.2 Who is more likely to survive, Male or Female?

First of all looking at how Age is varying with gender.

In [22]:
as_fig = sns.FacetGrid(titanic_data,hue='GenderClass',aspect=5)

as_fig.map(sns.kdeplot,'Age',shade=True)

oldest = titanic_data['Age'].max()

as_fig.set(xlim=(0,oldest))

as_fig.add_legend()
plt.title('Age distribution using FacetGrid')
  • In titanic RMS child of Age 3-8 yrs are in majority.
  • Maximum males and females are of Age 25-35 yrs.

Using groupby

In [23]:
titanic_data.groupby(['Survived','GenderClass'])['Survived'].count()

From the above you can see that its difficult to absorb information quickly by looking at numbers. Therefore we will make variety of plots to get clear vision of the scenario.

  • Using factorplot
In [24]:
sns.factorplot('GenderClass', hue='Survived', kind='count', data=titanic_data);
plt.title('Factor plot for male female and child')
  • Majority of males died.
  • Females have high probability to survive.

To know the exact %

In [25]:
print("% of women survived: " , titanic_data[titanic_data.GenderClass == 'female']['Survived'].sum()/titanic_data[titanic_data.GenderClass == 'female']['Survived'].count())
print("% of men survived:   " , titanic_data[titanic_data.GenderClass == 'male']['Survived'].sum()/titanic_data[titanic_data.GenderClass == 'male']['Survived'].count())
print("% of child survived:   " , titanic_data[titanic_data.GenderClass == 'child']['Survived'].sum()/titanic_data[titanic_data.GenderClass == 'child']['Survived'].count())
  • Using pie plot
In [26]:
f,ax = plt.subplots(1,3,figsize=(20,7))
titanic_data['Survived'][titanic_data['GenderClass'] == 'male'].value_counts().plot.pie(explode=[0,0.2],autopct='%1.1f%%',ax=ax[0],shadow=True)
titanic_data['Survived'][titanic_data['GenderClass'] == 'female'].value_counts().plot.pie(explode=[0,0.2],autopct='%1.1f%%',ax=ax[1],shadow=True)
titanic_data['Survived'][titanic_data['GenderClass'] == 'child'].value_counts().plot.pie(explode=[0,0.2],autopct='%1.1f%%',ax=ax[2],shadow=True)
ax[0].set_title('Survived (male)')
ax[1].set_title('Survived (female)')
ax[2].set_title('Survived (child)')

From the above pie plot you can see how survival depends on whether the passenger is a child, male or female.

  • 76% of females survived.
  • 57% of children also survived.
  • Only 16% of males survived.
In [27]:
titanic_data['Survived'][titanic_data['GenderClass'] == 'male'].value_counts()
In [28]:
titanic_data['Survived'][titanic_data['GenderClass'] == 'female'].value_counts()
In [29]:
(titanic_data.Survived==0).sum()

Using donut pie chart to see the relationship between survival and gender

In [30]:
def func(pct, allvals):
    absolute = int(pct/100.*np.sum(allvals))
    return "{:.1f}%\n({:d} g)".format(pct, absolute)
In [31]:
import matplotlib.pyplot as plt
 
# Make data: 
group_names=['Survived', 'Not Survived']
group_size=[342,549]
subgroup_names=['Survived.Male','Survived.Female','Not Survived.Male','Not Survived.Female']
subgroup_size=[88,209,450,66]
 
# Create colors
a, b, c=[plt.cm.Blues, plt.cm.Reds, plt.cm.Greens]
 
# First Ring (outside)
fig, ax = plt.subplots()
ax.axis('equal')
mypie, _ = ax.pie(group_size, radius=1.3, labels=group_names, colors = ['yellowgreen', 'gold'])
plt.setp( mypie, width=0.3, edgecolor='white')

# wedges, texts, autotexts = ax.pie(group_size, autopct=lambda pct: func(pct, data),
                                  #textprops=dict(color="w"))
 
# Second Ring (Inside)
mypie2, _ = ax.pie(subgroup_size, radius=1.3-0.3, labels=subgroup_names, labeldistance=0.7, colors=[a(0.5), b(0.4), a(0.5), b(0.4)])
plt.setp( mypie2, width=0.4, edgecolor='white')
plt.margins(0,0)
plt.title('Donut plot')
# show it
plt.show()

4.4. What is the rate of survival of males, females and child on the basis of Passenger Class?

  • Using mathematical function
In [32]:
print("% of survivals in") 
print("Pclass=1 : ", titanic_data.Survived[titanic_data.Pclass == 1].sum()/titanic_data[titanic_data.Pclass == 1].Survived.count())
print("Pclass=2 : ", titanic_data.Survived[titanic_data.Pclass == 2].sum()/titanic_data[titanic_data.Pclass == 2].Survived.count())
print("Pclass=3 : ", titanic_data.Survived[titanic_data.Pclass == 3].sum()/titanic_data[titanic_data.Pclass == 3].Survived.count())
  • Using crosstab function
In [33]:
pd.crosstab([titanic_data.GenderClass, titanic_data.Survived], titanic_data.Pclass, margins=True).apply(lambda r: 100*r/len(titanic_data), axis=1).style.background_gradient(cmap='autumn_r')

You can see how the percentage of males, females and children survived are varying depending on the passenger class they are in. Also, its quiet difficult to develop quick insights by looking only at numbers. Therefore we will explore doing the same with the help of plotting.

  • Using violin plot to see the relationship between Pclass and Survived
In [34]:
sns.violinplot('Pclass','Survived', kind='point', data = titanic_data)
plt.title('Violinplot Pclass Vs Survived')
plt.show()

Above is another beautiful way to see how the survival rate is varying with Passenger class.

  • Pclass 3 have more people who died, and for Pclass 1 survival rate is more.

Drawing factorplot to look at the distribution of popluation with Pclass and GenderClass.

In [35]:
sns.factorplot('Pclass', data=titanic_data, hue='GenderClass', kind='count')
plt.title('Factorplot with kind = "count" for Pclass and GenderClass')
  1. Pclass 3 have maximum number of males
  2. Pclass 1 have minimum number of children.
  • using factorplot to see the variation of surviavl rate with Pclass and GenderClass.
In [36]:
sns.factorplot('Pclass','Survived', data=titanic_data, hue='GenderClass')
plt.title('Factorplot for Survivale rate variation with Pclass and GenderClass')

The above graph shows:

  1. the survival rate for male is very low irrespective of the class he belongs to.
  2. And, the survival rate is less for all the 3rd class passengers.
  3. Almost all women in Pclass 1 and 2 survived and nearly all men in Pclass 2 and 3 died.

4.4 What is the survival rate considering the Embarked variable?

  • Using countplot
In [37]:
sns.countplot('Embarked',data=titanic_data, hue='Survived')
  1. Maximum_ number of people have Southampton as port of embarkment.
  2. Also observe people who boarded at Cherbourg, more people survived than died, and this is reverse for Queenstown.
  • Using factorplot and kind = 'point'
In [38]:
sns.factorplot('Embarked','Survived', kind='point', data = titanic_data)
plt.title('Factorplot for Embarked and Survived')
plt.show()

4.5. Survival rate - Comparing Embarked and Sex.

  • Distribution of GenderClass with respect to Port of Embarkment using Countplot.
In [39]:
sns.countplot('Embarked',data=titanic_data, hue='GenderClass')

Most of the people boarded from S, Also among all who boarded, males constitutes the majority of percentage.

  • Using Factorplot to see variation of survival rate with port of embarkment and GenderClass
In [40]:
sns.factorplot('Embarked','Survived', hue= 'GenderClass', kind='point', data= titanic_data)
plt.title('Factor plot showing survival rate variation with Embarked and GenderClass ')
plt.show()
  • Chances of survival of females who boarded from C is highest.
  • Chances of survival of males boarding from Q is lowest

4.6 How survival rate vary with Embarked, Sex and Pclass.

Seeing relation between Pclass and Embarked.

In [41]:
relation = pd.crosstab( titanic_data.Embarked, titanic_data.Pclass )
relation.plot.barh(figsize=(15,5))
plt.xticks(size = 10)
plt.yticks(size = 10)
plt.title('Relation Between Pclass and Embarked',size=20)

Maximum people who boarded from S belongs to Pclass 3.
Most of the passengers belonging to Pclass 1 boarded from C and S

In [42]:
dummy = relation.div(relation.sum(1).astype(float), axis=0).plot(kind="bar", stacked=True)
dummy = plt.xlabel('Emabarked')
  • Using Swarmplot
In [43]:
sns.set(style='whitegrid', palette='muted')
sns.swarmplot(x="Embarked", y="Age", hue="GenderClass", palette="gnuplot", data=titanic_data)
  • Using factorplot with kind = 'point'
In [44]:
sns.factorplot('Embarked','Survived', col='Pclass', hue= 'GenderClass', kind='point', data = titanic_data)
plt.show()
  • Practically all women of Pclass 2 that embarked in C and Q survived, also nearly all women of Pclass 1 survived_.
  • All men of Pclass 1 and 2 embarked in Q died, survival rate for men in Pclass 2 and 3 is always below 0.2.
  • For the remaining men in Pclass 1 that embarked in S and Q, survival rate is approx. 0.4

4.7 Segment age in bins with size 10.

In [45]:
for i in range(8,0,-1):
        titanic_data.loc[ titanic_data['Age'] <= i*10, 'Age_bin'] = i
In [46]:
print(titanic_data[['Age' , 'Age_bin']].head(10))
In [47]:
titanic_data.plot.hexbin(x='Age_bin', y='Survived', gridsize=12)

Comparing count of those who survived and died with respect to the Age_bin they are in.

  • Age_bin 1: As you can see hexagon for Survived( 1.0 ) is darker than Died(0.0), means more children survived than died.
  • Age_bin 3: More died than survived, Also count of survived is highest among all age bins ( see horizontaly along Survived = 1.0 ) , means maximum people who boarded Titanic were from this age group.
  • Age_bin >4: More people died than survived.
In [48]:
sns.barplot(x = "Age_bin", y = "Survived", hue = "Pclass", data = titanic_data)
plt.show()
  • Calculating number of people of Age_bin = 1 and 8 from each Pclass.
In [49]:
titanic_data[(titanic_data.Age_bin == 1)]['Pclass'].value_counts()
In [50]:
titanic_data[(titanic_data.Age_bin == 1)&(titanic_data.Pclass == 1)]['Survived']
In [51]:
titanic_data[(titanic_data.Age_bin == 8)]['Pclass'].value_counts()
  • Among children of age 0-10 yrs we dont have enough data points(3) in Pclass 1, therefore discarding it (blue line of Age_bin 1)
  • Also number of passengers belonging to age group 70-80 yrs, is very less, therefore ignoring them.
  • In each Pclass, we can see that the probability of survivying of small children(Age = 0-10 yrs) is higher than rest age group.
  • In every Age_bin(ignoring Pclass 1 of first, and last Age_bin), survival probability is highest for Pclass 1 and lowest for Pclass 3.
In [52]:
sns.factorplot('Age_bin','Survived',hue='Sex',kind='point',data=titanic_data)
plt.show()

Its clear from the above graph that among people of all the ages, females in general have higher probability of survival than males.

In [53]:
sns.factorplot('Age_bin','Survived', col='Pclass' , row = 'Sex', kind='point', data=titanic_data)
plt.show()

Calculating number of females from each Pclass in age group 1.

In [54]:
titanic_data[(titanic_data.Age_bin == 1) & (titanic_data.Sex =='female')]['Pclass'].value_counts()

From the factor plot:

  • Among males, probability of survival of children is higher than rest age groups.
  • In general for males, as Pclass increases, survival probability decreases.
  • Among female children (Age_bin == 1), there is only 1 girl, therfore discarding this.
  • For the rest of the females, as Pclass increases, survival probability decreases.
  • You can also see survival rate within each Pclass for males and females.

4.8 Analysing survival rate with FamilySize.

  • Using factorplot to know the survival rate on the basis of FamilySize.
In [55]:
ax = sns.factorplot(x='FamilySize', y='Survived', data=titanic_data, kind='violin', aspect=1.5, size=6, palette="Greens")
ax.set(ylabel='Percent of Passengers')
plt.title('Survival by Total Family Size')

As size of family increases its chances of survival also increases.

4.9 Segment fare in bins of size 12.

  • Using Distplot to see the distribution of Fare.
In [56]:
sns.distplot(titanic_data['Fare'],color ='g')
plt.title('Distribution of Fare')
plt.show()

We have seen that 'Fare' mostly varies between 10 and 90. We will use this information to create bins.

  • Creating a new column named 'Fare_bin' based on 12 interval ranges in 'Fare' as 12 bins.
In [57]:
for i in range(12,0,-1):
    titanic_data.loc[titanic_data['Fare'] <= i*10, 'Fare_bin'] = i
titanic_data.loc[titanic_data['Fare'] >110, 'Fare_bin']= 12
In [58]:
print(titanic_data[['Fare' , 'Fare_bin']].groupby('Fare_bin')['Fare'].count())
  • Using barrplot to plot the relationship between survival rate and Fare_bin and Pclass.
In [59]:
sns.barplot(x = "Fare_bin", y = "Survived", hue = "Pclass", data = titanic_data)
plt.show()
  • As fare increases, survival chances also increases.
  • Also Pclass 1 (blue color) have more chances to survive compared to other Pclass.

4.10 Draw pair plot to know the joint relationship between 'Fare','Age','Pclass' and 'Survived'

In [60]:
sns.pairplot(titanic_data[["Fare","Age","Pclass","Survived"]],vars = ["Fare","Age","Pclass"],hue="Survived", dropna=True,markers=["o", "s"])
plt.title('Pair Plot')

Observing the diagonal elements,

  • More people of Pclass 1 survived than died (First peak of red is higher than blue)
  • More people of Pclass 3 died than survived (Third peak of blue is higher than red)
  • More people of age group 20-40 died than survived.
  • Most of the people paying less fare died.

4.11 Establish coorelation between all the features using heatmap.

In [61]:
corr = titanic_data.corr()
plt.figure(figsize=(10,10))
sns.heatmap(corr,vmax=.8,linewidth=.01, square = True, annot = True,cmap='YlGnBu',linecolor ='black')
plt.title('Correlation between features')
  • Age and Pclass are negatively corelated with Survived.
  • FamilySize is made from Parch and SibSb only therefore high positive corelation among them.
  • Fare and FamilySize are positively coorelated with Survived.
  • With high corelation we face redundancy issues.

4.12 Hypothesis: Women and children are more likely to survive

On studying Questionnaire 4.1, 4.2 and 4.3 we observed that an overwhelming percentage of women & children have survived the titanic clash.

  • 76% of females survived.
  • 57% of children also survived.
  • Only 16% of males survived.
    Also the survival rate for male is very low irrespective of the class he belongs to and the survival rate is less for all the 3rd class passengers. Almost all women in Pclass 1 and 2 survived and nearly all men in Pclass 2 and 3 died.

5. Conclusion

  • With the help of this notebook we learnt how exploratory data analysis can be carried out using Pandas plotting.
  • Also we have seen making use of packages like matplotlib and seaborn to develop better insights about the data.
  • We have also seen how preproceesing helps in dealing with missing values and irregualities present in the data. We also learnt how to create new features which will in turn help us to better predict the survival.
  • We also make use of pandas profiling feature to generate an html report containing all the information of the various features present in the dataset.
  • We have seen the impact of columns like Age, Embarked, Fare, SibSp and Parch on the rate of survival.
  • The most important inference drawn from all this analysis is, we get to know what are the features on which survival is highly positively and negatively coorelated with.
  • This analysis will help us to choose which machine learning model we can apply to predict survival of test dataset.