Jovian
⭐️
Sign In
Learn data science and machine learning by building real-world projects on Jovian

Logistic Regression on titanic dataset

1. Problem Statement

The goal is to predict survival of passengers travelling in RMS Titanic using Logistic regression.

2. Data Loading and Description

image.png

  • The dataset consists of the information about people boarding the famous RMS Titanic. Various variables present in the dataset includes data of age, sex, fare, ticket etc.
  • The dataset comprises of 891 observations of 12 columns. Below is a table showing names of all the columns and their description.

| Column Name | Description | | ------------- |:------------- :| | PassengerId | Passenger Identity | | Survived | Whether passenger survived or not |
| Pclass | Class of ticket | | Name | Name of passenger |
| Sex | Sex of passenger | | Age | Age of passenger | | SibSp | Number of sibling and/or spouse travelling with passenger | | Parch | Number of parent and/or children travelling with passenger| | Ticket | Ticket number | | Fare | Price of ticket | | Cabin | Cabin number |

Importing packages
In [1]:
import numpy as np                                                 # Implemennts milti-dimensional array and matrices
import pandas as pd                                                # For data manipulation and analysis
# import pandas_profiling
import matplotlib.pyplot as plt                                    # Plotting library for Python programming language and it's numerical mathematics extension NumPy
import seaborn as sns                                              # Provides a high level interface for drawing attractive and informative statistical graphics
%matplotlib inline
sns.set()

# from subprocess import check_output
Importing the Dataset
In [2]:
titanic_data = pd.read_csv("https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Casestudy/titanic_train.csv")     # Importing training dataset using pd.read_csv

In [3]:
titanic_data.head()
Out[3]:
In [4]:
titanic_data['Embarked'].isna().sum()
Out[4]:
2
In [5]:
titanic_data['Age'].isna().sum()
Out[5]:
177
In [6]:
titanic_data['Fare'].isna().sum()
Out[6]:
0
In [7]:
titanic_data['Cabin'].isna().sum()
Out[7]:
687
In [8]:
titanic_data.shape
Out[8]:
(891, 12)

3. Preprocessing the data

  • Dealing with missing values
    • Dropping/Replacing missing entries of Embarked.
    • Replacing missing values of Age and Fare with median values.
    • Dropping the column 'Cabin' as it has too many null values.
In [9]:
titanic_data.Embarked = titanic_data.Embarked.fillna(titanic_data['Embarked'].mode()[0])
In [10]:
median_age = titanic_data.Age.median()
median_fare = titanic_data.Fare.median()
titanic_data.Age.fillna(median_age, inplace = True)
titanic_data.Fare.fillna(median_fare, inplace = True)
In [11]:
titanic_data.drop('Cabin', axis = 1,inplace = True)
In [12]:
titanic_data['SibSp'].value_counts()
Out[12]:
0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: SibSp, dtype: int64
In [13]:
titanic_data['Parch'].value_counts()
Out[13]:
0    678
1    118
2     80
5      5
3      5
4      4
6      1
Name: Parch, dtype: int64
  • Creating a new feature named FamilySize.
In [14]:
titanic_data['FamilySize'] = titanic_data['SibSp'] + titanic_data['Parch']+1
  • Segmenting Sex column as per Age, Age less than 15 as Child, Age greater than 15 as Males and Females as per their gender.
In [15]:
titanic_data.head()
Out[15]:
In [16]:
titanic_data['GenderClass'] = titanic_data.apply(lambda x: 'child' if x['Age'] < 15 else x['Sex'],axis=1)
In [17]:
titanic_data[titanic_data.Age<15].head(2)
Out[17]:
In [18]:
titanic_data[titanic_data.Age>15].head(2)
Out[18]:
In [19]:
titanic_data['GenderClass'].value_counts()
Out[19]:
male      538
female    275
child      78
Name: GenderClass, dtype: int64
In [20]:
titanic_data['Embarked'].value_counts()
Out[20]:
S    646
C    168
Q     77
Name: Embarked, dtype: int64
In [21]:
titanic_data.head()
Out[21]:
  • Dummification of GenderClass & Embarked.
In [22]:
titanic_data = pd.get_dummies(titanic_data, columns=['GenderClass','Embarked'], drop_first=True)
In [23]:
titanic_data.head()
Out[23]:
  • Dropping columns 'Name' , 'Ticket' , 'Sex' , 'SibSp' and 'Parch'
In [24]:
titanic = titanic_data.drop(['Name','Ticket','Sex','SibSp','Parch'], axis = 1)
titanic.head()
Out[24]:

Drawing pair plot to know the joint relationship between 'Fare' , 'Age' , 'Pclass' & 'Survived'

In [25]:
sns.pairplot(titanic_data[["Fare","Age","Pclass","Survived"]],vars = ["Fare","Age","Pclass"],hue="Survived", dropna=True,markers=["o", "s"])
plt.title('Pair Plot')
Out[25]:
Text(0.5, 1.0, 'Pair Plot')
Notebook Image

Observing the diagonal elements,

  • More people of Pclass 1 survived than died (First peak of red is higher than blue)
  • More people of Pclass 3 died than survived (Third peak of blue is higher than red)
  • More people of age group 20-40 died than survived.
  • Most of the people paying less fare died.

Establishing coorelation between all the features using heatmap.

In [26]:
corr = titanic_data.corr()
plt.figure(figsize=(10,10))
sns.heatmap(corr,vmax=.8,linewidth=.01, square = True, annot = True,cmap='YlGnBu',linecolor ='black')
plt.title('Correlation between features')
Out[26]:
Text(0.5, 1.0, 'Correlation between features')
Notebook Image
  • Age and Pclass are negatively corelated with Survived.
  • FamilySize is made from Parch and SibSb only therefore high positive corelation among them.
  • Fare and FamilySize are positively coorelated with Survived.
  • With high corelation we face redundancy issues.

4. Logistic Regression

4.1 Introduction to Logistic Regression

Logistic regression is a techinque used for solving the classification problem.
And Classification is nothing but a problem of identifing to which of a set of categories a new observation belongs, on the basis of training dataset containing observations (or instances) whose categorical membership is known.
For example to predict:
Whether an email is spam (1) or not (0) or,
Whether the tumor is malignant (1) or not (0)
Below is the pictorial representation of a basic logistic regression model to classify set of images into happy or sad. image.png

Both Linear regression and Logistic regression are supervised learning techinques. But for the Regression problem the output is continuous unlike the classification problem where the output is discrete.

  • Logistic Regression is used when the dependent variable(target) is categorical.
  • Sigmoid function or logistic function is used as hypothesis function for logistic regression. Below is a figure showing the difference between linear regression and logistic regression, Also notice that logistic regression produces a logistic curve, which is limited to values between 0 and 1.
    image.png

4.2 Mathematics behind Logistic Regression

The odds for an event is the (probability of an event occuring) / (probability of event not occuring): image.png For Linear regression: continuous response is modeled as a linear combination of the features: y = β0 + β1x
For Logistic regression: log-odds of a categorical response being "true" (1) is modeled as a linear combination of the features:

image.png This is called the logit function.
On solving for probability (p) you will get:

image.png

image.png

Shown below is the plot showing linear model and logistic model:

image.png

In other words:

  • Logistic regression outputs the probabilities of a specific class.
  • Those probabilities can be converted into class predictions.

The logistic function has some nice properties:

  • Takes on an "s" shape
  • Output is bounded by 0 and 1

We have covered how this works for binary classification problems (two response classes). But what about multi-class classification problems (more than two response classes)?

  • Most common solution for classification models is "one-vs-all" (also known as "one-vs-rest"): decompose the problem into multiple binary classification problems.
  • Multinomial logistic regression can solve this as a single problem.

4.3 Applications of Logistic Regression

Logistic Regression was used in biological sciences in early twentieth century. It was then used in many social science applications. For instance,

  • The Trauma and Injury Severity Score (TRISS), which is widely used to predict mortality in injured patients, was originally developed by Boyd et al. using logistic regression.
  • Many other medical scales used to assess severity of a patient have been developed using logistic regression.
  • Logistic regression may be used to predict the risk of developing a given disease (e.g. diabetes; coronary heart disease), based on observed characteristics of the patient (age, sex, body mass index, results of various blood tests, etc.).

Now a days, Logistic Regression have the following applications

  1. Image segementation and categorization
  2. Geographic image processing
  3. Handwriting recognition
  4. Detection of myocardinal infarction
  5. Predict whether a person is depressed or not based on a bag of words from corpus. image.png

The reason why logistic regression is widely used despite of the state of the art of deep neural network is that logistic regression is very efficient and does not require too much computational resources, which makes it affordable to run on production.

In [27]:
titanic.head()
Out[27]:
In [28]:
# titanic.drop('PassengerId',axis=1, inplace=True)
In [29]:
# titanic.head()

4.4 Preparing X and y using pandas

In [30]:
X = titanic.loc[:,titanic.columns != 'Survived']
X.head()
Out[30]:
In [31]:
y = titanic.Survived 
# y = titanic['Survived']
In [32]:
X.shape
Out[32]:
(891, 11)

4.5 Splitting X and y into training and test datasets.

In [33]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
In [34]:
print(X_train.shape)
print(y_train.shape)
(712, 11) (712,)
In [35]:
print(X_test.shape)
print(y_test.shape)
(179, 11) (179,)
In [36]:
X_train.head(10)
Out[36]:
In [37]:
X_test.head(10)
Out[37]:
In [38]:
y_train.shape
Out[38]:
(712,)

4.6 Logistic regression in scikit-learn

To apply any machine learning algorithm on your dataset, basically there are 4 steps:

  1. Load the algorithm
  2. Instantiate and Fit the model to the training dataset
  3. Prediction on the test set
  4. Calculating the accuracy of the model

The code block given below shows how these steps are carried out:

    logreg = LogisticRegression()
    logreg.fit(X_train, y_train)
    accuracy_score(y_test,y_pred_test))
    ```
In [39]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-39-cf3d9f23466f> in <module> 1 from sklearn.linear_model import LogisticRegression 2 logreg = LogisticRegression() ----> 3 logreg.fit(X_train,y_train) ~\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py in fit(self, X, y, sample_weight) 1405 else: 1406 prefer = 'processes' -> 1407 fold_coefs_ = Parallel(n_jobs=self.n_jobs, verbose=self.verbose, 1408 **_joblib_parallel_args(prefer=prefer))( 1409 path_func(X, y, pos_class=class_, Cs=[C_], ~\anaconda3\lib\site-packages\joblib\parallel.py in __call__(self, iterable) 1046 # remaining jobs. 1047 self._iterating = False -> 1048 if self.dispatch_one_batch(iterator): 1049 self._iterating = self._original_iterator is not None 1050 ~\anaconda3\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator) 864 return False 865 else: --> 866 self._dispatch(tasks) 867 return True 868 ~\anaconda3\lib\site-packages\joblib\parallel.py in _dispatch(self, batch) 782 with self._lock: 783 job_idx = len(self._jobs) --> 784 job = self._backend.apply_async(batch, callback=cb) 785 # A job can complete so quickly than its callback is 786 # called before we get here, causing self._jobs to ~\anaconda3\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback) 206 def apply_async(self, func, callback=None): 207 """Schedule a func to be run""" --> 208 result = ImmediateResult(func) 209 if callback: 210 callback(result) ~\anaconda3\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch) 570 # Don't delay the application, to avoid keeping the input 571 # arguments in memory --> 572 self.results = batch() 573 574 def get(self): ~\anaconda3\lib\site-packages\joblib\parallel.py in __call__(self) 260 # change the default number of processes to -1 261 with parallel_backend(self._backend, n_jobs=self._n_jobs): --> 262 return [func(*args, **kwargs) 263 for func, args, kwargs in self.items] 264 ~\anaconda3\lib\site-packages\joblib\parallel.py in <listcomp>(.0) 260 # change the default number of processes to -1 261 with parallel_backend(self._backend, n_jobs=self._n_jobs): --> 262 return [func(*args, **kwargs) 263 for func, args, kwargs in self.items] 264 ~\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py in _logistic_regression_path(X, y, pos_class, Cs, fit_intercept, max_iter, tol, verbose, solver, coef, class_weight, dual, penalty, intercept_scaling, multi_class, random_state, check_input, max_squared_sum, sample_weight, l1_ratio) 760 options={"iprint": iprint, "gtol": tol, "maxiter": max_iter} 761 ) --> 762 n_iter_i = _check_optimize_result( 763 solver, opt_res, max_iter, 764 extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG) ~\anaconda3\lib\site-packages\sklearn\utils\optimize.py in _check_optimize_result(solver, result, max_iter, extra_warning_msg) 241 " https://scikit-learn.org/stable/modules/" 242 "preprocessing.html" --> 243 ).format(solver, result.status, result.message.decode("latin1")) 244 if extra_warning_msg is not None: 245 warning_msg += "\n" + extra_warning_msg AttributeError: 'str' object has no attribute 'decode'

4.7 Using the Model for Prediction

In [ ]:
y_pred_train = logreg.predict(X_train)  
In [ ]:
y_pred_test = logreg.predict(X_test)                                                           # make predictions on the testing set
  • We need an evaluation metric in order to compare our predictions with the actual values.

5. Model evaluation

Error is the deviation of the values predicted by the model with the true values.
We will use __accuracy score __ and confusion matrix for evaluation.

5.1 Model Evaluation using accuracy classification score

In [ ]:
from sklearn.metrics import accuracy_score
print('Accuracy score for test data is:', accuracy_score(y_test,y_pred_test))

5.2 Model Evaluation using confusion matrix

A confusion matrix is a summary of prediction results on a classification problem.

The number of correct and incorrect predictions are summarized with count values and broken down by each class.
Below is a diagram showing a general confusion matrix. image.png

In [ ]:
from sklearn.metrics import confusion_matrix

confusion_matrix = pd.DataFrame(confusion_matrix(y_test, y_pred_test))

print(confusion_matrix)
In [ ]:
confusion_matrix.index = ['Actual Died','Actual Survived']
confusion_matrix.columns = ['Predicted Died','Predicted Survived']
print(confusion_matrix)

This means 93 + 48 = 141 correct predictions & 25 + 13 = 38 false predictions.

Adjusting Threshold for predicting Died or Survived.

  • In the section 4.7 we have used, .predict method for classification. This method takes 0.5 as the default threshhod for prediction.
  • Now, we are going to see the impact of changing threshold on the accuracy of our logistic regression model.
  • For this we are going to use .predict_proba method instead of using .predict method.

Setting the threshold to 0.75

In [ ]:
logreg.predict_proba(X_test)
In [ ]:
logreg.predict_proba(X_test)[:,1]
In [ ]:
logreg.predict_proba(X_test)[:,1]> 0.75
In [ ]:
preds1 = np.where(logreg.predict_proba(X_test)[:,1]> 0.75,1,0)
print('Accuracy score for test data is:', accuracy_score(y_test,preds1))
In [ ]:
preds1
In [ ]:
np.array(y_test)

The accuracy have been reduced significantly changing from 0.79 to 0.73. Hence, 0.75 is not a good threshold for our model.

Setting the threshold to 0.25

In [ ]:
preds2 = np.where(logreg.predict_proba(X_test)[:,1]> 0.25,1,0)
print('Accuracy score for test data is:', accuracy_score(y_test,preds2))

The accuracy have been reduced, changing from 0.79 to 0.75. Hence, 0.25 is also not a good threshold for our model.
Later on we will see methods to identify the best threshold.