Learn data science and machine learning by building real-world projects on Jovian

Sign up to execute **logistic-regression-darshan-dec-27** and 150,000+ data science projects. Build your own projects and share them online!

Created 5 months ago

The goal is to **predict survival** of passengers travelling in RMS **Titanic** using **Logistic regression**.

- The dataset consists of the information about people boarding the famous RMS Titanic. Various variables present in the dataset includes data of age, sex, fare, ticket etc.
- The dataset comprises of
**891 observations of 12 columns**. Below is a table showing names of all the columns and their description.

| Column Name | Description |
| ------------- |:------------- :|
| PassengerId | Passenger Identity |
| Survived | Whether passenger survived or not |

| Pclass | Class of ticket |
| Name | Name of passenger |

| Sex | Sex of passenger |
| Age | Age of passenger |
| SibSp | Number of sibling and/or spouse travelling with passenger |
| Parch | Number of parent and/or children travelling with passenger|
| Ticket | Ticket number |
| Fare | Price of ticket |
| Cabin | Cabin number |

In [1]:

```
import numpy as np # Implemennts milti-dimensional array and matrices
import pandas as pd # For data manipulation and analysis
# import pandas_profiling
import matplotlib.pyplot as plt # Plotting library for Python programming language and it's numerical mathematics extension NumPy
import seaborn as sns # Provides a high level interface for drawing attractive and informative statistical graphics
%matplotlib inline
sns.set()
# from subprocess import check_output
```

In [2]:

`titanic_data = pd.read_csv("https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Casestudy/titanic_train.csv") # Importing training dataset using pd.read_csv`

In [3]:

`titanic_data.head()`

Out[3]:

In [4]:

`titanic_data['Embarked'].isna().sum()`

Out[4]:

`2`

In [5]:

`titanic_data['Age'].isna().sum()`

Out[5]:

`177`

In [6]:

`titanic_data['Fare'].isna().sum()`

Out[6]:

`0`

In [7]:

`titanic_data['Cabin'].isna().sum()`

Out[7]:

`687`

In [8]:

`titanic_data.shape`

Out[8]:

`(891, 12)`

- Dealing with missing values

- Dropping/Replacing missing entries of
**Embarked.** - Replacing missing values of
**Age**and**Fare**with median values. - Dropping the column
**'Cabin'**as it has too many*null*values.

- Dropping/Replacing missing entries of

In [9]:

`titanic_data.Embarked = titanic_data.Embarked.fillna(titanic_data['Embarked'].mode()[0])`

In [10]:

```
median_age = titanic_data.Age.median()
median_fare = titanic_data.Fare.median()
titanic_data.Age.fillna(median_age, inplace = True)
titanic_data.Fare.fillna(median_fare, inplace = True)
```

In [11]:

`titanic_data.drop('Cabin', axis = 1,inplace = True)`

In [12]:

`titanic_data['SibSp'].value_counts()`

Out[12]:

```
0 608
1 209
2 28
4 18
3 16
8 7
5 5
Name: SibSp, dtype: int64
```

In [13]:

`titanic_data['Parch'].value_counts()`

Out[13]:

```
0 678
1 118
2 80
5 5
3 5
4 4
6 1
Name: Parch, dtype: int64
```

- Creating a new feature named
**FamilySize**.

In [14]:

`titanic_data['FamilySize'] = titanic_data['SibSp'] + titanic_data['Parch']+1`

- Segmenting
**Sex**column as per**Age**, Age less than 15 as**Child**, Age greater than 15 as**Males and Females**as per their gender.

In [15]:

`titanic_data.head()`

Out[15]:

In [16]:

`titanic_data['GenderClass'] = titanic_data.apply(lambda x: 'child' if x['Age'] < 15 else x['Sex'],axis=1)`

In [17]:

`titanic_data[titanic_data.Age<15].head(2)`

Out[17]:

In [18]:

`titanic_data[titanic_data.Age>15].head(2)`

Out[18]:

In [19]:

`titanic_data['GenderClass'].value_counts()`

Out[19]:

```
male 538
female 275
child 78
Name: GenderClass, dtype: int64
```

In [20]:

`titanic_data['Embarked'].value_counts()`

Out[20]:

```
S 646
C 168
Q 77
Name: Embarked, dtype: int64
```

In [21]:

`titanic_data.head()`

Out[21]:

**Dummification**of**GenderClass**&**Embarked**.

In [22]:

`titanic_data = pd.get_dummies(titanic_data, columns=['GenderClass','Embarked'], drop_first=True)`

In [23]:

`titanic_data.head()`

Out[23]:

**Dropping**columns**'Name' , 'Ticket' , 'Sex' , 'SibSp' and 'Parch'**

In [24]:

```
titanic = titanic_data.drop(['Name','Ticket','Sex','SibSp','Parch'], axis = 1)
titanic.head()
```

Out[24]:

Drawing **pair plot** to know the joint relationship between **'Fare' , 'Age' , 'Pclass' & 'Survived'**

In [25]:

```
sns.pairplot(titanic_data[["Fare","Age","Pclass","Survived"]],vars = ["Fare","Age","Pclass"],hue="Survived", dropna=True,markers=["o", "s"])
plt.title('Pair Plot')
```

Out[25]:

`Text(0.5, 1.0, 'Pair Plot')`

Observing the diagonal elements,

- More people of
**Pclass 1***survived*than died (First peak of red is higher than blue) - More people of
**Pclass 3***died*than survived (Third peak of blue is higher than red) - More people of age group
**20-40 died**than survived. - Most of the people paying
**less fare died**.

Establishing **coorelation** between all the features using **heatmap**.

In [26]:

```
corr = titanic_data.corr()
plt.figure(figsize=(10,10))
sns.heatmap(corr,vmax=.8,linewidth=.01, square = True, annot = True,cmap='YlGnBu',linecolor ='black')
plt.title('Correlation between features')
```

Out[26]:

`Text(0.5, 1.0, 'Correlation between features')`

**Age and Pclass are negatively corelated with Survived.**- FamilySize is made from Parch and SibSb only therefore high positive corelation among them.
**Fare and FamilySize**are**positively coorelated with Survived.**- With high corelation we face
**redundancy**issues.

Logistic regression is a techinque used for solving the **classification problem**.

And Classification is nothing but a problem of **identifing** to which of a set of **categories** a new observation belongs, on the basis of *training dataset* containing observations (or instances) whose categorical membership is known.

For example to predict:

**Whether an email is spam (1) or not (0)** or,

**Whether the tumor is malignant (1) or not (0)**
Below is the pictorial representation of a basic logistic regression model to classify set of images into

Both Linear regression and Logistic regression are **supervised learning techinques**. But for the *Regression* problem the output is **continuous** unlike the *classification* problem where the output is **discrete**.

- Logistic Regression is used when the
**dependent variable(target) is categorical**. **Sigmoid function**or logistic function is used as*hypothesis function*for logistic regression. Below is a figure showing the difference between linear regression and logistic regression, Also notice that logistic regression produces a logistic curve, which is limited to values between 0 and 1.

The **odds** for an event is the **(probability of an event occuring) / (probability of event not occuring)**:
For **Linear regression**: continuous response is modeled as a linear combination of the features: **y = β0 + β1x**

For **Logistic regression**: log-odds of a categorical response being "**true**" (1) is modeled as a linear combination of the features:

This is called the **logit function**.

On solving for probability (p) you will get:

Shown below is the plot showing **linear model** and **logistic model**:

In other words:

- Logistic regression outputs the
**probabilities of a specific class**. - Those probabilities can be converted into
**class predictions**.

The logistic function has some nice properties:

- Takes on an
**"s"**shape - Output is bounded by
**0 and 1**

We have covered how this works for binary classification problems (two response classes). But what about **multi-class classification problems** (more than two response classes)?

- Most common solution for classification models is
**"one-vs-all"**(also known as**"one-vs-rest"**): decompose the problem into multiple binary classification problems. **Multinomial logistic regression**can solve this as a single problem.

Logistic Regression was used in **biological sciences** in early twentieth century. It was then used in many social science applications. For instance,

- The Trauma and Injury Severity Score (TRISS), which is widely used to
**predict mortality in injured patients**, was originally developed by Boyd et al. using logistic regression. - Many other medical scales used to
**assess severity**of a patient have been developed using logistic regression. - Logistic regression may be used to
**predict the risk of developing a given disease**(e.g. diabetes; coronary heart disease), based on observed characteristics of the patient (age, sex, body mass index, results of various blood tests, etc.).

Now a days, Logistic Regression have the following applications

- Image segementation and categorization
- Geographic image processing
- Handwriting recognition
- Detection of myocardinal infarction
- Predict whether a person is depressed or not based on a bag of words from corpus.

The reason why logistic regression is widely used despite of the state of the art of deep neural network is that logistic regression is very **efficient** and does **not** require too much **computational resources**, which makes it **affordable** to run on production.

In [27]:

`titanic.head()`

Out[27]:

In [28]:

`# titanic.drop('PassengerId',axis=1, inplace=True)`

In [29]:

`# titanic.head()`

In [30]:

```
X = titanic.loc[:,titanic.columns != 'Survived']
X.head()
```

Out[30]:

In [31]:

```
y = titanic.Survived
# y = titanic['Survived']
```

In [32]:

`X.shape`

Out[32]:

`(891, 11)`

In [33]:

```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)
```

In [34]:

```
print(X_train.shape)
print(y_train.shape)
```

```
(712, 11)
(712,)
```

In [35]:

```
print(X_test.shape)
print(y_test.shape)
```

```
(179, 11)
(179,)
```

In [36]:

`X_train.head(10)`

Out[36]:

In [37]:

`X_test.head(10)`

Out[37]:

In [38]:

`y_train.shape`

Out[38]:

`(712,)`

To apply any machine learning algorithm on your dataset, basically there are 4 steps:

- Load the algorithm
- Instantiate and Fit the model to the training dataset
- Prediction on the test set
- Calculating the accuracy of the model

The code block given below shows how these steps are carried out:

```
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
accuracy_score(y_test,y_pred_test))
```
```

In [39]:

```
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
```

```
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-39-cf3d9f23466f> in <module>
1 from sklearn.linear_model import LogisticRegression
2 logreg = LogisticRegression()
----> 3 logreg.fit(X_train,y_train)
~\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py in fit(self, X, y, sample_weight)
1405 else:
1406 prefer = 'processes'
-> 1407 fold_coefs_ = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
1408 **_joblib_parallel_args(prefer=prefer))(
1409 path_func(X, y, pos_class=class_, Cs=[C_],
~\anaconda3\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
1046 # remaining jobs.
1047 self._iterating = False
-> 1048 if self.dispatch_one_batch(iterator):
1049 self._iterating = self._original_iterator is not None
1050
~\anaconda3\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
864 return False
865 else:
--> 866 self._dispatch(tasks)
867 return True
868
~\anaconda3\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
782 with self._lock:
783 job_idx = len(self._jobs)
--> 784 job = self._backend.apply_async(batch, callback=cb)
785 # A job can complete so quickly than its callback is
786 # called before we get here, causing self._jobs to
~\anaconda3\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
206 def apply_async(self, func, callback=None):
207 """Schedule a func to be run"""
--> 208 result = ImmediateResult(func)
209 if callback:
210 callback(result)
~\anaconda3\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
570 # Don't delay the application, to avoid keeping the input
571 # arguments in memory
--> 572 self.results = batch()
573
574 def get(self):
~\anaconda3\lib\site-packages\joblib\parallel.py in __call__(self)
260 # change the default number of processes to -1
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262 return [func(*args, **kwargs)
263 for func, args, kwargs in self.items]
264
~\anaconda3\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
260 # change the default number of processes to -1
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262 return [func(*args, **kwargs)
263 for func, args, kwargs in self.items]
264
~\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py in _logistic_regression_path(X, y, pos_class, Cs, fit_intercept, max_iter, tol, verbose, solver, coef, class_weight, dual, penalty, intercept_scaling, multi_class, random_state, check_input, max_squared_sum, sample_weight, l1_ratio)
760 options={"iprint": iprint, "gtol": tol, "maxiter": max_iter}
761 )
--> 762 n_iter_i = _check_optimize_result(
763 solver, opt_res, max_iter,
764 extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
~\anaconda3\lib\site-packages\sklearn\utils\optimize.py in _check_optimize_result(solver, result, max_iter, extra_warning_msg)
241 " https://scikit-learn.org/stable/modules/"
242 "preprocessing.html"
--> 243 ).format(solver, result.status, result.message.decode("latin1"))
244 if extra_warning_msg is not None:
245 warning_msg += "\n" + extra_warning_msg
AttributeError: 'str' object has no attribute 'decode'
```

In [ ]:

`y_pred_train = logreg.predict(X_train) `

In [ ]:

`y_pred_test = logreg.predict(X_test) # make predictions on the testing set`

- We need an evaluation metric in order to compare our predictions with the actual values.

**Error** is the *deviation* of the values *predicted* by the model with the *true* values.

We will use __accuracy score __ and **confusion matrix** for evaluation.

In [ ]:

```
from sklearn.metrics import accuracy_score
print('Accuracy score for test data is:', accuracy_score(y_test,y_pred_test))
```

A **confusion matrix** is a **summary** of prediction results on a classification problem.

The number of correct and incorrect predictions are summarized with count values and broken down by each class.

Below is a diagram showing a general confusion matrix.

In [ ]:

```
from sklearn.metrics import confusion_matrix
confusion_matrix = pd.DataFrame(confusion_matrix(y_test, y_pred_test))
print(confusion_matrix)
```

In [ ]:

```
confusion_matrix.index = ['Actual Died','Actual Survived']
confusion_matrix.columns = ['Predicted Died','Predicted Survived']
print(confusion_matrix)
```

This means 93 + 48 = **141 correct predictions** & 25 + 13 = **38 false predictions**.

**Adjusting Threshold** for predicting Died or Survived.

- In the section 4.7 we have used,
**.predict**method for classification. This method takes 0.5 as the default threshhod for prediction. - Now, we are going to see the impact of changing threshold on the accuracy of our logistic regression model.
- For this we are going to use
**.predict_proba**method instead of using .predict method.

Setting the threshold to **0.75**

In [ ]:

`logreg.predict_proba(X_test)`

In [ ]:

`logreg.predict_proba(X_test)[:,1]`

In [ ]:

`logreg.predict_proba(X_test)[:,1]> 0.75`

In [ ]:

```
preds1 = np.where(logreg.predict_proba(X_test)[:,1]> 0.75,1,0)
print('Accuracy score for test data is:', accuracy_score(y_test,preds1))
```

In [ ]:

`preds1`

In [ ]:

`np.array(y_test)`

The accuracy have been **reduced** significantly changing from **0.79 to 0.73**. Hence, 0.75 is **not a good threshold** for our model.

Setting the threshold to **0.25**

In [ ]:

```
preds2 = np.where(logreg.predict_proba(X_test)[:,1]> 0.25,1,0)
print('Accuracy score for test data is:', accuracy_score(y_test,preds2))
```

The accuracy have been **reduced**, changing from **0.79 to 0.75**. Hence, 0.25 is also **not a good threshold** for our model.

Later on we will see methods to identify the best threshold.