Jovian
⭐️
Sign In

Objective

Forest fires help in the natural cycle of woods' growth and replenishment. They Clear dead trees, leaves, and competing vegetation from the forest floor, so new plants can grow. Remove weak or disease-ridden trees, leaving more space and nutrients for stronger trees.

But when fires burn too hot and uncontrollable or when they’re in the “wildland-urban interface” (the places where woodlands and homes or other developed areas meet), they can be damaging and life threatning.

In this kernel, our aim is to predict the burned area (area) of forest fires, in the northeast region of Portugal. Based on the the spatial, temporal, and weather variables where the fire is spotted.

This prediction can be used for calculating the forces sent to the incident and deciding the urgency of the situation.

Further read:

  1. Mylandplan
  2. KNIME
In [ ]:
target = 'area'

Define the metrics

RMSE

RMSE is the most popular evaluation metric used in regression problems. It follows an assumption that error are unbiased and follow a normal distribution.

Further read: https://www.analyticsvidhya.com/blog/2019/08/11-important-model-evaluation-error-metrics/

Dependencies

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')

import statsmodels.api as sm
from scipy.stats import zscore
from statsmodels.stats.stattools import durbin_watson
from sklearn.model_selection import train_test_split,KFold
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression,Ridge,Lasso,ElasticNet
In [1]:
import jovian

Load and describe data

In [ ]:
#path = 'forestfires.csv'
path = "../input/forest-fires-data-set/forestfires.csv"
df = pd.read_csv(path)

df.shape
In [ ]:
df.dtypes
In [ ]:
df.describe().T

Missing value treatment

In [ ]:
df.isna().sum()

Exploratory Data Analysis

We will try out the following analysis on our dataset

  • Univariate
  • Bivariate
  • Multivariate
In [ ]:
plt.rcParams["figure.figsize"] = 9,5

Univariate analysis

Let's begin with the target variable, Area
In [ ]:
plt.figure(figsize=(16,5))
print("Skew: {}".format(df[target].skew()))
print("Kurtosis: {}".format(df[target].kurtosis()))
ax = sns.kdeplot(df[target],shade=True,color='g')
plt.xticks([i for i in range(0,1200,50)])
plt.show()
In [ ]:
ax = sns.boxplot(df[target])

Few observations:

  • The data is highly skewed with a value of +12.84 and huge kurtosis value of 194.

  • It even tells you that majority of the forest fires do not cover a large area, most of the damaged area is under 50 hectares of land.

  • We can apply tranformation to fix the skewnesss and kurtosis, however we will have to inverse transform before submitting the output.

  • Outlier Check: There are 4 outlier instances in our area columns but the questions is should we drop it or not? (Will get back to this in the outlier treatment step)

In [ ]:
# Outlier points
y_outliers = df[abs(zscore(df[target])) >= 3 ]
y_outliers
Independent columns
In [ ]:
dfa = df.drop(columns=target)
cat_columns = dfa.select_dtypes(include='object').columns.tolist()
num_columns = dfa.select_dtypes(exclude='object').columns.tolist()

cat_columns,num_columns
Categorical columns
In [ ]:
# analyzing categorical columns
plt.figure(figsize=(16,10))
for i,col in enumerate(cat_columns,1):
    plt.subplot(2,2,i)
    sns.countplot(data=dfa,y=col)
    plt.subplot(2,2,i+2)
    df[col].value_counts(normalize=True).plot.bar()
    plt.ylabel(col)
    plt.xlabel('% distribution per category')
plt.tight_layout()
plt.show()    
  1. It is interesting to see that abnormally high number of the forest fires occur in the month of August and September.

  2. In the case of day, the days Friday to Monday have higher proportion of cases. (However, no strong indicators)

Numerical Columns
In [ ]:
plt.figure(figsize=(18,40))
for i,col in enumerate(num_columns,1):
    plt.subplot(8,4,i)
    sns.kdeplot(df[col],color='g',shade=True)
    plt.subplot(8,4,i+10)
    sns.boxplot(df[col])
plt.tight_layout() 
plt.show()
num_data = df[num_columns]
pd.DataFrame(data=[num_data.skew(),num_data.kurtosis()],index=['skewness','kurtosis'])

Outliers, Skewness and kurtosis (high positive or negative) was observed in the following columns:

  1. FFMC
  2. ISI
  3. rain

Bivariate analysis with our target variable

In [ ]:
print(df['area'].describe(),'\n')
print(y_outliers)
In [ ]:
# a categorical variable based on forest fire area damage
# No damage, low, moderate, high, very high
def area_cat(area):
    if area == 0.0:
        return "No damage"
    elif area <= 1:
        return "low"
    elif area <= 25:
        return "moderate"
    elif area <= 100:
        return "high"
    else:
        return "very high"

df['damage_category'] = df['area'].apply(area_cat)
df.head()
Categorical columns
In [ ]:
cat_columns
In [ ]:
for col in cat_columns:
    cross = pd.crosstab(index=df['damage_category'],columns=df[col],normalize='index')
    cross.plot.barh(stacked=True,rot=40,cmap='hot')
    plt.xlabel('% distribution per category')
    plt.xticks(np.arange(0,1.1,0.1))
    plt.title("Forestfire damage each {}".format(col))
#     print(cross)

plt.show()
  • Previously we had observed that August and September had the most number of forest fires. And from the above plot of month, we can understand few things

    • Most of the fires in August were low (< 1 hectare).
    • The very high damages(>100 hectares) happened in only 3 months - august,july and september.
  • Regarding fire damage per day, nothing much can be observed. Except that, there were no very high damaging fires on Friday and on Saturdays it has been reported most.

Numerical columns
In [ ]:
plt.figure(figsize=(20,40))
for i,col in enumerate(num_columns,1):
    plt.subplot(10,1,i)
    if col in ['X','Y']:
        sns.swarmplot(data=df,x=col,y=target,hue='damage_category')
    else:
        sns.scatterplot(data=df,x=col,y=target,hue='damage_category')
plt.show()

Multivariate analysis

In [ ]:
selected_features = df.drop(columns=['damage_category','day','month']).columns
selected_features

ax = sns.pairplot(df,hue='damage_category',vars=selected_features)

In [ ]:
sns.pairplot(df,hue='damage_category',vars=selected_features)
plt.show()

heatmap

In [ ]:

plt.figure(figsize =(16,8))

sns.heatmap(df.corr(),annot=True,cmap='YlGnBu')
plt.show()

Outlier treatment

We had observed outliers in the following columns:

  1. area
  2. FFMC
  3. ISI
  4. rain
In [ ]:
out_columns = ['area','FFMC','ISI','rain']

However, the above outliers are not error values so we cannot remove it.

In order to minimize the effect of outliers in our model we will transform the above features.

Ref: https://humansofdata.atlan.com/2018/03/when-delete-outliers-dataset/

Preparing the data for modelling

Thing which we can cover here

  • Encoding the categorical columns
In [ ]:
df = pd.get_dummies(df,columns=['day','month'],drop_first=True)
  • Data transformations like log,inverse,exponential,etc
In [ ]:
print(df[out_columns].describe())
np.log1p(df[out_columns]).skew(), np.log1p(df[out_columns]).kurtosis()
In [ ]:
# FFMC and rain are still having high skew and kurtosis values, 
# since we will be using Linear regression model we cannot operate with such high values
# so for FFMC we can remove the outliers in them using z-score method
mask = df.loc[:,['FFMC']].apply(zscore).abs() < 3

# Since most of the values in rain are 0.0, we can convert it as a categorical column
df['rain'] = df['rain'].apply(lambda x: int(x > 0.0))

df = df[mask.values]
df.shape
In [ ]:
out_columns.remove('rain')
df[out_columns] = np.log1p(df[out_columns])
In [ ]:
df[out_columns].skew(), df[out_columns].kurtosis()

Baseline model - Linear Regression

Difference between statistical and machine learning approach

  • Machine learning produces predictions. As far as I can tell, it is not very good at drawing conclusions about general principles based on a set of observations.
  • Statistical estimation lets the practitioner make inferences (conclusions about a larger set of phenomena based on the observation of a smaller set of phenomena.) For example, in a regression model the practitioner can estimate the effect of a one unit change in an independent variable X on a dependent variable y.

Further read: https://www.quora.com/When-do-you-use-machine-learning-vs-statistical-regression

In [ ]:
X = df.drop(columns=['area','damage_category'])
y = df['area']

Statistical approach

Assumptions of linear regression in statistics
1. Linearity of model 2. Normality of the residuals 3. Homoscedasticity 4. Autocorrelation in Residuals 5. Multicollinearity
In [ ]:
X_constant = sm.add_constant(X)
In [ ]:
# Build OLS model
lin_reg = sm.OLS(y,X_constant).fit()
lin_reg.summary()

1. Linearity of model

In [ ]:
def linearity_test(model, y):
    '''
    Function for visually inspecting the assumption of linearity in a linear regression model.
    It plots observed vs. predicted values and residuals vs. predicted values.
    
    Args:
    * model - fitted OLS model from statsmodels
    * y - observed values
    '''
    fitted_vals = model.predict()
    resids = model.resid

    fig, ax = plt.subplots(1,2,figsize=(15,5))

    sns.regplot(x=fitted_vals, y=y, lowess=True, ax=ax[0], line_kws={'color': 'red'})
    ax[0].set_title('Observed vs. Predicted Values', fontsize=16)
    ax[0].set(xlabel='Predicted', ylabel='Observed')

    sns.regplot(x=fitted_vals, y=resids, lowess=True, ax=ax[1], line_kws={'color': 'red'})
    ax[1].set_title('Residuals vs. Predicted Values', fontsize=16)
    ax[1].set(xlabel='Predicted', ylabel='Residuals')
    
linearity_test(lin_reg, y) 
plt.tight_layout()
  • By observing the plots the linearity assumption is not there

  • Adding new features might result in linearity of model

2. Normality of the residuals

In [ ]:
sm.qqplot(lin_reg.resid,line ='r')
plt.show()
  • we can improve the normality of residuals by removing the outliers
    • but in our probelm we need the outliers because of its feature importance

Expectation Mean of residual is zero

In [ ]:
lin_reg.resid.mean()
  • Very much close to zero, we can assume mean of residual is zero

3. Homoscedasticity

In [ ]:
from statsmodels.compat import lzip
import numpy as np
from statsmodels.compat import lzip
%matplotlib inline
%config InlineBackend.figure_format ='retina'
import seaborn as sns 
import matplotlib.pyplot as plt
import statsmodels.stats.api as sms
sns.set_style('darkgrid')
sns.mpl.rcParams['figure.figsize'] = (15.0, 9.0)

model = lin_reg
fitted_vals = model.predict()
resids = model.resid
resids_standardized = model.get_influence().resid_studentized_internal
fig, ax = plt.subplots(1,2)

sns.regplot(x=fitted_vals, y=resids, lowess=True, ax=ax[0], line_kws={'color': 'red'})
ax[0].set_title('Residuals vs Fitted', fontsize=16)
ax[0].set(xlabel='Fitted Values', ylabel='Residuals')
sns.regplot(x=fitted_vals, y=np.sqrt(np.abs(resids_standardized)), lowess=True, ax=ax[1], line_kws={'color': 'red'})
ax[1].set_title('Scale-Location', fontsize=16)
ax[1].set(xlabel='Fitted Values', ylabel='sqrt(abs(Residuals))')

name = ['F statistic', 'p-value']
test = sms.het_goldfeldquandt(model.resid, model.model.exog)
lzip(name, test)
  • To identify homoscedasticity in the plots, the placement of the points should be random and no pattern (increase/decrease in values of residuals) should be visible and P-Values should be less than 0.05
  • In the plots we can see there are paticular patterns and P-Values is also greater than 0.05 ,so we can say that there is no homoscedasticity

4. Autocorrelation in Residuals

In [ ]:
import statsmodels.tsa.api as smt

acf = smt.graphics.plot_acf(lin_reg.resid, lags=40 , alpha=0.05)
acf.show()
  • By observing the above data we can say that there is positive autocorrelation is present , we can reduce it by using fine tuning of parameters

5. Multicollinearity

In [ ]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = [variance_inflation_factor(X_constant.values, i) for i in range(X_constant.shape[1])]
pd.DataFrame({'vif': vif[1:]}, index=X.columns)
  • there is multicollinearity present between some features where vif >5.
  • To deal with multicollinearity we should iteratively remove features with high values of VIF.

Machine learning approach

In [ ]:
# Basic model
In [ ]:
lin_reg = LinearRegression()
lin_reg.fit(X, y)

In [ ]:
print(f'Coefficients: {lin_reg.coef_}')
print(f'Intercept: {lin_reg.intercept_}')
print(f'R^2 score: {lin_reg.score(X, y)}')

Improving Stats model

In [ ]:
# Using p-value and variable inflation factor
In [ ]:
df.drop(['DC','FFMC','ISI','month_aug','month_sep','month_jul','month_oct'],axis=1,inplace =True)

In [ ]:
X = df.drop(columns=['area','damage_category'])
y = df['area']
In [ ]:
X_constant =sm.add_constant(X)
lin_reg = sm.OLS(y,X_constant).fit()
lin_reg.summary()

Improving ML model

In [ ]:
# Feature Selection techniques - RFE, forward or backward selection
In [ ]:
lin_reg = LinearRegression()
lin_reg.fit(X, y)
In [ ]:
print(f'Coefficients: {lin_reg.coef_}')
print(f'Intercept: {lin_reg.intercept_}')
print(f'R^2 score: {lin_reg.score(X, y)}')
In [ ]:
jovian.commit()
[jovian] Saving notebook..