Jovian
⭐️
Sign In
Learn data science and machine learning by building real-world projects on Jovian

Multiseasonal time series analysis decomposition and forecasting with Python

Author: Daniel J., TOTH

https://tothjd.medium.com

https://www.linkedin.com/in/tothjd/

Scope

This notebook processes the Hourly Energy Consumption dataset from www.kaggle.com. The time series contains energy demand data from several power supplier companies with diverse service fields. One series was selected (American Electric Power - AEP) for demonstration purposes.

The primary goal of my analysis:

  • is to gain insights into power demand dynamics clearly to lay audience
  • perform modelling to forecast values in a meaningful way
  • forecast in a longer, one year time scale

Description of data

Data is available at https://www.kaggle.com/robikscube/hourly-energy-consumption. From the available files, AEP_hourly.csv is used. Index values are strings instead of datetime format. Some dates are missing, some are duplicates. In case of the latter, corresponding values are different. These are to be addressed as time series model classes take time indices in datetime format with specified frequency. NaN values are not present.

Methods

Eventually, I will show that UnobservedComponents (UC) class of Statsmodels provide an efficient algorithm to cope with complex multiseasonal time series in a relatively few lines of code. Before applying UC, I perform a multiseasonal decomposition by seasonal_decompose method, in several lines. As it turns out, UC is essentially the same, however it can take arrays of exogenous variables as arguments to regress the residual and refine the model. The code below is extensively commented and hopefully shows some useful snippets or tips for the reader, such as:

  • dealing with datetime indices (model classes need formatting)
  • dealing with missing data and duplicates
  • plotting time series at random time intervals (applying random choice)
  • drawing dashed vertical line for inspecting seasonal effects
  • placing annotations with text boxes and arrows on subplots
  • extracting data from seasonal_decompose results object
  • approximating time series components with polynomial or trigonometric functions using Numpy and optimizing with scipy.optimize class
  • evaluating models by mean absolute error (MAE) and root mean squared error (RMSE) using sklearn.metrics class
  • performing model residual diagnostics

Table of Contents

1. Loading and cleaning dataframe

2. Exploratory data analysis

3. Decompositon of time series to individual components

4. In-sample prediction of model after decomposition

5. Approximation functions of model components

6. Component analysis of optimized model

7. Unobserved Components Model (UCM)

8. UCM residual diagnostics

9. UCM supplemented by exogenous variables

10. Second UCM residual diagnostics

11. UCM as pure regression analysis

12. Third UCM residual analysis

13. UCM supplemented by additional exogenous variables

14. Fourth UCM residual analysis

15. Evaluation of models

In [1]:
#mathematical operations
import math
import scipy as sp
import numpy as np

#data handling
import pandas as pd

#plotting
import matplotlib as mpl
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
import seaborn as sns
sns.set()

#machine learning and statistical methods
import statsmodels.api as sm

#dataframe index manipulations
import datetime

#selected preprocessing and evaluation methods
from sklearn.preprocessing import StandardScaler
from statsmodels.tsa.stattools import kpss
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

#muting unnecessary warnings if needed
import warnings

1. Loading and cleaning dataframe

In [2]:
#loading raw data
df_aep = pd.read_csv("AEP_hourly.csv", index_col=0)
df_aep
Out[2]:
In [3]:
#sorting unordered indices
df_aep.sort_index(inplace = True)
df_aep
Out[3]:
In [4]:
#visual checking of data. Plotting by Pandas method, drawing axes by Matplotlib
f, ax = plt.subplots(figsize=(18,6),dpi=200);
plt.suptitle('American Electric Power (AEP) estimated energy consumption in MegaWatts (MW)', fontsize=24);
df_aep.plot(ax=ax,rot=90,ylabel='MW');
Notebook Image
In [5]:
#visual check of data at different scale
f, ax = plt.subplots(figsize=(18,6),dpi=200);
plt.suptitle('American Electric Power estimated energy consumption in MegaWatts (MW)', fontsize=36);
df_aep.iloc[-3*8766:,:].plot(ax=ax,rot=90,fontsize=12);
Notebook Image
In [6]:
#creating datetime list with boundaries of raw data series, hourly frequency
datelist = pd.date_range(datetime.datetime(2004,10,1,1,0,0), datetime.datetime(2018,8,3,0,0,0), freq='H').tolist()

#extracting raw data series indices
idx_list = df_aep.index.to_list()

#checking for anomalies by comparing the two
idx_list == datelist
Out[6]:
False
In [7]:
#searching for anomalies
len(datelist), len(idx_list), len(set(idx_list))
Out[7]:
(121296, 121273, 121269)
In [8]:
#converting dataframe indices to datetime
dt_idc = pd.to_datetime(df_aep.index, format='%Y-%m-%d %H:%M:%S')
In [9]:
#troubleshooting indices by inspecting irregularities of order. Header is printed first
print('Index,   current datetime,   current value,   last datetime,   last value,   timedelta,   value delta')

#collecting important values of irregular indices (index value and timedelta) for later use
idc = []

#iterating through datetime indices
for idx in range(1,len(dt_idc)):
    
    #if statement is True, if difference between consecutive datetime indices is not one hour
    if dt_idc[idx] - dt_idc[idx-1] != datetime.timedelta(hours=1):
        
        #appending collection of important values
        idc.append([idx,dt_idc[idx] - dt_idc[idx-1]])
        
        #printing values according to header
        print('{},   {},   {},   {},   {},   {},   {}'.format(idx, dt_idc[idx], df_aep.iloc[idx,0], dt_idc[idx-1], df_aep.iloc[idx-1,0], dt_idc[idx]-dt_idc[idx-1], df_aep.iloc[idx,0]-df_aep.iloc[idx-1,0]))
Index, current datetime, current value, last datetime, last value, timedelta, value delta 721, 2004-10-31 03:00:00, 10318.0, 2004-10-31 01:00:00, 11433.0, 0 days 02:00:00, -1115.0 4417, 2005-04-03 04:00:00, 13271.0, 2005-04-03 02:00:00, 13426.0, 0 days 02:00:00, -155.0 9455, 2005-10-30 03:00:00, 13154.0, 2005-10-30 01:00:00, 13348.0, 0 days 02:00:00, -194.0 13151, 2006-04-02 04:00:00, 11255.0, 2006-04-02 02:00:00, 11315.0, 0 days 02:00:00, -60.0 18189, 2006-10-29 03:00:00, 12781.0, 2006-10-29 01:00:00, 13629.0, 0 days 02:00:00, -848.0 21381, 2007-03-11 04:00:00, 13119.0, 2007-03-11 02:00:00, 13090.0, 0 days 02:00:00, 29.0 27091, 2007-11-04 03:00:00, 13269.0, 2007-11-04 01:00:00, 13590.0, 0 days 02:00:00, -321.0 30115, 2008-03-09 04:00:00, 17180.0, 2008-03-09 02:00:00, 17206.0, 0 days 02:00:00, -26.0 35825, 2008-11-02 03:00:00, 12121.0, 2008-11-02 01:00:00, 12639.0, 0 days 02:00:00, -518.0 38849, 2009-03-08 04:00:00, 10932.0, 2009-03-08 02:00:00, 11123.0, 0 days 02:00:00, -191.0 44559, 2009-11-01 03:00:00, 11369.0, 2009-11-01 01:00:00, 11871.0, 0 days 02:00:00, -502.0 47751, 2010-03-14 04:00:00, 12546.0, 2010-03-14 02:00:00, 12658.0, 0 days 02:00:00, -112.0 53461, 2010-11-07 03:00:00, 14263.0, 2010-11-07 01:00:00, 14739.0, 0 days 02:00:00, -476.0 54250, 2010-12-10 01:00:00, 17373.0, 2010-12-09 23:00:00, 19078.0, 0 days 02:00:00, -1705.0 56484, 2011-03-13 04:00:00, 12969.0, 2011-03-13 02:00:00, 12755.0, 0 days 02:00:00, 214.0 62194, 2011-11-06 03:00:00, 13220.0, 2011-11-06 01:00:00, 13506.0, 0 days 02:00:00, -286.0 65218, 2012-03-11 04:00:00, 13510.0, 2012-03-11 02:00:00, 13407.0, 0 days 02:00:00, 103.0 70928, 2012-11-04 03:00:00, 12171.0, 2012-11-04 01:00:00, 12873.0, 0 days 02:00:00, -702.0 71697, 2012-12-06 05:00:00, 15192.0, 2012-12-06 03:00:00, 14711.0, 0 days 02:00:00, 481.0 73951, 2013-03-10 04:00:00, 12436.0, 2013-03-10 02:00:00, 12537.0, 0 days 02:00:00, -101.0 79661, 2013-11-03 03:00:00, 12190.0, 2013-11-03 01:00:00, 11478.0, 0 days 02:00:00, 712.0 82685, 2014-03-09 04:00:00, 13008.0, 2014-03-09 02:00:00, 13140.0, 0 days 02:00:00, -132.0 82743, 2014-03-11 15:00:00, 14405.0, 2014-03-11 13:00:00, 14839.0, 0 days 02:00:00, -434.0 88395, 2014-11-02 02:00:00, 13190.0, 2014-11-02 02:00:00, 12994.0, 0 days 00:00:00, 196.0 91420, 2015-03-08 04:00:00, 14062.0, 2015-03-08 02:00:00, 14111.0, 0 days 02:00:00, -49.0 97131, 2015-11-01 02:00:00, 10542.0, 2015-11-01 02:00:00, 10785.0, 0 days 00:00:00, -243.0 100324, 2016-03-13 04:00:00, 10236.0, 2016-03-13 02:00:00, 10314.0, 0 days 02:00:00, -78.0 106035, 2016-11-06 02:00:00, 11008.0, 2016-11-06 02:00:00, 10964.0, 0 days 00:00:00, 44.0 109060, 2017-03-12 04:00:00, 14320.0, 2017-03-12 02:00:00, 14361.0, 0 days 02:00:00, -41.0 114771, 2017-11-05 02:00:00, 10446.0, 2017-11-05 02:00:00, 10596.0, 0 days 00:00:00, -150.0 117796, 2018-03-11 04:00:00, 13704.0, 2018-03-11 02:00:00, 13797.0, 0 days 02:00:00, -93.0
In [10]:
#changing string indices to datetime converted indices
df_aep.set_index(dt_idc, inplace=True)
df_aep.index
Out[10]:
DatetimeIndex(['2004-10-01 01:00:00', '2004-10-01 02:00:00',
               '2004-10-01 03:00:00', '2004-10-01 04:00:00',
               '2004-10-01 05:00:00', '2004-10-01 06:00:00',
               '2004-10-01 07:00:00', '2004-10-01 08:00:00',
               '2004-10-01 09:00:00', '2004-10-01 10:00:00',
               ...
               '2018-08-02 15:00:00', '2018-08-02 16:00:00',
               '2018-08-02 17:00:00', '2018-08-02 18:00:00',
               '2018-08-02 19:00:00', '2018-08-02 20:00:00',
               '2018-08-02 21:00:00', '2018-08-02 22:00:00',
               '2018-08-02 23:00:00', '2018-08-03 00:00:00'],
              dtype='datetime64[ns]', name='Datetime', length=121273, freq=None)
In [11]:
#correction of anomalies in reversed order avoiding index shift
for idx in reversed(idc):
    
    #appending mean values in case of datetime gaps
    if idx[1] == datetime.timedelta(hours=2):
        idx_old = df_aep.iloc[idx[0]].name
        idx_new = idx_old-datetime.timedelta(hours=1)
        df_aep.loc[idx_new] = np.mean(df_aep.iloc[idx[0]-1:idx[0]+1].values)
    
    #dropping duplicates, appending mean of dropped values
    elif idx[1] == datetime.timedelta(hours=0):
        idx_old = df_aep.iloc[idx[0]].name
        value = np.mean(df_aep.iloc[idx[0]-1:idx[0]+1].values)
        df_aep.drop(df_aep.iloc[idx[0]-1:idx[0]+1].index, inplace=True)
        df_aep.loc[idx_old] = value

df_aep
Out[11]:
In [12]:
df_aep.sort_index(inplace=True)
df_aep
Out[12]:
In [13]:
#checking for datetime series again
idx_list = df_aep.index.to_list()
idx_list == datelist
Out[13]:
True
In [14]:
#setting frequency attribute of datetime index
df_aep.index.freq = 'H'
df_aep.index
Out[14]:
DatetimeIndex(['2004-10-01 01:00:00', '2004-10-01 02:00:00',
               '2004-10-01 03:00:00', '2004-10-01 04:00:00',
               '2004-10-01 05:00:00', '2004-10-01 06:00:00',
               '2004-10-01 07:00:00', '2004-10-01 08:00:00',
               '2004-10-01 09:00:00', '2004-10-01 10:00:00',
               ...
               '2018-08-02 15:00:00', '2018-08-02 16:00:00',
               '2018-08-02 17:00:00', '2018-08-02 18:00:00',
               '2018-08-02 19:00:00', '2018-08-02 20:00:00',
               '2018-08-02 21:00:00', '2018-08-02 22:00:00',
               '2018-08-02 23:00:00', '2018-08-03 00:00:00'],
              dtype='datetime64[ns]', name='Datetime', length=121296, freq='H')

2. Exploratory data analysis

In [15]:
#drawing a random sample of 5 indices without repetition
sample = sorted([x for x in np.random.choice(range(len(df_aep)), 5, replace=False)])

#zoom x scales for plotting
periods = [9000,3000,720,240]

#plotting random time intervals at different scales to observe any seasonality present
f, axes = plt.subplots(len(sample),4,dpi=400,figsize=(12,12));
plt.suptitle('{} random time window plotted at {} different scales'.format(len(sample),len(periods)), fontsize=24, x=0.5, y=0.95)
f.tight_layout(pad=3.0);

#s for sample datetime start
for si,s in enumerate(sample):
    
    #p for period length
    for pi,p in enumerate(periods):
        df_aep.iloc[s:(s+p+1),:].plot(ax=axes[si][pi], legend=False, rot=90);
        #annotating datetime start
        axes[si][pi].annotate("Start at: " + df_aep.iloc[s:s+1,:].index[0].strftime("%d-%b-%Y %H:%M"), (0,1), xycoords='axes fraction')