Multiseasonal time series analysis decomposition and forecasting with Python
Author: Daniel J., TOTH
https://tothjd.medium.com
https://www.linkedin.com/in/tothjd/
Scope
This notebook processes the Hourly Energy Consumption dataset from www.kaggle.com. The time series contains energy demand data from several power supplier companies with diverse service fields. One series was selected (American Electric Power - AEP) for demonstration purposes.
The primary goal of my analysis:
- is to gain insights into power demand dynamics clearly to lay audience
- perform modelling to forecast values in a meaningful way
- forecast in a longer, one year time scale
Description of data
Data is available at https://www.kaggle.com/robikscube/hourly-energy-consumption. From the available files, AEP_hourly.csv
is used.
Index values are strings instead of datetime format. Some dates are missing, some are duplicates. In case of the latter, corresponding values are different. These are to be addressed as time series model classes take time indices in datetime format with specified frequency. NaN values are not present.
Methods
Eventually, I will show that UnobservedComponents
(UC) class of Statsmodels
provide an efficient algorithm to cope with complex multiseasonal time series in a relatively few lines of code. Before applying UC, I perform a multiseasonal decomposition by seasonal_decompose
method, in several lines. As it turns out, UC is essentially the same, however it can take arrays of exogenous variables as arguments to regress the residual and refine the model. The code below is extensively commented and hopefully shows some useful snippets or tips for the reader, such as:
- dealing with datetime indices (model classes need formatting)
- dealing with missing data and duplicates
- plotting time series at random time intervals (applying random choice)
- drawing dashed vertical line for inspecting seasonal effects
- placing annotations with text boxes and arrows on subplots
- extracting data from
seasonal_decompose
results object - approximating time series components with polynomial or trigonometric functions using
Numpy
and optimizing withscipy.optimize
class - evaluating models by mean absolute error (MAE) and root mean squared error (RMSE) using
sklearn.metrics
class - performing model residual diagnostics
Table of Contents
1. Loading and cleaning dataframe
2. Exploratory data analysis
3. Decompositon of time series to individual components
4. In-sample prediction of model after decomposition
5. Approximation functions of model components
6. Component analysis of optimized model
7. Unobserved Components Model (UCM)
8. UCM residual diagnostics
9. UCM supplemented by exogenous variables
10. Second UCM residual diagnostics
11. UCM as pure regression analysis
12. Third UCM residual analysis
13. UCM supplemented by additional exogenous variables
14. Fourth UCM residual analysis
15. Evaluation of models
#mathematical operations
import math
import scipy as sp
import numpy as np
#data handling
import pandas as pd
#plotting
import matplotlib as mpl
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
import seaborn as sns
sns.set()
#machine learning and statistical methods
import statsmodels.api as sm
#dataframe index manipulations
import datetime
#selected preprocessing and evaluation methods
from sklearn.preprocessing import StandardScaler
from statsmodels.tsa.stattools import kpss
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
#muting unnecessary warnings if needed
import warnings
#loading raw data
df_aep = pd.read_csv("AEP_hourly.csv", index_col=0)
df_aep
#sorting unordered indices
df_aep.sort_index(inplace = True)
df_aep