Forecasting the future has always been one of man’s biggest desires and many approaches have been tried over the centuries. In this post we will look at a simple statistical method for time series analysis, called AR for Autoregressive Model. We will use this method to predict future sales data and will rebuild it to get a deeper understanding of how this method works, so read on!
Let us dive directly into the matter and build an AR model out of the box. We will use the inbuilt
BJsales dataset which contains 150 observations of sales data (for more information consult the R documentation). Conveniently enough AR models can be built directly in base R with the
ar.ols() function (OLS stands for Ordinary Least Squares which is the method used to fit the actual model). Have a look at the following code:
Well, this seems to be good news for the sales team: rising sales! Yet, how does this model arrive at those numbers? To understand what is going on we will now rebuild the model. Basically, everything is in the name already: auto-regressive, i.e. a (linear) regression on (a delayed copy of) itself (auto from Ancient Greek self)!
So, what we are going to do is create a delayed copy of the time series and run a linear regression on it. We will use the
lm()function from base R for that (see also Learning Data Science: Modelling Basics). Have a look at the following code:# reproduce with lm() df_data
As you can see, the coefficients and predicted values are the same (except for some negligible rounding errors)!
A few things warrant further attention: When building the linear model in line 17 the formula is created dynamically on the fly because the dependent variable is in the last column which number depends on
N(the number of lookback periods). To be more precise, it is not just a simple linear regression but a multiple regression because each column (which represent different time delays) goes into the model as a separate (independent) variable. Additionally, the regression is performed on the demeaned data, meaning that you subtract the mean.
So, under the hood what sounds so impressive (“Autoregressive model”.. wow!) is nothing else but good ol’ linear regression. So, for this method to work, there must be some autocorrelation in the data, i.e. some repeating linear pattern.
As you can imagine there are instances where this will not work. For example, in financial time series there is next to no autocorrelation (otherwise it would be too easy, right! – see also my question and answers on Quant.SE here: Why aren’t econometric models used more in Quant Finance?).
In order to use this model to predict
n_aheadperiods ahead the predict function first uses the last
Nperiods and then uses the new predicted values as input for the next prediction, and so forth
n_aheadtimes. After that, the mean is added again. Obviously, the farther we predict into the future the more uncertain the forecast becomes because the basis of the prediction comprises more and more values that were predicted themselves. The values for both parameters were taken here for demonstration purposes only. A realistic scenario would be to take more lookback periods than predicted periods and you would, of course, take domain knowledge into account, e.g. when you have monthly data take at least twelve periods as your
This post only barely scratched the surface of forecasting time series data. Basically, many of the standard approaches of statistics and machine learning can be modified so that they can be used on time series data. Yet, even the most sophisticated method is not able to foresee external shocks (like the current COVID-19 pandemic) and feedback loops when the very forecasts change the actual behaviour of people.
So, all methods have to be taken with a grain of salt because of those systematic challenges. You should always keep that in mind when you get the latest sales forecast!