How Do You Create A Linear Regression Forecast On Time Series Data In Python

December 26, 2023 Post a Comment

I need to be able to create a python function for forecasting based on linear regression model with confidence bands on time-series data: The function needs to take an argument spe

Solution 1:

In the text of your question, you clearly state that you would like upper and lower bounds on your regression output, as well as the output prediction. You also mention using Holt-Winters algorithms for forecasting in particular.

The packages suggested by other answerers are useful, but you might note that sklearn LinearRegression does not give you error bounds "out of the box", statsmodels does not provide Holt-Winters right now.

Therefore, I suggest try using this implementation of Holt-Winters. Unfortunately its license is unclear, so I can't reproduce it here in full. Now, I'm not sure whether you actually want Holt-Winters (seasonal) prediction, or Holt's linear exponential smoothing algorithm. I'm guessing the latter given the title of the post. Thus, you can use the linear() function of the linked library. The technique is described in detail here for interested readers.

In the interests of not providing a link only answer - I'll describe the main features here. A function is defined that takes the data i.e.

deflinear(x, fc, alpha = None, beta = None):

x is the data to be fit, fc is the number of timesteps that you want to forecast, alpha and beta take their usual Holt-Winters meanings: roughly a parameter to control the amount of smoothing to the "level" and to the "trend" respectively. If alpha or beta are not specified, they are estimated using scipy.optimize.fmin_l_bfgs_b to minimise the RMSE.

The function simply applies the Holt algorithm by looping through the existing data points and then returns the forecast as follows:

return Y[-fc:], alpha, beta, rmse

where Y[-fc:] are the forecast points, alpha and beta are the values actually used and rmse is the root mean squared error. Unfortunately, as you can see, there are no upper or lower confidence intervals. By the way - we should probably refer to them as prediction intervals.

Prediction intervals maths

Holt's algorithm and Holt-Winters algorithm are exponential smoothing techniques and finding confidence intervals for predictions generated from them is a tricky subject. They have been referred to as "rule of thumb" methods and, in the case of the Holt-Winters multiplicative algorithm, without "underlying statistical model". However, the final footnote to this page asserts that:

It is possible to calculate confidence intervals around long-term forecasts produced by exponential smoothing models, by considering them as special cases of ARIMA models. (Beware: not all software calculates confidence intervals for these models correctly.) The width of the confidence intervals depends on (i) the RMS error of the model, (ii) the type of smoothing (simple or linear); (iii) the value(s) of the smoothing constant(s); and (iv) the number of periods ahead you are forecasting. In general, the intervals spread out faster as α gets larger in the SES model and they spread out much faster when linear rather than simple smoothing is used.

We see here that an ARIMA(0,2,2) model is equivalent to a Holt linear model with additive errors

Prediction intervals code (i.e. how to proceed)

You indicate in comments that you "can easily do this in R". I guess you may be used to the holt() function provided by the forecast package in R and therefore expecting such intervals. In which case - you can adapt the python library to give them to you on the same basis.

Looking at the R holt code, we can see that it returns an object based on forecast(ets(...). Under the hood - this eventually calls to this function class1, which returns a mean mu and variance var (as well as cj which I have to confess I do not understand). The variance is used to calculate the upper and lower bounds here.

To do something similar in Python - we would need to produce something similar to the class1 R function that estimates the variance of each prediction. This function takes the residuals found in model fitting and multiplies them by a factor at each time step to get the variance at that timestep. In the particular case of the linear Holt's algorithm, the factor is the cumulative sum of alpha + k*beta where k is the number of timesteps' prediction. Once you have that variance at each prediction point, treat the errors as normally distributed and get the X% value from the normal distribution.

Here's an idea how to do this in Python (using the code I linked as your linear function)

#Copy, import or reimplement the RMSE and linear function from#https://gist.github.com/andrequeiroz/5888967#factor in case there are not 1 timestep per day - in your case#assuming the timesteps are UTC epoch - I think they're 5 min# spaced i.e. 288 per day
timesteps_per_day = 288# Note - big assumption here - your known data will be regular in time# i.e. timesteps_per_day observations per day.  From the timestamps this seems valid.# if you can't guarantee that - you'll need to interpolate the datadefholt_predict(data, timestamps, forecast_days, pred_error_level = 0.95):
    forecast_timesteps = forecast_days*timesteps_per_day
    middle_predictions, alpha, beta, rmse = linear(data,int(forecast_timesteps))
    cum_error = [beta+alpha]
    for k inrange(1,forecast_timesteps):
        cum_error.append(cum_error[k-1] + k*beta + alpha)

    cum_error = np.array(cum_error)
    #Use some numpy multiplication to get the intervals
    var = cum_error * rmse**2# find the correct ppf on the normal distribution (two-sided)
    p = abs(scipy.stats.norm.ppf((1-pred_error_level)/2))
    interval = np.sqrt(var) * p
    upper = middle_predictions + interval
    lower = middle_predictions - interval
    fcast_timestamps = [timestamps[-1] + i * 86400 / timesteps_per_day for i inrange(forecast_timesteps)]

    ret_value = []

    ret_value.append({'target':'Forecast','datapoints': zip(middle_predictions, fcast_timestamps)})
    ret_value.append({'target':'Upper','datapoints':zip(upper,fcast_timestamps)})
    ret_value.append({'target':'Lower','datapoints':zip(lower,fcast_timestamps)})
    return ret_value

if __name__=='__main__':
    import numpy as np
    import scipy.stats
    from math import sqrt

    null = None
    data_in = [{"target": "average", "datapoints": [[null, 1435688679],
    [34.870499801635745, 1435688694], [null, 1435688709], [null,
    1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769],
    [null, 1435688784], [null, 1435688799], [null, 1435688814], [null,
    1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874],
    [null, 1435688889], [null, 1435688904], [null, 1435688919], [null,
    1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979],
    [38.180000209808348, 1435688994], [null, 1435689009], [null,
    1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069],
    [null, 1435689084], [null, 1435689099], [null, 1435689114], [null,
    1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174],
    [null, 1435689189], [null, 1435689204], [null, 1435689219], [null,
    1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279],
    [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324],
    [null, 1435689339], [null, 1435689354], [null, 1435689369], [null,
    1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429],
    [null, 1435689444], [null, 1435689459], [null, 1435689474], [null,
    1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534],
    [null, 1435689549], [null, 1435689564]]}]

    #translate the data.  There may be better ways if you're#prepared to use pandas / input data is proper json
    time_series = data_in[0]["datapoints"]

    epoch_in = []
    Y_observed = []

    for (y,x) in time_series:
        if y and x:
            epoch_in.append(x)
            Y_observed.append(y)

    #Pass in the number of days to forecast
    fcast_days = 30
    res = holt_predict(Y_observed,epoch_in,fcast_days)
    data_out = data_in + res
    #data_out now holds the data as you wanted it.#Optional plot of resultsimport matplotlib.pyplot as plt
    plt.plot(epoch_in,Y_observed)
    m,tstamps = zip(*res[0]['datapoints'])
    u,tstamps = zip(*res[1]['datapoints'])
    l,tstamps = zip(*res[2]['datapoints'])
    plt.plot(tstamps,u, label='upper')
    plt.plot(tstamps,l, label='lower')
    plt.plot(tstamps,m, label='mean')
    plt.show()

N.B. The output I've given adds points as tuple type into your object. If you really need list, then replace zip(upper,fcast_timestamps) with map(list,zip(upper,fcast_timestamps)) where the code adds upper, lower and Forecast dicts to the result.

This code is for the particular case of the Holt's linear algorithm - it is not a generic way to calculate correct prediction intervals.

Important note

Your sample input data seems to have a lot of null and only 3 genuine data points. This is highly unlikely to be a good basis for doing timeseries prediction - especially as they all seem to be with 15 minutes and you're trying to forecast up to 3 months!. Indeed - if you feed that data into the R holt(), it will say:

You've got to be joking. I need more data!

i'm assuming you have a larger dataset to test on. I tried the code above on the stock market opening prices for 2015 and it seemed to give reasonable results (see below).

Code used on 2015 stock market opening prices

You may think the prediction intervals look a little wide. This blog from the author of the R forecast module implies that is intentional, though :

"confidence intervals for the mean are much narrower than prediction intervals"

Solution 2:

Scikit is a great module for python

The X and Y vars must be separated into two arrays where if you were to plot them (X,Y) the index from one would match with the other array.

So I guess in your time series data you would separate the X and Y values as follows:

null = None
time_series = [{"target": "average", "datapoints": [[null, 1435688679], [34.870499801635745, 1435688694], [null, 1435688709], [null, 1435688724], [null, 1435688739], [null, 1435688754], [null, 1435688769], [null, 1435688784], [null, 1435688799], [null, 1435688814], [null, 1435688829], [null, 1435688844], [null, 1435688859], [null, 1435688874], [null, 1435688889], [null, 1435688904], [null, 1435688919], [null, 1435688934], [null, 1435688949], [null, 1435688964], [null, 1435688979], [38.180000209808348, 1435688994], [null, 1435689009], [null, 1435689024], [null, 1435689039], [null, 1435689054], [null, 1435689069], [null, 1435689084], [null, 1435689099], [null, 1435689114], [null, 1435689129], [null, 1435689144], [null, 1435689159], [null, 1435689174], [null, 1435689189], [null, 1435689204], [null, 1435689219], [null, 1435689234], [null, 1435689249], [null, 1435689264], [null, 1435689279], [30.79849989414215, 1435689294], [null, 1435689309], [null, 1435689324], [null, 1435689339], [null, 1435689354], [null, 1435689369], [null, 1435689384], [null, 1435689399], [null, 1435689414], [null, 1435689429], [null, 1435689444], [null, 1435689459], [null, 1435689474], [null, 1435689489], [null, 1435689504], [null, 1435689519], [null, 1435689534], [null, 1435689549], [null, 1435689564]]}]

 # assuming the time series is the X axis value

input_X_vars = []
input_Y_vars = []

for pair in time_series[0]["datapoints"]:
    if (pair[0] != None and pair[1] != None):
        input_X_vars.append(pair[1])
        input_Y_vars.append(pair[0])



import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

regr = linear_model.LinearRegression()
regr.fit(input_X_vars, input_Y_vars)

test_X_vars = [1435688681, 1435688685]

results = regr.predict(test_X_vars)

forecast_append = {"target": "Lower", "datapoints": results}

time_series.append(forecast_append)

I set null as None as the 'null' keyword is parsed as simply a variable in python.

Solution 3:

Note: this is not a detailed canonical answer, but references to available libraries that apply to your domain (statistical models).

For python, you could use:

scipy : scipy.stats.linregress
- e.g. http://glowingpython.blogspot.com/2012/03/linear-regression-with-numpy.html (uses scipy.stats.linregress)
statsmodels: here are some examples.
sklearn: here is the documentation.

There are some good articles:

http://davidcoallier.com/blog/linear-regression-from-r-to-python/, in particular it has a section about forecasting.
http://connor-johnson.com/2014/02/18/linear-regression-with-python/, another article, shows both using statsmodels and sklearn.

Solution 4:

I've made a linear regression using gradient descent in Python, perhaps that can help you. After the data are initialized, the loss function is created, which is ordinary least squares (OLS). Then the gradient function is created, and this is used in an iterative procedure to find the optimal parameters. These can then be used for forecasting. Here's a working example about the prediction of the grade obtained, given the number of books and the number of hours in class.

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import *
from random import random, seed
from matplotlib import cm

# Initialize datapoints [# Books, hours in class], on the basis of which we want to make a prediction:#x = np.array([[2104,3],[1600,3],[2400,3],[1416,2],[300,4],[2001,3]])
x = np.array([[0, 9],
[1, 15],
[0, 10],
[2, 16],
[4, 10],
[4, 20],
[1, 11],
[4, 20],
[3, 15],
[0, 15],
[2, 8],
[1, 13],
[4, 18],
[1, 10],
[0, 8],
[1, 10],
[3, 16],
[0, 11],
[1, 19],
[4, 12],
[4, 11],
[0, 19],
[2, 15],
[3, 15],
[1, 20],
[0, 6],
[3, 15],
[3, 19],
[2, 14],
[2, 13],
[3, 17],
[2, 20],
[2, 11],
[3, 20],
[4, 20],
[4, 20],
[3, 9],
[1, 8],
[2, 16],
[0, 10]])

# Initialize y, the grade obtained
y = np.array([45,
57,
45,
51,
65,
88,
44,
87,
89,
59,
66,
65,
56,
47,
66,
41,
56,
37,
45,
58,
47,
64,
97,
55,
51,
61,
69,
79,
71,
62,
87,
54,
43,
92,
83,
94,
60,
56,
88,
62])

#Initialize starting point for theta, the parameter we want to optimize:#theta = np.array([90,0.1,-8])
theta = np.array([1,1,1])

# Initialize h_theta(x):defh(x, theta):
    return theta[0] + theta[1]*x[:,0] + theta[2]*x[:,1]

# Initialize the loss function (ordinary least squares), J(theta):deflossf(theta):
    loss = np.sum((h(x,theta)-y)*(h(x,theta)-y))
    return loss
    '''
    for i in range(x.shape[0]):
        lossIncrement = (h(x[[i]],theta) - y[i])*(h(x[[i]],theta) - y[i])
        loss = loss + lossIncrement

    return 0.5*loss
    '''# Define the gradient of the loss function for any parameters theta, use vector calculations here instead of for loops:defgradlossf(theta):
    # dJ/dTheta0:
    d1 = np.sum(h(x,theta)-y)
    d2 = np.sum((h(x,theta)-y)*x[:,0])
    d3 = np.sum((h(x,theta)-y)*x[:,1])
    return np.array([d1,d2,d3])

epsilon = 1e-2
max_iter = 1000
norm = np.linalg.norm(gradlossf(theta))
alpha0 = 1iter = 0
beta = 1e-4
tolerance = 1e-6
tau = 0.5# Start iterative procedure:whileiter < max_iter and norm > epsilon:
    print theta

    alpha = alpha0
    alpha_found = Falsewhile alpha_found isFalse:
        xpoint = theta - alpha*gradlossf(theta)
        val1 = lossf(xpoint)
        val2 = lossf(theta)

        if val1<val2+tolerance:
            alpha_found = Trueprint'We choose alpha =',alpha
        else:
            alpha = tau*alpha

    iter = iter + 1
    theta = xpoint
    print'New value of loss function is: ', val1


# THE CORRECT VALUES: [37.379, 4.037, 1.283] (from SPSS) Residual: 7306.230# Plot datapoints
fig = plt.figure()
plt.hold(True)
x1_surf=np.arange(0, 5, 0.1)                # generate a mesh
x2_surf=np.arange(5, 25, 0.1)
x1_surf, x2_surf = np.meshgrid(x1_surf, x2_surf)
z_surf = theta[0] + theta[1]*x1_surf + theta[2]*x2_surf             # ex. function, which depends on x and y

ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(x1_surf, x2_surf, z_surf, cmap=cm.hot);    # plot a 3d surface plot
ax.scatter(x[:,0],x[:,1],y)

ax.set_xlabel('# Books')
ax.set_ylabel('Hours in class')
ax.set_zlabel('Grade')

plt.show()

Python Playground