Linear Regression Fails In Python With Large Values In Dependent Variables

October 02, 2024 Post a Comment

I'm trying to rewrite a forecasting model (in Stata) using Python (with pandas.stats.api.ols), and ran into an issue with linear regression: the coefficients and intercept computed

Solution 1:

I think this is because of the relative precision issue in python (not just in python, but most other programming languages as well, like C++). np.finfo(float).eps gives 2.2204460492503131e-16, so everything less than eps*max_value_of_your_data will be treated essentially as 0 when you try any primitive operations like + - * /. For example, 1e117 + 1e100 == 1e117 returns True, because 1e100/1e117 = 1e-17 < eps. Now look at your data.

# your data# =======================print(df)

    A           B          C           D           E
0  10  1.0000e+30     100000  1.0000e+15  1.0000e+25
1  20  1.0737e+39    3200000  3.2768e+19  3.3554e+32
2  30  2.0589e+44   24300000  1.4349e+22  8.4729e+36
3  40  1.1529e+48  102400000  1.0737e+24  1.1259e+40
4  50  9.3132e+50  312500000  3.0518e+25  2.9802e+42
5  21  4.6407e+39    4084101  6.8122e+19  1.1363e+33

When relative precision is taken into account,

# ===================================================import numpy as np

np.finfo(float).eps # 2.2204460492503131e-16

df[df < df.max().max()*np.finfo(float).eps] = 0
df

   A           B  C  D           E
000.0000e+00000.0000e+00101.0737e+39000.0000e+00202.0589e+44008.4729e+36301.1529e+48001.1259e+40409.3132e+50002.9802e+42504.6407e+39000.0000e+00

So there is no variation at all in y(A), and that's why statsmodels returns all 0 coefficients. As a reminder, it's always a good practice to normalize your data first before running regression.

Python Playground

Linear Regression Fails In Python With Large Values In Dependent Variables

Solution 1:

Post a Comment for "Linear Regression Fails In Python With Large Values In Dependent Variables"