Linear Regression Fails In Python With Large Values In Dependent Variables
I'm trying to rewrite a forecasting model (in Stata) using Python (with pandas.stats.api.ols), and ran into an issue with linear regression: the coefficients and intercept computed
Solution 1:
I think this is because of the relative precision issue in python (not just in python, but most other programming languages as well, like C++). np.finfo(float).eps
gives 2.2204460492503131e-16
, so everything less than eps*max_value_of_your_data
will be treated essentially as 0
when you try any primitive operations like + - * /
. For example, 1e117 + 1e100 == 1e117
returns True
, because 1e100/1e117 = 1e-17 < eps
. Now look at your data.
# your data# =======================print(df)
A B C D E
0 10 1.0000e+30 100000 1.0000e+15 1.0000e+25
1 20 1.0737e+39 3200000 3.2768e+19 3.3554e+32
2 30 2.0589e+44 24300000 1.4349e+22 8.4729e+36
3 40 1.1529e+48 102400000 1.0737e+24 1.1259e+40
4 50 9.3132e+50 312500000 3.0518e+25 2.9802e+42
5 21 4.6407e+39 4084101 6.8122e+19 1.1363e+33
When relative precision is taken into account,
# ===================================================import numpy as np
np.finfo(float).eps # 2.2204460492503131e-16
df[df < df.max().max()*np.finfo(float).eps] = 0
df
A B C D E
000.0000e+00000.0000e+00101.0737e+39000.0000e+00202.0589e+44008.4729e+36301.1529e+48001.1259e+40409.3132e+50002.9802e+42504.6407e+39000.0000e+00
So there is no variation at all in y(A)
, and that's why statsmodels
returns all 0 coefficients. As a reminder, it's always a good practice to normalize your data first before running regression.
Post a Comment for "Linear Regression Fails In Python With Large Values In Dependent Variables"