Некоторые из моих предикторов имеют очень разные масштабы - нужно ли их трансформировать перед подбором модели линейной регрессии?

Я хотел бы запустить линейную регрессию по многомерному набору данных. Существуют различия между различными измерениями с точки зрения их величины порядка. Например, измерение 1 обычно имеет диапазон значений [0, 1], а измерение 2 имеет диапазон значений [0, 1000].

Нужно ли выполнять какие-либо преобразования, чтобы гарантировать, что диапазоны данных для разных измерений находятся в одном масштабе? Если это необходимо, есть ли руководство для такого рода преобразований?

regression multiple-regression linear-model бит вопрос
источник

Ответы:

Переменные / масштабирующие переменные не влияют на их корреляцию с откликом

Чтобы понять, почему это так, предположим, что корреляция между и равна . Тогда корреляция между и является $Y$ $X$ $\rho$ $Y$ $(X-a)/b$

\frac{c o v (Y, (X - a) / b)}{S D ((X - a) / b) \cdot S D (Y)} = \frac{c o v (Y, X / b)}{S D (X / b) \cdot S D (Y)} = \frac{\frac{1}{b} \cdot c o v (Y, X)}{\frac{1}{b} S D (X) \cdot S D (Y)} = ρ

$\frac{ {\rm cov}(Y,(X-a)/b) }{ {\rm SD}((X-a)/b) \cdot {\rm SD}(Y) } = \frac{ {\rm cov}(Y,X/b) }{ {\rm SD}(X/b) \cdot {\rm SD}(Y) } = \frac{ \frac{1}{b} \cdot {\rm cov}(Y,X) }{ \frac{1}{b}{\rm SD}(X) \cdot {\rm SD}(Y) } = \rho$

which follows from the definition of correlation and three facts:

${\rm cov}(Y, X+a) = {\rm cov}(Y,X) + \underbrace{{\rm cov}(Y,a)}_{=0} = {\rm cov}(Y,X)$
${\rm cov}(Y,aX) = a {\rm cov}(Y,X)$
${\rm SD}(aX) = a \cdot {\rm SD}(X)$

Therefore, in terms of model fit (e.g. $R^2$ or the fitted values), shifting or scaling your variables (e.g. putting them on the same scale) will not change the model, since linear regression coefficients are related to the correlations between variables. It will only change the scale of your regression coefficients, which should be kept in mind when you're interpreting the output if you choose to transform your predictors.

Edit: The above has assumed that you're talking about ordinary regression with the intercept. A couple more points related to this (thanks @cardinal):

The intercept can change when you transform your variables and, as @cardinal points out in the comments, the coefficients will change when you shift your variables if you omit the intercept from the model, although I assume you're not doing that unless you have a good reason (see e.g. this answer).
If you're regularizing your coefficients in some way (e.g. Lasso, ridge regression), then centering/scaling will impact the fit. For example, if you're penalizing $\sum \beta_{i}^{2}$ (the ridge regression penalty) then you cannot recover an equivalent fit after standardizing unless all of the variables were on the same scale in the first place, i.e. there is no constant multiple that will recover the same penalty.

Regarding when/why a researcher may want to transform predictors

A common circumstance (discussed in the subsequent answer by @Paul) is that researchers will standardize their predictors so that all of the coefficients will be on the same scale. In that case, the size of the point estimates can give a rough idea of which predictors have the largest effect once the numerical magnitude of the predictor has been standardized.

Another reason a researcher may like to scale very large variables is so that the regression coefficients are not on an extremely tiny scale. For example, if you wanted to look at the influence of population size of a country on crime rate (couldn't think of a better example), you might want to measure population size in millions rather than in its original units, since the coefficient may be something like $.00000001$ .

Macro
источник

Two quick remarks: While the beginning of the post is correct, it misses the fact that centering will have an effect if an intercept is absent. :) Second, centering and rescaling has important effects if regularization is used. While the OP may not be considering this, it is still probably a useful point to keep in mind.

cardinal

The invariance to rescaling is also easily seen if one is comfortable with matrix notation. With

X

$X$ full rank (for simplicity),

\hat{y} = X (X^{'} X)^{- 1} X^{'} y

$\hat y = X (X'X)^{-1} X'y$ . Now if we replace

X

$X$ by

X D

$X D$ where

D

$D$ is diagonal we get

\tilde{y} = (X D) ((X D)^{'} X D)^{- 1} (X D)^{'} y = X D (D X^{'} X D)^{- 1} D X^{'} y = X (X^{'} X)^{- 1} X^{'} y = \hat{y} .

$\tilde y = (X D) ((XD)'XD)^{-1} (XD)'y = X D(D X'X D)^{-1} D X'y = X (X'X)^{-1} X'y = \hat y\>.$

cardinal

@cardinal, I've decided to mention the fact that, if your estimates are regularized then centering/scaling can have an impact. I resisted at first because I thought it would begin a long digression that may confuse those who are not familiar with regularizing but I found I could address it with relatively little space. Thanks--

Macro

Not all my comments are necessarily meant to suggest that the answer should be updated. Many times I just like to slip in ancillary remarks under nice answers to give a couple thoughts on related ideas that might be of interest to a passer-by. (+1)

cardinal

Something funky is going on with the vote counting. Once again, I upvoted this when making my earlier comment and it didn't "take". Hmm.

cardinal

The so called "normalization" is a common routine for most regression methods. There are two ways:

Map each variable into [-1, 1] bounds (mapminmax in MatLab.
Remove mean from each variable and divide on its standard devation (mapstd in MatLab), i.e. actually "normalize". If real mean an deviation is unknown just take sample characterisitics: ${\tilde{X}}_{i j} = \frac{X_{i j} - μ_{i}}{σ_{i}}$ $\tilde{X}_{ij}=\frac{X_{ij}-\mu_i}{\sigma_i}$ or ${\tilde{X}}_{i j} = \frac{X_{i j} - \bar{X_{i}}}{s t d (X_{i})}$ $\tilde{X}_{ij}=\frac{X_{ij} - \overline{X_i}}{std({X_i})}$ where $E[X_i] = \mu$ , $E[X_i^2-E[X_i]^2]=\sigma^2$ , $\overline{X_i}=\frac{1}{N}\sum_{j=1}^{N}X_{ij}$ and $std({X_i}) = \sqrt{\frac{1}{N}\sum_{j=1}^{N}(X_{ij}^2 -\overline{X_{i}}^2)}$

As linear regression is very sensitive to the variables ranges I would generally suggest to normalize all the variables if you do not have any prior knowledge about the dependence and expect all the variables to be relativeley important.

The same goes for response variables, although it is not much important for them.

Why doing normalization or standartization? Mostly in order to determine relative impact of different variables in the model.that can be achieved if all variables are in the same units.

Hope this helps!

Paul
источник

What do you mean when you say linear regression is very sensitive to the variables ranges? For any x1,x2,y these two commands: summary(lm(y~x1+x2))$r.sq and summary(lm(y~scale(x1)+scale(x2)))$r.sq - the

R^{2}

$R^2$ values when you don't standardize the coefficients and when you do - give the same value, indicating equivalent fit.

Macro

I was not completeley correct in the formuation. i meant the foolowing. The regression would be always the same (in sense of

R^{2}

$\mathbf{R^2}$ ) if you perform only linear transformations of the data. But if you want to determine which variables are crusial and which are almost noisy the scale matters. It is just convinient to standartize variables and forget about their original scales. So regression is "sensetive" in terms of understanding relative impacts.

Paul

Thanks for clarifying, but which variables are crusial and which are almost noisy the scale matters is often decided upon by the

p

$p$ -value, which also won't change when you standardize (except the intercept, of course). I agree with your point that it does provide a nicer interpretation of the raw coefficient estimates.

Macro