Переменные / масштабирующие переменные не влияют на их корреляцию с откликом
Чтобы понять, почему это так, предположим, что корреляция между и равна . Тогда корреляция между и являетсяYXρY(X−a)/b
cov(Y,(X−a)/b)SD((X−a)/b)⋅SD(Y)=cov(Y,X/b)SD(X/b)⋅SD(Y)=1b⋅cov(Y,X)1bSD(X)⋅SD(Y)=ρ
which follows from the definition of correlation and three facts:
cov(Y,X+a)=cov(Y,X)+cov(Y,a)=0=cov(Y,X)
cov(Y,aX)=acov(Y,X)
SD(aX)=a⋅SD(X)
Therefore, in terms of model fit (e.g. R2 or the fitted values), shifting or scaling your variables (e.g. putting them on the same scale) will not change the model, since linear regression coefficients are related to the correlations between variables. It will only change the scale of your regression coefficients, which should be kept in mind when you're interpreting the output if you choose to transform your predictors.
Edit: The above has assumed that you're talking about ordinary regression with the intercept. A couple more points related to this (thanks @cardinal):
The intercept can change when you transform your variables and, as @cardinal points out in the comments, the coefficients will change when you shift your variables if you omit the intercept from the model, although I assume you're not doing that unless you have a good reason (see e.g. this answer).
If you're regularizing your coefficients in some way (e.g. Lasso, ridge regression), then centering/scaling will impact the fit. For example, if you're penalizing ∑β2i (the ridge regression penalty) then you cannot recover an equivalent fit after standardizing unless all of the variables were on the same scale in the first place, i.e. there is no constant multiple that will recover the same penalty.
Regarding when/why a researcher may want to transform predictors
A common circumstance (discussed in the subsequent answer by @Paul) is that researchers will standardize their predictors so that all of the coefficients will be on the same scale. In that case, the size of the point estimates can give a rough idea of which predictors have the largest effect once the numerical magnitude of the predictor has been standardized.
Another reason a researcher may like to scale very large variables is so that the regression coefficients are not on an extremely tiny scale. For example, if you wanted to look at the influence of population size of a country on crime rate (couldn't think of a better example), you might want to measure population size in millions rather than in its original units, since the coefficient may be something like .00000001.
The so called "normalization" is a common routine for most regression methods. There are two ways:
As linear regression is very sensitive to the variables ranges I would generally suggest to normalize all the variables if you do not have any prior knowledge about the dependence and expect all the variables to be relativeley important.
The same goes for response variables, although it is not much important for them.
Why doing normalization or standartization? Mostly in order to determine relative impact of different variables in the model.that can be achieved if all variables are in the same units.
Hope this helps!
источник
x1,x2,y
these two commands:summary(lm(y~x1+x2))$r.sq
andsummary(lm(y~scale(x1)+scale(x2)))$r.sq
- the