Линейная модель Гетероскедастичность

У меня есть следующая линейная модель:

Линейная модель остатков Распределение наблюдений

$\log(Y + 1)$

> summary(Y)
Min.   :-0.0005647  
1st Qu.: 0.0001066  
Median : 0.0003060  
Mean   : 0.0004617  
3rd Qu.: 0.0006333  
Max.   : 0.0105730  
NA's   :30.0000000

Как я могу преобразовать переменные, чтобы улучшить ошибку и дисперсию предсказания, особенно для крайних правых значений?

regression data-transformation linear-model heteroscedasticity Роберт Кубрик
источник

Ответы:

Какова твоя цель? Мы знаем, что гетероскедастичность не влияет на наши оценки коэффициентов; это только делает наши стандартные ошибки неправильными. Следовательно, если вы заботитесь только о соответствии модели, то гетероскедастичность не имеет значения.

Вы можете получить более эффективную модель ( то есть модель с меньшими стандартными ошибками), если используете взвешенные наименьшие квадраты. В этом случае вам необходимо оценить дисперсию для каждого наблюдения и взвесить каждое наблюдение с помощью инверсии этой дисперсии, специфичной для наблюдения (в случае weightsаргумента lm). Эта процедура оценки меняет ваши оценки.

В качестве альтернативы, чтобы исправить стандартные ошибки гетероскедастичности без изменения ваших оценок, вы можете использовать надежные стандартные ошибки. Для Rприложения, смотрите пакет sandwich.

Использование преобразования журнала может быть хорошим подходом для исправления гетероскедастичности, но только в том случае, если все ваши значения положительны и новая модель обеспечивает разумную интерпретацию относительно вопроса, который вы задаете.

Чарли
источник

Моя основная цель - уменьшить количество ошибок. Мне придется взглянуть на взвешенные наименьшие квадраты, но у меня сложилось впечатление, что преобразование DV было правильным шагом, учитывая, насколько регулярно увеличивается остаточная дисперсия для более высоких подгоночных значений.

Роберт Кубрик

Что значит "уменьшить ошибки"? Средняя ошибка равна 0. Даже если смотреть на вашем графике, в любом выбранном вами окне среднее значение равно 0.

Чарли

Я имею в виду улучшение прогнозирования модели, то есть снижение общей абсолютной ошибки и дисперсии ошибок, особенно для более высоких подгоночных значений.

Роберт Кубрик

y

$y$

y

$y$

y

$y$

y

$y$

y

$y$

y

$y$

y

$y$

Вы хотели бы попробовать преобразование Бокса-Кокса . Это версия преобразования власти:

y \mapsto {\begin{array}{rcl} \frac{y^{λ} - 1}{λ (\dot{y})^{λ - 1}}, & λ \neq 0 \\ \dot{y} \ln y, & λ = 0 \end{array}

$y \mapsto \left\{ \begin{eqnarray} \frac{y^\lambda-1}{\lambda (\dot y)^{\lambda-1}}, & \lambda \neq 0 \\ \dot y \ln y, & \lambda = 0 \end{eqnarray} \right.$ where

\dot{y}

$\dot y$ is the geometric mean of the data. When used as a transformation of the response variable, its nominal role is to make the data closer to the normal distribution, and skewness is the leading reason why the data may look non-normal. My gut feeling with your scatterplot is that it needs to be applied to (some of) the explanatory and the response variables.

Some earlier discussions include What other normalizing transformations are commonly used beyond the common ones like square root, log, etc.? and How should I transform non-negative data including zeros?. You can find R code following How to search for a statistical procedure in R?

Econometricians stopped bothering about heteroskedasticity after seminal work of Halbert White (1980) on setting up inferential procedures robust to heteroskedasticity (which in fact just retold the earlier story by a statistician F. Eicker (1967)). See Wikipedia page that I just rewrote.

StasK
источник

Thanks, at this point I'm debating whether to apply a power transform or use robust regression to reduce the errors and improve the prediction intervals. I wonder how the two techniques compare. Also if I use the transformation I would need to back-transform the predicted values. It doesn't look like an obvious formula, does it?

Robert Kubrick

If by robust regression, you mean robust standard errors as @StasK describes, that doesn't change the residuals/errors at all. The coefficients are exactly the same as OLS, giving exactly the same residuals. The standard errors of the coefficients change and are usually larger than the OLS SEs. Prediction intervals are improved in that you now are using the correct standard errors for your coefficients (though they are likely larger relative to those from OLS). If your goal is to predict

y

$y$ , you really should stick with the linear model and use the techniques that I mention in my answer.

Charlie

@Charlie I mean en.wikipedia.org/wiki/Robust_regression. I am new to this, but I understand robust regression changes the estimation technique, therefore the residuals must be different.

Robert Kubrick

Right, that is a different method and does change your estimates. I think that robust regression is better suited to cases with outliers. Depending upon which version of robust regression you decide to use and your particular data set, you can get wider confidence intervals relative to OLS.

Charlie

There is a very simple solution to heteroskedasticity issue associated with dependent variables within time series data. I don't know if this is applicable to your dependent variable. Assuming it is, instead of using nominal Y change it to % change in Y from the current period over the prior period. For instance, let's say your nominal Y is GDP of $14 trillion in the most current period. Instead, compute the change in GDP over the most recent period (let's say 2.5%).

A nominal time series always grows and is always heteroskedastic (the variance of the error grows over time because the values grow). A % change series is typically homoskedastic because the dependent variable is pretty much stationary.

Sympa
источник

The

Y

$Y$ values I am using are time series % changes from the previous period.

Robert Kubrick

This is surprising. Usually, % change variables are not heteroskedastic. I am wondering if the residuals are less heteroskedastic than we think. And, that the underlying issue is one of outliers. I see 4 or 5 observations in the 0.15% range that if removed would make the whole graph less heteroskedastic looking. Also, as others have mentioned heteroskedasticity will not corrupt your regression coefficients, but only your confidence intervals and related standard error. However, looking at your graph it seems that CIs may not be too affected. And, could still be useful.

Sympa