Расчет доверительных интервалов для логистической регрессии

Я использую биномиальную логистическую регрессию , чтобы определить , если воздействие has_xили has_yвоздействий на вероятность того , что пользователь нажмет на что - то. Моя модель следующая:

fit = glm(formula = has_clicked ~ has_x + has_y, 
          data=df, 
          family = binomial())

Это вывод из моей модели:

Call:
glm(formula = has_clicked ~ has_x + has_y, 
    family = binomial(), data = active_domains)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9869  -0.9719  -0.9500   1.3979   1.4233  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -0.504737   0.008847 -57.050  < 2e-16 ***
has_xTRUE -0.056986   0.010201  -5.586 2.32e-08 ***
has_yTRUE  0.038579   0.010202   3.781 0.000156 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 217119  on 164182  degrees of freedom
Residual deviance: 217074  on 164180  degrees of freedom
AIC: 217080

Number of Fisher Scoring iterations: 4

Поскольку каждый коэффициент является значимым, используя эту модель, я могу сказать, какое значение имеет любая из этих комбинаций, используя следующий подход:

predict(fit, data.frame(has_x = T, has_y=T), type = "response")

Я не понимаю, как я могу сообщить о Std. Ошибка прогноза.

Мне просто нужно использовать $1.96*SE$ ? Или мне нужно конвертировать $SE$ используя подход, описанный здесь ?
Если я хочу понять стандартную ошибку для обеих переменных, как бы я это учел?

В отличие от этого вопроса , мне интересно понять, каковы верхние и нижние границы ошибки в процентах. Например, мои предсказания показывает значение 37% для True,Trueможно рассчитать , что это для ? (0,3% выбрали, чтобы проиллюстрировать мою точку зрения) $+/- 0.3%$ $95\% CI$

regression logistic standard-error logit celenius
источник

Дубликат: stats.stackexchange.com/questions/5304/…

kjetil b halvorsen

Возможный дубликат Почему существует разница между ручным вычислением 95-процентного доверительного интервала и использованием функции confint () в R?

kjetil b halvorsen

@kjetilbhalvorsen Вы уверены, что это дубликат, так как ОП, кажется, хочет интервал прогнозирования, но, похоже, работает в масштабе ИЛИ, а не в логарифмическом масштабе, что может быть причиной проблемы?

mdewey

Если вы хотите оценить, насколько хорошо предсказывает логистическая регрессия, обычно используются другие меры, чем прогноз + SE. Один популярный критерий оценки - кривая ROC с соответствующим AUC

adibender

Может ли это помочь? stackoverflow.com/questions/47414842/…

Ксавье Бурре Сикот

Ответы:

Ваш вопрос может возникнуть из-за того, что вы имеете дело с коэффициентами и вероятностями, что поначалу сбивает с толку. Поскольку логистическая модель представляет собой нелинейное преобразование вычисления доверительные интервалы не так просты. $\beta^Tx$

Фон

Напомним, что для модели логистической регрессии

Вероятность из : $(Y = 1)$ $p = \frac{e^{\alpha + \beta_1x_1 + \beta_2 x_2}}{1 + e^{ \alpha + \beta_1x_1 + \beta_2 x_2}}$
Коэффициенты из : $(Y = 1)$ $\left( \frac{p}{1-p}\right) = e^{\alpha + \beta_1x_1 + \beta_2 x_2}$
Вход Коэффициенты из : $(Y = 1)$ $\log \left( \frac{p}{1-p}\right) = \alpha + \beta_1x_1 + \beta_2 x_2$

Рассмотрим случай, когда у вас есть увеличение на единицу переменной , т.е. $x_1$ , тогда новые шансы $x_1 + 1$

Odds (Y знак равно 1) знак равно е^{α + β_{1} ({Икс}_{1} + 1) + β_{2} {Икс}_{2}} знак равно е^{α + β_{1} {Икс}_{1} + β_{1} + β_{2} {Икс}_{2}}

$\text{Odds}(Y = 1) = e^{\alpha + \beta_1(x_1 + 1) + \beta_2x_2} = e^{\alpha + \beta_1 x_1 + \beta_1 + \beta_2x_2 }$

Следовательно, коэффициент шансов (ИЛИ)

\frac{Odds (x_{1} + 1)}{Odds (x_{1})} = \frac{e^{α + β_{1} (x_{1} + 1) + β_{2} x_{2}}}{e^{α + β_{1} x_{1} + β_{2} x_{2}}} = e^{β_{1}}

$\frac{\text{Odds}(x_1 + 1)}{\text{Odds}(x_1)} = \frac{e^{\alpha + \beta_1(x_1 + 1) + \beta_2x_2} }{e^{\alpha + \beta_1 x_1 + \beta_2x_2}} = e^{\beta_1}$

Коэффициент логарифма = $\beta_1$
Относительный риск или (отношение вероятностей) = $\frac{ \frac{e^{\alpha + \beta_1x_1 + \beta_1 + \beta_2 x_2}}{1 + e^{ \alpha + \beta_1x_1 + \beta_1 + \beta_2 x_2}}}{ \frac{e^{\alpha + \beta_1x_1 + \beta_2 x_2}}{1 + e^{ \alpha + \beta_1x_1 + \beta_2 x_2}}}$

Интерпретация коэффициентов

Как бы вы интерпретировали значение коэффициента ? Предполагая, что все остальное остается неизменным: $\beta_j$

Для каждой единицы увеличения отношение логарифмов увеличивается на . $x_j$ $\beta_j$
Для каждой единицы увеличения отношение шансов увеличивается на . $x_j$ $e^{\beta_j}$
При каждом увеличении от до отношение шансов увеличивается на $x_j$ $k$ $k + \Delta$ $e^{\beta_j \Delta}$
Если коэффициент отрицательный, то увеличение приводит к уменьшению отношения шансов. $x_j$

Доверительные интервалы для одного параметра $\beta_j$

Мне просто нужно использовать ? Или мне нужно конвертировать SE, используя подход, описанный здесь? $1.96∗SE$

$\beta_j$ оценивается с помощью Maxiumum правдоподобия, теория MLE говорит нам, что он асимптотически нормален, и поэтому мы можем использовать большой выборочный доверительный интервал Вальда, чтобы получить обычный

β_{j} \pm z^{*} S E (β_{j})

$\beta_j \pm z^* SE(\beta_j)$

e^{β_{j} \pm z^{*} S E (β_{j})}

$e^{\beta_j \pm z^* SE(\beta_j)}$

which is a confidence interval on the odds ratio. Note that these intervals are for a single parameter only.

If I want to understand the standard-error for both variables how would I consider that?

If you include several parameters you can use the Bonferroni procedure, otherwise for all parameters you can use the confidence interval for probability estimates

Bonferroni procedure for several parameters

If $g$ parameters are to be estimated with family confidence coefficient of approximately $1 - \alpha$ , the joint Bonferroni confidence limits are

β_{g} \pm z_{(1 - \frac{α}{2 g})} S E (β_{g})

$\beta_g \pm z_{(1 - \frac{\alpha}{2g})}SE(\beta_g)$

Confidence intervals for probability estimates

The logistic model outputs an estimation of the probability of observing a one and we aim to construct a frequentist interval around the true probability $p$ such that $Pr(p_{L} \leq p \leq p_{U}) = .95$

One approach called endpoint transformation does the following:

Compute the upper and lower bounds of the confidence interval for the linear combination $x^T\beta$ (using the Wald CI)
Apply a monotonic transformation to the endpoints $F(x^T\beta)$ to obtain the probabilities.

Since $Pr(x^T\beta) = F(x^T\beta)$ is a monotonic transformation of $x^T\beta$

[P r (x^{T} β)_{L} \leq P r (x^{T} β) \leq P r (x^{T} β)_{U}] = [F (x^{T} β)_{L} \leq F (x^{T} β) \leq F (x^{T} β)_{U}]

$[Pr(x^T\beta)_L \leq Pr(x^T\beta) \leq Pr(x^T\beta)_U] = [F(x^T\beta)_L \leq F(x^T\beta) \leq F(x^T\beta)_U]$

Concretely this means computing $\beta^Tx \pm z^* SE(\beta^Tx)$ and then applying the logit transform to the result to get the lower and upper bounds:

[\frac{e^{x^{T} β - z^{*} S E (x^{T} β)}}{1 + e^{x^{T} β - z^{*} S E (x^{T} β)}}, \frac{e^{x^{T} β + z^{*} S E (x^{T} β)}}{1 + e^{x^{T} β + z^{*} S E (x^{T} β)}},]

$[\frac{e^{x^T\beta - z^* SE(x^T\beta)}}{1 + e^{x^T\beta - z^* SE(x^T\beta)}}, \frac{e^{x^T\beta + z^* SE(x^T\beta)}}{1 + e^{x^T\beta + z^* SE(x^T\beta)}},]$

The estimated approximate variance of $x^T\beta$ can be calculated using the covariance matrix of the regression coefficients using

V a r (x^{T} β) = x^{T} Σ x

$Var(x^T\beta) = x^T \Sigma x$

The advantage of this method is that the bounds cannot be outside the range $(0,1)$

There are several other approaches as well, using the delta method, bootstrapping etc.. which each have their own assumptions, advantages and limits.

Sources and info

My favorite book on this topic is "Applied Linear Statistical Models" by Kutner, Neter, Li, Chapter 14

Otherwise here are a few online sources:

Xavier Bourret Sicotte
источник

Much of this is about CI for the coefficients which is a fine thing for the OP to know about but are we sure that is what he needs? You later section seems to me more relevant but perhaps the distinctions may be missed if read too quickly?

mdewey

Yes you are probably right - but understanding odds, log odds and probabilities for log regression is something I struggled with in the past - I hope this post summarises the topic well enough to such that it might help someone in the future. Perhaps I could answer the question more explicitly by providing a CI but we would need the covariance matrix

Xavier Bourret Sicotte

To get the 95% confidence interval of the prediction you can calculate on the logit scale and then convert those back to the probability scale 0-1. Here is an example using the titanic dataset.

library(titanic)
data("titanic_train")

titanic_train$Pclass = factor(titanic_train$Pclass, levels = c(1,2,3), labels = c('First','Second','Third'))

fit = glm(Survived ~ Sex + Pclass, data=titanic_train, family = binomial())

inverse_logit = function(x){
  exp(x)/(1+exp(x))
}

predicted = predict(fit, data.frame(Sex='male', Pclass='First'), type='link', se.fit=TRUE)

se_high = inverse_logit(predicted$fit + (predicted$se.fit*1.96))
se_low = inverse_logit(predicted$fit - (predicted$se.fit*1.96))
expected = inverse_logit(predicted$fit)

The mean and low/high 95% CI.

> expected
        1 
0.4146556 
> se_high
        1 
0.4960988 
> se_low
        1 
0.3376243

And the output from just using type='response', which only gives the mean

predict(fit, data.frame(Sex='male', Pclass='First'), type='response')
        1 
0.4146556

Shawn
источник

predict(fit, data.frame(Sex='male', Pclass='First'), type='response', se.fit=TRUE) will work.

Tony416