Включая взаимодействие, но не основные эффекты в модели

85

Является ли когда-либо обоснованным включение двустороннего взаимодействия в модель без учета основных эффектов? Что, если ваша гипотеза касается только взаимодействия, вам все равно нужно включить основные эффекты?

лощина
источник
3
Моя философия - запускать множество моделей, проверять их прогнозы, сравнивать, объяснять, запускать больше моделей.
Майкл Бишоп
11
Если взаимодействия значимы только тогда, когда основные эффекты присутствуют в модели, возможно, что основные эффекты значительны, а взаимодействия нет. Рассмотрим один очень значимый основной эффект с дисперсией порядка 100 и другой незначительный основной эффект, для которого все значения приблизительно равны одному с очень низкой дисперсией. Их взаимодействие не имеет существенного значения, но эффект взаимодействия будет значительным, если основные эффекты будут удалены из модели.
Томас Левин
4
@Tommas, должна ли ваша первая строка читать «если взаимодействия значимы только тогда, когда основные эффекты НЕ присутствуют в модели, ...»?
Глен
2
О да, это должно быть!
Томас Левин

Ответы:

55

По моему опыту, не только необходимо иметь все эффекты более низкого порядка в модели, когда они связаны с эффектами более высокого порядка, но также важно правильно моделировать (например, допускать нелинейность) основные эффекты, которые, по-видимому, не связаны с факторы во взаимодействиях интересов. Это потому, что взаимодействия между и могут быть для основных эффектов и . Взаимодействия иногда кажутся необходимыми, потому что они коллинеарны с пропущенными переменными или пропущенными нелинейными (например, сплайн) терминами.x1x2x3x4

Фрэнк Харрелл
источник
1
Это означает, что мы должны начать удалять термины из y ~ x1 * x2 * x3 * x4, начиная удаление терминов высшего порядка, то есть нормального метода удаления, верно?
Любопытно
9
Удаление терминов не рекомендуется, если вы не можете тестировать целые классы терминов как «чанк». Например, может быть целесообразно сохранить или удалить все термины взаимодействия или сохранить или удалить все взаимодействия 3-го или 4-го порядка.
Фрэнк Харрелл
Что не так с удалением только некоторых взаимодействий в определенном порядке?
user1205901
3
Если у вас есть полностью заданный порядок, который не был определен при просмотре данных, то вы можете сделать это. В общем случае при принятии нескольких решений с использованием нескольких значений P у вас будут проблемы с линейностью и множественностью.
Фрэнк Харрелл
2
Я чувствую, что этот ответ неясен, и только частично отвечает на вопрос. Действительно, в этом ответе говорится, что необходимо смоделировать основной эффект, но не дается ответ на вопрос, является ли действительным его регрессировать, чтобы сосредоточиться только на взаимодействии, которым оно является и которое используется в некоторых моделях, таких как gPPI (см. Мой ответ ниже).
gaborous
37

Вы спрашиваете, действительно ли это когда-либо. Позвольте мне привести общий пример, разъяснение которого может предложить вам дополнительные аналитические подходы.

Простейшим примером взаимодействия является модель с одной зависимой переменной и двумя независимыми переменными , в видеZXY

Z=α+βX+γY+δXY+ε,

с - случайной переменной с нулевым ожиданием и параметрами и . Часто стоит проверить, приближает ли , потому что алгебраически эквивалентное выражение той же моделиεα,β,γ,δδβγ

Z=α(1+βX+γY+δXY)+ε

=α(1+βX)(1+γY)+α(δβγ)XY+ε

(where β=αβ, etc).

Whence, if there's a reason to suppose (δβγ)0, we can absorb it in the error term ε. Not only does this give a "pure interaction", it does so without a constant term. This in turn strongly suggests taking logarithms. Some heteroscedasticity in the residuals--that is, a tendency for residuals associated with larger values of Z to be larger in absolute value than average--would also point in this direction. We would then want to explore an alternative formulation

log(Z)=log(α)+log(1+βX)+log(1+γY)+τ

with iid random error τ. Furthermore, if we expect βX and γY to be large compared to 1, we would instead just propose the model

log(Z)=(log(α)+log(β)+log(γ))+log(X)+log(Y)+τ

=η+log(X)+log(Y)+τ.

This new model has just a single parameter η instead of four parameters (α, β, etc.) subject to a quadratic relation (δ=βγ), a considerable simplification.

I am not saying that this is a necessary or even the only step to take, but I am suggesting that this kind of algebraic rearrangement of the model is usually worth considering whenever interactions alone appear to be significant.

Some excellent ways to explore models with interaction, especially with just two and three independent variables, appear in chapters 10 - 13 of Tukey's EDA.

whuber
источник
Can you provide an example of when you would be able to assume δβγ would approximate zero? It's difficult for me to think of those terms in relation to the original terms and what they would mean.
djhocking
@djhocking Any situation in which the alternative formulation is a good model will necessarily imply α(δβγ)0 in the first model. A special case is the final model, which is a simple linear relationship between log(Z) and the logs of X and Y, tantamount to a multiplicative relationship ZXY on the original scale. Such relationships abound in nature--it simply says Z is directly and separately proportional to both X and Y.
whuber
30

While it is often stated in textbooks that one should never include an interaction in a model without the corresponding main effects, there are certainly examples where this would make perfect sense. I'll give you the simplest example I can imagine.

Suppose subjects randomly assigned to two groups are measured twice, once at baseline (i.e., right after the randomization) and once after group T received some kind of treatment, while group C did not. Then a repeated-measures model for these data would include a main effect for measurement occasion (a dummy variable that is 0 for baseline and 1 for the follow-up) and an interaction term between the group dummy (0 for C, 1 for T) and the time dummy.

The model intercept then estimates the average score of the subjects at baseline (regardless of the group they are in). The coefficient for the measurement occasion dummy indicates the change in the control group between baseline and the follow-up. And the coefficient for the interaction term indicates how much bigger/smaller the change was in the treatment group compared to the control group.

Here, it is not necessary to include the main effect for group, because at baseline, the groups are equivalent by definition due to the randomization.

One could of course argue that the main effect for group should still be included, so that, in case the randomization failed, this will be revealed by the analysis. However, that is equivalent to testing the baseline means of the two groups against each other. And there are plenty of people who frown upon testing for baseline differences in randomized studies (of course, there are also plenty who find it useful, but this is another issue).

Wolfgang
источник
4
Problems arise when the time zero (baseline) measurement is used as a first response variable. The baseline is often used as an entry criterion for the study. For example, a study might enroll patients with systolic blood pressure (bp) > 140, then randomize to 2 bp treatments and follow the bps. Initially, bp has a truncated distribution and the later measurements will be more symmetric. It is messy to model 2 distributional shapes in the same model. There are many more reasons to treat the baseline as a baseline covariate.
Frank Harrell
3
That's a good point, but recent studies suggest that this is not an issue. In fact, it seem that there are more disadvantages to using baseline scores as a covariate. See: Liu, G. F., et al. (2009). Should baseline be a covariate or dependent variable in analyses of change from baseline in clinical trials? Statistics in Medicine, 28, 2509-2530.
Wolfgang
3
I have read that paper. It is not convincing, and Liu has not studied a variety of the kinds of clinical trial situations I described. More arguments are at biostat.mc.vanderbilt.edu/wiki/pub/Main/RmS/course2.pdf in the chapter about analysis of serial (longitudinal) data.
Frank Harrell
1
Thanks for the link. I assume you are referring to the discussion under 8.2.3. Those are some interesting points, but I don't think this gives a definite answer. I am sure that the paper by Liu et al. isn't the ultimate answer either, but it does suggest for example that non-normality of the baseline values is not a crucial issue. Maybe this is something for a separate discussion item, as it does not directly relate to the OP's question.
Wolfgang
2
Yes, it depends on the amount of non-normality. Why depend on good fortune when formulating a model? There are also many purely philosophical reasons to treat time zero measurements as baseline measurements (see quotes from Senn and Rochon in my notes).
Frank Harrell
19

The reason to keep the main effects in the model is for identifiability. Hence, if the purpose is statistical inference about each of the effects, you should keep the main effects in the model. However, if your modeling purpose is solely to predict new values, then it is perfectly legitimate to include only the interaction if that improves predictive accuracy.

Galit Shmueli
источник
5
Can you please be a litte bit more explicit about the identifiability problem?
ocram
6
I don't believe that a model omitting main effects is necessarily unidentified. Perhaps you mean "interpretability" rather than "identifiability" (which is a technical term with a precise definition)
JMS
6
@JMS: Yes, it kills interpretability. However, the term "identifiability" is used differently by statisticians and by social scientists. I meant the latter, where (loosely speaking) you want to identify each statistical parameter with a particular construct. By dropping the main effect you no longer can match construct to parameter.
Galit Shmueli
13

this is implicit in many of answers others have given but the simple point is that models w/ a product term but w/ & w/o the moderator & predictor are just different models. Figure out what each means given the process you are modeling and whether a model w/o the moderator & predictor makes more sense given your theory or hypothesis. The observation that the product term is significant but only when moderator & predictor are not included doesn't tell you anything (except maybe that you are fishing around for "significance") w/o a cogent explanation of why it makes sense to leave them out.

dmk38
источник
I came here to investigate interpretation of main effects in the presence of a significant interaction term and this answer really helped a lot. Thanks!
Patrick Williams
9

Arguably, it depends on what you're using your model for. But I've never seen a reason not to run and describe models with main effects, even in cases where the hypothesis is only about the interaction.

Michael Bishop
источник
What if the interaction is only significant when the main effects are not in the model?
Glen
3
@Glen - There are many things to think about other than statistical significance. See this. Better to examine your overall model fit (plot your residuals against your predictions for each model you fit), your theory, and your motivations for modeling.
Michael Bishop
7

I will borrow a paragraph from the book An introduction to survival analysis using Stata by M.Cleves, R.Gutierrez, W.Gould, Y.Marchenko edited by Stata press to answer to your question.

It is common to read that interaction effects should be included in the model only when the corresponding main effects are also included, but there is nothing wrong with including interaction effects by themselves. [...] The goal of a researcher is to parametrize what is reasonably likely to be true for the data considering the problem at hand and not merely following a prescription.

andrea
источник
3
Absolutely terrible advice.
Frank Harrell
3
@Frank, would you mind expanding on your comment? On the face of it, "parameterize what is reasonably likely to be true for the data" makes a lot of sense.
whuber
6
See stats.stackexchange.com/questions/11009/…. The data are incapable to telling you what is true, and such an approach is heavily dependent on the measurement origin for the variables being multiplied. Assessing isolated interaction effects of temperature in Fahrenheit will give a different picture than if using Celsius.
Frank Harrell
@Frank: Thanks, I found it :-). It is now part of this thread.
whuber
7

Both x and y will be correlated with xy (unless you have taken a specific measure to prevent this by using centering). Thus if you obtain a substantial interaction effect with your approach, it will likely amount to one or more main effects masquerading as an interaction. This is not going to produce clear, interpretable results. What is desirable is instead to see how much the interaction can explain over and above what the main effects do, by including x, y, and (preferably in a subsequent step) xy.

As to terminology: yes, β 0 is called the "constant." On the other hand, "partial" has specific meanings in regression and so I wouldn't use that term to describe your strategy here.

Some interesting examples that will arise once in a blue moon are described at this thread.

rolando2
источник
7

I would suggest it is simply a special case of model uncertainty. From a Bayesian perspective, you simply treat this in exactly the same way you would treat any other kind of uncertainty, by either:

  1. Calculating its probability, if it is the object of interest
  2. Integrating or averaging it out, if it is not of interest, but may still affect your conclusions

This is exactly what people do when testing for "significant effects" by using t-quantiles instead of normal quantiles. Because you have uncertainty about the "true noise level" you take this into account by using a more spread out distribution in testing. So from your perspective the "main effect" is actually a "nuisance parameter" in relation to the question that you are asking. So you simply average out the two cases (or more generally, over the models you are considering). So I would have the (vague) hypothesis:

Hint:The interaction between A and B is significant
I would say that although not precisely defined, this is the question you want to answer here. And note that it is not the verbal statements such as above which "define" the hypothesis, but the mathematical equations as well. We have some data D, and prior information I, then we simply calculate:
P(Hint|DI)=P(Hint|I)P(D|HintI)P(D|I)
(small note: no matter how many times I write out this equation, it always helps me understand the problem better. weird). The main quantity to calculate is the likelihood P(D|HintI), this makes no reference to the model, so the model must have been removed using the law of total probability:
P(D|HintI)=m=1NMP(DMm|HintI)=m=1NMP(Mm|HintI)P(D|MmHintI)
Where Mm indexes the mth model, and NM is the number of models being considered. The first term is the "model weight" which says how much the data and prior information support the mth model. The second term indicates how much the mth model supports the hypothesis. Plugging this equation back into the original Bayes theorem gives:
P(Hint|DI)=P(Hint|I)P(D|I)m=1NMP(Mm|HintI)P(D|MmHintI)
=1P(D|I)m=1NMP(DMm|I)P(MmHintD|I)P(DMm|I)=m=1NMP(Mm|DI)P(Hint|DMmI)

And you can see from this that P(Hint|DMmI) is the "conditional conclusion" of the hypothesis under the mth model (this is usually all that is considered, for a chosen "best" model). Note that this standard analysis is justified whenever P(Mm|DI)1 - an "obviously best" model - or whenever P(Hint|DMjI)P(Hint|DMkI) - all models give the same/similar conclusions. However if neither are met, then Bayes' Theorem says the best procedure is to average out the results, placing higher weights on the models which are most supported by the data and prior information.

probabilityislogic
источник
5

It is very rarely a good idea to include an interaction term without the main effects involved in it. David Rindskopf of CCNY has written some papers about those rare instances.

Peter Flom - Reinstate Monica
источник
5

There are various processes in nature that involve only an interaction effect and laws that decribe them. For instance Ohm's law. In psychology you have for instance the performance model of Vroom (1964): Performance = Ability x Motivation.Now, you might expect finding an significant interaction effect when this law is true. Regretfully, this is not the case. You might easily end up with finding two main effects and an insignificant interaction effect (for a demonstration and further explanation see Landsheer, van den Wittenboer and Maassen (2006), Social Science Research 35, 274-294). The linear model is not very well suited for detecting interaction effects; Ohm might never have found his law when he had used linear models.

As a result, interpreting interaction effects in linear models is difficult. If you have a theory that predicts an interaction effect, you should include it even when insignificant. You may want to ignore main effects if your theory excludes those, but you will find that difficult, as significant main effects are often found in the case of a true data generating mechanism that has only a multiplicative effect.

My answer is: Yes, it can be valid to include a two-way interaction in a model without including the main effects. Linear models are excellent tools to approximate the outcomes of a large variety of data generating mechanisms, but their formula's can not be easily interpreted as a valid description of the data generating mechanism.

Hans Landsheer
источник
4

This one is tricky and happened to me in my last project. I would explain it this way: lets say you had variables A and B which came out significant independently and by a business sense you thought that an interaction of A and B seems good. You included the interaction which came out to be significant but B lost its significance. You would explain your model initially by showing two results. The results would show that initially B was significant but when seen in light of A it lost its sheen. So B is a good variable but only when seen in light of various levels of A (if A is a categorical variable). Its like saying Obama is a good leader when seen in the light of its SEAL army. So Obama*seal will be a significant variable. But Obama when seen alone might not be as important. (No offense to Obama, just an example.)

ayush biyani
источник
1
Here it is kind of the opposite. The interaction (of interest) is only significant when the main effects are not in the model.
Glen
3

F = m*a, force equals mass times acceleration.

It is not represented as F = m + a + ma, or some other linear combination of those parameters. Indeed, only the interaction between mass and acceleration would make sense physically.

nick michalak
источник
2
What applies to an incontrovertible physics equation that has no room for variability does not necessarily apply or is not necessarily true or accurate or productive when modeling data characterized by variability.
rolando2
2

Interaction with and without Main Effect. Blue is one condition. Red another. Their respective effects are tested over three consecutive measurements.

Is it ever valid to include a two-way interaction without main effect?

Yes it can be valid and even necessary. If for example in 2. you would include a factor for main effect (average difference of blue vs red condition) this would make the model worse.

What if your hypothesis is only about the interaction, do you still need to include the main effects?

Your hypothesis might be true independent of there being a main effect. But the model might need it to best describe the underlying process. So yes, you should try with and without.

Note: You need to center the code for the "continuous" independent variable (measurement in the example). Otherwise the interaction coefficients in the model will not be symmetrically distributed (no coefficient for the first measurement in the example).

Sol Hator
источник
1

If the variables in question are categorical, then including interactions without the main effects is just a reparameterizations of the model, and the choice of parameterization depends on what you are trying to accomplish with your model. Interacting continuous variables with other continuous variables ore with categorical variables is a whole different story. See: see this faq from UCLA's Institute for Digital Research and Education

David Beede
источник
1

Yes this can be valid, although it is rare. But in this case you still need to model the main effects, which you will afterward regress out.

Indeed, in some models, only the interaction is interesting, such as drug testing/clinical models. This is for example the basis of the Generalized PsychoPhysiological Interactions (gPPI) model: y = ax + bxh + ch where x/y are voxels/regions of interest and h the block/events designs.

In this model, both a and c will be regressed out, only b will be kept for inference (the beta coefficients). Indeed, both a and c represent spurious activity in our case, and only b represents what cannot be explained by spurious activity, the interaction with the task.

gaborous
источник
1

The short answer: If you include interaction in the fixed effects, then the main effects are automatically included whether or not you specifically include them in your code. The only difference is your parametrization, i.e., what the parameters in your model mean (e.g., are they group means or are they differences from reference levels).

Assumptions: I assume we are working in the general linear model and are asking when we can use the fixed effects specification AB instead of A+B+AB, where A and B are (categorical) factors.

Mathematical clarification: We assume that the response vector YN(ξ,σ2In). If XA, XB and XAB are the design matrices for the three factors, then a model with "main effects and interaction" corresponds to the restriction ξ span{XA,XB,XAB}. A model with "only interaction" corresponds to the restriction ξ span{XAB}. However, span{XAB}= span{XA,XB,XAB}. So, it's two different parametrizations of the same model (or the same family of distributions if you are more comfortable with that terminology).

I just saw that David Beede provided a very similar answer (apologies), but I thought I would leave this up for those who respond well to a linear algebra perspective.

Ketil B T
источник