Почему искаженные данные не предпочтительны для моделирования?

16

В большинстве случаев, когда люди говорят о преобразованиях переменных (как для предикторов, так и для переменных ответа), они обсуждают способы обработки асимметрии данных (например, преобразование журнала, преобразование Бокса и Кокса и т. Д.). Я не могу понять, почему устранение асимметрии считается такой распространенной передовой практикой? Как асимметрия влияет на производительность различных типов моделей, таких как древовидные модели, линейные модели и нелинейные модели? Какие модели больше подвержены асимметрии и почему?

saurav shekhar
источник
2
In order to give a reasonable answer, please clarify what you mean by: a) data, b) modelling and c) models. The key question - as usual - is what you want to do with it. But what is it?
cherub
I updated my answer to add some relevant citations and expand on the claims.
Tavrock

Ответы:

11

When removing skewness, transformations are attempting to make the dataset follow the Gaussian distribution. The reason is simply that if the dataset can be transformed to be statistically close enough to a Gaussian dataset, then the largest set of tools possible are available to them to use. Tests such as the ANOVA, t-test, F-test, and many others depend on the data having constant variance (σ2) or follow a Gaussian distribution.1

There are models that are more robust1 (such as using Levine's test instead of Bartlett's test), but most tests and models which work well with other distributions require that you know what distribution you are working with and are typically only appropriate for a single distribution as well.

To quote the NIST Engineering Statistics Handbook:

In regression modeling, we often apply transformations to achieve the following two goals:

  1. to satisfy the homogeneity of variances assumption for the errors.
  2. to linearize the fit as much as possible.

Some care and judgment is required in that these two goals can conflict. We generally try to achieve homogeneous variances first and then address the issue of trying to linearize the fit.

and in another location

A model involving a response variable and a single independent variable has the form:

Yi=f(Xi)+Ei

where Y is the response variable, X is the independent variable, f is the linear or non-linear fit function, and E is the random component. For a good model, the error component should behave like:

  1. random drawings (i.e., independent);
  2. from a fixed distribution;
  3. with fixed location; and
  4. with fixed variation.

In addition, for fitting models it is usually further assumed that the fixed distribution is normal and the fixed location is zero. For a good model the fixed variation should be as small as possible. A necessary component of fitting models is to verify these assumptions for the error component and to assess whether the variation for the error component is sufficiently small. The histogram, lag plot, and normal probability plot are used to verify the fixed distribution, location, and variation assumptions on the error component. The plot of the response variable and the predicted values versus the independent variable is used to assess whether the variation is sufficiently small. The plots of the residuals versus the independent variable and the predicted values is used to assess the independence assumption.

Assessing the validity and quality of the fit in terms of the above assumptions is an absolutely vital part of the model-fitting process. No fit should be considered complete without an adequate model validation step.


  1. (abbreviated) citations for claims:
    • Breyfogle III, Forrest W. Implementing Six Sigma
    • Pyzdek, Thomas. The Six Sigma Handbook
    • Montgomery, Douglas C. Introduction to Statistical Quality Control
    • Ed. Cubberly, Willaim H and Bakerjan, Ramon. Tool and Manufacturing Engineers Handbook: Desktop Edition
Tavrock
источник
Thanks for your response Tavrock. But as far as I know, ANOVA or t-test of F-test are not used in decision trees (at least to perform splits). Also, in linear regression most of the assumptions regarding shape of the distribution is related to the errors. If errors are skewed then these tests fail. So, this means that skewness of the predictor variable should not affect the quality of prediction for these models. Please correct me if I am wrong. Thanks again!!
saurav shekhar
1
Can you clarify your question - do you want to know about transforming the response variable, or about transforming the predictor variables, or both?
Groovy_Worm
1
@Groovy_Worm thanks for pointing that. In this question I am concerned about both predictor and response variables.
saurav shekhar
You might be looking for generalized linear modeling (GLM). In linear regression, you typically assume that your dependent variable follows a gaussian distribution conditional on the random variables X and e. With GLM, you can expand your universe to allow for (almost) any type of distribution for your dependent variable, your independent variables (via a link function that you specify).
Chris K
7

This is mostly true for parametric models. As Tavrock said, having a response variable that's not skewed makes Gaussian approximation of parameter estimation work better, this because symmetric distribution converge much faster than skewed ones to Gaussian. This means that, if you have skewed data, transforming it will make smaller dataset least for using appropriately confidence intervals and tests on parameters (prediction intervals still won't be valid, because even if your data is now symmetric, you couldn't say it's normal, only parameters estimations will converge to Gaussian).

This whole speech is about conditioned distribution of response variable, you could say: about errors. Nonetheless if you have a variable that seems skewed when you look at his unconditioned distribution, that could likely mean that it has a skewed conditioned distribution. fitting a model on your data will clear your mind on it.

In decision trees I'll first point one thing: there's no point on transforming skewed explanatory variables, monotonic functions won't change a thing; this can be useful on linear models, but's not on decision trees. This said, CART models use analysis of variance to perform spits, and variance is very sensible to outliers and skewed data, this is the reason why transforming your response variable can considerably improve your model accuracy.

carlo
источник
1

I believe this is very much an artifact of the tradition to revert to Gaussians due to their nice properties.

But there are nice distributional alternatives, e. g. the generalized gamma that encompasses a host of different skewed distributional shapes and forms

salient
источник
1

Like other readers have said, some more background on what you are planning to achieve with your data would be helpful.

That being said, there are two important doctrines in the realm of statistics known as the central limit theorem and the law of large numbers. That is to say, that the more observations one has, the more a dataset is expected to approximate a normal distribution, one with an equal mean, median and mode. Under the law of large numbers, it is expected that the deviation between the expected and the actual value will eventually drop to zero given sufficient observations.

Therefore, a normal distribution allows the researcher to make more accurate predictions about a population if the underlying distribution is known.

Skewness is when a distribution deviates from this, i.e. a deviation could be positively or negatively skewed. However, the central limit theorem argues that given a large enough set of observations, the result will be an approximately normal distribution. So, if the distribution is not normal, it is always recommended to gather more data first before attempting to change the underlying structure of the distribution via the transformation procedures you mentioned.

Michael Grogan
источник
1

When is skewness a bad thing to have? Symmetric distributions (generally but not always: e.g., not for the Cauchy distribution) have median, mode and mean very close to each other. So consider, if we want to measure the location of a population, it is useful to have the median, mode and mean close to each other.

For example, if we take the logarithm of the distribution of income, we reduce the skewness enough that we can get useful models of location of income. However, we will still have a heavier right tail than we really want. To reduce that further, we might use a Pareto distribution. The Pareto distribution is similar to a log-log transformation of the data. Now both the Pareto and log-normal distributions have difficulty on the low end of the income scale. For example, both suffer from ln0=. The treatment of this problem is covered in power transforms.

Example from 25 incomes in kilo dollars purloined from the www.

k$	lnk$
28  3.33220451
29  3.36729583
35  3.555348061
42  3.737669618
42  3.737669618
44  3.784189634
50  3.912023005
52  3.951243719
54  3.988984047
56  4.025351691
59  4.077537444
78  4.356708827
84  4.430816799
90  4.49980967
95  4.553876892
101 4.615120517
108 4.682131227
116 4.753590191
121 4.795790546
122 4.804021045
133 4.890349128
150 5.010635294
158 5.062595033
167 5.117993812
235 5.459585514

The skewness of the first column is 0.99, and of the second is -0.05. The first column is not likely normal (Shapiro-Wilk p=0.04) and the second not significantly not normal (p=0.57).

First column    Mean 90.0 (95% CI, 68.6 to 111.3)     Median 84.0 (95.7% CI, 52.0 to 116.0)
Second col Exp(Mean) 76.7 (95% CI, 60.2 to 97.7) Exp(Median) 84.0 (95.7% CI, 52.0 to 116.0)

So, the question is, if you are a random person having one of the earnings listed, what are you likely to earn? Is it reasonable to conclude that you would earn 90k or more than the median of 84k? Or is it more likely to conclude that even the median is biased as a measure of location and that the exp[meanln(k$)]  of 76.7 k, which is less than the median, is also more reasonable as an estimate?

Obviously, the log-normal here is a better model and the mean logarithm gives us a better measure of location. That this is well known, if not entirely understood, is illustrated by the phrase "I anticipate getting a 5-figure salary."

Carl
источник
0

Mostly of the results are based on Gaussian assumptions. If you have a skewed distribution, you don't have a Gaussian distribution, so maybe you should try desperately to turn it into that.

BUT of course, you can try with GLM.

Red Noise
источник
0

I think that it's not just modeling but our brains are not used to work with highly skewed data. For instance, it's well known in behavioral finance that we're not good at estimating the very low or high probabilities.

Aksakal
источник