В большинстве случаев, когда люди говорят о преобразованиях переменных (как для предикторов, так и для переменных ответа), они обсуждают способы обработки асимметрии данных (например, преобразование журнала, преобразование Бокса и Кокса и т. Д.). Я не могу понять, почему устранение асимметрии считается такой распространенной передовой практикой? Как асимметрия влияет на производительность различных типов моделей, таких как древовидные модели, линейные модели и нелинейные модели? Какие модели больше подвержены асимметрии и почему?
16
Ответы:
When removing skewness, transformations are attempting to make the dataset follow the Gaussian distribution. The reason is simply that if the dataset can be transformed to be statistically close enough to a Gaussian dataset, then the largest set of tools possible are available to them to use. Tests such as the ANOVA,t -test, F -test, and many others depend on the data having constant variance (σ2 ) or follow a Gaussian distribution.1
There are models that are more robust1 (such as using Levine's test instead of Bartlett's test), but most tests and models which work well with other distributions require that you know what distribution you are working with and are typically only appropriate for a single distribution as well.
To quote the NIST Engineering Statistics Handbook:
and in another location
источник
This is mostly true for parametric models. As Tavrock said, having a response variable that's not skewed makes Gaussian approximation of parameter estimation work better, this because symmetric distribution converge much faster than skewed ones to Gaussian. This means that, if you have skewed data, transforming it will make smaller dataset least for using appropriately confidence intervals and tests on parameters (prediction intervals still won't be valid, because even if your data is now symmetric, you couldn't say it's normal, only parameters estimations will converge to Gaussian).
This whole speech is about conditioned distribution of response variable, you could say: about errors. Nonetheless if you have a variable that seems skewed when you look at his unconditioned distribution, that could likely mean that it has a skewed conditioned distribution. fitting a model on your data will clear your mind on it.
In decision trees I'll first point one thing: there's no point on transforming skewed explanatory variables, monotonic functions won't change a thing; this can be useful on linear models, but's not on decision trees. This said, CART models use analysis of variance to perform spits, and variance is very sensible to outliers and skewed data, this is the reason why transforming your response variable can considerably improve your model accuracy.
источник
I believe this is very much an artifact of the tradition to revert to Gaussians due to their nice properties.
But there are nice distributional alternatives, e. g. the generalized gamma that encompasses a host of different skewed distributional shapes and forms
источник
Like other readers have said, some more background on what you are planning to achieve with your data would be helpful.
That being said, there are two important doctrines in the realm of statistics known as the central limit theorem and the law of large numbers. That is to say, that the more observations one has, the more a dataset is expected to approximate a normal distribution, one with an equal mean, median and mode. Under the law of large numbers, it is expected that the deviation between the expected and the actual value will eventually drop to zero given sufficient observations.
Therefore, a normal distribution allows the researcher to make more accurate predictions about a population if the underlying distribution is known.
Skewness is when a distribution deviates from this, i.e. a deviation could be positively or negatively skewed. However, the central limit theorem argues that given a large enough set of observations, the result will be an approximately normal distribution. So, if the distribution is not normal, it is always recommended to gather more data first before attempting to change the underlying structure of the distribution via the transformation procedures you mentioned.
источник
When is skewness a bad thing to have? Symmetric distributions (generally but not always: e.g., not for the Cauchy distribution) have median, mode and mean very close to each other. So consider, if we want to measure the location of a population, it is useful to have the median, mode and mean close to each other.
For example, if we take the logarithm of the distribution of income, we reduce the skewness enough that we can get useful models of location of income. However, we will still have a heavier right tail than we really want. To reduce that further, we might use a Pareto distribution. The Pareto distribution is similar to a log-log transformation of the data. Now both the Pareto and log-normal distributions have difficulty on the low end of the income scale. For example, both suffer fromln0=−∞ . The treatment of this problem is covered in power transforms.
Example from 25 incomes in kilo dollars purloined from the www.
The skewness of the first column is 0.99, and of the second is -0.05. The first column is not likely normal (Shapiro-Wilk p=0.04) and the second not significantly not normal (p=0.57).
So, the question is, if you are a random person having one of the earnings listed, what are you likely to earn? Is it reasonable to conclude that you would earn 90k or more than the median of 84k? Or is it more likely to conclude that even the median is biased as a measure of location and that theexp[meanln(k$)] of 76.7 k, which is less than the median, is also more reasonable as an estimate?
Obviously, the log-normal here is a better model and the mean logarithm gives us a better measure of location. That this is well known, if not entirely understood, is illustrated by the phrase "I anticipate getting a 5-figure salary."
источник
Mostly of the results are based on Gaussian assumptions. If you have a skewed distribution, you don't have a Gaussian distribution, so maybe you should try desperately to turn it into that.
BUT of course, you can try with GLM.
источник
I think that it's not just modeling but our brains are not used to work with highly skewed data. For instance, it's well known in behavioral finance that we're not good at estimating the very low or high probabilities.
источник