On correlation, causality, and related issues
The previous entry touched upon the question of fallacies. Recently, I have been involved in a Swedish discussion of povertye, which has put emphasis on some of my concerns.
Notably, there seems to be a considerable lack of understanding of questions such as causality vs. correlation, how scientific studies work, and similar. Annoyingly, this problem is very common, even among journalists and politicians (who should know better as a professional requirement)—and, horrifyingly, even the odd scientist.
Let us first look at the concepts of causality and correlation:
Correlation implies that there is a connection of some kind between two phenomena, personal characteristics, or similar—but is says nothing about how the connection works. In particular, it does not say that the one is the cause of the other, or the reverse; and it is quite common that a third something is the cause of both, or that they are partially mutually causing each other. (Technical use of “predicts” is similar: It too is not a causality, but unlike a correlation it can be a one-way street. If I pick out a random person on the street here in Cologne, there is a fair chance that he is German—the first predicts the second, with a high probability of correctness. On the other hand, picking a random German, the chance that he is on a street in Cologne is comparatively small—the second does not predict the first.)
Causality, OTOH, catches just this causing.
To take a few examples of how these concepts can work (and easily be misunderstood):
Height and weight are reasonably strongly correlated. However, which is the cause of the other? An increase in height does (on average—a qualifier that I will leave out below) cause an increase in weight, because there is more body present. However, an increase in weight can also often cause an increase in height: Lack of nutrition can stunt growth and those who eat more are likely to both gain weight through fat/muscle gain and to gain height through a lesser risk/degree of malnutrition. In addition, entirely other factors can cause both weight and height gains (e.g., strictly hypothetically, that those genetically predisposed to tallness are also predisposed to obesity).
Here we see a complex interaction of factors. We can further note that, although height and weight are correlated, the correlation is imperfect: An obese 5-footer can be heavier than skinny 7-footer. Correlations only rarely allow for predictions about individuals, and instead find their use where aggregates are concerned.
Assume that we consider a large sample of men and women, with and without bikes (and that sex and the possession of a bike are independent of each other). Looking specifically at the three subsets women (X), bike-owners (Y), and female bike-owners (Z), we find that membership of X and membership of Z correlate: Being a woman increases the chance of being a female bike-owner and being a female bike-owner necessitates being a woman. In the same way, membership in Y and membership in Z correlate.
It would now seem plausible to assume that since both X–Z and Z–Y correlate, then we would also have a correlation X–Y. That, however, is not true! There is (in this model) no connection whatsoever between X (being a woman) and Y (owning a bike).
Here we see the risk of “chaining” correlations.
Consider the set X of all Finns and the set of Y all people with Finnish as their native language. Clearly, X and Y have a strong correlation. It would now, on a too casual glance, seem plausible that the same applies to any subset of X. However, there are specific subsets which have no or even a negative correlation—notably, the large minority of Swedish descent.
What is true for a set is not (necessarily) true for all subsets. (Including, obviously, individual cases, which can be mapped to sets with one member.)
Consider a school class with blond and brown-haired children. The teacher (for reasons of his own) gives the blond children an apple and a chocolate bar, while the brown-haired are given an orange and bag of wine-gummy.
Assuming that no other edibles are present (and that the children are not extremely voracious…), there are perfect correlations among the children between owning an apple and owning a chocolate bar, an orange and wine-gummy, being blond and owning an apple, and so on. There are also perfect negative correlations between e.g. apple-owning and orange-owning (not all correlations need indicate a connection of X -> Y, but they can also be of the X -> not-Y kind).
However, there is no causation between apples and chocolate bars or between oranges and wine-gummy. (One of the main rules of science: Correlation does not imply causation.)
Looking at e.g. being blond and owning an apple, we land in a complicated situation: On the one hand, we could argue that the blond hair did cause possession of the apple; on the other, this could be seen as a spurious thought because the actual cause behind the correlations is the teacher. (What is a causation and what not is often a far from clear decision, and care must be taken when basing decisions on ambiguous causations. In a similar vein, there are often causes and underlying causes.)
Assume the same setting as the previous example, when a second teacher rushes in, confiscates all candies and replaces them with fruit (the bastard!), so that all children have exactly one apple and one orange.
Here we see an oddity: Causation does not need to imply correlation.
The first teachers actions did cause the students to be given candies, but the actions of the second nullified that effect. Similarly, the first teacher did cause the children to be given fruit according to a certain pattern, but this pattern (in the sample at hand) disappeared with the actions of the second teacher (without nullifying the actions of the first teacher).
A (hypothetical) study of the NBA is made, with the result that the correlation between height and prowess (by some measures) is low, zero, or even negative.
Does this imply that height has no effect on prowess? No–here we have the classic issue of a pre-filtered sample. Studying NBA players reduces the variation of ability to a very high degree (compared to the overall population) and the variation of height (to some degree). This makes the sample flawed (for many purposes) and the conclusion invalid.
Repeating the same study on the overall population, without this pre-filtering, will show a large positive correlation.
A correlation is only as good as the samples used (in general) and using samples which are “top heavy” (in particular) can hide correlations that actually are present.
Similar to the above, other variations of highly flawed conclusions based on flawed samples can be constructed, e.g. by creating statistics on car accidents for the overall population based on a sample of hospital visitors; by using a conclusion which is true for one population, but not for another; or by making comparisons between samples that may be inherently unequal, e.g. by trying to measure a difference in hockey-ability between Swedes and Canadians by comparing random samples of NHL players. (The entry barrier to the NHL will be lower for a Canadian, which means that the Swedish sample will have undergone a stronger pre-filtering.)
An important conclusion from the above is that if a scientific study claims that “X and Y correlate” (or “X predicts Y”), great care should be taken before assuming a causation or suggesting new policy. In fact, even if the study actually does make claims about causality, great care should be taken: The scientist(s) may be sloppy, driven by ideological motivation or research grants, or seeing what the result “should” be (rather than what it is)—scientists are only human.
The last point is one of importance: Many non-scientists have a somewhat superstitious take on scientists, and assume that they master all complexities in they encounter, take all aspects of a problem into consideration, and so on. This is simply not a correct estimate: Even when a scientist is aware of all aspects (unlikely, bordering on a tautological impossibility), he will still be forced to make simplifications. A social-science study, e.g., may pick out a handful of variables of relevance, try to catch any remaining issues in a generic error term—and then proceed to test these on a sample that is too small, picked with imperfect randomness, or otherwise deviating from the ideals. (This not to mention the many other complications that can occur with flawed measurements, leading questionnaires, whatnot.)
As has subsequently occurred to me, the above examples can be somewhat misleading in that they are mostly “binary” (someone has/is something—or not). This was a deliberate choice to have simple and easy to understand examples; however, it is important to bear in mind that the typical practical case will be of a different character. The first item, dealing with height and weight, is a good example: There is no binary “tall implies heavy”, “short implies light”, but a a gradual increase of expected/average weight as height increases (and vice versa).
This is particularly important when I speak of “negative correlation” above: This should not really be seen as the presense of X implying the absense of Y, but as a decrease of Y as X increases. A good example is speed and travel time: If a vehicle goes faster (all other factors equal) the time taken for it to reach its destination decreases.