One of the most critical areas in research is finding the relationship between two variables, and examining whether there’s any correlation between them.
However, does correlation always mean causation?
Let us first define the two terms.
Correlation refers to any of a broad class of statistical relationships involving dependence.
Familiar examples of dependent phenomena include the correlation between the physical stature of parents and their offspring, and the correlation between the demand of a product and its price.
On the other hand, two events are said to have a causal relationship when the second event occurs as a consequence of the first. That is, there is a cause and effect phenomenon.
For any two correlated events, A and B, the following relationships are possible:
- A causes B;
- B causes A;
- A and B are consequences of a common cause, but do not cause each other;
- There is no connection between A and B; the correlation is coincidental.
And this is where researchers have to be careful. Just because they see that event B follows A, it does not imply that A causes B.
For example, after a survey of accident rates and ice cream consumption in a city, it was found that there was a high statistical correlation between ice cream and drowning. But that does not mean that eating ice cream causes a person to drown.
The above example demonstrates the concept of the lurking variable, that is a variable that has an important effect and yet is not included amongst the predictor variables under consideration.
One way to identify the effects of lurking variables on a data set is to examine what happens to the data over time. This can reveal seasonal trends, such as the ice cream example, that get obscured when the data is lumped together.
Of course, the best course of action for a researcher is to be proactive, question all assumptions and design experiments carefully.
Statistics vs. Logic
In regression, correlation is described through the square of the correlation coefficient or R2. The closer R2 is to one, the better the correlation.
For example, a study compared the total US highway fatality rate to the metric tons of fresh lemons imported from Mexico. The value of R2 came at 0.97.
What we have in this example is an excellent correlation between two variables.
That is, as the amount of imported lemons increases, so do the traffic fatalities.
However, it is fairly obvious just from logical thought that there is more than likely no causal relationship between the two. That is, the importing of lemons does not cause traffic fatalities. Conversely, if we stopped importing lemons, we would not expect the number of traffic fatalities to decline.
Therefore, whenever there is a conflict between the results of a statistical study and a researcher’s common sense, he should follow his common sense. One should remember that statistics (or any branch of science, for that matter) is just a tool and cannot be a substitute for human thought and insight.
Conversely, if a researcher feels that the statistical results back his hunch, then he should probably dig deeper.
Arguably the most well known and important example of a correlation being clear but causation being in doubt concerned smoking and lung cancer in the 1950s. There had been a sixfold increase in the rate of lung cancer in the preceding two decades. Nobody disputed that there was a correlation between lung cancer and smoking, but to prove that one caused the other would be no mean feat.
There might be a confounder that was responsible for the correlation between smoking and lung cancer. The increased rate could have been the result of better diagnosis, more industrial pollution or more cars on the roads belching noxious fumes. Perhaps people who were more genetically predisposed to want to smoke were also more susceptible to getting cancer?
It took a study involving more than 40,000 doctors in the UK to show conclusively that smoking really does cause cancer.