Correlation Isn't Causation, But Sure Is a Hint

 

By Christo Lute, Director of Advanced Analytics,
Analytics Guild

One of the first lessons people learn in statistics is that “correlation does not imply causation.” It is also one of the most abused lessons in statistics. It can prematurely end discussions about scientific findings, gets used as a fallacious but enduring criticism of statistical reasoning, and ultimately, stands in as a means to push away inconvenient data.

In one sense, correlation not implying causation is true, and if a student does not recognize this, many, many statistical errors and poor reasoning will result. For example: did you know that there is a positive correlation between ice cream consumption and deaths by swimming pool drowning? Did you know that there is correlation between the amount of whole milk consumption in the US and high-fructose corn syrup consumption?

The lesson here is not that ice cream consumption causes drowning, nor that milk contains corn syrup or something else nonsensical. The lesson is that variables can be correlated, but it is not sufficient to find correlation and declare a causal relationship. In the ice cream case, the stronger correlation between ice cream consumption and drowning is probably warmer weather, since people eat more ice cream in warmer weather and swim more often. In the corn syrup and whole milk correlation, the relationship is likely spurious and not strongly causally linked.

The abuse of this notion occurs when it is taken to an extreme, when it’s translated from the subtle claim that “correlation does not describe a causal relationship but a measure of interdependence of variables” into a proclamation that correlation has nothing to say about causation. This interpretation is patently false; it fails to appreciate one of the main reasons why we use correlations: to discern causal relationships between events.

Take a measuring cup. Consider the following: as you pour water into the cup, the water reaches a higher marker on the cup. This means that there is a correlation between water poured in the cup and the measuring line used on the cup. But the reason that the measuring line is used is because of the water poured in. The water causes a measuring line to be used. When stated this way, the problem with correlation not implying causation becomes obvious. We can all see that adding water to a measuring cup will mean a higher line is being used, and understand that it’s the water causing that measurement to be used. What else would cause it?

Correlation is not the final word on a subject, but a starting position in pursuit of causal relationships. We use correlation as a stepping stone to unearth and understand relationships. 

Balancing these notions is critical. The next time you are working with correlations in your data, hold strong and know that the correlation does not imply causation. Instead, be curious about the relationship, dig into the data and see if you can discern a deeper understanding of how A might influence B. This search for causation is where the real magic in data analytics happens. In the words of master data visualist Edward Tufte, “Correlation isn’t causation, but sure is a hint.”