Correlation and Causality
 |
Observing a correlation between two variables may mislead someone into seeing a causal relationship between these variables. However, this is often not the case. In general, correlations and causality can be interpreted in the following ways:
- a controls b
- b controls a
- a and b are determined by a third variable
- a and b exert a mutual influence on each other
- a and b do not control each other at all, the correlation is spurious (it happens to occur by chance)
|
The third situation is the most common in everyday practice, the last situation is of importance for small samples, because the correlation coeffcient shows a broad distribution for low numbers of observations.
The following summary presents the most important aspects of correlation and causality:
Correlation by formal means |
If two independent variables X and Y are divided by a variable Z which
is correlated to either X or Y, the resulting variables X' and Y' are correlated.
The same is true for variables which are normalized to a sum of 100
percent (as it is often the case with tables of nutritive values). Such
variables always show a negative correlation. |
Correlation by Inhomogeneity |
If the distribution of the data is inhomogeneous, a correlation is
likely to occur. It is therefore advisable to plot the variables against
each other (scatter plot of X vs. Y)
Example: |
Shoe size is correlated to income. The larger
the shoe size, the higher the income. (Solution: women earned less money
than men. Both groups show no internal correlation, but if both groups
are pooled a "correlation" occurs.)
The longer a student needs to finish his study,
the higher is his income afterwards. (Solution: the time required to get
a degree depends on the studies, e.g. the average time to graduate in philosophy
is shorter than to get a degree in chemistry. Within the group of chemists,
the income increases with decreasing time of study, but again: pooling
the data creates inhomogeneity and leads to the described correlation) |
|
Additional (hidden) variables |
Variables X and Y are correlated, but in fact a third parameter Z,
which is not included in the data set, is correlated to both X and Y. This
is particularly hard to discover, since the parameter Z may well be unknown.
An important subclass of this type of correlation is time series, where
time is the common variable. If both X and Y show a trend in time, correlation
will be observed.
Example: |
Shoe size is correlated to the calcium content
of bones. (Solution: children have less calcium in their bones than adults,
naturally the shoe size of children is also smaller than that of adults) |
|
Outliers in the data |
Outliers cause high correlations if the outlier is far enough away
from the rest of the data.
Example: |
A common spike in the signals of an analytical
instrument may result in high correlation between these signals (note:
spikes are a common problem in laboratories; they are e.g. caused for example
by switching refrigerators). |
|
As an important consequence, we have to state that mathematical correlation is no proof of causality. Correlations must not interpreted in a causal way unless there is evidence of a causal relationship beyond the correlation.
|