Fundamentals of Statistics contains material of various lectures and courses of H. Lohninger on statistics, data analysis and chemometrics......click here for more.


Scatter, Covariance, and Correlation Matrix

These three types of matrices often form the basis of a multivariate method. The correlation and the covariance matrix are also often used for a first inspection of relationships among the variables of a multivariate data set. Therefore it is crucial to understand the principles behind them and the pitfalls which may arise from not-as-expected data sets.

How are these matrices related to each other?

Basically, all of these matrices are calculated using the same procedure: ATA. The only difference between them is how the data is scaled before the matrix multiplication is executed:
 

  • scatter matrix: no scaling
  • covariance: the mean of each variable is subtracted before multiplication
  • cross correlation: each variable is standardized (mean subtracted, then divided by standard deviation)


What is the effect of a single outlier on these matrices?

Suppose you have a data matrix which contains one object which is an outlier compared to the rest of the data. This single outlier will completely "corrupt" the matrices (especially the cross correlation matrix), showing a fake correlation. This fake correlation can misguide any unprepared operator. You may try this effect yourself by running the following  interactive example .

Be extremly careful when selecting variables by looking at the cross correlation table. A high correlation may be due to a single outlier in the data matrix.