Focused Principal Component Analysis

The Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables. … see more

In particular, the Focused Principal Component Analysis (fPCA) conveys the structure of a correlation matrix into a low-dimensional diagram but, unlike PCA, it makes it possible to represent accurately the correlations of a given variable with the other variables (and even to test graphically the hypothesis that one of these correlations is equal to zero).

In sum, the fPCA provides a focused outcome, so that the distances between the predictors and the outcome can be interpreted as a representation of their correlations. The relative positions of the predictors can give an idea of their correlations and can be interpreted as in a classic PCA.

Example:

The iris dataset is used to exemplify the use of the fPCA method. The dataset contains four measurements for 150 flowers representing three species of iris (i.e. setosa, versicolor and virginica).… see more



setosa	versicolor	virginica

The first step which is usually taken when analysing data is to quickly find a linear correlation. However, when the data contains several factors, the correlation may be unclear and the next step consists on finding clusters between the data to complement the correlation.

lm Plot

The following figure shows that there are clearly clusters between versicolor and virginica (green and blue), and a linear correlation between Petal.Length, Petal.Width.

> summary(mod)

  Call:
  lm(formula = iris$Species ~ iris$Sepal.Length + iris$Sepal.Width +
      iris$Petal.Length + iris$Petal.Width)

  Residuals:
       Min       1Q   Median       3Q      Max
  -0.59215 -0.15368  0.01268  0.11089  0.55077

  Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
  (Intercept)        1.18650    0.20484   5.792 4.15e-08 ***
  iris$Sepal.Length -0.11191    0.05765  -1.941   0.0542 .  
  iris$Sepal.Width  -0.04008    0.05969  -0.671   0.5030    
  iris$Petal.Length  0.22865    0.05685   4.022 9.26e-05 ***
  iris$Petal.Width   0.60925    0.09446   6.450 1.56e-09 ***
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

  Residual standard error: 0.2191 on 145 degrees of freedom
  Multiple R-squared:  0.9304,  Adjusted R-squared:  0.9285
  F-statistic: 484.5 on 4 and 145 DF,  p-value: < 2.2e-16

As seen, the linear correlation suggests these four variables as predictors with R-squared > 90%. This values indicates how well a regression model predicts responses for the reference observation. But it needs to be used with caution at all times. It might be the case—and this a very common ~~error~~ case in descriptive statistics—that the correlation found can be considered as a well fitting method to generate an equation. But in reality, the method is unable to explain the true relationship between the variables.

If the purpose of the preliminary analysis is to provide an overview of the dataset, then the fPCA method is a good candidate due to its ability to provide both correlations and clusters in a fast way.

fPCA Plot

In brief, the variables inside the dashed line are significantly associated with the focused outcome (Species). Green dots correspond to positive associations, and yellow dots to negative associations. Dots in opposite quadrants represent negative correlation between them. And finally, the red line represents a significant correlation with the outcome; this is also given by the r values in the axis.

As seen in the figure, the correlation between Petal.Length and Petal.Width is maintained in this method; the lm model supports the further explanation of the correlation. It can also be seen that Sepal.Width is the less accurate predictor as found in both lm and fPCA methods; however, in the former the variable is neglectable (for the model) but in the later is considered because of its relation with the other variables. This is a relevant attribute of the fPCA method.