4 minutes

# Focused Principal Component Analysis

The **Principal Component Analysis (PCA)** is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of *principal components* is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set. **PCA** is sensitive to the relative scaling of the original variables. … see more

In particular, the **Focused Principal Component Analysis (fPCA)** conveys the structure of a correlation matrix into a low-dimensional diagram but, unlike *PCA*, it makes it possible to represent accurately the correlations of a given variable with the other variables (and even to test graphically the hypothesis that one of these correlations is equal to zero).

In sum, the **fPCA** provides a *focused* outcome, so that the distances between the predictors and the outcome can be interpreted as a representation of their correlations. The relative positions of the predictors can give an idea of their correlations and can be interpreted as in a classic **PCA**.

### Example:

The *iris* dataset is used to exemplify the use of the *fPCA* method. The dataset contains four measurements for 150 flowers representing three species of iris (i.e. setosa, versicolor and virginica).… see more

setosa | versicolor | virginica |

The first step which is usually taken when analysing data is to quickly find a linear correlation. However, when the data contains several factors, the correlation may be unclear and the next step consists on finding clusters between the data to complement the correlation.

The following figure shows that there are clearly clusters between *versicolor* and *virginica* (green and blue), and a linear correlation between *Petal.Length*, *Petal.Width*.

```
> summary(mod)
Call:
lm(formula = iris$Species ~ iris$Sepal.Length + iris$Sepal.Width +
iris$Petal.Length + iris$Petal.Width)
Residuals:
Min 1Q Median 3Q Max
-0.59215 -0.15368 0.01268 0.11089 0.55077
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.18650 0.20484 5.792 4.15e-08 ***
iris$Sepal.Length -0.11191 0.05765 -1.941 0.0542 .
iris$Sepal.Width -0.04008 0.05969 -0.671 0.5030
iris$Petal.Length 0.22865 0.05685 4.022 9.26e-05 ***
iris$Petal.Width 0.60925 0.09446 6.450 1.56e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2191 on 145 degrees of freedom
Multiple R-squared: 0.9304, Adjusted R-squared: 0.9285
F-statistic: 484.5 on 4 and 145 DF, p-value: < 2.2e-16
```

As seen, the linear correlation suggests these four variables as predictors with R-squared > 90%. This values indicates how well a regression model predicts responses for the reference observation. But it needs to be used with caution at all times. It might be the case—and this a very common ~~error~~ case in descriptive statistics—that the correlation found can be considered as a well fitting method to generate an equation. But in reality, the method is unable to explain the true relationship between the variables.

If the purpose of the preliminary analysis is to provide an overview of the dataset, then the *fPCA* method is a good candidate due to its ability to provide both correlations and clusters in a fast way.

In brief, the variables inside the dashed line are significantly associated with the focused outcome (Species). Green dots correspond to positive associations, and yellow dots to negative associations. Dots in opposite quadrants represent negative correlation between them. And finally, the red line represents a significant correlation with the outcome; this is also given by the *r* values in the axis.

As seen in the figure, the correlation between *Petal.Length* and *Petal.Width* is maintained in this method; the *lm* model supports the further explanation of the correlation. It can also be seen that *Sepal.Width* is the less accurate predictor as found in both *lm* and *fPCA* methods; however, in the former the variable is neglectable (for the model) but in the later is considered because of its relation with the other variables. This is a relevant attribute of the *fPCA* method.