Abstract
Principal component analysis (PCA) is a standard dimensionality reduction technique used in various research and applied fields. From an algorithmic point of view, classical PCA can be formulated in terms of operations on a multivariate Gaussian likelihood. As a consequence of the implied Gaussian formulation, the principal components are not robust to outliers. In this paper, we propose a modified formulation, based on the use of a multivariate Cauchy likelihood instead of the Gaussian likelihood, which has the effect of robustifying the principal components. We present an algorithm to compute these robustified principal components. We additionally derive the relevant influence function of the first component and examine its theoretical properties. Simulation experiments on high-dimensional datasets demonstrate that the estimated principal components based on the Cauchy likelihood typically outperform, or are on a par with, existing robust PCA techniques. Moreover, the Cauchy PCA algorithm we have used has much lower computational cost in very high dimensional settings than the other public domain robust PCA methods we consider.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
In the analysis of multivariate data, it is frequently desirable to employ statistical methods which are insensitive to the presence of outliers in the sample. To address the problem of outliers, it is important to develop robust statistical procedures. Most statistical procedures include explicit or implicit prior assumptions about the distribution of the observations, but often without taking into account the effect of outliers. The purpose of this paper is to present a novel robust version of PCA which has some attractive features.
Principal component analysis (PCA) is considered to be one of the most important techniques in statistics. However, the classical version of PCA depends on either a covariance or a correlation matrix, both of which are very sensitive to outliers. We develop an alternative method to classical PCA, which is far more robust, by using a multivariate Cauchy likelihood to construct a robust principal components (PC) procedure. It is an adaptation of the classic method of PCA obtained by replacing the Gaussian log-likelihood function by the Cauchy log-likelihood function, in a sense that will be explained in Sect. 2.2. Although we do not claim that the interpretation of standard PCA in terms of operations on a Gaussian likelihood is new, see Bolton and Krzanowski (1999), this fact does not appear to have been exploited in the development of a robust PCA procedure, as we do in this paper. An important reason for using the multivariate Cauchy likelihood is that this likelihood has only one maximum point, but the single most important motivation is that it leads to a robust procedure.
In the next section we review briefly some of the techniques employed for estimating PCA parameters and for directing a PCA in ways which are robust against the presence of outliers. We also present robustness preliminaries that include some important techniques which are necessary to assess whether the method used is robust or not. In Sect. 3 we develop the Cauchy-PCA and theoretically explore its robustness properties. Finally, in Sect. 4 we present the numerical algorithms for computing the Cauchy PCs, and also give the results of a number of very high-dimensional real-data and simulated examples. Our approach is seen to be competitive with, and often gives superior results to, that of the projection pursuit (PP) algorithm of Croux et al. (2007), Croux et al. (2013) and other competitors discussed later. Finally we conclude the paper in Sect. 5.
1.1 Literature review on robust PCA
It is well known that PCA is an important technique for high-dimensional data reduction. PCA is based on the sample covariance matrix \(\hat{{\varvec{\Sigma }}}\) and it involves searching for a linear combination \(y_{j}= \textbf{u}^{T}{} \textbf{x}_{j}\) of the \(\textbf{x}\) components of the vector that maximize the sample variance of the components of \(\textbf{y}\). According to Mardia et al. (1979), for example, the solution will be given by the equation
where \({\varvec{\Lambda }}= \hbox {diag}\{\lambda _{1}, \ldots , \lambda _{p}\}\) and its diagonal elements \(\lambda _{i}\) are the sample variances, while \(\textbf{U}\) is an orthogonal matrix, i.e. \(\textbf{U U}^{T} =\textbf{U}^{T}{} \textbf{U}=\textbf{I}_{p}\), whose columns \(\textbf{u}_{i}\) are the corresponding eigenvectors which represent the linear combinations. The PCs are efficiently estimated in practice via Singular Value Decomposition (SVD).
Classical PCA, unfortunately, is non-robust, since it based on the sample covariance or sample correlation matrix, both of which are very sensitive to outlying observations; see Sect. 2. However, this problem has been handled by two different methods which result in robust versions of PCA by:
-
i.
replacing the standard covariance or correlation matrix with a robust estimator; or
-
ii.
maximising (or minimising) a different objective function to obtain a robust PCA.
Many different proposals had been developed to carry out robust PCA, such as using PP, \(M-\)estimators and so on.
Despite maximum likelihood estimation, perhaps, being considered as the most important statistical inference method, sometimes this approach can lead to incorrect results when the underlying assumptions are not satisfied, for instance, when data contain outliers or deviate slightly from the supposed model. A generalization of maximum likelihood estimation proposed by Huber (1964), which is called M-estimation, aims to produce a robust statistic by constructing approaches that are resistant to deviations from the underline assumptions. M-estimators were also defined for the multivariate case by Maronna (1976).
Campbell (1980) provided a procedure for robust PCA by examining the estimates of means and covariances which are less affected by outlier observations, and by exploring the observations which have a large effect on the estimates. He replaced the sample covariance sample by an \(M-\)estimator. Hubert and Verboven (2003) introduced a new approach to create robust PCA. It combines the advantages of two methods, the first one is based on replacing the covariance or correlation matrix by its robust estimator, while the second one is based on maximizing the objective function for this robust estimator.
A robust PCA based on the PP method was developed by Li and Chen (1985), using Huber’s M-estimator of dispersion as the projection index. The objective of PP is to seek projections, of the high-dimensional data set onto low-dimensional subspaces, that optimise a function of "interestingness". The function that should be optimised is called an index or objective function and its choice depends on a feature that the researcher is concerned about. This property gives the PP technique a flexibility to handle many different statistical problems range from clustering to identifying outliers in a multivariate data set.
Bolton and Krzanowski (1999) characterized the PCs for PP in terms of maximum likelihood under the assumption of normality. PCA can be considered as a special case of PP as well as many other methods of multivariate analysis. Li and Chen (1985) used Huber’s M-estimator of dispersion as projective index to develop a robust PCA based on the PP approach. The sample median was used as a projective index to develop a robust PCA by Xie et al. (1993). In their simulation studies, Xie et al. (1993) observed a PCA resistant to outliers and deviations from the normal distribution.
More recently, there has been further work on developing approaches to robust PCA. Croux et al. (2007) and Croux et al. (2013) also suggested a robust PCA using a version of PP. Candès et al. (2011) proposed a robust penalized PCA, termed Principal Component Pursuit (PCP). Another approach, developed by Zhang et al. (2013), is based on a Robust Regularized Singular Value Decomposition (RRSVD). We will contrast our methodology against the three aforementioned algorithms in numerical examples, focusing especially on the ultra high-dimensional case where \(p> > n\).
2 Preliminaries on standard PCA
PCA is an orthogonal linear transformation that projects the data to a new coordinate system according to the variance of each direction. Given a data matrix \(\textbf{X}\in {\mathbb {R}}^{n\times p}\) with each row correspond to a sample, the first direction \(\textbf{u}_1\) that maximizes the variance is defined through
where \(\textbf{1}_n\) is an n-dimensional vector whose elements are all set to 1 while \({\bar{\textbf{x}}}=\frac{1}{n}\sum _{i=1}^n \textbf{x}_i\) is the empirical mean. The process is repeated k times and at each iteration the to-be-estimated principal direction has to be orthogonal to all previously-computed principal directions. Thus, the k-th direction which has to be orthogonal to the previous ones is defined by
2.1 Non-robustness of standard PCA
We will show that the influence function for the largest eigenvalue of the covariance matrix and the respective eigenvector are unbounded with respect to the norm of an outlier sample. Suppose that \(\varvec{\Sigma }\) is the covariance matrix of a population with distribution function F, i.e.,
where \(\varvec{\mu }=\int _{{\mathbb {R}}^p} \textbf{x}dF(\textbf{x})\) corresponds to the mean vector. Assume that the leading eigenvalue of \(\varvec{\Sigma }\) has multiplicity 1, then we denote it by \(\lambda \) and the leading eigenvector by \({\hat{\textbf{u}}}\) (i.e., \(\textbf{u}_{1}={\hat{\textbf{u}}}\)).
Let T be an arbitrary functional, F a distribution and \(\textbf{z}\in {\mathbb {R}}^p\) an arbitrary point in the relevant sample space. The influence function is defined as
where \(\Delta _{\textbf{z}}\) is a unit point mass located at \(\textbf{z}\).
A robust estimator for T means that the influence function is bounded with respect to the norm of the outlier \(\textbf{z}\).
Proposition 1
The influence function for the leading eigenvector of \(\varvec{\Sigma }\) is given byFootnote 1
Similarly, the IF for the largest eigenvalue of \({\varvec{\Sigma }}\) is
The detailed calculations are presented in Appendix 1. The following result shows that outliers with unbounded influence function do exist.
Corollary 1
Let \(\textbf{z}=\varvec{\mu }+ \gamma {\hat{\textbf{u}}} + \eta \textbf{v}\) where \(\textbf{v}\) is orthogonal to \({\hat{\textbf{u}}}\) and does not belong to the null space of \(\varvec{\Sigma }\) and \(\gamma ,\eta \ne 0\) then
and similarly for \(IF_\lambda (\textbf{z}, F)\).
Proof
Direct substitution of \(\textbf{z}\) into the influence function gives:
Since \(\textbf{v}\) does not belong to the null space of \(\varvec{\Sigma }\), it holds that \((\varvec{\Sigma }-\lambda \textbf{I}_p)^{+} \textbf{v}\ne \textbf{0}\) thus \(||(\varvec{\Sigma }-\lambda \textbf{I}_p)^{+} \textbf{v}||_2=c\ne 0\). Hence,
Given that \(||\textbf{z}||_2^2 = \gamma ^2+\eta ^2+||\varvec{\mu }||_2^2+\gamma \varvec{\mu }^T{\hat{\textbf{u}}}+\eta \varvec{\mu }^T\textbf{v}\), either sending \(|\gamma |\rightarrow \infty \) or \(|\eta |\rightarrow \infty \) completes the proof.
Similarly,
as \(|\gamma |\rightarrow \infty \). \(\square \)
2.2 Generalizations of standard PCA
Standard PCA can be viewed as a special case of a more general optimization problem. We present two such generalization: the first one leads to PP algorithms while the second leads to a maximum likelihood formulation. Let \(\textbf{u}\) be a unit vector and define the projection values
and a function \(\Phi :{\mathbb {R}}^n \rightarrow {\mathbb {R}}\) acting on the projected values. The first generalization of PCA is defined as the maximization of \(\Phi \):
As in the standard PCA, the following principal directions are obtained after removing the contribution of the current PC from the data. When \(\Phi \) is the sample variance then we recover the standard PCA.
The second generalization interprets the computation of the principal component as a maximum likelihood estimation problem. By letting,
be the Gaussian log-likelihood, the first principal direction can be obtained by solving the minimax problem:
Indeed, the inner maximization can be solved analytically which leads to the optimal solution
and
Unsurprisingly, the optimal values are the sample mean and the sample variance. Using the above formulas it is straightforward to show that
Variations of PCA can be derived by changing the likelihood function and in the next section we analyze the case of Cauchy distribution.
3 Cauchy PCA
The Cauchy log-likelihood function is given by
where \(\mu \) and \(\sigma \) are the two parameters of the Cauchy distribution. The first Cauchy principal direction is also obtained by solving the minimax optimization problem:
In contrast to the Gaussian case, the inner maximization cannot be performed analytically. Therefore an iterative approach needs to be utilized. Here, we apply the Newton-Raphson method with initial values the median and half the interquartile range for the location and scale parameters, respectively. According to Copas (1975), although the mean of the Cauchy distribution does not exist and it has infinite variance, the Cauchy log-likelihood function \(l_{C}(\mu , \sigma )\) has a unique maximum likelihood estimate, \(({\hat{\mu }},{\hat{\sigma }})\).
Fixing \(\mu \) and \(\sigma \), the outer minimization is also non-analytic and a fixed point iteration is applied to calculate \(\textbf{u}\). The iteration is given by
where \({\hat{\textbf{u}}}_{un}\) is the unnormalized direction which is obtained from the differentiation of the Lagrangian function with respect to \(\textbf{u}\) and it is given by
Once the first principal direction has been computed, its contribution from the dataset \(\textbf{X}\) is removed and the same procedure to estimate the next principal direction is repeated. This iterative process is repeated k times to estimate the first k PCs. The removal of the contribution makes the principal directions orthogonal to each other. We summarize the estimation of k Cauchy PCs in the following pseudo-code (Algorithm 1).
3.1 Robustness of the leading Cauchy principal direction
Let \(\varvec{\theta }= \left( \mu ,\sigma \right) ^T\) be the parameter vector of the Cauchy distribution and consider the infinite-sample normalized Cauchy log-likelihood function
where \(g(c,\varvec{\theta }) = \log (\sigma /\pi ) - \log ( \sigma ^2+(c-\mu )^2)\) and \(c(\textbf{u})=\textbf{x}^T\textbf{u}\). We will estimate the influence function for the leading Cauchy principal direction
where
are the optimal Cauchy parameters for a given direction \(\textbf{u}\).
Since \({\hat{\textbf{u}}}\) is restricted to be a unit vector, the standard condition for the minimum, i.e., \(\left. \frac{\partial }{\partial \textbf{u}} l(\textbf{u}|\varvec{\theta }_F(\textbf{u}))\right| _{\textbf{u}={\hat{\textbf{u}}}} = \textbf{0}\) is not valid. The appropriate condition is defined by
where \(\textbf{P}_{\textbf{u}}\) is the projection matrix given by \(\textbf{P}_{\textbf{u}}=\textbf{I}_p-\textbf{u}\textbf{u}^T\).
Remark 1
An equivalent condition is to satisfy \(\textbf{h}^T \)\( \left. \frac{\partial }{\partial \textbf{u}} l(\textbf{u}|\varvec{\theta }_F(\textbf{u}))\right| _{\textbf{u}={\hat{\textbf{u}}}} = \textbf{0}\) for all \(\textbf{h}\) such that \(\textbf{h}^T{\hat{\textbf{u}}}=0\) and \(||\textbf{h}||_2=1\). Both derived conditions are essentially a consequence of the Lagrangian formulation of the constraint optimization problem. Indeed, the Lagrange condition implies that at the minimum the direction of the objective function’s derivative should be parallel to the direction of the constraint’s derivative which translates to \(\left. \frac{\partial }{\partial \textbf{u}} l(\textbf{u}|\varvec{\theta }_F(\textbf{u}))\right| _{\textbf{u}={\hat{\textbf{u}}}} = \lambda {\hat{\textbf{u}}}\) where \(\lambda \ne 0\) is the Lagrange multiplier.
Let \({\bar{g}}(\textbf{x};\textbf{u}) = \left. g(\textbf{x}^T\textbf{u}|\theta )\right| _{\theta =\theta _F(\textbf{u})}\) be the likelihood function computed at \(\theta =\theta _F(\textbf{u})\) and let denote its partial derivatives as
and
Similarly, \({\bar{g}}_{cc}\), \({\bar{g}}_{c\theta }\) and \({\bar{g}}_{\theta \theta }\) denote the second order derivatives. The following proposition establishes the expression for the influence function of the leading Cauchy principal direction, \({\hat{\textbf{u}}}\).
Proposition 2
Under the assumption of \({\textbf{I}}_F({\hat{\textbf{u}}})\) and \(\textbf{A}\) being invertible matrices, the influence function of \({\hat{\textbf{u}}}\) is
where
and
while
is the expected Fisher information matrix under F for the parameters of the Cauchy distribution computed at \({\hat{\textbf{u}}}\).
Proof
The proof consists of several straightforward series expansions and implicit function calculations. The complete proof is given in Appendix 1. \(\square \)
The following boundedness result for the influence function states the conditions under which Cauchy PCA is robust.
Corollary 2
Let the assumptions of the proposition hold. If \(\textbf{z}\not \perp {\hat{\textbf{u}}}\) or if \(\textbf{z}\perp {\hat{\textbf{u}}}=0\) but \(\mu _F({\hat{\textbf{u}}})=0\) then the influence function for \({\hat{\textbf{u}}}\) is bounded.
Proof
First, observe that matrix \(\textbf{A}\) does not depend on \(\textbf{z}\). It is only \(\textbf{b}\) that depends on \(\textbf{z}\) and our goal is to prove that \(\textbf{b}\) is bounded with respect to \(\textbf{z}\). Second, we have to compute the partial derivatives \({\bar{g}}_c(\textbf{z}; {\hat{\textbf{u}}})\) and \({\bar{g}}_{\varvec{\theta }}(\textbf{z}; {\hat{\textbf{u}}})\). Straightforward calculations lead to
and
Let us now define an arbitrary scaling of the outlier \(\textbf{z}\rightarrow \alpha \textbf{z}\) and prove boundedness of \(\textbf{b}\) as we send \(\alpha \rightarrow \infty \). We consider the first case where \(\textbf{z}\not \perp {\hat{\textbf{u}}}\). It holds that \(\lim _{\alpha \rightarrow \infty } {\bar{g}}_c(\alpha \textbf{z}; {\hat{\textbf{u}}})\alpha \textbf{z}= -(\textbf{z}^T{\hat{\textbf{u}}})^{-1}\textbf{z}\), \(\lim _{\alpha \rightarrow \infty } {\bar{g}}_\mu (\alpha \textbf{z}; {\hat{\textbf{u}}}) = 0\) and \(\lim _{\alpha \rightarrow \infty } {\bar{g}}_\sigma (\alpha \textbf{z}; {\hat{\textbf{u}}}) = \frac{1}{\sigma _F({\hat{\textbf{u}}})}\) therefore \(\textbf{b}\) is bounded with respect to \(\alpha \).
For the second case, we have
and
since \(\mu _F({\hat{\textbf{u}}})=0\) by assumption. Thus \(\textbf{b}\) is bounded with respect to \(\alpha \) for the second case, too. \(\square \)
The only case not covered by the corollary is when \(\textbf{z}^T{\hat{\textbf{u}}}=0\) and \(\mu ({\hat{\textbf{u}}})\ne 0\). Our experiments presented in the following section show that outliers that are orthogonal to the Cauchy principal direction do sometimes influence the estimation of the Cauchy principal direction yet not significantly.
3.2 Several Cauchy principal components
We briefly mention possibilities for estimating several Cauchy principal components. There are two obvious approaches: one approach, the sequential approach, is to repeat the algorithm described above on the subspace orthogonal to \({\hat{\textbf{u}}}={\hat{\textbf{u}}}_1\) to obtain \({\hat{\textbf{u}}}_2\), the second Cauchy principal component, where \({\hat{\textbf{u}}}_1\) is the first Cauchy principal component; then repeat the procedure on the subspace orthogonal to \({\hat{\textbf{u}}}_1\) and \({\hat{\textbf{u}}}_2\) to obtain \({\hat{\textbf{u}}}_3\); and so on. A second approach, the simultaneous approach, is to decide in advance how many principal components we wish to determine, p say, and then use a p-dimensional multivariate Cauchy likelihood, which has \(p+ p(p+1)/2\) free parameters, to obtain \({\hat{\textbf{u}}}_1, \ldots , {\hat{\textbf{u}}}_p\). It turns out that these two approaches lead to equivalent results in classical (Gaussian) PCA but when a Cauchy likelihood is used the two approaches produce different sets of principal components. Our current thinking is this: the sequential approach is easier to implement (essentially the same computations can be used at each step) and it is faster. However, the simultaneous approach could potentially be preferable if we know in advance how many principal components we wish to estimate. Further investigation is required.
3.3 Cauchy measures of variability explained
Suppose we obtain mutually orthogonal PC unit vectors \(\hat{\textbf{u}}_1, \ldots , \hat{\textbf{u}}_m\) from the robust outputs of a Cauchy PCA. We may wish either to perform a type of scree plot or to assess the "variability explained" by different PCs, but it is not clear at the outset how best to do this. One possibility is to use the connection between maximising the variance along different choices of \(\textbf{u}\) and minimising the likelihood in the Gaussian case, and motivate by analogy. For example, suppose that \({\hat{\ell }}_{C,1}, \ldots , {\hat{\ell }}_{C,m}\) are the minimised Cauchy log-likelihoods corresponding to the orthogonal PC unit vectors \(\hat{\textbf{u}}_1, \ldots , \hat{\textbf{u}}_m\). We may then choose a suitable function \(g(\cdot )\) and define \({\hat{\lambda }}_j = g({\hat{\ell }}_{C,j})\), \(j=1, \ldots , m\). Then we could proceed as though \({\hat{\lambda }}_1, \ldots , {\hat{\lambda }}_m\) had been obtained in a standard (Gaussian) PCA analysis. How might we choose the function g? One potentially interesting possibility would be to choose g to be the inverse function of \(h({\hat{\sigma }}^2)=-\frac{n}{2}\{\log ({\hat{\sigma }}^2)+1\}\), where h maps the maximised variance to the minimised log-likelihood in the Gaussian case. Specifically, we define
This approach would automatically inherit the robustness properties of the Cauchy PCA, but the resulting \({\hat{\lambda }}_j\) would be on a scale roughly comparable with that of the eigenvalues produced in a classical PCA. It would be of interest to explore this and other possibilities in future work.
4 Numerical results
4.1 Simulation studies
In this section we will empirically validate our proposed methodology, via simulation studies. We searched for R packages that offer robust PCA in the \(n<<p\) case and came up with FastHCS (Vakili 2018), rrcovHD (Todorov 2016), rpca (Sykulski 2017), RobRSVD (Zhang and Pan 2013) and pcaPP (Filzmoser et al. 2018). Out of them, pcaPP (Projection Pursuit PCA) is the only one that does not require hyper-parameter tuning, e.g. selection of the LASSO penalty \(\lambda \) or choice of the percentage of observations used to estimate a robust covariance matrix. However, we will also consider the PCP method Candès et al. (2011) and the RRSVD method (Zhang et al. 2013), implemented in the packages rpca and RobSRVD, respectively, using the default arguments as for the selection of the regularization parameters. Our proposed method has been implemented in the R package cauchypca (Tsagris et al. 2023).
4.1.1 Setup of the simulations
Initially, we created a \(p \times p\) (orthonormal) basis \(\textbf{B}\) by using QR decomposition on some randomly generated data. We then generated eigenvalues \(\lambda _i \sim Exp(0.4)\), where \(i=1,\ldots ,p\) and hence we obtained the covariance matrix \(\pmb {\Sigma } = \textbf{B}\pmb {\Lambda }{} \textbf{B}^T\), where \(\pmb {\Lambda } =\text {diag}(\lambda _i)\). The first column of \(\textbf{B}\) served as the first eigenvector, computed using the uncontaminated data, and was the benchmark in our comparative evaluations. Following this step, we simulated n random vectors \(\textbf{X} \sim N_p\left( \textbf{0}, \pmb {\Sigma } \right) \) and in order to check the robustness of the results to the center of the data, all observations were shifted right by adding 50 everywhere. A number of outliers equal to 2\(\%\) of the sample size were introduced. These outliers were \(\bar{\textbf{x}}+e^{\kappa }{} \textbf{z} \in {{\mathbb {R}}}^{p}\), where \(\bar{\textbf{x}}\) is the sample mean vector, \(\textbf{z}\) are unit vector(s) and \(e^{\kappa }\) a real number denoting their norm, where \(\kappa \) varied from 3 up to 8 increasing with a step size equal to 1 and the angle between the outliers \(\textbf{z}\) and the first “clean” eigenvector spanned from \(0^{\circ }\) up to \(90^{\circ }\). In all cases, we subtracted the spatial median or the variable-wise medianFootnote 2 and scaled them by the mean absolute deviation.
At each case, we computed the first eigenvector for the Cauchy-PCA and for each of the three competitors (PP, PCP and RSVD). The performance metric is the angle (in degrees) between the first robust eigenvector and the first eigenvector computed with the uncontaminated data using classical PCA. All experiments were repeated 100 times and the results were averaged.
4.1.2 Comparative results
Tables 1, 2, 3, 4, 5 and 6 present the performance of the first Cauchy-PCA eigenvector and of the first eigenvector of the PCP, RRSVD and PP methods for a variety of norms of the outlier, with different angles (\(\phi \)) between the outlier and the leading true eigenvector, for the \(n<p\) case. The case of \(n<p\) was selected as statistical inference in this case is more challenging than the \(p<n\) caseFootnote 3. Additionally, this case is also ordinarily met in the field of bioinformatics were the -omics data count tens of thousands of variables (genes, single nucleotide polymorphisms, etc.) but only tens or at most hundreds of observations.
As observed in Tables 1, 2, 3, 4, 5 and 6 the average angular difference between the Cauchy and the PP PCA ranges from \(20^{\circ }\) up to more than \(50^{\circ }\), which is evidently quite substantial, providing evidence that Cauchy PCA has performed in a superior manner to the projection pursuit method of Croux et al. (2007, 2013). In particular, the tables demonstrate that Cauchy PCA is less error prone than PP PCA but, as is seen in Tables 5 and 6, the error decreases for both methods with increasing sample size. Further, the mean angular difference between the two methods increases as the angle \(\phi \) increases. For instance, in Tables 1 and 2, when \(k=8\) and \(\phi =0^{\circ }\) the difference between the two methods is \(20^{\circ }\), whereas when \(\phi =90^{\circ }\) the difference increases to \(48^{\circ }\).
With regard to the other two competitors, Cauchy PCA outperforms both of them in the \( n=100 \ \& p=500\) case scenario as observed in Table 1, which presents the departure from the first (sample) eigenvector calculated using the uncontaminated data. Examination of the same case scenario, but measuring the angular departure from the population eigenvector (Table 2) shows that Cauchy PCA performs worse than the other two competitors when the angle \(\phi \) is equal to \(0^{\circ }\) or \(30^{\circ }\). For the larger values of \(\phi \), though, Cauchy PCA outperforms them. The same conclusion is drawn for all other case scenarios.
Further, the error is not highly affected by the angle \(\phi \), or the norm of the outliers. It can be seen that in Tables 3 and 5 in the special case of \(\phi =90^{\circ }\), the error increases for the Cauchy PCA by by between 2 and 3 degrees, thus corroborating the result of Corollary 2. However, this effect, as in Table 1, is rather small, though noticeable.
4.2 High dimensional real datasets
Two real gene expression datasets, GSE13159 and GSE31161Footnote 4, downloaded from the Biodataome platform (Lakiotaki et al. 2018), were used in the experiments. The dimensions of the datasets were equal to \(2096 \times 54,630\) and \(1,035 \times 54,675\), respectively. We randomly selected 2000 variables and computed the outliers using the high dimensional Minimum Diagonal Product (MDP) Ro et al. (2015). In accordance with the simulations studies, we removed the \(2\%\) of the most extreme outliers detected by MDP and computed the first classical PC (benchmark eigenvector), the first Cauchy-PCA eigenvector and the first PP-PCA eigenvector computed with the uncontaminated data. We then added those outliers and increased their norm by \(e^k\), where \(k=(0, 3, 4, \ldots , 8)\) and computed the first Cauchy-PCA eigenvector and the first PP-PCA eigenvector. In all cases, we subtracted the spatial median or the variable-wise median and scaled them by the mean absolute deviation. The performance metric is the angle (in degrees) between the first robust (based on Cauchy or PP-PCA) eigenvector and the first population eigenvector, computed with the uncontaminated data, and the time required by each method. This procedure was repeated 200 times and the average results are graphically displayed in Fig. 1a–d.
Broadly speaking the effect of the PP PCA does not seem to have been affected substantially by the centering method, i.e. subtraction of the spatial or the variable-wise median. On the contrary, the Cauchy PCA is affected by the type of median employed to this end. Centering with the spatial median yields high error levels for all norms of the outliers, for both datasets, whereas centering with the variable-wise median produces much lower error levels. On average, the difference in the error between Cauchy PCA and PP PCA is about \(30^{\circ }\) for the GSE31159 dataset (Fig. 1a) and about \(14^{\circ }\) for the GSE3161 dataset (Fig. 1b). However, the error of the Cauchy PCA increases and the stabilizes in the GSE31159 dataset whereas the error of the PP PCA is stable regardless of the norm of the outliers. A different conclusion is extracted in the GSE31161 where the error of either method decreases as the norm of the outliers increases, until it reaches a plateau.
With regards to computational efficiency, the PP PCA is not affected by either centering method, whereas Cauchy PCA seems to be affected in the GSE31159 dataset but not in the GSE31161 dataset as seen in Fig. 1c, d. Cauchy PCA centered with the variable-wise median is, on average, 5 times faster than PP PCA.
The reason why PCP and RSVD were excluded from examination with the real datasets is their high computational cost. Table 7 contains the duration (in seconds) of each algorithm required for one run with an increasing number of variables, from 100 up to 5000 variables, for each dataset. For the dataset GSE13159 and 2000 variables, the PCP method requires more than 3.5 h and the RRSVD requires 1.3 h. The figures were created using 200 replications. If we add these two other competing methods and perform only 50 iterations, for the GSE13159, the PCP method would require 8 * 3.5 * 50 = 1400 h which means 58 days and the RRSVD would require an extra 8 * 1.29 * 50 = 516 h or 21.5 days.
5 Conclusion
The starting point for this paper is the observation that classical PCA can be formulated purely in terms of operations on a Gaussian likelihood. Although this observation is not new, the specifics of this formulation of classical PCA do not appear to be as widely known as might be expected. The novel idea underlying this paper is to formulate a version of PCA in which a Cauchy likelihood is used instead of a Gaussian likelihood, leading to what we call Cauchy PCA. Study of the resulting influence function shows that Cauchy PCA has very good robustness properties. Moreover, we have provided an implementation of Cauchy PCA which runs quickly and reliably. Numerous simulation and real-data examples, mainly in high-dimensional settings, show that, in terms of quality of performance, Cauchy PCA typically outperforms, or is on a par with, alternative robust versions of PCA whose implementations are in the public domain; and Cauchy PCA is typically far superior to all of these other methods in terms of computational cost in very high dimensional settings.
Notes
We use \(\textbf{A}^+\) to denote the Moore-Penrose inverse of a matrix \(\textbf{A}\).
The results are pretty similar for either type of median and we here show the results of he variable-wise median.
In this paper we focus on high-dimensional simulations and real-date examples (\(p>n)\) but in results not presented in the paper we found that Cauchy PCA is also very competitive and performs strongly in low dimensional settings (\(p<n\)).
From a biological standpoint, the data have already been uniformly pre-processed, curated and automatically annotated.
References
Bolton, R.J., Krzanowski, W.J.: A characterization of principal components for projection pursuit. Am. Stat. 53(2), 108–109 (1999)
Campbell, N.A.: Robust procedures in multivariate analysis I: robust covariance estimation. Appl. Stat. 29(3), 231–237 (1980)
Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM (JACM) 58(3), 1–37 (2011)
Copas, J.B.: On the unimodality of the likelihood for the Cauchy distribution. Biometrika 62(3), 701–704 (1975)
Croux, C., Filzmoser, P., Oliveira, M.R.: Algorithms for projection-pursuit robust principal component analysis. Chemom. Intell. Lab. Syst. 87(2), 218–225 (2007)
Croux, C., Filzmoser, P., Fritz, H.: Robust sparse principal component analysis. Technometrics 55(2), 202–214 (2013)
Filzmoser, P., Fritz, H., Kalcher, K.: pcaPP: robust PCA by projection pursuit (2018). R package version 1.9-73. https://CRAN.R-project.org/package=pcaPP
Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Stat. 35(1), 73–101 (1964)
Hubert, M., Verboven, S.: A robust PCR method for high-dimensional regressors. J. Chemom. 17, 438–452 (2003)
Lakiotaki, K., Vorniotakis, N., Tsagris, M., Georgakopoulos, G., Tsamardinos, I.: Biodataome: a collection of uniformly preprocessed and automatically annotated datasets for data-driven biology. Database 2018, bay011 (2018)
Li, G.Y., Chen, Z.L.: Projection-pursuit approach to robust dispersion matrices and principal components: primary theory and Monte Carlo. J. Am. Stat. Assoc. 80, 759–766 (1985)
Mardia, K.F., Kent, J.T., Bibby, J.M.: Multivariate Analysis. Academic Press (1979)
Maronna, R.A.: Robust M-estimators of multivariate location and scatter. Ann. Stat. 4(1), 51–67 (1976)
Ro, K., Zou, C., Wang, Z., Yin, G.: Outlier detection for high-dimensional data. Biometrika 102(3), 589–599 (2015)
Sykulski, M.: Rpca: RobustPCA: decompose a matrix into low-rank and sparse components (2017). R package version 0.2.3. https://CRAN.R-project.org/package=rpca
Todorov, V.: rrcovHD: robust Multivariate Methods for High Dimensional Data (2016). R package version 0.2-5. https://CRAN.R-project.org/package=rrcovHD
Tsagris, M., Fayomi, A., Pantazis, Y., Wood, A.T.A.: Cauchypca: robust principal component analysis using the cauchy distribution (2023). R package version 1.1. https://CRAN.R-project.org/package=cauchypca
Vakili, K.: FastHCS: robust algorithm for principal component analysis (2018). R package version 0.0.6. https://CRAN.R-project.org/package=FastHCS
Xie, Y.L., Wang, J.H., Liang, L.X., Sun, X.H.S., Yu, R.Q.: Robust principal component analysis by projection-pursuit. J. Chemom. 7(6), 527–541 (1993)
Zhang, L., Pan, C.: RobRSVD: robust regularized singular value decomposition (2013). R package version 1.0. https://rdrr.io/cran/RobRSVD/
Zhang, L., Shen, H., Huang, J.Z.: Robust regularized singular value decomposition with application to mortality data. Ann. Appl. Stat. 7(3), 1540–1561 (2013)
Funding
Open access funding provided by HEAL-Link Greece.
Author information
Authors and Affiliations
Contributions
A.T.A.W. conceived the idea and A.F. developed it under A.T.A.W.’s supervision. M.T. implemented the computational work. Y.P. made suggestions and reviewed some of the theoretical work. All authors contributed to the writing and read the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendices
Appendix A Proof of Proposition 1
Proof
The perturbed distribution \((1-\epsilon )F(\textbf{x}) + \epsilon \Delta _\textbf{z}(\textbf{x})\) has perturbed mean value
and perturbed covariance matrix
Denoting by \(\lambda _\epsilon \) the leading eigenvalue of \(\varvec{\Sigma }_\epsilon \) and by \(\textbf{u}_\epsilon \) the corresponding eigenvector, it holds that
Next, we expand the perturbed eigenvector and eigenvalue around the unperturbed ones as follows:
and
with
Substituting the formulas into (A1), and equating the zero-th and first order we get
and
and
Multiplying (A2) from the left with \(\textbf{u}_0^T\), we get
For \(\textbf{u}_1\), we rearrange (A2) to
and then multiply from the left with the pseudo-inverse of \(\varvec{\Sigma }-\lambda _0\textbf{I}\) to obtain
Using the properties ( Mardia et al. (1979)): \((\varvec{\Sigma }-\lambda _0\textbf{I})^+(\varvec{\Sigma }-\lambda _0\textbf{I}) = \textbf{I}- \textbf{u}_0\textbf{u}_0^T\) and \((\varvec{\Sigma }-\lambda _0\textbf{I})^+\textbf{u}_0 = \textbf{0}\), we obtain
and the proof is completed. \(\square \)
Appendix B Proof of Proposition 2
Let us first make the symbolism more explicit and denote \(l_F(\textbf{u}|\varvec{\theta })\) the Cauchy log-likelihood function with respect to the distribution F and \({\hat{\textbf{u}}}_F\) the respective leading Cauchy principal direction. Then, our goal is to calculate the limit of
as \(\epsilon \rightarrow 0\) where \({\hat{\textbf{u}}}_{F_{\epsilon ,\textbf{z}}}\) is the leading Cauchy principal direction for the distribution \(F_{\epsilon ,\textbf{z}}=(1-\epsilon )F+\epsilon \Delta _\textbf{z}\). The optimality condition for the leading Cauchy principal direction reads
and
Moreover, \({\hat{\textbf{u}}}_{F_{\epsilon ,\textbf{z}}}\) is a unit vector which can be represented as
where \(\textbf{h}\) is a unit vector perpendicular to \({\hat{\textbf{u}}}_F\) and \(\rho \) is a (small) real number. Under these assumptions, \({\hat{\textbf{u}}}_{F_{\epsilon ,\textbf{z}}}\) is a unit vector since
Obviously, \(\rho \) depends on \(\epsilon \) and \(\textbf{z}\) (i.e., \(\rho =\rho (\epsilon ,\textbf{z})\)) and \(\lim _{\epsilon \rightarrow 0} \rho = 0\) but we choose to avoid denoting their explicit relationship because it is not required in our proof. Moreover, a Taylor expansion for the representation leads to
thus we obtain that
Next, we compute the partial derivative using the chain rule
Therefore,
The second summand equals to zero because \({\hat{\textbf{u}}}_{F}\) maximizes the Cauchy log-likelihood function thus it holds that \(\int _{{\mathbb {R}}^p} {\bar{g}}_{\varvec{\theta }}(\textbf{x};{\hat{\textbf{u}}}_{F}) dF(\textbf{x}) = \textbf{0}\). Similarly,
Next, we further Taylor expand \({\bar{g}}_{c}(\textbf{x}; \textbf{u}_{F_{\epsilon ,\textbf{z}}})\) using \({\hat{\textbf{u}}}_{F_{\epsilon ,\textbf{z}}} = {\hat{\textbf{u}}}_F + \rho \textbf{h}+ O(\rho ^2)\)
Using again the chain rule, we obtain that
The computation of the partial derivative \(\frac{\partial }{\partial \textbf{u}} \varvec{\theta }_{F}(\textbf{u})\) follows. Formula \(\varvec{\theta }_{F}(\textbf{u}) = {{\,\mathrm{\arg \max }\,}}_{\varvec{\theta }} l_F(\textbf{x}^T\textbf{u}|\varvec{\theta })\) implies that
Differentiating with respect to \(\textbf{u}\) and using the implicit function theorem, we get
Thus,
Overall, (B1) becomes
where we use the facts that
and
Thus, the influence function is
where
and
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Fayomi, A., Pantazis, Y., Tsagris, M. et al. Cauchy robust principal component analysis with applications to high-dimensional data sets. Stat Comput 34, 26 (2024). https://doi.org/10.1007/s11222-023-10328-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-023-10328-x