Sparse PCA
Sparse principal component analysis (SPCA or sparse PCA) is a specialised technique used in statistical analysis and, in particular, in the analysis of multivariate data sets. It extends the classic method of principal component analysis (PCA) for the reduction of dimensionality of data by introducing sparsity structures to the input variables.
A particular disadvantage of ordinary PCA is that the principal components are usually linear combinations of all input variables. SPCA overcomes this disadvantage by finding components that are linear combinations of just a few input variables (SPCs). This means that some of the coefficients of the linear combinations defining the SPCs, called loadings,[note 1] are equal to zero. The number of nonzero loadings is called the cardinality of the SPC.
There exist two different approaches to SPCA: conventional SPCA, based on Harold Hotelling's definition[1] of PCA, by which the PCs are the linear combinations of the variables with unit norm and orthogonal loadings which sequentially have the largest variance (the norm of the SPCs); the second approach is Least Squares SPCA[2] (LS SPCA), which is based on Karl Pearson's definition[3] of PCA by which the SPCs are orthogonal and sequentially best approximate the data matrix in a least square sense. While the two definitions lead to the same PCA solution, when sparsity constraints are added this is no longer true.
Contemporary datasets often have the number of input variables () comparable with or even much larger than the number of samples (). It has been shown that if does not converge to zero, the classical PCA is not consistent. But conventional SPCA can retain consistency even if [4]
LS SPCA Variants
In optimal LS SPCA (USPCA, uncorrelated SPCA) the orthogonality constraints require that the cardinality of the solutions is not smaller than the order of the component, These constraints may also create numerical problems when computing components of order larger than two.
Correlated SPCA (CSPCA, correlated SPCA) is a variant of LS SPCA in which the orthogonality constraints are relaxed and the solutions are obtained iteratively by minimizing the norm of the approximation error from residuals orthogonal to the previously computed SPCs. Even though the resulting components are correlated (usually very mildly) , they have lower cardinality and in many cases explain more variance than the corresponding USPCA solutions.
The computation of the USPCA and CSPCA solutions is demanding when the data matrix is large. With Projection SPCA[5] (PSPCA) approximate CSPCA solutions can computed much more efficiently by simply projecting the current first PC onto subsets of the variables. This means that the solutions can be computed with efficient linear regression routines. PSPCA is fast and can be shown to explain a proportion of the variance of the dataset comparable with that explained by CSPCA.
Mathematical formulation
Consider a data matrix, , where each of the columns represent an input variable, and each of the rows represents an independent sample from data population. One assumes each column of has mean zero, otherwise one can subtract column-wise mean from each element of . Let be the empirical covariance matrix of , which has dimension .
Conventional SPCA
Given an integer with , the sparse PCA problem can be formulated as maximizing the variance along a direction represented by vector while constraining its cardinality:
- Eq. 1
The first constraint specifies that v is a unit vector. In the second constraint, represents the pseudo-norm of v, which is defined as the number of its non-zero components. So the second constraint specifies that the number of non-zero components in v is less than or equal to k, which is typically an integer that is much smaller than dimension p. The optimal value of Eq. 1 is known as the k-sparse largest eigenvalue.
If one takes k=p, the problem reduces to the ordinary PCA, and the optimal value becomes the largest eigenvalue of covariance matrix Σ.
After finding the optimal solution v, one deflates Σ to obtain a new matrix
and iterate this process to obtain further principal components. However, unlike PCA, sparse PCA cannot guarantee that different principal components are orthogonal. In order to achieve orthogonality, additional constraints must be enforced.
The following equivalent definition is in matrix form. Let be a p×p symmetric matrix, one can rewrite the sparse PCA problem as
- Eq. 2
Tr is the matrix trace, and represents the non-zero elements in matrix V. The last line specifies that V has matrix rank one and is positive semidefinite. The last line means that one has , so Eq. 2 is equivalent to Eq. 1.
Moreover, the rank constraint in this formulation is actually redundant, and therefore sparse PCA can be cast as the following mixed-integer semidefinite program[6]
- Eq. 3
Because of the cardinality constraint, the maximization problem is hard to solve exactly, especially when dimension p is high. In fact, the sparse PCA problem in Eq. 1 is NP-hard in the strong sense.[7]
Least Squares SPCA
In this section the generic SPCs will be defined as , where denotes the jth -vectors of loadings.
By denoting the matrix with columns and , the USPCs are derived as the solutions to the minimization problem:
- Eq. 4
where is the orthogonal projector onto the column space of .
Because of the orthogonality constraints, the problem in equation Eq. 4 can be solved sequentially for each , as:
- Eq. 5
where denotes the first columns of . Closed form solutions to equation Eq. 5 for given are given in Merola (2015).[2]
In CSPC the loadings are obtained iteratively by minimizing the norm of the approximation error from the residuals orthogonal to the previously computed SPCs. That is, if we let , the vector of loadings is obtained as the solution to:
- Eq. 6
Without the sparsity constraints, the problems in equations Eq. 4 - Eq. 6 are equivalent to Pearson's definition of PCA.
The PSPCA sparse loadings are obtained iteratively by projecting the first PCs of the residual matrix on subsets of the variables. that is by solving the problems
- Eq. 7
where is the first principal component of the residual matrix
Computational considerations
As most sparse problems, variable selection in SPCA is a computationally intractable nonconvex NP-hard problem,[8] therefore greedy sub-optimal algorithms are required to find solutions.
Algorithms for conventional SPCA
Several alternative approaches (of Eq. 1) have been proposed, including
- a regression framework,[9]
- a penalized matrix decomposition framework,[10]
- a convex relaxation/semidefinite programming framework,[11]
- a generalized power method framework[12]
- an alternating maximization framework[13]
- forward-backward greedy search and exact methods using branch-and-bound techniques,[14]
- a certifiably optimal branch-and-bound approach[15]
- Bayesian formulation framework.[16]
- A certifiably optimal mixed-integer semidefinite branch-and-cut approach [6]
The methodological and theoretical developments of Sparse PCA as well as its applications in scientific studies are recently reviewed in a survey paper.[17]
Notes on Semidefinite Programming Relaxation
It has been proposed that sparse PCA can be approximated by semidefinite programming (SDP).[11] If one drops the rank constraint and relaxes the cardinality constraint by a 1-norm convex constraint, one gets a semidefinite programming relaxation, which can be solved efficiently in polynomial time:
- Eq. 3
In the second constraint, is a p×1 vector of ones, and |V| is the matrix whose elements are the absolute values of the elements of V.
The optimal solution to the relaxed problem Eq. 3 is not guaranteed to have rank one. In that case, can be truncated to retain only the dominant eigenvector.
While the semidefinite program does not scale beyond n=300 covariates, it has been shown that a second-order cone relaxation of the semidefinite relaxation is almost as tight and successfully solves problems with n=1000s of covariates [18]
Computation of the LS SPCA solutions
LS SPCA requires that the optimal cardinality and subsets of variables forming each SPC are determined by some optimization criterion. In LS SPCA, a reasonable approach is to choose the lowest cardinality for which an SPC explains a given percentage of the cumulative variance explained by the corresponding standard PC.
Merola (2015)[19] proposed a backward elimination algorithm for variable selection in USPCA and CSPCA, which can be slow with large datasets. With PSPCA the optimal subsets can be efficiently computed by using standard algorithms for regression Feature_selection. One possible strategy for computing the LS SPCs is to use PSPCA for variable selection and then compute the SPCs with USPCA or CSPCA.
Applications
Financial Data Analysis
Suppose ordinary PCA is applied to a dataset where each input variable represents a different asset, it may generate principal components that are weighted combination of all the assets. In contrast, sparse PCA would produce principal components that are weighted combination of only a few input assets, so one can easily interpret its meaning. Furthermore, if one uses a trading strategy based on these principal components, fewer assets imply less transaction costs.
Biology
Consider a dataset where each input variable corresponds to a specific gene. Sparse PCA can produce a principal component that involves only a few genes, so researchers can focus on these specific genes for further analysis.
High-dimensional Hypothesis Testing
Contemporary datasets often have the number of input variables () comparable with or even much larger than the number of samples (). It has been shown that if does not converge to zero, the classical PCA is not consistent. In other words, if we let in Eq. 1, then the optimal value does not converge to the largest eigenvalue of data population when the sample size , and the optimal solution does not converge to the direction of maximum variance. But sparse PCA can retain consistency even if [4]
The k-sparse largest eigenvalue (the optimal value of Eq. 1) can be used to discriminate an isometric model, where every direction has the same variance, from a spiked covariance model in high-dimensional setting.[20] Consider a hypothesis test where the null hypothesis specifies that data are generated from a multivariate normal distribution with mean 0 and covariance equal to an identity matrix, and the alternative hypothesis specifies that data is generated from a spiked model with signal strength :
where has only k non-zero coordinates. The largest k-sparse eigenvalue can discriminate the two hypotheses if and only if .
Since computing k-sparse eigenvalue is NP-hard, one can approximate it by the optimal value of semidefinite programming relaxation (Eq. 3). If that case, we can discriminate the two hypotheses if . The additional term cannot be improved by any other polynomial time algorithm if the planted clique conjecture holds.
Software/source code
Conventional SPCA
- amanpg - R package for Sparse PCA using the Alternating Manifold Proximal Gradient Method [21]
- elasticnet – R package for Sparse Estimation and Sparse PCA using Elastic-Nets [22]
- nsprcomp - R package for sparse and/or non-negative PCA based on thresholded power iterations[23]
- scikit-learn – Python library for machine learning which contains Sparse PCA and other techniques in the decomposition module.[24]
Shortcoming of conventional SPCA computed on rank deficient matrices
The different objective function used in the two approaches lead to very different solutions.
Conventional SPCs are the PCs of subsets of highly correlated variables.[8] Instead, LS SPCs are combinations of variables unlikely to be highly correlated. Furthermore, conventional methods fail to identify the lowest possible representation of column-rank deficient matrices PCs, a task for which they were created.
An example of this shortcoming of conventional SPCA is given in Merola and Chen (2019).[5] Consider a matrix with 100 observations on five perfectly collinear variables defined as
The covariance matrix of these variables has rank one, and the only nonzero eigenvalue is equal to 15.2. The first PC explains all the variance and can be written in terms of any of the variables as , that is, as cardinality one components with loading larger than one. Instead, conventional SPCA would take the variables with larger variance to explain more variance than other collinear variables and that only the full cardinality PC explains the maximum variance. This is illustrated in Table 1, which shows the optimal conventional SPCA results
Cardinality | |||||
---|---|---|---|---|---|
Variable | 1 | 2 | 3 | 4 | 5 |
x1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.26 |
x2 | 0.00 | 0.00 | 0.00 | 0.38 | 0.37 |
x3 | 0.00 | 0.00 | 0.50 | 0.46 | 0.45 |
x4 | 0.00 | 0.67 | 0.58 | 0.53 | 0.52 |
x5 | 1.00 | 0.75 | 0.65 | 0.60 | 0.58 |
norm2 | 5.1 | 9.1 | 12.1 | 14.1 | 15.2 |
rel. norm2 | 0.33 | 0.60 | 0.8 | 0.93 | 1.00 |
Var. exp. | 100% | 100% | 100% | 100% | 100% |
The results of conventional SPCA are even more bizzarre when applied to the correlation matrix of the above dataset. The correlation matrix is a 5 x 5 matrix of ones and the unit norm scaled variables are identical. However, in conventional SPCA linear combinations of a different number of these identical variables are deemed to explain more variance than components combination of fewer variables, as shown in Table 2.
Cardinality | |||||
---|---|---|---|---|---|
Variable | 1 | 2 | 3 | 4 | 5 |
norm2 | 1 | 2 | 3 | 4 | 5 |
rel. norm2 | 0.2 | 0.4 | 0.6 | 0.8 | 1 |
Var. exp. | 100% | 100% | 100% | 100% | 100% |
Comparison of LS and conventional SPCA results
With due differences, the preference of conventional SPCA for correlated variables can be observed on datasets with multiply correlated variables. As a general rule, conventional SPCA seeks subsets of highly correlated variables. This is not the case for LS SPCA, because the data reconstruction approach discourages the presence of correlated variables in the subset forming the SPCs.
As an example, Table 3[5] shows the results of USPCA, CSPCA and PSPCA applied to four row-rank deficient matrices. For each of the first three SPCs are shown the cardinality and the cumulative variance that they explain as a percentage of the cumulative variance explained by the corresponding PCs. The reduction in cardinality is striking when considering that over 96% of the variance is explained and in some cases 100% of it. The results for the three the variants of LS SPCA are quite similar.
Comp. 1 | Comp. 2 | Comp. 3 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Dataset | Rows x Cols | Method | Card | % CVexp | Card | % CVexp | Card | % CVexp | ||
Prot. | 11 x 7466 | USPCA | 1 | 99.3% | 2 | 99.2% | 3 | 99.2% | ||
Prot. | 11 x 7466 | CSPCA | 1 | 99.3% | 2 | 99.4% | 1 | 99.3% | ||
Prot. | 11 x 7466 | PSPCA | 1 | 99.3% | 2 | 99.3% | 1 | 99.3% | ||
Khanh | 88 x 2308 | USPCA | 6 | 96.5% | 7 | 96.5% | 4 | 96.4% | ||
Khanh | 88 x 2308 | CSPCA | 6 | 96.5% | 7 | 96.5% | 4 | 96.4% | ||
Khanh | 88 x 2308 | PSPCA | 6 | 96.4% | 6 | 96.2% | 4 | 96.4% | ||
Rama. | 144 x 16063 | USPCA | 3 | 96.0% | 4 | 96.3% | 6 | 95.8% | ||
Rama. | 144 x 16063 | CSPCA | 3 | 96.0% | 4 | 96.3% | 6 | 96.3% | ||
Rama. | 144 x 16063 | PSPCA | 3 | 95.9% | 4 | 96.3% | 6 | 96.2% | ||
Phon. | 257 x 4509 | USPCA | 1 | 100.0% | 4 | 99.9% | 13 | 99.9% | ||
Phon. | 257 x 4509 | CSPCA | 1 | 100.0% | 4 | 99.9% | 13 | 99.9% | ||
Phon. | 257 x 4509 | PSPCA | 1 | 100.0% | 4 | 99.9% | 13 | 99.9% |
Figure 1[5] shows the norm, relative cumulative variance explained, and correlation with the first PC for the first SPCs obtained with conventional SPCA and PSCA on four datasets. The first two datasets are full column-rank and the last two are not. More details on these datasets can be found in Merola (2019).[5] It is easy to see how the PSPCs explain more variance and reach close to one correlation with the PC with much smaller cardinality, irrespective of the much larger norm of the conventional SPCA components.
As shown in Table 3, the datasets Khanh and Rama have much fewer rows than columns. For this reason, the PSPCs are identical to the PCs when their cardinality is equal to the rank of the matrices. Instead, the conventional SPCs are computed with a much larger cardinality than the matrices rank (hence including perfectly correlated variables) without reaching the LS SPCs performance.
Notes
- The term loadings is inappropriately used for these vectors, which should be called coefficients, instead. The naming comes from the term loadings used in factor analysis to design the values generating the common covariance matrix. Since in standard PCA the loadings are equal to the coefficients, the term loadings has been used for the coefficients. This is quite unfortunate, because in SPCA the coefficients are not equal to the loadings.
References
- Hotelling, H. (1933). "Analysis of a complex of statistical variables into principal components". Journal of Educational Psychology. 24 (6): 417–441. doi:10.1037/h0071325. ISSN 1939-2176.
- Merola, Giovanni Maria (2015). "Least Squares Sparse Principal Component Analysis: A Backward Elimination Approach to Attain Large Loadings". Australian & New Zealand Journal of Statistics. 57 (3): 391–429. doi:10.1111/anzs.12128.
- Pearson, Karl (1901). "On lines and planes of closest fit to systems of points in space". The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science. 2 (11): 559–572. doi:10.1080/14786440109462720. ISSN 1941-5982.
- Iain M Johnstone; Arthur Yu Lu (2009). "On Consistency and Sparsity for Principal Components Analysis in High Dimensions". Journal of the American Statistical Association. 104 (486): 682–693. doi:10.1198/jasa.2009.0121. PMC 2898454. PMID 20617121.
- Merola, Giovanni Maria; Chen, Gemai (2019-09-01). "Projection sparse principal component analysis: An efficient least squares method". Journal of Multivariate Analysis. 173: 366–382. arXiv:1612.00939. doi:10.1016/j.jmva.2019.04.001. ISSN 0047-259X.
- Dimitris Bertsimas; Ryan Cory-Wright; Jean Pauphilet (2020). "Solving Large-Scale Sparse PCA to Certifiable (Near) Optimality". arXiv:2005.05195 [math.OC].
- Andreas M. Tillmann; Marc E. Pfetsch (2013). "The Computational Complexity of the Restricted Isometry Property, the Nullspace Property, and Related Concepts in Compressed Sensing". IEEE Transactions on Information Theory. 60 (2): 1248–1259. arXiv:1205.2081. CiteSeerX 10.1.1.760.2559. doi:10.1109/TIT.2013.2290112. S2CID 2788088.
- Moghaddam, Baback; Weiss, Yair; Avidan, Shai (2007-09-07). Schölkopf, Bernhard; Platt, John; Hofmann, Thomas (eds.). Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference. The MIT Press. pp. 915–922. doi:10.7551/mitpress/7503.001.0001. ISBN 978-0-262-25691-9.
- Hui Zou; Trevor Hastie; Robert Tibshirani (2006). "Sparse principal component analysis" (PDF). Journal of Computational and Graphical Statistics. 15 (2): 262–286. CiteSeerX 10.1.1.62.580. doi:10.1198/106186006x113430. S2CID 5730904.
- Fan Chen; Karl Rohe (2021). "A New Basis for Sparse Principal Component Analysis". arXiv:2007.00596 [stat.ML].
- Alexandre d’Aspremont; Laurent El Ghaoui; Michael I. Jordan; Gert R. G. Lanckriet (2007). "A Direct Formulation for Sparse PCA Using Semidefinite Programming" (PDF). SIAM Review. 49 (3): 434–448. arXiv:cs/0406021. doi:10.1137/050645506. S2CID 5490061.
- Michel Journee; Yurii Nesterov; Peter Richtarik; Rodolphe Sepulchre (2010). "Generalized Power Method for Sparse Principal Component Analysis" (PDF). Journal of Machine Learning Research. 11: 517–553. arXiv:0811.4724. Bibcode:2008arXiv0811.4724J. CORE Discussion Paper 2008/70.
- Peter Richtarik; Majid Jahani; S. Damla Ahipasaoglu; Martin Takac (2021). "Alternating Maximization: Unifying Framework for 8 Sparse PCA Formulations and Efficient Parallel Codes". Optimization and Engineering. 22 (3): 1493–1519. arXiv:1212.4137. doi:10.1007/s11081-020-09562-3. S2CID 2549610.
- Baback Moghaddam; Yair Weiss; Shai Avidan (2005). "Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms" (PDF). Advances in Neural Information Processing Systems. Vol. 18. MIT Press.
- Lauren Berk; Dimitris Bertsimas (2019). "Certifiably optimal sparse principal component analysis". Mathematical Programming Computation. Springer. 11 (3): 381–420. doi:10.1007/s12532-018-0153-6. hdl:1721.1/131566. S2CID 126998398.
- Yue Guan; Jennifer Dy (2009). "Sparse Probabilistic Principal Component Analysis" (PDF). Journal of Machine Learning Research Workshop and Conference Proceedings. 5: 185.
- Hui Zou; Lingzhou Xue (2018). "A Selective Overview of Sparse Principal Component Analysis". Proceedings of the IEEE. 106 (8): 1311–1320. doi:10.1109/jproc.2018.2846588.
- Dimitris Bertsimas; Ryan Cory-Wright (2020). "On polyhedral and second-order cone decompositions of semidefinite optimization problems". Operations Research Letters. Elsevier. 48 (1): 78–85. arXiv:1910.03143. doi:10.1016/j.orl.2019.12.003.
- Merola, Giovanni Maria (2015). "Least Squares Sparse Principal Component Analysis: A Backward Elimination Approach to Attain Large Loadings". Australian & New Zealand Journal of Statistics. 57 (3): 391–429. doi:10.1111/anzs.12128.
- Quentin Berthet; Philippe Rigollet (2013). "Optimal Detection of Sparse Principal Components in High Dimension". Annals of Statistics. 41 (1): 1780–1815. arXiv:1202.5070. doi:10.1214/13-aos1127. S2CID 7162068.
- https://cran.r-project.org/web/packages/amanpg/index.html
- https://cran.r-project.org/web/packages/elasticnet/index.html
- https://cran.r-project.org/web/packages/nsprcomp/index.html
- http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.SparsePCA.html