作者
Yinsheng Zhang,Lu Jin,Fei Guo,Xiaofeng Ni,Yaju Zhao,Ying Cheng,Haiyan Wang
摘要
Spectroscopic profiling data used in analytical chemistry can be very high-dimensional. Dimensionality reduction (DR) is an effective way to handle the potential "curse of dimensionality" problem. Among the existing DR algorithms, many can be categorized as a matrix factorization (MF) problem, which decomposes the original data matrix X into the product of a low-dimensional matrix W and a dictionary matrix H. First, this paper provides a theoretical reformulation of relevant DR algorithms under a unified MF perspective, including PCA (principal component analysis), NMF (non-negative matrix factorization), LAE (linear autoencoder), RP (random projection), SRP (sparse random projection), VQ (vector quantization), AA (archetypical analysis), and ICA (independent component analysis). From this perspective, an open-sourced toolkit has been developed to integrate all of the above algorithms with a unified API. Second, we made a comparative study on MF-based DR algorithms. In a case study of TOF (time-of-flight) mass spectra, the eight algorithms extracted three components from the original 27,619 features. The results are compared by a set of DR quality metrics, e.g., reconstruction error, pairwise distance/ranking property, computational cost, local and global structure preservations, etc. Finally, based on the case study result, we summarized guidelines for DR algorithm selection. (1) For reconstruction quality, choose ICA. In the case study, ICA, PCA, and NMF have high reconstruction qualities (reconstruction error < 2%), ICA being the best. (2) To keep the pairwise topological structure, choose PCA. PCA best preserves the pairwise distance/ranking property. (3) For edge computing and IoT scenarios, choose RP or SRP if reconstruction is not required and the JL-lemma condition is met. The RP family has the best computational performance in the experiment, almost 10-100 times faster than its peers.