计算机科学
离群值
异常检测
数据挖掘
超参数
假阳性悖论
人工智能
统计假设检验
假阳性和假阴性
集成学习
随机性
贝叶斯概率
机器学习
模式识别(心理学)
统计
数学
作者
Jinze Huang,Yang Zhao,Bo Meng,Lu Ao,Yaoguang Wei,Lianhua Dong,Xiang Fang,Dong An,Xinhua Dai
摘要
Abstract Quality control in quantitative proteomics is a persistent challenge, particularly in identifying and managing outliers. Unsupervised learning models, which rely on data structure rather than predefined labels, offer potential solutions. However, without clear labels, their effectiveness might be compromised. Single models are susceptible to the randomness of parameters and initialization, which can result in a high rate of false positives. Ensemble models, on the other hand, have shown capabilities in effectively mitigating the impacts of such randomness and assisting in accurately detecting true outliers. Therefore, we introduced SEAOP, a Python toolbox that utilizes an ensemble mechanism by integrating multi-round data management and a statistics-based decision pipeline with multiple models. Specifically, SEAOP uses multi-round resampling to create diverse sub-data spaces and employs outlier detection methods to identify candidate outliers in each space. Candidates are then aggregated as confirmed outliers via a chi-square test, adhering to a 95% confidence level, to ensure the precision of the unsupervised approaches. Additionally, SEAOP introduces a visualization strategy, specifically designed to intuitively and effectively display the distribution of both outlier and non-outlier samples. Optimal hyperparameter models of SEAOP for outlier detection were identified by using a gradient-simulated standard dataset and Mann–Kendall trend test. The performance of the SEAOP toolbox was evaluated using three experimental datasets, confirming its reliability and accuracy in handling quantitative proteomics.
科研通智能强力驱动
Strongly Powered by AbleSci AI