可解释性
计算机科学
机器学习
人工智能
可扩展性
特征工程
高斯过程
回归
贝叶斯概率
Python(编程语言)
线性回归
深度学习
数据挖掘
高斯分布
数学
统计
物理
量子力学
数据库
操作系统
作者
Jonathan Parkinson,Wei Wang
标识
DOI:10.1021/acs.jcim.3c00601
摘要
Gaussian process (GP) is a Bayesian model which provides several advantages for regression tasks in machine learning such as reliable quantitation of uncertainty and improved interpretability. Their adoption has been precluded by their excessive computational cost and by the difficulty in adapting them for analyzing sequences (e.g., amino acid sequences) and graphs (e.g., small molecules). In this study, we introduce a group of random feature-approximated kernels for sequences and graphs that exhibit linear scaling with both the size of the training set and the size of the sequences or graphs. We incorporate these new kernels into our new Python library for GP regression, xGPR, and develop an efficient and scalable algorithm for fitting GPs equipped with these kernels to large datasets. We compare the performance of xGPR on 17 different benchmarks with both standard and state-of-the-art deep learning models and find that GP regression achieves highly competitive accuracy for these tasks while providing with well-calibrated uncertainty quantitation and improved interpretability. Finally, in a simple experiment, we illustrate how xGPR may be used as part of an active learning strategy to engineer a protein with a desired property in an automated way without human intervention.
科研通智能强力驱动
Strongly Powered by AbleSci AI