索引
背景(考古学)
人工智能
代表(政治)
计算机科学
遗传学
计算生物学
序列(生物学)
致病性
机器学习
生物
自然语言处理
基因
单核苷酸多态性
古生物学
微生物学
政治
基因型
法学
政治学
作者
Xiao Fan,Hongbing Pan,Alan Tian,Wendy K. Chung,Yufeng Shen
标识
DOI:10.1101/2022.08.30.505840
摘要
Abstract Inframe insertion and deletion variants (indels) alter protein sequence and length. Accurate pathogenicity predictions are important in genetic studies of human diseases. Indel Interpretation is challenging due to limitations in the available number of known pathogenic variants for training. Existing methods largely use manually encoded features including conservation, protein structure and function, and allele frequency. Recent advances in deep learning modeling of protein sequences and structures provide an opportunity to improve the representation of salient features based on large numbers of protein sequences. We developed a new pathogenicity predictor for SH ort Inframe i N sertion and d E letion (SHINE). SHINE uses pre-trained protein language models to construct a latent representation of an indel and its protein context from protein sequences and multiple protein sequence alignments, and feeds the latent representation into supervised machine learning models for pathogenicity prediction. We curated training data from ClinVar and gnomAD, and created two test datasets from different sources. SHINE achieved better prediction performance than existing methods for both deletion and insertion variants in these two test datasets. Our work suggests that unsupervised protein language models can provide valuable information about proteins, and new methods based on these models can improve variant interpretation in genetic analyses.
科研通智能强力驱动
Strongly Powered by AbleSci AI