计算机科学
功能(生物学)
人工智能
序列(生物学)
自然语言处理
机器学习
生物
遗传学
进化生物学
作者
Mai Ha Vu,Rahmad Akbar,Philippe A. Robert,Bartłomiej Świątczak,Geir Kjetil Sandve,Victor Greiff,Dag Haug
标识
DOI:10.1038/s42256-023-00637-1
摘要
Deep neural-network-based language models (LMs) are increasingly applied to large-scale protein sequence data to predict protein function. However, being largely black-box models and thus challenging to interpret, current protein LM approaches do not contribute to a fundamental understanding of sequence–function mappings, hindering rule-based biotherapeutic drug development. We argue that guidance drawn from linguistics, a field specialized in analytical rule extraction from natural language data, can aid with building more interpretable protein LMs that are more likely to learn relevant domain-specific rules. Differences between protein sequence data and linguistic sequence data require the integration of more domain-specific knowledge in protein LMs compared with natural language LMs. Here, we provide a linguistics-based roadmap for protein LM pipeline choices with regard to training data, tokenization, token embedding, sequence embedding and model interpretation. Incorporating linguistic ideas into protein LMs enables the development of next-generation interpretable machine learning models with the potential of uncovering the biological mechanisms underlying sequence–function relationships. Language models trained on proteins can help to predict functions from sequences but provide little insight into the underlying mechanisms. Vu and colleagues explain how extracting the underlying rules from a protein language model can make them interpretable and help explain biological mechanisms.
科研通智能强力驱动
Strongly Powered by AbleSci AI