符号
计算机科学
自然语言处理
人工智能
Python(编程语言)
判决
嵌入
程序设计语言
语言学
哲学
作者
Rahul Sharma,Ehsan Saghapour,Jake Y. Chen
出处
期刊:iScience
[Elsevier]
日期:2024-03-01
卷期号:27 (3): 109127-109127
被引量:1
标识
DOI:10.1016/j.isci.2024.109127
摘要
Summary
NLP is a well-established field in ML for developing language models that capture the sequence of words in a sentence. Similarly, drug molecule structures can also be represented as sequences using the SMILES notation. However, unlike natural language texts, special characters in drug SMILES have specific meanings and cannot be ignored. We introduce a novel NLP-based method that extracts interpretable sequences and essential features from drug SMILES notation using N-grams. Our method compares these features to Morgan fingerprint bit-vectors using UMAP-based embedding, and we validate its effectiveness through two personalized drug screening (PSD) case studies. Our NLP-based features are sparse and, when combined with gene expressions and disease phenotype features, produce better ML models for PSD. This approach provides a new way to analyze drug molecule structures represented as SMILES notation, which can help accelerate drug discovery efforts. We have also made our method accessible through a Python library.
科研通智能强力驱动
Strongly Powered by AbleSci AI