词汇分析
计算机科学
人工智能
算法
子串
化学信息学
机器学习
自然语言处理
集合(抽象数据类型)
生物信息学
程序设计语言
生物
作者
Xinhao Li,Denis Fourches
标识
DOI:10.1021/acs.jcim.0c01127
摘要
Simplified molecular input line entry system (SMILES)-based deep learning models are slowly emerging as an important research topic in cheminformatics. In this study, we introduce SMILES pair encoding (SPE), a data-driven tokenization algorithm. SPE first learns a vocabulary of high-frequency SMILES substrings from a large chemical dataset (e.g., ChEMBL) and then tokenizes SMILES based on the learned vocabulary for the actual training of deep learning models. SPE augments the widely used atom-level tokenization by adding human-readable and chemically explainable SMILES substrings as tokens. Case studies show that SPE can achieve superior performances on both molecular generation and quantitative structure–activity relationship (QSAR) prediction tasks. In particular, the SPE-based generative models outperformed the atom-level tokenization model in the aspects of novelty, diversity, and ability to resemble the training set distribution. The performance of SPE-based QSAR prediction models were evaluated using 24 benchmark datasets where SPE consistently either did match or outperform atom-level and k-mer tokenization. Therefore, SPE could be a promising tokenization method for SMILES-based deep learning models. An open-source Python package SmilesPE was developed to implement this algorithm and is now freely available at https://github.com/XinhaoLi74/SmilesPE.
科研通智能强力驱动
Strongly Powered by AbleSci AI