词汇
计算机科学
背景(考古学)
图形
节点(物理)
人工神经网络
自然语言处理
理论计算机科学
人工智能
语言学
物理
生物
古生物学
哲学
量子力学
作者
Jun Xia,Chengshuai Zhao,Bozhen Hu,Zhangyang Gao,Cheng Tan,Yue Liu,Siyuan Li,Stan Z. Li
标识
DOI:10.26434/chemrxiv-2023-dngg4
摘要
Recent years have witnessed the prosperity of pre-training graph neural networks (GNNs) for molecules. Typically, atom types as node attributes are randomly masked and GNNs are then trained to predict masked types as in AttrMask \citep{hu2020strategies}, following the Masked Language Modeling (MLM) task of BERT~\citep{devlin2019bert}. However, unlike MLM where the vocabulary is large, the AttrMask pre-training does not learn informative molecular representations due to small and unbalanced atom `vocabulary'. To amend this problem, we propose a variant of VQ-VAE~\citep{van2017neural} as a context-aware tokenizer to encode atom attributes into chemically meaningful discrete codes. This can enlarge the atom vocabulary size and mitigate the quantitative divergence between dominant (e.g., carbons) and rare atoms (e.g., phosphorus). With the enlarged atom `vocabulary', we propose a novel node-level pre-training task, dubbed Masked Atoms Modeling (MAM), to mask some discrete codes randomly and then pre-train GNNs to predict them. MAM also mitigates another issue of AttrMask, namely the negative transfer. It can be easily combined with various pre-training tasks to improve their performance. Furthermore, we propose triplet masked contrastive learning (TMCL) for graph-level pre-training to model the heterogeneous semantic similarity between molecules for effective molecule retrieval. MAM and TMCL constitute a novel pre-training framework, Mole-BERT, which can match or outperform state-of-the-art methods in a fully data-driven manner. We release the code at \textcolor{magenta}{\url{https://github.com/junxia97/Mole-BERT}}.
科研通智能强力驱动
Strongly Powered by AbleSci AI