计算机科学
财产(哲学)
鉴定(生物学)
科学知识社会学
词(群论)
数据科学
知识抽取
点(几何)
表(数据库)
科学文献
情报检索
自然语言处理
人工智能
数据挖掘
语言学
认识论
生物
植物
哲学
古生物学
数学
几何学
作者
Vahe Tshitoyan,John Dagdelen,Leigh Weston,Alexander Dunn,Ziqin Rong,Olga Kononova,Kristin A. Persson,Gerbrand Ceder,Anubhav Jain
出处
期刊:Nature
[Springer Nature]
日期:2019-07-01
卷期号:571 (7763): 95-98
被引量:784
标识
DOI:10.1038/s41586-019-1335-8
摘要
The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases1,2, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing3–10, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings11–13 (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature. Natural language processing algorithms applied to three million materials science abstracts uncover relationships between words, material compositions and properties, and predict potential new thermoelectric materials.
科研通智能强力驱动
Strongly Powered by AbleSci AI