计算机科学
变压器
匹配(统计)
编码器
相关性(法律)
光学(聚焦)
人工智能
语言模型
任务(项目管理)
背景(考古学)
自然语言处理
机器学习
数据挖掘
操作系统
生物
经济
法学
政治学
电压
量子力学
光学
管理
物理
古生物学
数学
统计
作者
Hongzhi Wang,Shihao Jiang,Hua Zhang,Yifan Wu,Jiankai Shi,Hongzhi Wang,Guojun Dai
标识
DOI:10.1016/j.knosys.2022.109033
摘要
Entity matching (EM) aims to identify whether two records refer to the same underlying real-world entity. Traditional entity matching methods mainly focus on structured data, where the attribute values are short and atomic. Recently, there has been an increasing demand for matching textual records, such as matching descriptions of products that correspond to long spans of text, which challenges the applications of these methods. Although a few deep learning (DL) solutions have been proposed, these solutions tend to "directly" use the DL techniques and treat the EM as NLP tasks without determining the unique demand for the EM task. Thus, the performance of these DL-based solutions is still far from satisfactory. In this paper, we present JointMatcher, a novel EM method based on the pre-trained Transformer-based language models so that the generated features of the textual records contain the context information. We realize that more attention paid to the similar segments and number-contained segments of the record pair is crucial for accurate matching. To integrate the high-contextualized features with the consideration of paying more attention to the similar segments and the number-contained segments, JointMatcher is equipped with the relevance-aware encoder and the numerically-aware encoder. Extensive experiments using structured and real-world textual datasets demonstrated that JointMatcher outperforms the previous state-of-the-art (SOTA) results without injecting any domain knowledge when small or medium size training sets are used.
科研通智能强力驱动
Strongly Powered by AbleSci AI