计算机科学
水准点(测量)
一般化
机器学习
人工智能
集成学习
匹配(统计)
任务(项目管理)
特征(语言学)
试验装置
训练集
集合(抽象数据类型)
二元分类
集合预报
试验数据
模式识别(心理学)
统计
数学
支持向量机
数学分析
语言学
哲学
管理
大地测量学
经济
程序设计语言
地理
作者
Huahua Ding,Chaofan Dai,Yahui Wu,Ma Wei,Haohao Zhou
标识
DOI:10.1016/j.knosys.2024.111708
摘要
Entity Matching (EM) aims to determine whether records in two datasets refer to the same real-world entity. Existing work often uses Pre-trained Language Models (PLMs) for feature representation, converting EM to a binary classification task. However, due to the dependence of PLMs on large labeled datasets and the overlap between train and test sets in current EM benchmarks, these methods often underperform in real-world scenarios (e.g., small data size, hard negative samples, and unseen entities). To address the limitations of existing methods, we propose SETEM, a self-ensemble training method leveraging the stability and strong generalization of ensemble models to tackle these challenges in real-world scenarios. Additionally, we develop a faster training method for low-resource applications. Experiments on benchmark datasets show that SETEM outperforms Ditto and HierGAT on the F1 score. In particular, SETEM shows the greatest advantage with small datasets and a high proportion of unseen entities in the test set, achieving up to a 9.61% F1 score increment over baselines on the WDC dataset.
科研通智能强力驱动
Strongly Powered by AbleSci AI