相似性(几何)
字符串度量
弦(物理)
转化(遗传学)
匹配(统计)
字符串搜索算法
计算机科学
模式识别(心理学)
人工智能
任务(项目管理)
数据挖掘
情报检索
模式匹配
数学
统计
图像(数学)
基因
生物化学
经济
数学物理
化学
管理
作者
Kazunori Sakai,Yuyang Dong,Masafumi Oyamada,Kunihiro Takeoka,Takeshi Okadome
出处
期刊:Communications in computer and information science
日期:2022-01-01
卷期号:: 76-87
标识
DOI:10.1007/978-3-030-93849-9_5
摘要
Entity matching is an important task in common data cleaning and data integration problems of determining two records that refer to the same real-world entity. Many research use string similarity as features to infer entity matching but the power of the similarity may be affected by the pairs of hard-to-classify entities, which are actually different entities but have a high similarity or the same entity with low similarity. String transformation is a good solution to solve different representations between two domains of datasets, such as abbreviations, misspellings, and other expressions. In this paper, we propose two powerful features, similarity gain and dissimilarity gain, that enables us to discriminate whether the two entities refer to the same entity after string transformation. The similarity gain is defined by the maximum amount of similarity increase among the variations in similarity before and after applying string transformations. The dissimilarity is defined by the maximum amount of similarity decrease. Moreover, the similarity gain and dissimilarity gain can also be used for selecting valuable samples in a limited labeling budget. Sufficient experiments are conducted, and our method with the proposed features improves the best accuracy in most cases.
科研通智能强力驱动
Strongly Powered by AbleSci AI