Performance evaluation of seven multi-label classification methods on real-world patent and publication datasets

真实世界数据 计算机科学 多标签分类 数据挖掘 情报检索 数据科学 人工智能
作者
Shuo Xu,Yuefu Zhang,Xin An,Sainan Pi
出处
期刊:Journal of Data and Information Science [Chinese Academy of Sciences]
卷期号:9 (2): 81-103 被引量:1
标识
DOI:10.2478/jdis-2024-0014
摘要

Abstract Purpose Many science, technology and innovation (STI) resources are attached with several different labels. To assign automatically the resulting labels to an interested instance, many approaches with good performance on the benchmark datasets have been proposed for multilabel classification task in the literature. Furthermore, several open-source tools implementing these approaches have also been developed. However, the characteristics of real-world multilabel patent and publication datasets are not completely in line with those of benchmark ones. Therefore, the main purpose of this paper is to evaluate comprehensively seven multi-label classification methods on real-world datasets. Design/methodology/approach Three real-world datasets (Biological-Sciences, Health-Sciences, and USPTO) from SciGraph and USPTO database are constructed. Seven multilabel classification methods with tuned parameters (dependency-LDA, ML k NN, LabelPowerset, RA k EL, TextCNN, TexRNN, and TextRCNN) are comprehensively compared on these three real-world datasets. To evaluate the performance, the study adopts three classification-based metrics: Macro-F1, Micro-F1, and Hamming Loss. Findings The TextCNN and TextRCNN models show obvious superiority on small-scale datasets with more complex hierarchical structure of labels and more balanced documentlabel distribution in terms of macro-F1, micro-F1 and Hamming Loss. The ML k NN method works better on the larger-scale dataset with more unbalanced document-label distribution. Research limitations Three real-world datasets differ in the following aspects: statement, data quality, and purposes. Additionally, open-source tools designed for multi-label classification also have intrinsic differences in their approaches for data processing and feature selection, which in turn impacts the performance of a multi-label classification approach. In the near future, we will enhance experimental precision and reinforce the validity of conclusions by employing more rigorous control over variables through introducing expanded parameter settings. Practical implications The observed Macro F1 and Micro F1 scores on real-world datasets typically fall short of those achieved on benchmark datasets, underscoring the complexity of real-world multi-label classification tasks. Approaches leveraging deep learning techniques offer promising solutions by accommodating the hierarchical relationships and interdependencies among labels. With ongoing enhancements in deep learning algorithms and large-scale models, it is expected that the efficacy of multi-label classification tasks will be significantly improved, reaching a level of practical utility in the foreseeable future. Originality/value (1) Seven multi-label classification methods are comprehensively compared on three real-world datasets. (2) The TextCNN and TextRCNN models perform better on small-scale datasets with more complex hierarchical structure of labels and more balanced document-label distribution. (3) The ML k NN method works better on the larger-scale dataset with more unbalanced document-label distribution.

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
科研通AI2S应助急诊守夜人采纳,获得10
1秒前
yy完成签到,获得积分10
4秒前
科研通AI2S应助nihI采纳,获得30
5秒前
搞不动科研完成签到,获得积分10
5秒前
空儒完成签到 ,获得积分10
7秒前
勤奋乐天完成签到,获得积分10
9秒前
研友_ZzrWKZ完成签到 ,获得积分10
9秒前
说如果完成签到 ,获得积分10
9秒前
tianmengkui完成签到,获得积分10
19秒前
赵一完成签到 ,获得积分10
23秒前
27秒前
认真的焦完成签到 ,获得积分10
27秒前
小小虾完成签到 ,获得积分10
32秒前
宇文雨文完成签到 ,获得积分10
33秒前
zybbb完成签到 ,获得积分10
36秒前
Hiker完成签到,获得积分10
36秒前
诺亚方舟哇哈哈完成签到 ,获得积分0
38秒前
Hao完成签到,获得积分10
39秒前
半颗橙子完成签到 ,获得积分10
39秒前
虚心的幻梅完成签到 ,获得积分10
44秒前
蔡从安完成签到,获得积分20
49秒前
耸耸完成签到 ,获得积分10
52秒前
1分钟前
玩命的书兰完成签到 ,获得积分10
1分钟前
1分钟前
1分钟前
面汤完成签到 ,获得积分10
1分钟前
王伟轩应助科研通管家采纳,获得10
1分钟前
王伟轩应助科研通管家采纳,获得10
1分钟前
1分钟前
无花果应助科研通管家采纳,获得10
1分钟前
脑洞疼应助科研通管家采纳,获得10
1分钟前
orixero应助科研通管家采纳,获得30
1分钟前
CipherSage应助科研通管家采纳,获得10
1分钟前
小二郎应助科研通管家采纳,获得10
1分钟前
王伟轩应助科研通管家采纳,获得10
1分钟前
王伟轩应助科研通管家采纳,获得10
1分钟前
laber应助科研通管家采纳,获得20
1分钟前
1分钟前
张张张xxx完成签到,获得积分10
1分钟前
高分求助中
(应助此贴封号)【重要!!请各用户(尤其是新用户)详细阅读】【科研通的精品贴汇总】 10000
Modern Epidemiology, Fourth Edition 5000
Handbook of pharmaceutical excipients, Ninth edition 5000
Digital Twins of Advanced Materials Processing 2000
Weaponeering, Fourth Edition – Two Volume SET 2000
Polymorphism and polytypism in crystals 1000
Signals, Systems, and Signal Processing 610
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 纳米技术 有机化学 生物化学 化学工程 物理 计算机科学 复合材料 内科学 催化作用 物理化学 光电子学 电极 冶金 基因 遗传学
热门帖子
关注 科研通微信公众号,转发送积分 6021732
求助须知:如何正确求助?哪些是违规求助? 7635442
关于积分的说明 16166869
捐赠科研通 5169562
什么是DOI,文献DOI怎么找? 2766488
邀请新用户注册赠送积分活动 1749483
关于科研通互助平台的介绍 1636588