Performance evaluation of seven multi-label classification methods on real-world patent and publication datasets

真实世界数据 计算机科学 多标签分类 数据挖掘 情报检索 数据科学 人工智能
作者
Shuo Xu,Yuefu Zhang,Xin An,Sainan Pi
出处
期刊:Journal of Data and Information Science [Chinese Academy of Sciences]
卷期号:9 (2): 81-103 被引量:1
标识
DOI:10.2478/jdis-2024-0014
摘要

Abstract Purpose Many science, technology and innovation (STI) resources are attached with several different labels. To assign automatically the resulting labels to an interested instance, many approaches with good performance on the benchmark datasets have been proposed for multilabel classification task in the literature. Furthermore, several open-source tools implementing these approaches have also been developed. However, the characteristics of real-world multilabel patent and publication datasets are not completely in line with those of benchmark ones. Therefore, the main purpose of this paper is to evaluate comprehensively seven multi-label classification methods on real-world datasets. Design/methodology/approach Three real-world datasets (Biological-Sciences, Health-Sciences, and USPTO) from SciGraph and USPTO database are constructed. Seven multilabel classification methods with tuned parameters (dependency-LDA, ML k NN, LabelPowerset, RA k EL, TextCNN, TexRNN, and TextRCNN) are comprehensively compared on these three real-world datasets. To evaluate the performance, the study adopts three classification-based metrics: Macro-F1, Micro-F1, and Hamming Loss. Findings The TextCNN and TextRCNN models show obvious superiority on small-scale datasets with more complex hierarchical structure of labels and more balanced documentlabel distribution in terms of macro-F1, micro-F1 and Hamming Loss. The ML k NN method works better on the larger-scale dataset with more unbalanced document-label distribution. Research limitations Three real-world datasets differ in the following aspects: statement, data quality, and purposes. Additionally, open-source tools designed for multi-label classification also have intrinsic differences in their approaches for data processing and feature selection, which in turn impacts the performance of a multi-label classification approach. In the near future, we will enhance experimental precision and reinforce the validity of conclusions by employing more rigorous control over variables through introducing expanded parameter settings. Practical implications The observed Macro F1 and Micro F1 scores on real-world datasets typically fall short of those achieved on benchmark datasets, underscoring the complexity of real-world multi-label classification tasks. Approaches leveraging deep learning techniques offer promising solutions by accommodating the hierarchical relationships and interdependencies among labels. With ongoing enhancements in deep learning algorithms and large-scale models, it is expected that the efficacy of multi-label classification tasks will be significantly improved, reaching a level of practical utility in the foreseeable future. Originality/value (1) Seven multi-label classification methods are comprehensively compared on three real-world datasets. (2) The TextCNN and TextRCNN models perform better on small-scale datasets with more complex hierarchical structure of labels and more balanced document-label distribution. (3) The ML k NN method works better on the larger-scale dataset with more unbalanced document-label distribution.

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
领导范儿应助苹果采纳,获得10
2秒前
关尔匕禾页完成签到,获得积分10
2秒前
O(∩_∩)O哈哈~完成签到,获得积分10
2秒前
深情安青应助狂野人杰采纳,获得10
3秒前
我能发顶刊完成签到,获得积分10
3秒前
坦率的松发布了新的文献求助10
3秒前
科研通AI6.2应助probiotics采纳,获得10
4秒前
6秒前
8秒前
幽弥狂完成签到,获得积分10
11秒前
11秒前
寻凝发布了新的文献求助10
11秒前
zhu ning发布了新的文献求助10
12秒前
14秒前
kowster应助彩色的不二采纳,获得10
14秒前
顾矜应助柔弱的涫采纳,获得10
15秒前
16秒前
李小豆完成签到,获得积分10
16秒前
17秒前
luckyhan发布了新的文献求助10
18秒前
18秒前
19秒前
ZHANGHUI发布了新的文献求助30
19秒前
zhu ning完成签到,获得积分10
19秒前
20秒前
20秒前
娴娴超爱笑完成签到,获得积分10
21秒前
22秒前
赘婿应助可爱的番薯采纳,获得10
23秒前
浅呀呀呀发布了新的文献求助10
24秒前
24秒前
llopcop完成签到,获得积分10
25秒前
苹果发布了新的文献求助10
25秒前
saaa完成签到,获得积分10
26秒前
dahafei发布了新的文献求助10
27秒前
彩色的不二完成签到 ,获得积分10
27秒前
27秒前
29秒前
stws发布了新的文献求助10
30秒前
dahafei完成签到,获得积分10
31秒前
高分求助中
(应助此贴封号)【重要!!请各用户(尤其是新用户)详细阅读】【科研通的精品贴汇总】 10000
Modern Epidemiology, Fourth Edition 5000
Digital Twins of Advanced Materials Processing 2000
Weaponeering, Fourth Edition – Two Volume SET 2000
Polymorphism and polytypism in crystals 1000
Signals, Systems, and Signal Processing 610
Discrete-Time Signals and Systems 610
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 纳米技术 有机化学 物理 生物化学 化学工程 计算机科学 复合材料 内科学 催化作用 光电子学 物理化学 电极 冶金 遗传学 细胞生物学
热门帖子
关注 科研通微信公众号,转发送积分 6025305
求助须知:如何正确求助?哪些是违规求助? 7661919
关于积分的说明 16178888
捐赠科研通 5173438
什么是DOI,文献DOI怎么找? 2768218
邀请新用户注册赠送积分活动 1751624
关于科研通互助平台的介绍 1637702