Performance evaluation of seven multi-label classification methods on real-world patent and publication datasets

真实世界数据 计算机科学 多标签分类 数据挖掘 情报检索 数据科学 人工智能
作者
Shuo Xu,Yuefu Zhang,Xin An,Sainan Pi
出处
期刊:Journal of Data and Information Science [Journal of Data and Information Science]
卷期号:9 (2): 81-103 被引量:1
标识
DOI:10.2478/jdis-2024-0014
摘要

Abstract Purpose Many science, technology and innovation (STI) resources are attached with several different labels. To assign automatically the resulting labels to an interested instance, many approaches with good performance on the benchmark datasets have been proposed for multilabel classification task in the literature. Furthermore, several open-source tools implementing these approaches have also been developed. However, the characteristics of real-world multilabel patent and publication datasets are not completely in line with those of benchmark ones. Therefore, the main purpose of this paper is to evaluate comprehensively seven multi-label classification methods on real-world datasets. Design/methodology/approach Three real-world datasets (Biological-Sciences, Health-Sciences, and USPTO) from SciGraph and USPTO database are constructed. Seven multilabel classification methods with tuned parameters (dependency-LDA, ML k NN, LabelPowerset, RA k EL, TextCNN, TexRNN, and TextRCNN) are comprehensively compared on these three real-world datasets. To evaluate the performance, the study adopts three classification-based metrics: Macro-F1, Micro-F1, and Hamming Loss. Findings The TextCNN and TextRCNN models show obvious superiority on small-scale datasets with more complex hierarchical structure of labels and more balanced documentlabel distribution in terms of macro-F1, micro-F1 and Hamming Loss. The ML k NN method works better on the larger-scale dataset with more unbalanced document-label distribution. Research limitations Three real-world datasets differ in the following aspects: statement, data quality, and purposes. Additionally, open-source tools designed for multi-label classification also have intrinsic differences in their approaches for data processing and feature selection, which in turn impacts the performance of a multi-label classification approach. In the near future, we will enhance experimental precision and reinforce the validity of conclusions by employing more rigorous control over variables through introducing expanded parameter settings. Practical implications The observed Macro F1 and Micro F1 scores on real-world datasets typically fall short of those achieved on benchmark datasets, underscoring the complexity of real-world multi-label classification tasks. Approaches leveraging deep learning techniques offer promising solutions by accommodating the hierarchical relationships and interdependencies among labels. With ongoing enhancements in deep learning algorithms and large-scale models, it is expected that the efficacy of multi-label classification tasks will be significantly improved, reaching a level of practical utility in the foreseeable future. Originality/value (1) Seven multi-label classification methods are comprehensively compared on three real-world datasets. (2) The TextCNN and TextRCNN models perform better on small-scale datasets with more complex hierarchical structure of labels and more balanced document-label distribution. (3) The ML k NN method works better on the larger-scale dataset with more unbalanced document-label distribution.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
大幅提高文件上传限制,最高150M (2024-4-1)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
开心夏旋完成签到 ,获得积分10
2秒前
今后应助木光采纳,获得10
4秒前
来一斤这种鱼完成签到 ,获得积分10
4秒前
洁净的静芙完成签到 ,获得积分10
12秒前
wei完成签到 ,获得积分0
14秒前
坟里唱情歌完成签到 ,获得积分10
19秒前
27秒前
勤恳的画笔完成签到 ,获得积分10
32秒前
DddZS完成签到 ,获得积分10
38秒前
gmc完成签到 ,获得积分10
42秒前
先锋完成签到 ,获得积分10
54秒前
欢呼的茗茗完成签到 ,获得积分10
58秒前
59秒前
属实有点拉胯完成签到 ,获得积分10
1分钟前
1分钟前
乔杰完成签到 ,获得积分10
1分钟前
快乐小狗完成签到 ,获得积分10
1分钟前
糊涂生活糊涂过完成签到 ,获得积分10
1分钟前
GuangboXia完成签到,获得积分10
1分钟前
yzxzdm完成签到 ,获得积分0
1分钟前
Gary完成签到 ,获得积分10
1分钟前
飞快的盼易完成签到 ,获得积分10
1分钟前
Tina完成签到 ,获得积分10
1分钟前
科研通AI2S应助HHM采纳,获得10
2分钟前
2分钟前
杨永佳666完成签到 ,获得积分10
2分钟前
木光发布了新的文献求助10
2分钟前
coolplex完成签到 ,获得积分10
2分钟前
小美酱完成签到 ,获得积分10
2分钟前
紧张的刺猬完成签到,获得积分10
2分钟前
活泼啤酒完成签到 ,获得积分10
2分钟前
从容松弛完成签到 ,获得积分10
2分钟前
星星完成签到,获得积分10
2分钟前
woods完成签到,获得积分10
2分钟前
小芳芳完成签到 ,获得积分10
2分钟前
黑粉头头完成签到,获得积分10
3分钟前
雁塔完成签到 ,获得积分10
3分钟前
方方完成签到 ,获得积分10
3分钟前
萝卜丁完成签到 ,获得积分10
3分钟前
栗悟饭完成签到,获得积分10
3分钟前
高分求助中
Sustainability in Tides Chemistry 2800
The Young builders of New china : the visit of the delegation of the WFDY to the Chinese People's Republic 1000
Rechtsphilosophie 1000
Bayesian Models of Cognition:Reverse Engineering the Mind 888
Defense against predation 800
Very-high-order BVD Schemes Using β-variable THINC Method 568
Chen Hansheng: China’s Last Romantic Revolutionary 500
热门求助领域 (近24小时)
化学 医学 生物 材料科学 工程类 有机化学 生物化学 物理 内科学 纳米技术 计算机科学 化学工程 复合材料 基因 遗传学 催化作用 物理化学 免疫学 量子力学 细胞生物学
热门帖子
关注 科研通微信公众号,转发送积分 3137039
求助须知:如何正确求助?哪些是违规求助? 2788025
关于积分的说明 7784284
捐赠科研通 2444088
什么是DOI,文献DOI怎么找? 1299724
科研通“疑难数据库(出版商)”最低求助积分说明 625536
版权声明 601010