计算机科学
文件分类
tf–国际设计公司
潜在Dirichlet分配
特征(语言学)
特征向量
水准点(测量)
代表(政治)
模式识别(心理学)
人工智能
情报检索
数据挖掘
期限(时间)
机器学习
主题模型
大地测量学
哲学
物理
政治
法学
地理
量子力学
语言学
政治学
作者
Dong‐Hwa Kim,Deokseong Seo,Suhyoun Cho,Pilsung Kang
标识
DOI:10.1016/j.ins.2018.10.006
摘要
The purpose of document classification is to assign the most appropriate label to a specified document. The main challenges in document classification are insufficient label information and unstructured sparse format. A semi-supervised learning (SSL) approach could be an effective solution to the former problem, whereas the consideration of multiple document representation schemes can resolve the latter problem. Co-training is a popular SSL method that attempts to exploit various perspectives in terms of feature subsets for the same example. In this paper, we propose multi-co-training (MCT) for improving the performance of document classification. In order to increase the variety of feature sets for classification, we transform a document using three document representation methods: term frequency–inverse document frequency (TF–IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network-based document embedding known as document to vector (Doc2Vec). The experimental results demonstrate that the proposed MCT is robust to parameter changes and outperforms benchmark methods under various conditions.
科研通智能强力驱动
Strongly Powered by AbleSci AI