潜在Dirichlet分配
计算机科学
人工智能
聚类分析
主题模型
自然语言处理
文档聚类
自编码
深度学习
作者
Pintu Chandra Paul,Md Shihab Uddin,Md Tofael Ahmed,Mohammed Moshiul Hoque,Maqsudur Rahman
标识
DOI:10.1109/iccit57492.2022.10055173
摘要
In order to infer topics from unstructured text data, topic modeling techniques is extensively employed in the field of Natural Language Processing. Latent Dirichlet Allocation (LDA), a popular technique in topic modeling, can be used for the automatic identification of topics from a vast sample of textual documents. The LDA-based topic models, however, may not always yield good outcomes on their own. One of the most efficient unsupervised machine learning methods, clustering, is often employed in applications like topic modeling and information extraction from unstructured textual data. In our study, a hybrid clustering based approach using Bidirectional Encoder Representations from Transformers (BERT) and LDA for large Bangla textual dataset has been thoroughly investigated. The BERT has done the contextual embedding with LDA. The experiments on this hybrid model are carried out to show the efficiency of clustering similar topics from a noble dataset of Bangla news articles. The outcomes of the experiments demonstrate that clustering with BERT-LDA model would aid in the inference of more coherent topics. The maximum coherence value of 0.63 has been found for our noble dataset using LDA and for BERT-LDA model, the value is 0.66.
科研通智能强力驱动
Strongly Powered by AbleSci AI