潜在Dirichlet分配
计算机科学
主题模型
聚类分析
吉布斯抽样
人工智能
概率潜在语义分析
文档聚类
多项式分布
相似性(几何)
潜在语义分析
自然语言处理
班级(哲学)
情报检索
数学
贝叶斯概率
统计
图像(数学)
作者
Abhinandan Udupa,K N Adarsh,Anvitha Aravinda,Neelam H Godihal,N Kayarvizhy
标识
DOI:10.1109/ccip57447.2022.10058687
摘要
Topic models may be a useful tool for locating latent subjects in collections of documents. Short text clustering has become a more important task as social networking sites like Twitter have gained popularity. Short text is characterised by high sparsity, high-dimensionality, and large-volume. These characteristics are challenging to overcome. Two of the most well-known short text modelling algorithms are BERTopic and the Gibbs Sampling Dirichlet Multinomial Mixture Model (GSDMM). GSDMM is a topic model which can infer the count of topic clusters automatically with a good compromise between the fullness and uniformity of the clustering results, and is fast to converge. BERTopic is a neural topic model that extracts coherent topic representations based on the semantic similarity of words and phrases in the and clustering with the help of a class-based form of TF-IDF. We compare these two algorithms in this paper to determine which model is more effective in short text topic modelling.
科研通智能强力驱动
Strongly Powered by AbleSci AI