余弦相似度
计算机科学
相似性(几何)
自然语言处理
编码器
人工智能
判决
三角函数
二进制数
变压器
语义相似性
tf–国际设计公司
情报检索
模式识别(心理学)
算术
数学
图像(数学)
物理
操作系统
电压
量子力学
期限(时间)
几何学
作者
Kanav Goyal,Megha Sharma
标识
DOI:10.1109/icatiece56365.2022.10046766
摘要
In this paper, multiple methods to vectorize documents were compared, and cosine similarities were calculated for the corresponding documents. Some of the vectorizing methods also consider the text's semantic meaning. The methods involve cosine similarity with algorithms like Bag of Words, Binary Bag of Words, Tf-Idf, Bidirectional Encoder Representations from Transformers, and Universal Sentence Encoder. Two important libraries to preprocess the text were used; these are NLTK and Genism. The Binary bag of words with Genism gave the best results of all the methods used. The dataset used involved around 2000 short news articles; these belonged to 5 categories.
科研通智能强力驱动
Strongly Powered by AbleSci AI