帕斯卡(单位)
卷积神经网络
人工智能
计算机科学
变压器
模式识别(心理学)
图像(数学)
视觉对象识别的认知神经科学
动作识别
计算机视觉
机器学习
对象(语法)
工程类
班级(哲学)
电压
电气工程
程序设计语言
作者
Seyed Rohollah Hosseyni,Hasan Taheri,Sanaz Seyedin,Ali Ahmad Rahmani
出处
期刊:Cornell University - arXiv
日期:2023-01-01
标识
DOI:10.48550/arxiv.2307.08994
摘要
Understanding the relationship between different parts of an image is crucial in a variety of applications, including object recognition, scene understanding, and image classification. Despite the fact that Convolutional Neural Networks (CNNs) have demonstrated impressive results in classifying and detecting objects, they lack the capability to extract the relationship between different parts of an image, which is a crucial factor in Human Action Recognition (HAR). To address this problem, this paper proposes a new module that functions like a convolutional layer that uses Vision Transformer (ViT). In the proposed model, the Vision Transformer can complement a convolutional neural network in a variety of tasks by helping it to effectively extract the relationship among various parts of an image. It is shown that the proposed model, compared to a simple CNN, can extract meaningful parts of an image and suppress the misleading parts. The proposed model has been evaluated on the Stanford40 and PASCAL VOC 2012 action datasets and has achieved 95.5% mean Average Precision (mAP) and 91.5% mAP results, respectively, which are promising compared to other state-of-the-art methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI