计算机科学
人工智能
特征(语言学)
视听
卷积神经网络
保险丝(电气)
图像(数学)
模态(人机交互)
编码(集合论)
模式识别(心理学)
可视化
动作(物理)
音频挖掘
语音识别
计算机视觉
多媒体
声学模型
语音处理
集合(抽象数据类型)
哲学
语言学
电气工程
程序设计语言
工程类
物理
量子力学
作者
Muhammad Bilal Shaikh,Douglas Chai,Syed Mohammed Shamsul Islam,Naveed Akhtar
标识
DOI:10.1007/s00521-023-09186-5
摘要
Abstract Multimodal Human Action Recognition (MHAR) is an important research topic in computer vision and event recognition fields. In this work, we address the problem of MHAR by developing a novel audio-image and video fusion-based deep learning framework that we call Multimodal Audio-Image and Video Action Recognizer (MAiVAR). We extract temporal information using image representations of audio signals and spatial information from video modality with the help of Convolutional Neutral Networks (CNN)-based feature extractors and fuse these features to recognize respective action classes. We apply a high-level weights assignment algorithm for improving audio-visual interaction and convergence. This proposed fusion-based framework utilizes the influence of audio and video feature maps and uses them to classify an action. Compared with state-of-the-art audio-visual MHAR techniques, the proposed approach features a simpler yet more accurate and more generalizable architecture, one that performs better with different audio-image representations. The system achieves an accuracy 87.9% and 79.0% on UCF51 and Kinetics Sounds datasets, respectively. All code and models for this paper will be available at https://tinyurl.com/4ps2ux6n .
科研通智能强力驱动
Strongly Powered by AbleSci AI