计算机科学
卷积神经网络
变压器
人工智能
动作识别
计算机视觉
模式识别(心理学)
特征提取
试验装置
工程类
电气工程
电压
班级(哲学)
标识
DOI:10.1109/prai55851.2022.9904115
摘要
Human action recognition is a widely investigated field in computer vision. Violence automatic detection is a subset of action recognition, which deserves special attention because of its wide applicability in unmanned security monitoring systems. This paper presents an end-to-end model, which introduces Transformer for human pose estimation and 3d convolutional neural network to capture motion present in spatial-temporal dimension. We train a 3d convolutional neural network to learn spatial-temporal features of human keypoint sequences which are the outputs of Transformer block. Our proposed model achieves an accuracy of 89% on the test set of large-scale video database RWF-2000, and obtains an accuracy of 93% on our own school violent video database. Our experiment result shows that transformer-based approach can be used in video violence detection.
科研通智能强力驱动
Strongly Powered by AbleSci AI