计算机科学
变压器
编码器
唤醒
人工智能
相关性
面部表情
价(化学)
情感(语言学)
语音识别
计算机视觉
数学
工程类
心理学
沟通
神经科学
物理
几何学
量子力学
电压
电气工程
操作系统
作者
Shuo Yang,Yongtang Bao,Yue Qi
标识
DOI:10.1109/icip49359.2023.10222702
摘要
In recent years, video-based continuous affect estimation has received more attention in computer vision. Therefore, how to robustly and accurately model the temporal information during facial expression change is crucial. Hence, we propose a transformer network that incorporates both local context and dimensional correlation to model visual information in an efficient manner. Specifically, noise, such as instantaneous head poses and lighting changes, may affect the model’s performance due to the local context insensitivity of the transformer’s self-attention layer. Therefore, a local-wise transformer encoder is adopted to enhance the transformer’s ability to capture local contextual information. In addition, considering the prior knowledge of the correlation between valence and arousal,we design the va-relevance bootstrap module and the corresponding valence-arousal relevance loss (va loss). Experiments on Aff-Wild2 and AFEW-VA datasets show the superior performance of our method for continuous affect estimation.
科研通智能强力驱动
Strongly Powered by AbleSci AI