计算机科学
判别式
人工智能
特征提取
模式识别(心理学)
提取器
手语
空间分析
特征(语言学)
利用
稳健性(进化)
计算机视觉
数学
统计
工程类
哲学
基因
生物化学
语言学
计算机安全
化学
工艺工程
作者
Wenjie Yin,Yonghong Hou,Zihui Guo,Kailin Liu
出处
期刊:IEEE Transactions on Circuits and Systems for Video Technology
[Institute of Electrical and Electronics Engineers]
日期:2023-07-18
卷期号:34 (3): 1684-1695
被引量:6
标识
DOI:10.1109/tcsvt.2023.3296668
摘要
Continuous Sign language Recognition (CSLR) aims to generate gloss sequences based on untrimmed sign videos. Since discriminative visual features are essential for CSLR, current efforts mainly focus on strengthening the feature extractor. The feature extractor can be disassembled into a spatial representation module and a short-term temporal module for spatial and visual features modeling. However, existing methods always regard it as a monoblock and rarely implement specific refinements for such two distinct modules, which is difficult to achieve effective modeling of spatial appearance information and temporal motion information. To address the above issues, we proposed a spatial temporal enhanced network which contains a spatial-visual alignment (SVA) module and a temporal feature difference (TFD) module. Specifically, the SVA module conducts an auxiliary task between the spatial features and target gloss sequences to enhance the extraction of hand and facial expressions. Meanwhile, the TFD module is constructed to exploit the underlying dynamic between consecutive frames and inject the aggregated motion information into spatial features to assist short-term temporal modeling. Extensive experimental results demonstrate the effectiveness of the proposed modules and our network achieves state-of-the-art or competitive performance on four public CSLR datasets.
科研通智能强力驱动
Strongly Powered by AbleSci AI