计算机科学
稳健性(进化)
频域
变压器
人工智能
利用
水准点(测量)
接头(建筑物)
模式识别(心理学)
计算机视觉
工程类
建筑工程
生物化学
化学
计算机安全
大地测量学
电压
地理
电气工程
基因
作者
Qian Zhao,Chuansheng Zheng,Mengyuan Liu,Pichao Wang,Chen Chen
标识
DOI:10.1109/cvpr52729.2023.00857
摘要
Recently, transformer-based methods have gained significant success in sequential 2D-to-3D lifting human pose estimation. As a pioneering work, PoseFormer captures spatial relations of human joints in each video frame and human dynamics across frames with cascaded transformer layers and has achieved impressive performance. However, in real scenarios, the performance of PoseFormer and its follow-ups is limited by two factors: (a) The length of the input joint sequence; (b) The quality of 2D joint detection. Existing methods typically apply self-attention to all frames of the input sequence, causing a huge computational burden when the frame number is increased to obtain advanced estimation accuracy, and they are not robust to noise naturally brought by the limited capability of 2D joint detectors. In this paper, we propose PoseFormerV2, which exploits a compact representation of lengthy skeleton sequences in the frequency domain to efficiently scale up the receptive field and boost robustness to noisy 2D joint detection. With minimum modifications to PoseFormer, the proposed method effectively fuses features both in the time domain and frequency domain, enjoying a better speed-accuracy trade-off than its precursor. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that the proposed approach significantly outperforms the original PoseFormer and other transformer-based variants. Code is released at https://github.com/ QitaoZhao/PoseFormerV2.
科研通智能强力驱动
Strongly Powered by AbleSci AI