计算机科学
隐藏字幕
网格
变压器
图像分辨率
棱锥(几何)
人工智能
语义学(计算机科学)
模式识别(心理学)
数据挖掘
计算机视觉
图像(数学)
程序设计语言
电压
物理
几何学
数学
量子力学
光学
作者
Haonan Zhang,Pengpeng Zeng,Lianli Gao,Xinyu Lyu,Jingkuan Song,Heng Tao Shen
出处
期刊:IEEE Transactions on Circuits and Systems for Video Technology
[Institute of Electrical and Electronics Engineers]
日期:2023-11-23
卷期号:34 (6): 4829-4842
被引量:6
标识
DOI:10.1109/tcsvt.2023.3336371
摘要
The existing approaches to image captioning tend to adopt Transformer-based architectures with grid features, which represent the state-of-the-art. However, the strategies are prone to address the grid features with a fixed resolution, which often hampers the perception of entities with various scales. In addition, directly applying them may also result in spatial and fine-grained semantic information loss. To this end, we propose a simple yet effective method, named Spatial Pyramid Transformer (SPT). Specifically, it adopts several parameter-shared pyramid structures to perform semantic interactions across different grid resolutions. In each layer, we design a Spatial-aware Pseudo-supervised (SP) module, which aims to adaptively resort to disrupted spatial information among flatted grid features. Moreover, to maintain the model size and enhance semantics, we build a simple weighted residual connection termed as Scale-wise Reinforcement (SR) module to simultaneously explore both low- and high-level encoded features. Extensive experiments on the MS-COCO benchmark demonstrate that our method achieves new state-of-the-art performance without bringing excessive parameters compared with vanilla transformer. In addition, our method is extended to the video captioning task, which further proves the practicability of the proposed method. Code is available at https://github.com/zchoi/SPT.
科研通智能强力驱动
Strongly Powered by AbleSci AI