计算机科学
人工智能
分割
计算机视觉
图像分割
棱锥(几何)
自然语言处理
尺度空间分割
模式识别(心理学)
物理
光学
作者
Zipeng Qin,Jianbo Liu,Xiaolin Zhang,Maoqing Tian,Aojun Zhou,Shuai Yi,Hongsheng Li
标识
DOI:10.1109/tmm.2024.3396281
摘要
The recently proposed MaskFormer [1] gives a refreshed perspective on the task of semantic segmentation: it shifts from the popular pixel-level classification paradigm to a mask-level classification method. In essence, it generates paired probabilities and masks corresponding to category segments and combines them during inference for the segmentation maps. In our study, we find that the per-mask classification decoder on top of a single-scale feature is not effective enough to extract reliable probabilities or masks. To mine for rich semantic information across the feature pyramid, we propose a transformer-based Pyramid Fusion Transformer (PFT) for per-mask approach semantic segmentation with multi-scale features. The proposed transformer decoder performs cross-attention between the learnable queries and each spatial feature from the feature pyramid in parallel and uses cross-scale inter-query attention to exchange complimentary information. We achieve competitive performance on three widely used semantic segmentation datasets. In particular, on ADE20 K validation set, our result with Swin-B backbone surpasses that of MaskFormer's with a much larger Swin-L backbone in both single-scale and multi-scale inference, achieving 54.1 mIoU and 55.7 mIoU respectively. Using a Swin-L backbone, we achieve single-scale 56.1 mIoU and multi-scale 57.4 mIoU, obtaining state-of-the-art performance on the dataset. Extensive experiments on three widely used semantic segmentation datasets verify the effectiveness of our proposed method.
科研通智能强力驱动
Strongly Powered by AbleSci AI