计算机科学
增采样
分割
特征(语言学)
人工智能
编码(集合论)
帧(网络)
相似性(几何)
约束(计算机辅助设计)
图像分辨率
解码方法
模式识别(心理学)
计算机视觉
图像(数学)
算法
机械工程
电信
哲学
语言学
集合(抽象数据类型)
工程类
程序设计语言
作者
Yubin Hu,Yuze He,Yanghao Li,Jisheng Li,Yuxing Han,Jiangtao Wen,Yong‐Jin Liu
标识
DOI:10.1109/cvpr52729.2023.02167
摘要
Video semantic segmentation (VSS) is a computationally expensive task due to the per-frame prediction for videos of high frame rates. In recent work, compact models or adaptive network strategies have been proposed for efficient VSS. However, they did not consider a crucial factor that affects the computational cost from the input side: the input resolution. In this paper, we propose an altering resolution framework called AR-Seg for compressed videos to achieve efficient VSS. AR-Seg aims to reduce the computational cost by using low resolution for non-keyframes. To prevent the performance degradation caused by downsampling, we design a Cross Resolution Feature Fusion (CR-eFF) module, and supervise it with a novel Feature Similarity Training (FST) strategy. Specifically, CReFF first makes use of motion vectors stored in a compressed video to warp features from high-resolution keyframes to low-resolution non-keyframes for better spatial alignment, and then selectively aggregates the warped features with local attention mechanism. Furthermore, the proposed FST supervises the aggregated features with high-resolution features through an explicit similarity loss and an implicit constraint from the shared decoding layer. Extensive experiments on CamVid and Cityscapes show that AR-Seg achieves state-of-the-art performance and is compatible with different segmentation backbones. On CamVid, AR-Seg saves 67% computational cost (measured in GFLOPs) with the PSPNet18 back-bone while maintaining high segmentation accuracy. Code: https://github.com/THU-LYJ-Lab/AR-Seg.
科研通智能强力驱动
Strongly Powered by AbleSci AI