计算机科学
人工智能
分割
视频跟踪
图像分割
计算机视觉
多视点视频编码
视频压缩图片类型
视频后处理
视频处理
作者
Hao Fang,Tong Zhang,Xiaofei Zhou,Xinxin Zhang
标识
DOI:10.1109/tcsvt.2024.3361076
摘要
Recently, Transformer-based offline video instance segmentation (VIS) solutions have made significant progress by decomposing the whole task into global segmentation map generation and instance discrimination. We argue that the quality of video queries that represent all instances in a video clip is crucial for offline VIS methods. Existing methods typically interact video queries with dense spatio-temporal features, resulting in significant computational complexity and redundant information. Thus, we propose a novel video instance segmentation framework, LBVQ, dedicated to learning better video queries. Specifically, we first obtain the frame queries for each frame independently without any complex inter-frame spatial-temporal association operations. Secondly, we propose an adaptive query initialization module (AQI), which adaptively integrates frame queries to initialize video queries instead of traditional random initialization strategies. This initialization method preserves rich instance clues and accelerates the optimization of the whole model. Finally, to enhance the quality of video queries, we propose a query propagation module (QPM) that captures relevant instance information in frame queries frame by frame, greatly improving the model's understanding of long videos. By learning higher quality video queries, LBVQ achieves the state-of-the-art on VIS benchmarks with a ResNet-50 backbone: 52.2 AP, 44.8 AP on YouTube-VIS 2019 & 2021. Moreover, LBVQ achieves 39.7 AP on YouTube-VIS 2022 and 22.2 AP on OVIS, demonstrating superior potential for long videos. To further improve the quality of segmentation masks, a large-scale pretrained SAM is employed to refine the segmentation results. Code is available at https://github.com/fanghaook/LBVQ.
科研通智能强力驱动
Strongly Powered by AbleSci AI