计算机科学
人工智能
情态动词
分割
计算机视觉
融合
图像分割
模式识别(心理学)
自然语言处理
语言学
哲学
化学
高分子化学
作者
Bochen Xie,Yongjian Deng,Zhanpeng Shao,Youfu Li
标识
DOI:10.1109/tmm.2024.3380255
摘要
Bio-inspired event cameras record a scene as sparse and asynchronous "events" by detecting per-pixel brightness changes. Such cameras show great potential in challenging scene understanding tasks, benefiting from the imaging advantages of high dynamic range and high temporal resolution. Considering the complementarity between event and standard cameras, we propose a multi-modal fusion network (EISNet) to improve the semantic segmentation performance. The key challenges of this topic lie in ( i ) how to encode event data to represent accurate scene information and ( ii ) how to fuse multi-modal complementary features by considering the characteristics of two modalities. To solve the first challenge, we propose an Activity-Aware Event Integration Module (AEIM) to convert event data into frame-based representations with high-confidence details via scene activity modeling. To tackle the second challenge, we introduce the Modality Recalibration and Fusion Module (MRFM) to recalibrate modal-specific representations and then aggregate multi-modal features at multiple stages. MRFM learns to generate modal-oriented masks to guide the merging of complementary features, achieving adaptive fusion. Based on these two core designs, our proposed EISNet adopts an encoder-decoder transformer architecture for accurate semantic segmentation using events and images. Experimental results show that our model outperforms state-of-the-art methods by a large margin on event-based semantic segmentation datasets. The code is publicly available at https://github.com/bochenxie/EISNet .
科研通智能强力驱动
Strongly Powered by AbleSci AI