计算机科学
人工智能
特征(语言学)
目标检测
蒸馏
特征提取
对象(语法)
模式识别(心理学)
图层(电子)
RGB颜色模型
计算机视觉
集合(抽象数据类型)
哲学
语言学
化学
有机化学
程序设计语言
标识
DOI:10.1109/iccc59590.2023.10507606
摘要
Sound, as one of the inherent attributes of objects, can provide valuable information for object detection. At present, the method of object location only by monitoring ambient sound is less robust. To solve this problem, a multimodal self-supervised knowledge distillation object detection network with cross level feature alignment is proposed. Taking RGB and depth images as input of teacher network and audio as input for student network, a multi-teacher cross-level feature alignment loss based on attention fusion is designed. It integrates students 'deep and shallow features to learn teachers' corresponding middle layer features, so as to extract comprehensive knowledge with more efficiency. Positioning distillation loss is also added to obtain more localization information. In the multimodal audio-visual detection MAVD data set, the mAP value of the network increased 11.6% compared with the baseline network, demonstrating the superiority of the detection network.
科研通智能强力驱动
Strongly Powered by AbleSci AI