Shooting condition insensitive unmanned aerial vehicle object detection

计算机科学嵌入人工智能编码器计算机视觉目标检测无人机探测器自编码鉴别器领域（数学分析）钥匙（锁）特征（语言学）不变（物理）特征学习深度学习模式识别（心理学）电信数学分析语言学哲学物理数学计算机安全生物数学物理遗传学操作系统

作者

Jingfan Liu,Jingjing Cui,Mao Ye,Xiatian Zhu,Song Tang

出处

期刊：Expert Systems With Applications [Elsevier]
日期：2024-07-01 卷期号：246: 123221-123221

标识

DOI：10.1016/j.eswa.2024.123221

摘要

The increasing use of unmanned aerial vehicle (UAV) devices in diverse fields such as agriculture, surveillance, and aerial photography has led to a significant demand for intelligent object detection. The key is in dealing with unconstrained shooting condition variations (e.g., weather, view, altitude). Previous data augmentation or adversarial learning based methods try to extract shooting condition invariant features, but they are constrained by the large number of combinations of different shooting conditions. To address this limitation, in this work we introduce a novel Language Guided UAV Detection Network Training Method (LGNet), capable of leveraging pre-trained multi-modal representations (e.g., CLIP) as learning structure reference, and as a model-agnostic strategy that can be applied in various detection models. The key idea is to remove language-described domain-specific features from the visual-language feature space, enhancing tolerance to variations in shooting conditions. Concretely, we fine-tune text prompt embedding about shooting condition and feed the fine-tuned text prompt embedding into CLIP-text encoder to obtain more accurate domain-specific features. By aligning the features from the detector backbone with those of the CLIP image encoder, we situate features within a visual-language space, while staying away from language-encoded domain-specific features to be domain-invariant. Extensive experiments demonstrate that LGNet, as a generic training plug-in, boosts the state-of-the-art performance on various base detectors. Specifically, it achieves an increase in the range of 0.9–1.7% in Average Precision (AP) on the UAVDT dataset and 1.0-2.4% on the VisDrone dataset, respectively.

求助该文献

最长约 10秒，即可获得该文献文件

Shooting condition insensitive unmanned aerial vehicle object detection

今日热心研友