计算机科学
嵌入
人工智能
编码器
计算机视觉
目标检测
无人机
探测器
自编码
鉴别器
领域(数学分析)
钥匙(锁)
特征(语言学)
不变(物理)
特征学习
深度学习
模式识别(心理学)
电信
数学分析
语言学
哲学
物理
数学
计算机安全
生物
数学物理
遗传学
操作系统
作者
Jingfan Liu,Jingjing Cui,Mao Ye,Xiatian Zhu,Song Tang
标识
DOI:10.1016/j.eswa.2024.123221
摘要
The increasing use of unmanned aerial vehicle (UAV) devices in diverse fields such as agriculture, surveillance, and aerial photography has led to a significant demand for intelligent object detection. The key is in dealing with unconstrained shooting condition variations (e.g., weather, view, altitude). Previous data augmentation or adversarial learning based methods try to extract shooting condition invariant features, but they are constrained by the large number of combinations of different shooting conditions. To address this limitation, in this work we introduce a novel Language Guided UAV Detection Network Training Method (LGNet), capable of leveraging pre-trained multi-modal representations (e.g., CLIP) as learning structure reference, and as a model-agnostic strategy that can be applied in various detection models. The key idea is to remove language-described domain-specific features from the visual-language feature space, enhancing tolerance to variations in shooting conditions. Concretely, we fine-tune text prompt embedding about shooting condition and feed the fine-tuned text prompt embedding into CLIP-text encoder to obtain more accurate domain-specific features. By aligning the features from the detector backbone with those of the CLIP image encoder, we situate features within a visual-language space, while staying away from language-encoded domain-specific features to be domain-invariant. Extensive experiments demonstrate that LGNet, as a generic training plug-in, boosts the state-of-the-art performance on various base detectors. Specifically, it achieves an increase in the range of 0.9–1.7% in Average Precision (AP) on the UAVDT dataset and 1.0-2.4% on the VisDrone dataset, respectively.
科研通智能强力驱动
Strongly Powered by AbleSci AI