模态(人机交互)
人工智能
计算机科学
情态动词
跟踪(教育)
计算机视觉
模式识别(心理学)
机器学习
心理学
教育学
化学
高分子化学
作者
Tianlu Zhang,Qiang Zhang,Kurt Debattista,Jungong Han
标识
DOI:10.1109/tpami.2025.3555485
摘要
Contemporary multi-modal trackers achieve strong performance by leveraging complex backbones and fusion strategies, but this comes at the cost of computational efficiency, limiting their deployment in resource-constrained settings. On the other hand, compact multi-modal trackers are more efficient but often suffer from reduced performance due to limited feature representation. To mitigate the performance gap between compact and more complex trackers, we introduce a cross-modality distillation framework. This framework includes a complementarity-aware mask autoencoder designed to enhance cross-modal interactions by selectively masking patches within a modality, thereby forcing the model to learn more robust multi-modal representations. Additionally, we present a specific-common feature distillation module that transfers both modality-specific and shared information from a more powerful model's backbone to the compact model. Moreover, we develop a multi-path selection distillation module to guide a simple fusion module in learning more accurate multi-modal information from a sophisticated fusion mechanism using multiple paths. Extensive experiments on six multi-modal tracking benchmarks demonstrate that the proposed tracker, despite being lightweight, outperforms most state-of-the-art methods, highlighting its effectiveness. Notably, our tiny variant achieves a PR score of 67.5% on LasHeR, a PR score of 58.5% on DepthTrack, and a PR score of 73.1% on VisEvent with only 6.5 M parameters, while operating at 126 FPS on an NVIDIA 2080Ti GPU.
科研通智能强力驱动
Strongly Powered by AbleSci AI