弹丸
功能(生物学)
推论
零(语言学)
计算机科学
地理
人工智能
计算机视觉
地图学
语言学
化学
哲学
有机化学
进化生物学
生物
作者
Weiming Huang,Jing Wang,Gao Cong
标识
DOI:10.1080/13658816.2024.2347322
摘要
Inferring urban functions using street view images (SVIs) has gained tremendous momentum. The recent prosperity of large-scale vision-language pretrained models sheds light on addressing some long-standing challenges in this regard, for example, heavy reliance on labeled samples and computing resources. In this paper, we present a novel prompting framework for enabling the pretrained vision-language model CLIP to effectively infer fine-grained urban functions with SVIs in a zero-shot manner, that is, without labeled samples and model training. The prompting framework UrbanCLIP comprises an urban taxonomy and several urban function prompt templates, in order to (1) bridge the abstract urban function categories and concrete urban object types that can be readily understood by CLIP, and (2) mitigate the interference in SVIs, for example, street-side trees and vehicles. We conduct extensive experiments to verify the effectiveness of UrbanCLIP. The results indicate that the zero-shot UrbanCLIP largely surpasses several competitive supervised baselines, e.g. a fine-tuned ResNet, and its advantages become more prominent in cross-city transfer tests. In addition, UrbanCLIP's zero-shot performance is considerably better than the vanilla CLIP. Overall, UrbanCLIP is a simple yet effective framework for urban function inference, and showcases the potential of foundation models for geospatial applications.
科研通智能强力驱动
Strongly Powered by AbleSci AI