Spatio-temporal prediction aims to forecast and gain insights into the ever-changing dynamics of urban environments across both time and space. Its purpose is to anticipate future patterns, trends, and events in diverse facets of urban life, including transportation, population movement, and crime rates. Although numerous efforts have been dedicated to developing neural network techniques for accurate predictions on spatio-temporal data, it is important to note that many of these methods heavily depend on having sufficient labeled data to generate precise spatio-temporal representations. Unfortunately, the issue of data scarcity is pervasive in practical urban sensing scenarios. In certain cases, it becomes challenging to collect any labeled data from downstream scenarios, intensifying the problem further. Consequently, it becomes necessary to build a spatio-temporal model that can exhibit strong generalization capabilities across diverse spatio-temporal learning scenarios.