Citywide spatio-temporal (ST) forecasting is a fundamental task for many urban applications, including traffic accident prediction, taxi demand planning, and crowd flow forecasting. The goal of this task is to generate accurate predictions concurrently for all regions within a city. Prior works take great effort on modeling the ST correlations. However, they often overlook intrinsic correlations and inherent data distribution across the city, both of which are influenced by urban zoning and functionality, resulting in inferior performance on citywide ST forecasting. In this paper, we introduce CityCAN, a novel causal attention network, to collectively generate predictions for every region of a city. We first present a causal framework to identify useful correlations among regions, filtering out useless ones, via an intervention strategy. In the framework, a Global Local-Attention Encoder, which leverages attention mechanisms, is designed to jointly learn both local and global ST correlations among correlated regions. Then, we design a citywide loss to constrain the prediction distribution by incorporating the citywide distribution. Extensive experiments on three real-world applications demonstrate the effectiveness of CityCAN.