Rapid global urbanization is a double-edged sword, heralding promises of economical prosperity and public health while also posing unique environmental and humanitarian challenges. Smart and connected communities (S&CCs) apply data-centric solutions to these problems by integrating artificial intelligence (AI) and the Internet of Things (IoT). This coupling of intelligent technologies also poses interesting system design challenges regarding heterogeneous data fusion and task diversity. Transformers are of particular interest to address these problems, given their success across diverse fields of natural language processing (NLP), computer vision, time-series regression, and multi-modal data fusion. This begs the question whether Transformers can be further diversified to leverage fusions of IoT data sources for heterogeneous multi-task learning in S&CC trade spaces. In this paper, a Transformer-based AI system for emerging smart cities is proposed. Designed using a pure encoder backbone, and further customized through interchangeable input embedding and output task heads, the system supports virtually any input data and output task types present S&CCs. This generalizability is demonstrated through learning diverse task sets representative of S&CC environments, including multivariate time-series regression, visual plant disease classification, and image-time-series fusion tasks using a combination of Beijing PM2.5 and Plant Village datasets. Simulation results show that the proposed Transformer-based system can handle various input data types via custom sequence embedding techniques, and are naturally suited to learning a diverse set of tasks. The results also show that multi-task learners increase both memory and computational efficiency while maintaining comparable performance to both single-task variants, and non-Transformer baselines.