With the tremendous success of large language models such as ChatGPT, artificial intelligence has entered a new era of large models. Multimodal data, which can comprehensively perceive and recognize the physical world, has become an essential path towards general artificial intelligence. However, multimodal large models trained on public datasets often underperform in specific industrial domains. In this paper, we tackle the problem of building large vision-language intelligent models for specific industrial domains by leveraging the general large models and federated learning. We compare the challenges faced by federated learning in the era of small models and large models from different dimensions, and propose a technical framework for federated learning in the era of large models.Specifically, our framework mainly considers three aspects: heterogeneous model fusion, flexible aggregation methods, and data quality improvement. Based on this framework, we conduct a case study of leading enterprises contributing vision-language data and expert knowledge to city safety operation management. The preliminary experiments show that enterprises can enhance and accumulate their intelligence capabilities through federated learning, and jointly create an intelligent city model that provides high-quality intelligent services covering energy infrastructure security, residential community security and urban operation management.