Inferring the fine-grained urban flows based on the coarse-grained flow observations is practically important to many smart city-related applications. However, the collected human/vehicle trajectory flows are usually rather unreliable, may contain various noise and sometimes are incomplete, thus posing great challenges to existing approaches. In this paper, we present a pioneering study on robust fine-grained urban flow inference with noisy and incomplete urban flow observations, and propose a denoising diffusion model named DiffUFlow to effectively address it. Specifically, we propose an improved reverse diffusion strategy. A spatial-temporal feature extraction network called STFormer and a semantic features extraction network called ELFetcher are also proposed. Then, we overlay the spatial-temporal feature map extracted by STFormer onto the coarse-grained flow map, serving as a conditional guidance for the reverse diffusion process. We further integrate the semantic features extracted by ELFetcher to cross-attention layers, enabling the comprehensive consideration of semantic information encompassing the entirety of urban data in fine-grained inference. Extensive experiments on two large real-world datasets validate the effectiveness of our method compared with the state-of-the-art baselines.