The occurrence of traffic accidents in cities is often accompanied by property losses, environmental pollution, casualties, and congestion. Predicting the spatio-temporal range of accident-induced congestion can mitigate the negative effects by taking appropriate measures to respond to traffic accidents in a timely manner. Unlike most existing traffic accident spatial-temporal prediction strategies that depend on existing traffic models, this paper proposes a model-free method by using the macroscopic road network images, which relieves the restriction of precise modeling of traffic dynamics and the detailed traffic data. Specifically, we first design a digital twin road network to observe the traffic operation from a macro perspective. Then, after designing the structure of the Convolutional LSTM (Conv-LSTM) cell, we stack multiple Conv-LSTM layers to form an encoding-decoding structure to predict spatio-temporal congestion caused by accidents in urban road networks. Finally, the simulation results indicate that the proposed method improves the prediction accuracy compared with the model-based method and the LSTM network model. The proposed strategy provides a new approach to predict the spatio-temporal congestion caused by accidents from a macroscopic perspective.