Electrocardiography (ECG) is a crucial tool for diagnosing cardiovascular diseases. In particular, combining clinical ECG with computer technology for automatic ECG analysis can improve ECG discrimination accuracy. In this study, to tackle the existing problems of 2D classification models (CNN and Transformer) of ECG heartbeat images, we proposed a 2D deep learning classification network SRT comprising a CNN and Transformer-encoder modules to improve the model structure. Furthermore, to identify the global and local differences between different types of ECG heartbeats, we propose a novel attention module (SC-RGA module) to assist the multi-headed attention mechanism in improving the feature extraction of ECG images and the Dilated Stem structure to expand the receptive field at the input of the network. We also applied a series of strategies to deal with data imbalances in public ECG datasets. Finally, we evaluated the performance of SRT using the MIT-BIH arrhythmia database, with a classification accuracy of 0.957, a sensitivity of 0.881, and an average F1 index of 0.826. Comparative experiments demonstrate that the proposed model outperforms several advanced methods.