Researchers have proposed numerous novel features and models under the intra-patient paradigm. However, their performance suffers when considering the inter-patient paradigm. While some state-of-the-art results have been reported in recent years under the inter-patient paradigm, many of them deviate from the standard test protocol. The performance of minority classes remains unsatisfactory for practical applications under strict test protocols. This paper presents a novel framework based on a lightweight Transformer combined with CNN and a denoising autoencoder, which enhances the performance of minority classes under the standard test protocol. The proposed model includes a new seq2seq network that extracts local features from a single heartbeat using CNN or a denoising encoder, and attends to global features from neighboring heartbeats based on a lightweight Transformer encoder. In particular, we pretrained the autoencoder on the MIT-BIH dataset and an additional dataset, considering several transfer modes for feature representation. We organized multiple continuous heartbeats into a vector sequence, where each heartbeat incorporates information from its neighbors to improve feature representation. The model evaluation was conducted using the MIT-BIH inter-patient dataset, following the AAMI standard. The Transformer with CNN embedding achieved a total accuracy of 97.66% on the test set, while the Transformer with pretrained denoising autoencoder achieved a total accuracy of 97.93%. These results demonstrate the promising performance of our models for imbalanced inter-patient ECG classification under the standard test protocol.