Abstract The vibration signal of a bearing is closely related to its fault. The quality of the features extracted from the signal has a great impact on the accuracy of fault diagnosis. In this paper, a new method combining multi-scale autoencoder (AE) and generative adversarial network is proposed to extract the depth-sensitive features of the signal, and unite with the classifier for fault diagnosis. The AE is used as the generator (i.e. the generator is composed of encoder and decoder), and the idea of confrontation and reconstruction is used for training. The better the training of the generator, the better the training of the encoder, which means that the extracted feature of the encoder (the output of the encoder) is better. Then take these features as new inputs, send them to the classifier for classification, and finally get the fault type. This method solves the problems of weak representation and over-reliance on professional knowledge of the traditional method for bearing fault diagnosis. Meanwhile, compared with most existing neural network models for fault diagnosis, it has higher accuracy, especially in difficult diagnosis tasks. To further verify the effectiveness of the proposed model, a bearing test rig is established, and the collected data are used for fault diagnosis to prove the superiority of the proposed method.