With the help of some modern image generative techniques, it is possible to generate or manipulate image or video contents without introducing any obvious visual artifacts. If these manipulated images/videos are abused, it probably has a huge negative impact on society and individuals. Thus, deepfake detection has attracted considerable attention in recent years. Although the existing methods can achieve good detection performance on high-quality datasets, they are still far from satisfactory for low-quality dataset and cross-dataset evaluation. In this paper, therefore, we propose a new CNN-based method via multi-scale convolution and vision transformer for deepfake detection. In the proposed model, we design a multi-scale module with dilation convolution and depthwise separable convolution to capture more face details and tampering artifacts at different scales. Unlike the traditional classification module, furthermore, we employ a vision transformer to further learn the global information of face features for classification. Extensive experiments demonstrate that in most cases the proposed method achieves better detection results on both high-quality and low-quality datasets compared with related modern methods, and the cross-dataset generalization capabilities of the proposed method are good. In addition, many ablation experiments are provided to verify the rationality of the proposed network.