Most of the existing object detection systems adopt 3x3 convolution kernels for feature extraction, which leads to a problem that the receptive field of the features extraction net is always 3 X • Network features are not rich enough and lack accurate learning of features with pixel size not 3 x • To solve this problem, convolution with different kernel sizes is introduced for feature extraction. However, a large convolution kernel may lead to a rapid increase in the Parameters and FLOPs. In this paper, we propose an object detection network based on depth-wise convolution and multi-scale feature fusion (YOLODM-Net). Specifically, a feature extraction module named multi-scale feature fusion (MSFF) block is constructed, which uses depth-wise convolution of different kernel sizes to extract features and mixes them to enrich learning contents. In addition, we propose a multi-scale spatial attention module based on the Efficient Channel Attention (ECA) module. In this module, multi-scale information is added to make the extracted features more fine-grained. The proposed method was evaluated on the VOC2007 dataset and compared with the previous methods. The mAP of the model is better than that of the YOLOv7, YOLOx, etc. And the Parameters and FLOPs are also improved.