Engineering vehicles play a vital role in supporting construction projects. However, due to their substantial size, heavy tonnage, and significant blind spots while in motion, they present a potential threat to road maintenance, pedestrian safety, and the well-being of other vehicles. Hence, monitoring engineering vehicles holds considerable importance. This paper introduces an engineering vehicle detection model based on improved YOLOv6. First, a Swin Transformer is employed for feature extraction, capturing comprehensive image features to improve the detection capability of incomplete objects. Subsequently, the SimMIM self-supervised training paradigm is implemented to address challenges related to insufficient data and high labeling costs. Experimental results demonstrate the model’s superior performance, with a mAP50:95 value of 88.5% and mAP50 value of 95.9% on the dataset of four types of engineering vehicles, surpassing existing mainstream models and proving its effectiveness in engineering vehicle detection.