A Multi-scale Feature Adaptive Fusion model (MFAF-YOLO) for real-time detection of citrus harvesting robots in complex field environments is proposed in this study. This proposed model improves detection accuracy while meeting the lightweight requirements of consumer-level cameras. In this study, different citrus fruit varieties, sourced from two distinct devices, are classified into 'First priority', 'Second priority', and unannotated citrus. This classification strategy guides robots in sequential picking in real-field scenarios, reducing detection redundancy and diminishing the damage rate at the robot end-effector. Additionally, multiple clustering algorithms were employed to adjust the anchor box sizes of the model. The impact of dual and triple detection heads on model accuracy across diverse clustering algorithms was also explored. An innovative multi-scale feature adaptive fusion module was embedded in the model's neck section, aiming to optimize model accuracy and reduce model size. In the dataset processed with multiple augmentation techniques, the novel MFAF-YOLO model achieved the mean Average Precision (mAP) of 90.2 %, reflecting an improvement of 3.8 % compared to the original YOLOv5s model, demonstrating superior generalization capabilities. Compared with seven other mainstream models, including YOLOv4 and MobileNet-YOLOv5s, the Average Precision (AP) value of 'First priority' and 'Second priority' is 93.2 % and 87.3 %, respectively, achieved the highest AP while maintaining a better detection speed. The model size of MFAF-YOLO is reduced to 10.4 MB, marking a 26.2 % decrease relative to the lightweight YOLOv5s model. Experimental results highlight the model's strong robustness and successful attainment of an effective balance between model accuracy and lightweight. This proposed model provides theoretical support for real-time picking and harvesting decision-making of citrus.