Few-shot learning methods only need a small size of samples to train a good model. Moreover, most of these methods consider a single modality, ignoring the correlation between multi-modal data. Therefore, using multi-modal methods to solve the small-sample-size problem has become the development trend of artificial intelligence. In recent years, a multi-model method called Vision-Language Pre-training (VLP) has emerged. The semantic relation between multiple modalities can be learned through pre-training, thus obtaining better performance on downstream tasks. Accordingly, this paper took cucumber disease recognition with small samples as an example and proposed a recognition method of a multi-modal language model based on image-text-label information. First, image-text multi-modal contrastive learning, image self-supervised contrastive learning, and label information were combined to measure the distance of samples in the common image-text-label space. Second, the classification methods and optimization of large-scale vision-language pre-training on small sample cucumber datasets were studied. The proposed model achieved a recognition accuracy rate of 94.84% on a small multi-modal cucumber disease dataset. Finally, some experiments on the public dataset demonstrated that our method has good generalization.