To reduce the cost of the experimental characterization of the potential substrates for enzymes, machine learning prediction models offer an alternative solution. Pretrained language models, as powerful approaches for protein and molecule representation, have been employed in the development of enzyme-substrate prediction models, achieving promising performance. In addition to continuing improvements in language models, effectively fusing encoders to handle multimodal prediction tasks is critical for further enhancing model performance by using available representation methods. Here, we present FusionESP, a multimodal architecture that integrates protein and chemistry language models with two independent projection heads and a contrastive learning strategy for predicting enzyme-substrate pairs. Our best model achieved state-of-the-art performance with an accuracy of 94.77% on independent test data and exhibited better generalization capacity while requiring fewer computational resources and training data, compared to previous studies of a fine-tuned encoder or employing more encoders. It also confirmed our hypothesis that embeddings of positive pairs are closer to each other in a high-dimension space, while negative pairs exhibit the opposite trend. Our ablation studies showed that the projection heads played a crucial role in performance enhancement, while the contrastive learning strategy further improved the projection heads' capacity in classification tasks. The proposed architecture is expected to be further applied to enhance performance in additional multimodality prediction tasks in biology. A user-friendly web server of FusionESP is established and freely accessible at https://rqkjkgpsyu.us-east-1.awsapprunner.com/.