In this paper, we present the solutions to the MER-SEMI subchallenge of Multimodal Emotion Recognition Challenge (MER 2023). This subchallenge focuses on predicting discrete emotions for a small subset of unlabeled videos within the context of semi-supervised learning. Participants are provided with a combination of labeled and large amounts of unlabeled videos. Our preliminary experiments on labeled videos demonstrate that this task is primarily driven by the video and audio modalities, while the text modality plays a relatively weaker role in emotion prediction. To address this challenge, we propose the Video-Audio Transformer (VAT), which takes raw signals as inputs and extracts multimodal representations. VAT comprises a video encoder, an audio encoder, and a cross-modal encoder. To leverage the vast amount of unlabeled data, we introduce a contrastive loss to align the image and audio representations before fusing them through cross-modal attention. Additionally, to enhance the model's ability to learn from noisy video data, we apply momentum distillation, a self-training method that learns from pseudo-targets generated by a momentum model. Furthermore, we fine-tune VAT on annotated video data specifically for emotion recognition. Experimental results on the MER-SEMI task have shown the effectiveness of the proposed VAT model. Notably, our model ranks first (0.891) on the leaderboard. Our project is publicly available at https://github.com/dingchaoyue/Multimodal-Emotion-Recognition-MER-and-MuSe-2023-Challenges.