Weakly-supervised fine-grained recognition aims to detect potential differences between subcategories at a more detailed scale without using any manual annotations. While most recent works focus on classical image-based fine-grained recognition that recognizes subcategories at image-level, video-based fine-grained recognition is much more challenging and specifically needed. In this paper, we propose a Multimodal Audio-Visual Transformer for Weakly-supervised Fine-Grained Recognition (MAVT-FG) model which incorporates audio-visual modalities. Specifically, MAVT-FG consists of Audio-Visual Dual-Encoder for feature extraction, Cross-Decoder for Audio-Visual Fusion (DAVF) to exploit inherent cues and correspondences between two modalities, and Search-and-Select Fine-grained Branch (SSFG) to capture the most discriminative regions. Furthermore, we construct a new benchmark: Fine-grained Birds of Audio-Visual (FGB-AV) for audio-visual weakly-supervised fine-grained recognition at video-level. Experimental results show that our method achieves superior performance and outperforms other state-of-the-art methods.