Multimodal rumor detection aims at detecting rumors using information from textual and visual modalities. The most critical difficulty in multimodal rumor detection lies in capturing both the intra-modal and inter-modal relationships from multimodal data. However, existing methods mainly focus on the multimodal fusion process while paying little attention to the intra-modal relationships. To address these limitations, we propose a multimodal rumor detection method with deep metric learning (MRML) to effectively extract multimodal relationships of news for detecting rumors. Specifically, we design the metric-based triplet learning to extract the intra-modal relationships between rumors and non-rumors in every modality and the contrastive pairwise learning to capture the inter-modal relationships across multimodal. Extensive experiments on two real-world multimodal datasets show the superior performance of our rumor detection method.