Abstract People resonate more with music when exposed to visual information, and music enhances their perception of video content. Cross-modal recommendation techniques can be used to suggest appropriate background music for a given video. However, there is not a simple correspondence between the different modal data. Therefore, to explore the association between the two modalities of video and music, we propose MFF-VBMR, a video background music recommendation model based on multi-level fusion features. The model uses the cross-modal information of static, dynamic and emotional content of video and music to realize the task of matching and recommending suitable background music for a given video. We propose a feature normalized convolutional similarity algorithm network FNC, which takes into account the pairwise similarity of visual and acoustic regions without losing region details. Experimental results show that the proposed model outperforms other existing models in terms of performance and achieves satisfactory results for video background music recommendation.