Most of existing mutli-source remote sensing data classification methods are based on convolutional neural networks. Recently, the emergence of Vision Transformer greatly challenges the dominance of CNN-based methods. The self-attention mechanism in Transformer and other dynamic networks imply that high-order feature interactions are beneficial to improve the feature representation and fusion. To explore the high-order feature interactions in multi-source image fusion, in this paper, we proposed a novel recursive feature interactive fusion network. It is composed of cross-shaped window self-attention encoder, and recursive feature interactive fusion. We use gated convolution recursively to mix multi-modal features and exploit their spatial relations. Experimental results on two datasets show that the proposed method achieves better performance than closely related methods.