Land use classification using optical and Synthetic Aperture Radar (SAR) images is a crucial task in remote sensing image interpretation. Recently, deep multi-modal fusion models have significantly enhanced land use classification by integrating multi-source data. However, existing approaches solely rely on simple fusion methods to leverage the complementary information from each modality, disregarding the intermodal correlation during the feature extraction process, which leads to inadequate integration of the complementary information. In this paper, we propose FASONet, a novel multi-modal fusion network consisting of two key modules that tackle this challenge from different perspectives. Firstly, the feature alignment module (FAM) facilitates cross-modal learning by aligning high-level features from both modalities, thereby enhancing the feature representation for each modality. Secondly, we introduce the multi-modal squeeze and excitation fusion module (MSEM) to adaptively fuse discriminative features by weighting each modality and removing irrelevant parts. Our experimental results on the WHU-OPT-SAR dataset demonstrate the superiority of FASONet over other fusion-based methods, exhibiting a remarkable 5.1% improvement in MIoU compared to the state-of-the-art MCANet method.