Loop closure detection(LCD) is a key component in VSLAM systems to eliminate cumulative errors. We propose an end-to-end image feature extraction-aggregate LCD network, Res2Net-SE-NetVLAD, to extract discriminative multi-scale fusion features for VSLAM. In Res2Net-SE-NetVLAD, the deep learning network Res2Net is chosen as the backbone, and the channel attention mechanism SE-block is implemented to obtain multiple perceptual fields with different granularity. Based on this, the channel optimization module is used to quantify the feature maps from the channel level, and the NetVLAD layer is further fused to implement the scale feature extraction network Res2Net-SE-NetVLAD, for which end-to-end training can be performed to achieve LCD. The experimental results show that the proposed model outperforms other deep learning-based LCD methods in scenes with loop closure attributes.