In this paper, we focus on two tasks: semantic segmentation and target detection in the visual semantic understanding of tennis sports images and we optimize the network structure to achieve a more complete location contour information mining of the target. In detail, we focus on a weakly supervised image semantic segmentation method based on null convolution pixel relations. To address the problem of incomplete pixel-level pseudo-labeling, we introduce a cavity convolution unit with multiple cavity rates and a self-attentive mechanism in the classification model to adaptively enhance the target regions and suppress other irrelevant regions while expanding the perceptual field to generate high-quality pixel-level pseudo-labeling and then train the semantic segmentation model. The final experimental results show that the hierarchical fusion algorithm proposed in this paper significantly outperforms other algorithms, and the overall classification accuracy of the tandem cavity neural network algorithm reaches 81% with good overall classification results. The recognition accuracy of static movements is higher than that of dynamic movements.