Over the past few years, many supervised deep learning algorithms based on Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have achieved remarkable progress in the field of clinical-assisted diagnosis. However, the specific application of these algorithms e.g. ViT which requires a large amount of data in the training process is greatly limited due to the high cost of medical image annotation. To address this issue, this paper proposes an effective semi-supervised medical image segmentation framework, which combines two models with different structures, i.e. CNN and Transformer, and integrates their abilities to extract local and global information through a mutual supervision strategy. Based on this heterogeneous dual-network model, we employ multi-level image augmentation to expand the dataset, alleviating the model's demand for data. Additionally, we introduce an uncertainty minimization constraint to further improve the model's robustness, and incorporate an equivariance regularization module to encourage the model to capture semantic information of different categories in the images. In public benchmark tests, we demonstrate that the proposed method outperforms the recently developed semi-supervised medical image segmentation methods in terms of specific metrics such as Dice coefficient and 95% Hausdorff Distance for segmentation performance. The code will be released at https://github.com/swaggypg/MLABHCTM .