ABSTRACT Accurately segmenting gastrointestinal (GI) disease regions from Wireless Capsule Endoscopy images is essential for clinical diagnosis and survival prediction. However, challenges arise due to similar intensity distributions, variable lesion shapes, and fuzzy boundaries. In this paper, we propose MLFE‐UNet, an advanced fusion of CNN‐based transformers with UNet. Both the encoder and decoder utilize a multi‐level feature extraction (MLFA) CNN‐Transformer‐based module. This module extracts features from the input data, considering both global dependencies and local information. Furthermore, we introduce a multi‐level spatial attention (MLSA) block that functions as the bottleneck. It enhances the network's ability to handle complex structures and overlapping regions in feature maps. The MLSA block captures multiscale dependencies of tokens from the channel perspective and transmits them to the decoding path. A contextual feature stabilization block follows each transition to emulate lesion zones and facilitate segmentation guidelines at each phase. To address high‐level semantic information, we incorporate a computationally efficient spatial channel attention block. This is followed by a stabilization block in the skip connections, ensuring global interaction and highlighting important semantic features from the encoder to the decoder. To evaluate the performance of our proposed MLFE‐UNet, we selected common GI diseases, specifically bleeding and polyps. The dice coefficient scores obtained by MLFE‐UNet on the MICCAI 2017 (Red lesion) and CVC‐ClinicalDB data sets are 92.34% and 88.37%, respectively.