Forest fires, influenced by climatic and ecological changes, pose significant risks to global ecosystems and human communities. To address this challenge, our research introduces a segmentation method that integrates vision transformers (ViT) with conventional convolutional neural networks (CNNs). Within this framework, MobileViT serves as the basic architecture, with CNNs enhancing spatial resolution. Our model also incorporates the CBAM attention mechanism, the Dense ASPP module, and SP pooling to enhance segmentation performance. The model, optimized for lightweight and efficiency, achieved an F1-score of 87.2 % and a mIoU of 81.44 % on our custom dataset that underwent data augmentation. In addition, ablation experiments validate the value of each module in the performance of the composite model. Collectively, this research aims to advance real-time wildland fire monitoring capabilities. In addition, its potential extends to a broader range of applications encompassing various agricultural and forestry challenges.