Transformers have emerged as a transformative tool in various computer vision tasks, excelling at capturing long-range dependencies. Their potential applicability and scalability in the interpretation of high-resolution remote sensing images (HRRSIs) have thus garnered substantial interest. However, unlike natural images, HRRSIs present intricate scenes characterized by scale variations and diverse appearances. These challenges underscore the importance of enabling networks to effectively assimilate both local intricacies and global context. In this letter, we introduce LETFormer, a semantic segmentation transformer. LETFormer balances capturing longrange dependencies with preserving local details through its unique LETFormer block, featuring an anchor token. This token aggregates localized contextual information within a designated window and promotes meaningful interactions among anchor tokens. With a mask transformer decoder, LETFormer gains ample contextual cues for precise semantic mask prediction. Empirical findings based on evaluations using the ISPRS Potsdam and LoveDA benchmarks unequivocally establish LETFormer's superiority over state-of-the-art models. Additionally, we analyze the parameter size and floating-point operations per second (FLOPs) of LETFormer.