Self-supervised Monocular Depth Estimation (MDE) models trained solely on single-camera video have gained significant popularity. Recent studies have shown that Vision Transformers (ViT) can improve depth estimation quality, despite their high computational demands. In Extended Reality (XR) context, lightweight and fast models are crucial for seamless operation on mobile devices. This paper proposes SwiftDepth, a hybrid MDE framework that fulfils these requirements. The model combines the benefits of Convolutional Neural Network (CNN), which provides speed and shift invariance, and ViT, which offers a global receptive field. We utilize SwiftFormer, a low-latency feature extraction network with efficient additive attention. Also, we introduce a novel two-level decoder to enhance depth estimation quality without an increase in the number of parameters. Our model achieves comparable results to the state-of-the-art lightweight Lite-Mono on the KITTI outdoor dataset while demonstrating better generalization capability on the NYUv2 indoor dataset.