期刊:IEEE Transactions on Instrumentation and Measurement [Institute of Electrical and Electronics Engineers] 日期:2023-01-01卷期号:72: 1-17被引量:4
标识
DOI:10.1109/tim.2023.3329106
摘要
Diabetic retinopathy (DR) is a common complication of diabetes and one of the main causes of blindness in humans, which can be prevented by early-stage detection and treatment. Clinically, ophthalmologists use optical coherence tomography (OCT) image analysis as a basis for diagnosing DR. The existing medical resources can no longer meet the needs of the escalating patient population. Therefore, deep learning technology has become a mainstream solution for medical image analysis. Vision Transformer (ViT), a new neural network structure, has demonstrated great performance in analyzing images. However, due to the lack of inductive bias and prohibition of input image changes in size, ViT cannot avoid over-fitting problems on small datasets and limits the model to biological tissue characteristics. Thus, we propose an OCT multi-head self-attention (OMHSA) block that especially calculates OCT image information based on a hybrid CNN-Transformer strategy. Compared to traditional MHSA, OMHSA integrates local information extraction differences into the calculation of self-attention and adds local information to the transformer model without relying on a multi-branch network establishment. We built a neural network architecture (OCT-Former) by stacking convolutional layers and OMHSA blocks repeatedly in each stage. Similar to CNN, OCTFormer allows input size change at each stage to achieve a hierarchical structure effect. The model diagnosis effectiveness on the collected retinal OCT dataset was evaluated, and the accuracy reached 98.60%, surpassing the state-of-the-art (SOTA) model. The OCTFormer deployment to mobile terminals through knowledge distillation technology was shown, which presented a reference for deploying transformer models to actual clinical environments.