计算机科学
可扩展性
卷积神经网络
人工智能
变压器
深度学习
上下文图像分类
像素
医学影像学
人工神经网络
机器学习
模式识别(心理学)
学习迁移
图像(数学)
数据库
量子力学
物理
电压
作者
Junfei Xiao,Yang Bai,Alan Yuille,Zongwei Zhou
标识
DOI:10.1109/wacv56688.2023.00358
摘要
Vision Transformer (ViT) has become one of the most popular neural architectures due to its great scalability, computational efficiency, and compelling performance in many vision tasks. However, ViT has shown inferior performance to Convolutional Neural Network (CNN) on medical tasks due to its data-hungry nature and the lack of an-notated medical data. In this paper, we pre-train ViTs on 266,340 chest X-rays using Masked Autoencoders (MAE) which reconstruct missing pixels from a small part of each image. For comparison, CNNs are also pre-trained on the same 266,340 X-rays using advanced self-supervised methods (e.g. MoCo v2). The results show that our pre-trained ViT performs comparably (sometimes better) to the state-of-the-art CNN (DenseNet-121) for multi-label thorax dis-ease classification. This performance is attributed to the strong recipes extracted from our empirical studies for pre-training and fine-tuning ViT. The pre-training recipe signifies that medical reconstruction requires a much smaller proportion of an image (10% vs. 25%) and a more moderate random resized crop range (0.5∼1.0 vs. 0.2∼1.0) compared with natural imaging. Furthermore, we remark that in-domain transfer learning is preferred whenever possible. The fine-tuning recipe discloses that layer-wise LR decay, RandAug magnitude, and DropPath rate are significant factors to consider. We hope that this study can direct future research on the application of Transformers to a larger variety of medical imaging tasks.
科研通智能强力驱动
Strongly Powered by AbleSci AI