计算机科学
增采样
一般化
人工智能
特征(语言学)
联营
机器学习
简单(哲学)
感知器
骨干网
答疑
编码(集合论)
特征提取
数据挖掘
模式识别(心理学)
人工神经网络
图像(数学)
程序设计语言
哲学
数学分析
计算机网络
集合(抽象数据类型)
认识论
语言学
数学
标识
DOI:10.1109/bibm55620.2022.9995347
摘要
Although current methods have significantly improved the performance of medical visual question answering (Med-VQA), there are still two aspects worth exploring, namely the simplification of model structure and the effective model training on small-scale data. Different from the previous Med-VQA model, this paper only employs multi-layer perceptrons (MLPs) as the backbone network for feature extraction and modal fusion and designs a Med-VQA model on such basis, which achieves superior performance with a simple backbone network. To enhance model generalization, we design multimodal mixup (M-Mixup) to augment images and questions separately, which effectively alleviates the problem of insufficient training samples in the Med-VQA task. To prevent the destruction of the feature relationship when tokenizing the medical image, we design pooling tokens (PTs), a simple downsampling structure to capture fine-grained visual features without affecting the parameters and FLOPs of the entire model. Experimental results demonstrate that our model achieves state-of-the-art on the SLAKE, and obtains a remarkably competitive performance on the VQA-RAD. The source code and models are available at https://github.com/Alivelei/M-Mixup.
科研通智能强力驱动
Strongly Powered by AbleSci AI