作者
Suheer Al-Hadhrami,Mohamed El Bachir Menaï,Saad Al-Ahmadi,Ahmad Alnafessah
摘要
This paper comprehensively reviews medical VQA models, structures, and datasets, focusing on combining vision and language. Over 75 models and their statistical and SWOT (Strengths, Weaknesses, Opportunities, Threats) analyses were compared and analyzed. The study highlights whether the researchers in the general field influence those in the medical field. According to an analysis of text encoding techniques, LSTM is the approach that is utilized the most (42%), followed by non-text methods (14%) and BiLSTM (12%), whereas VGGNet (40%) and ResNet (22%) are the most often used vision methods, followed by Ensemble approaches (16%). Regarding fusion techniques, 14% of the models employed non-specific methods, while SAN (13%) and concatenation (10%) were frequently used. The study identifies LSTM-VGGNet and LSTM-ResNet combinations as the primary approaches in medical VQA, with 18% and 15% usage rates, respectively. The statistical analysis of medical VQA from 2018 to 2023 and individual yearly analyses reveals consistent preferences for LSTM and VGGNet, except in 2018 when ResNet was more commonly used. The SWOT analysis provides insights into the strengths and weaknesses of medical VQA research, highlighting areas for future exploration. These areas include addressing limited dataset sizes, enhancing question diversity, mitigating unimodal bias, exploring multi-modal datasets, leveraging external knowledge, incorporating multiple images, ensuring practical medical application integrity, improving model interpretation, and refining evaluation methods. This paper's findings contribute to understanding medical VQA and offer valuable guidance for future researchers aiming to make advancements in this field.