计算机科学
隐藏字幕
人工智能
任务(项目管理)
情态动词
自然语言处理
上下文图像分类
一般化
医学影像学
代表(政治)
集合(抽象数据类型)
机器学习
计算机视觉
图像(数学)
程序设计语言
数学分析
化学
数学
管理
高分子化学
经济
政治
政治学
法学
作者
Jong Hak Moon,Hyungyung Lee,Woncheol Shin,Young–Hak Kim,Edward Choi
出处
期刊:IEEE Journal of Biomedical and Health Informatics
[Institute of Electrical and Electronics Engineers]
日期:2022-12-01
卷期号:26 (12): 6070-6080
被引量:59
标识
DOI:10.1109/jbhi.2022.3207502
摘要
Recently a number of studies demonstrated impressive performance on diverse vision-language multi-modal tasks such as image captioning and visual question answering by extending the BERT architecture with multi-modal pre-training objectives. In this work we explore a broad set of multi-modal representation learning tasks in the medical domain, specifically using radiology images and the unstructured report. We propose Medical Vision Language Learner (MedViLL), which adopts a BERT-based architecture combined with a novel multi-modal attention masking scheme to maximize generalization performance for both vision-language understanding tasks (diagnosis classification, medical image-report retrieval, medical visual question answering) and vision-language generation task (radiology report generation). By statistically and rigorously evaluating the proposed model on four downstream tasks with three radiographic image-report datasets (MIMIC-CXR, Open-I, and VQA-RAD), we empirically demonstrate the superior downstream task performance of MedViLL against various baselines, including task-specific architectures.
科研通智能强力驱动
Strongly Powered by AbleSci AI