摘要
Machines don't have eyes, but you wouldn't know that if you followed the progression of deep learning models for accurate interpretation of medical images, such as x-rays, computed tomography (CT) and magnetic resonance imaging (MRI) scans, pathology slides, and retinal photos. Over the past several years, there has been a torrent of studies that have consistently demonstrated how powerful "machine eyes" can be, not only compared with medical experts but also for detecting features in medical images that are not readily discernable by humans. For example, a retinal scan is rich with information that people can't see, but machines can, providing a gateway to multiple aspects of human physiology, including blood pressure; glucose control; risk of Parkinson's, Alzheimer's, kidney, and hepatobiliary diseases; and the likelihood of heart attacks and strokes. As a cardiologist, I would not have envisioned that machine interpretation of an electrocardiogram would provide information about the individual's age, sex, anemia, risk of developing diabetes or arrhythmias, heart function and valve disease, kidney, or thyroid conditions. Likewise, applying deep learning to a pathology slide of tumor tissue can also provide insight about the site of origin, driver mutations, structural genomic variants, and prognosis. Although these machine vision capabilities for medical image interpretation may seem impressive, they foreshadow what is potentially far more expansive terrain for artificial intelligence (AI) to transform medicine. The big shift ahead is the ability to transcend narrow, unimodal tasks, confined to images, and broaden machine capabilities to include text and speech, encompassing all input modes, setting the foundation for multimodal AI.