A recent study by Suissa and colleagues explored the clinical relevance of a medical image segmentation metric (Dice metric) commonly used in the field of artificial intelligence (AI). They showed that pixel-wise agreement for physician identification of structures on ultrasound images is variable, and a relatively low Dice metric (0.34) correlated to a substantial agreement on subjective clinical assessment. We highlight the need to bring structure and clinical perspective to the evaluation of medical AI, which clinicians are best placed to direct.