Early screening is an important way to reduce the mortality of hepatocellular carcinoma (HCC) and improve its prognosis. As a noninvasive, economic, and safe procedure, B-mode ultrasound is currently the most common imaging modality for diagnosing and monitoring HCC. However, because of the difficulty of extracting effective image features and modeling longitudinal data, few studies have focused on early prediction of HCC based on longitudinal ultrasound images. In this paper, to address the above challenges, we propose a spatiotemporal attention network (STA-HCC) that adopts a convolutional-neural-network–transformer framework. The convolutional neural network includes a feature-extraction backbone and a proposed regions-of-interest attention block, which learns to localize regions of interest automatically and extract effective features for HCC prediction. The transformer can capture long-range dependencies and nonlinear dynamics from ultrasound images through a multihead self-attention mechanism. Also, an age-based position embedding is proposed in the transformer to embed a more-appropriate positional relationship among the longitudinal ultrasound images. Experiments conducted on our dataset of 6170 samples collected from 619 cirrhotic subjects show that STA-HCC achieves impressive performance, with an area under the receiver-operating-characteristic curve of 77.5%, an accuracy of 70.5%, a sensitivity of 69.9%, and a specificity of 70.5%. The results show that our method achieves state-of-the-art performance compared with other popular sequence models.