Abstract The present study explores how text-image cohesion is achieved in a dataset of thirty-seven political ads in the playlist “Why we are voting Biden-Harris,” which is part of the 2020 Biden for President Campaign. A further objective is to analyze how multimodal cohesion contributes to persuasion in political campaign ads. Using methods from Systemic Functional Linguistics ( Halliday and Hasan 1976 ) and multimodal views of cohesion ( Tseng 2013 ; Bateman 2014 ), both quantitative and qualitative results are obtained. The quantitative results reveal that most cohesive types are of the lexical type, followed by referential, conjunction, ellipsis, and substitution. Also, the visual and verbal chains outnumber the audio chains and, thus, they are responsible for cohesion in the ads. The qualitative results show that multimodal cohesion is a powerful tool for supporting persuasion in political campaigns by appealing to emotions, a hypothetical future, rationality, voices of expertise, and altruism ( Reyes 2011 ).