摘要
Sentiment analysis is a computational technique that analyses the subjective information conveyed within a given expression. This encompasses appraisals, opinions, attitudes or emotions towards a particular subject, individual, or entity. Conventional sentiment analysis solely considers the text modality and derives sentiment by identifying the semantic relationship between words within a sentence. Despite this, certain expressions, such as exaggeration, sarcasm and humor, pose a challenge for automated detection when conveyed only through text. Multimodal sentiment analysis incorporates various forms of data, such as visual and acoustic cues, in addition to text. By utilizing fusion analysis, this approach can more precisely determine the implied sentiment polarity, which includes positive, neutral, and negative sentiments. Thus, the recent advancements in deep learning have boosted the domain of multimodal sentiment analysis to new heights. The research community has also shown significant interest in this topic due to its potential for both practical application and educational research. In light of this fact, this paper aims to present a thorough analysis of recent ground-breaking research studies conducted in multimodal sentiment analysis, which employs deep learning models across various modalities such as text, audio, image, and video. Furthermore, the article dives into a discussion of the multiple categories of multimodal data, diverse domains in which multimodal sentiment analysis can be applied, a range of operations that are integral to multimodal sentiment analysis, deep learning architectures, a variety of fusion methods, challenges associated with multimodal sentiment analysis, and the benchmark datasets in addition to the state-of-the-art approaches. The ultimate goal of this survey is to indicate the success of deep learning architectures in tackling the complexities associated with multimodal sentiment analysis.