Lu Rong,Yijie Ding,Mengyao Wang,Abdulmotaleb El Saddik,M. Shamim Hossain
出处
期刊:IEEE Transactions on Consumer Electronics [Institute of Electrical and Electronics Engineers] 日期:2024-01-31卷期号:70 (1): 3697-3708
标识
DOI:10.1109/tce.2024.3357543
摘要
Recent advancements in consumer electronics as well as imaging technology have generated abundant multimodal data for consumer-centric AI applications. Effective analysis and utilization of such heterogeneous data hold great potential for consumption decisions. Hence, effective analysis of multi-modal consumer-generated content is a prominent research topic in the field of customer-centric artificial intelligence (AI). However, two key challenges that arise in this task are multi-modal representation and fusion. To address these issues, we propose a multimodal embedding from the language model (MELMo) enhanced decision-making model. The main idea is to extend the ELMo to a multi-modal scenario by designing a deep contextualized visual embedding from the language model (VELMo) and modeling multi-modal fusion at the decision level by using the cross-modal attention mechanism. In addition, we also designed a novel multitask decoder to learn the shared knowledge from related tasks. We evaluate our approach on two benchmark datasets, CMUMOSI and CMU-MOSEI, and show that MELMo outperforms state-of-the-art approaches. The F1 scores on the CMU-MOSI and CMU-MOSEI datasets reach 86.1% and 85.2%, respectively, representing an improvement of approximately 1.0% and 1.3% over the state-of-the-art system, providing an effective technique for multimodal consumer analytics in electronics and beyond.