作者
Litian Zhang,Xiaoming Zhang,L.H. Han,Zelong Yu,Yun Liu,Zhoujun Li
摘要
With the rise of multimedia content on the internet, Multimodal Summarization has become a challenging task to help individuals grasp vital information fast. However, previous methods mainly learn the different modalities indistinguishably, which is ineffective in capturing the fine-grained content and hierarchical correlation in multimodal articles. To resolve the present problem, this paper proposes a Multi-task Hierarchical Heterogeneous Fusion Framework (MHHF) to learn the hierarchical structure and heterogeneous correlation existing in the multimodal data. Specifically, we propose a Hierarchical Cross-modality Feature Fusion module to progressively explore the different levels of interaction from object-word features to sentence-scene features. Besides, a Multi-task Cross-modality Decoder is constructed to coalesce different levels of features with three sub-tasks, i.e., Abstractive Summary Generation, Relevant Image Selection, and Extractive Summary Generation. We conduct extensive experiments on three datasets, i.e., MHHF-dataset, CNN, and Daily Mail, which consist of 62880, 1970, and 203 multimodal articles, respectively. Our method achieves state-of-the-art performance on most metrics. Moreover, MHHF consistently outperforms the baseline model on MHHF-dataset by 5.88%, 4.41%, and 0.4% of Rouge-1, Rouge-2, and Rouge-L for the abstractive summarization task. Ablation studies show that both Hierarchical Cross-modality Feature Fusion and Multi-task Cross-modality Decoder can improve the quality of multimodal summarization output. Further diversity analysis and human evaluation also demonstrate that MHHF can generate more informative and fluent summaries.