亲爱的研友该休息了!由于当前在线用户较少,发布求助请尽量完整地填写文献信息,科研通机器人24小时在线,伴您度过漫漫科研夜!身体可是革命的本钱,早点休息,好梦!

Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment

计算机科学 人工智能 自然语言处理 计算机视觉 模式识别(心理学)
作者
Hao Fei,Shengqiong Wu,Meishan Zhang,Min Zhang,Tat‐Seng Chua,Shuicheng Yan
出处
期刊:IEEE Transactions on Pattern Analysis and Machine Intelligence [IEEE Computer Society]
卷期号:46 (12): 7701-7719 被引量:14
标识
DOI:10.1109/tpami.2024.3393452
摘要

While pre-training large-scale video-language models (VLMs) has shown remarkable potential for various downstream video-language tasks, existing VLMs can still suffer from certain commonly seen limitations, e.g., coarse-grained cross-modal aligning , under-modeling of temporal dynamics, detached video-language view. In this work, we target enhancing VLMs with a fine-grained structural spatio-temporal alignment learning method (namely Finsta). First of all, we represent the input texts and videos with fine-grained scene graph (SG) structures, both of which are further unified into a holistic SG (HSG) for bridging two modalities. Then, an SG-based framework is built, where the textual SG (TSG) is encoded with a graph Transformer, while the video dynamic SG (DSG) and the HSG are modeled with a novel recurrent graph Transformer for spatial and temporal feature propagation. A spatial-temporal Gaussian differential graph Transformer is further devised to strengthen the sense of the changes in objects across spatial and temporal dimensions. Next, based on the fine-grained structural features of TSG and DSG, we perform object-centered spatial alignment and predicate-centered temporal alignment respectively, enhancing the video-language grounding in both the spatiality and temporality. We design our method as a plug&play system, which can be integrated into existing well-trained VLMs for further representation augmentation, without training from scratch or relying on SG annotations in downstream applications. On 6 representative VL modeling tasks over 12 datasets in both standard and long-form video scenarios, Finsta consistently improves the existing 13 strong-performing VLMs persistently, and refreshes the current state-of-the-art end task performance significantly in both the fine-tuning and zero-shot settings.

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
淡然笑旋完成签到,获得积分20
4秒前
sillyceiling发布了新的文献求助10
7秒前
科目三应助科研通管家采纳,获得10
8秒前
9秒前
23秒前
852应助Isabella采纳,获得10
36秒前
星辰大海应助噗噗采纳,获得10
45秒前
茫小铫发布了新的文献求助10
55秒前
1分钟前
1分钟前
Ava应助自信书竹采纳,获得10
1分钟前
赫景明完成签到,获得积分20
1分钟前
1分钟前
噗噗发布了新的文献求助10
1分钟前
光阴完成签到 ,获得积分10
1分钟前
科研通AI6.4应助茫小铫采纳,获得10
1分钟前
yorha3h应助钟成采纳,获得10
1分钟前
打打应助忐忑的傲安采纳,获得30
1分钟前
wanci应助DASHU采纳,获得30
1分钟前
好运来完成签到 ,获得积分10
1分钟前
所所应助云7采纳,获得10
1分钟前
玄枵完成签到,获得积分10
2分钟前
小蘑菇应助科研通管家采纳,获得10
2分钟前
斯文败类应助科研通管家采纳,获得10
2分钟前
小宇完成签到,获得积分10
2分钟前
2分钟前
2分钟前
Donger完成签到 ,获得积分10
2分钟前
2分钟前
笑点低涟妖完成签到 ,获得积分10
2分钟前
云7发布了新的文献求助10
2分钟前
十一完成签到,获得积分10
2分钟前
xionggege完成签到,获得积分10
3分钟前
3分钟前
3分钟前
3分钟前
瓜子完成签到,获得积分20
3分钟前
Isabella发布了新的文献求助10
3分钟前
3分钟前
ai zs发布了新的文献求助10
3分钟前
高分求助中
(应助此贴封号)【重要!!请各用户(尤其是新用户)详细阅读】【科研通的精品贴汇总】 10000
Salmon nasal cartilage-derived proteoglycan complexes influence the gut microbiota and bacterial metabolites in mice 2000
The Composition and Relative Chronology of Dynasties 16 and 17 in Egypt 1500
Cowries - A Guide to the Gastropod Family Cypraeidae 1200
ON THE THEORY OF BIRATIONAL BLOWING-UP 666
Signals, Systems, and Signal Processing 610
“美军军官队伍建设研究”系列(全册) 500
热门求助领域 (近24小时)
化学 材料科学 医学 生物 纳米技术 工程类 有机化学 化学工程 生物化学 计算机科学 物理 内科学 复合材料 催化作用 物理化学 光电子学 电极 细胞生物学 基因 无机化学
热门帖子
关注 科研通微信公众号,转发送积分 6384167
求助须知:如何正确求助?哪些是违规求助? 8196436
关于积分的说明 17332152
捐赠科研通 5437742
什么是DOI,文献DOI怎么找? 2875915
邀请新用户注册赠送积分活动 1852430
关于科研通互助平台的介绍 1696791