DaViT: Dual Attention Vision Transformers

安全性令牌 计算机科学 空间语境意识 维数(图论) 人工智能 空间分析 频道(广播) 变压器 理论计算机科学 模式识别(心理学) 计算机视觉 数学 电信 物理 统计 量子力学 计算机安全 电压 纯数学
作者
Mingyu Ding,Bin Xiao,Noel Codella,Ping Luo,Jingdong Wang,Yuan Liu
出处
期刊:Lecture Notes in Computer Science 卷期号:: 74-92 被引量:56
标识
DOI:10.1007/978-3-031-20053-3_5
摘要

In this work, we introduce Dual Attention Vision Transformers (DaViT), a simple yet effective vision transformer architecture that is able to capture global context while maintaining computational efficiency. We propose approaching the problem from an orthogonal angle: exploiting self-attention mechanisms with both “spatial tokens” and “channel tokens”. With spatial tokens, the spatial dimension defines the token scope, and the channel dimension defines the token feature dimension. With channel tokens, we have the inverse: the channel dimension defines the token scope, and the spatial dimension defines the token feature dimension. We further group tokens along the sequence direction for both spatial and channel tokens to maintain the linear complexity of the entire model. We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention. Extensive experiments show DaViT backbones achieve state-of-the-art performance on four different tasks. Specially, DaViT-Tiny, DaViT-Small, and DaViT-Base achieve 82.8%, 84.2%, and 84.6% top-1 accuracy on ImageNet-1K without extra training data, using 28.3M, 49.7M, and 87.9M parameters, respectively. When we further scale up DaViT with 1.5B weakly supervised image and text pairs, DaViT-Giant reaches 90.4% top-1 accuracy on ImageNet-1K. Code is available at https://github.com/microsoft/DaViT.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
大幅提高文件上传限制,最高150M (2024-4-1)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
Violet发布了新的文献求助10
刚刚
华仔应助Antonio采纳,获得10
刚刚
啦啦啦完成签到,获得积分10
1秒前
1秒前
乐乐应助霸气的保温杯采纳,获得10
1秒前
1秒前
1秒前
冬瓜有内涵呐完成签到,获得积分10
2秒前
舒适砖头发布了新的文献求助10
2秒前
cookie发布了新的文献求助10
2秒前
粱乘风完成签到,获得积分10
2秒前
cassiel完成签到,获得积分10
3秒前
3秒前
hizhyhy完成签到,获得积分10
4秒前
Iriss完成签到,获得积分20
5秒前
容易发布了新的文献求助10
5秒前
毛豆应助张志迪采纳,获得10
5秒前
顾矜应助SunGuangkai采纳,获得10
5秒前
慧子无语完成签到,获得积分10
5秒前
skinnylove完成签到,获得积分10
5秒前
6秒前
6秒前
7秒前
害羞的安萱完成签到,获得积分10
7秒前
书南发布了新的文献求助10
7秒前
科研通AI2S应助Hastur00采纳,获得10
8秒前
8秒前
旧城以西发布了新的文献求助10
9秒前
王肖完成签到 ,获得积分10
11秒前
HFan完成签到,获得积分10
11秒前
whyzz完成签到,获得积分10
11秒前
11秒前
12秒前
Dprisk发布了新的文献求助10
12秒前
小林子发布了新的文献求助10
13秒前
棠棠发布了新的文献求助10
13秒前
14秒前
追寻的白安完成签到,获得积分10
14秒前
感动清炎发布了新的文献求助10
15秒前
15秒前
高分求助中
Licensing Deals in Pharmaceuticals 2019-2024 3000
Cognitive Paradigms in Knowledge Organisation 2000
Effect of reactor temperature on FCC yield 2000
Near Infrared Spectra of Origin-defined and Real-world Textiles (NIR-SORT): A spectroscopic and materials characterization dataset for known provenance and post-consumer fabrics 610
Introduction to Spectroscopic Ellipsometry of Thin Film Materials Instrumentation, Data Analysis, and Applications 600
The Bourse of Babylon: market quotations in the astronomical diaries of Babylonia 500
Promoting women's entrepreneurship in developing countries: the case of the world's largest women-owned community-based enterprise 500
热门求助领域 (近24小时)
化学 医学 生物 材料科学 工程类 有机化学 生物化学 物理 内科学 纳米技术 计算机科学 化学工程 复合材料 基因 遗传学 催化作用 物理化学 免疫学 量子力学 细胞生物学
热门帖子
关注 科研通微信公众号,转发送积分 3308821
求助须知:如何正确求助?哪些是违规求助? 2942271
关于积分的说明 8507774
捐赠科研通 2617189
什么是DOI,文献DOI怎么找? 1430004
科研通“疑难数据库(出版商)”最低求助积分说明 663969
邀请新用户注册赠送积分活动 649186