困惑
内存占用
变压器
语言模型
计算机科学
缩放比例
二次方程
归纳偏置
人工智能
自然语言处理
数学
电压
程序设计语言
电气工程
工程类
任务(项目管理)
几何学
系统工程
多任务学习
作者
Sandeep Subramanian,Ronan Collobert,Marc’Aurelio Ranzato,Y-Lan Boureau
出处
期刊:Cornell University - arXiv
日期:2020-01-01
被引量:3
标识
DOI:10.48550/arxiv.2005.00581
摘要
We investigate multi-scale transformer language models that learn representations of text at multiple scales, and present three different architectures that have an inductive bias to handle the hierarchical nature of language. Experiments on large-scale language modeling benchmarks empirically demonstrate favorable likelihood vs memory footprint trade-offs, e.g. we show that it is possible to train a hierarchical variant with 30 layers that has 23% smaller memory footprint and better perplexity, compared to a vanilla transformer with less than half the number of layers, on the Toronto BookCorpus. We analyze the advantages of learned representations at multiple scales in terms of memory footprint, compute time, and perplexity, which are particularly appealing given the quadratic scaling of transformers' run time and memory usage with respect to sequence length.
科研通智能强力驱动
Strongly Powered by AbleSci AI