Training-Free Transformer Architecture Search With Zero-Cost Proxy Guided Evolution

变压器计算机科学人工智能建筑机器学习模式识别（心理学）数据挖掘工程类电压电气工程艺术视觉艺术

作者

Qinqin Zhou,Kekai Sheng,Xiawu Zheng,Ke Li,Yonghong Tian,Jie Chen,Rongrong Ji

出处

期刊：IEEE Transactions on Pattern Analysis and Machine Intelligence [IEEE Computer Society]
日期：2024-01-01 卷期号：: 1-17 被引量：2

链接

nih.govdoi.org

标识

DOI：10.1109/tpami.2024.3378781

摘要

Transformers have shown remarkable performance, however, their architecture design is a time-consuming process that demands expertise and trial-and-error. Thus, it is worthwhile to investigate efficient methods for automatically searching high-performance Transformers via Transformer Architecture Search (TAS). In order to improve the search efficiency, training-free proxy based methods have been widely adopted in Neural Architecture Search (NAS). Whereas, these proxies have been found to be inadequate in generalizing well to Transformer search spaces, as confirmed by several studies and our own experiments. This paper presents an effective scheme for TAS called TR ansformer A rchitecture search with Z er O -cost p R oxy guided evolution (T-Razor) that achieves exceptional efficiency. Firstly, through theoretical analysis, we discover that the synaptic diversity of multi-head self-attention (MSA) and the saliency of multi-layer perceptron (MLP) are correlated with the performance of corresponding Transformers. The properties of synaptic diversity and synaptic saliency motivate us to introduce the ranks of synaptic diversity and saliency that denoted as DSS++ for evaluating and ranking Transformers. DSS++ incorporates correlation information among sampled Transformers to provide unified scores for both synaptic diversity and synaptic saliency. We then propose a block-wise evolution search guided by DSS++ to find optimal Transformers. DSS++ determines the positions for mutation and crossover, enhancing the exploration ability. Experimental results demonstrate that our T-Razor performs competitively against the state-of-the-art manually or automatically designed Transformer architectures across four popular Transformer search spaces. Significantly, T-Razor improves the searching efficiency across different Transformer search spaces, e.g., reducing required GPU days from more than 24 to less than 0.4 and outperforming existing zero-cost approaches. We also apply T-Razor to the BERT search space and find that the searched Transformers achieve competitive GLUE results on several Neural Language Processing (NLP) datasets. This work provides insights into training-free TAS, revealing the usefulness of evaluating Transformers based on the properties of their different blocks.

求助该文献

最长约 10秒，即可获得该文献文件

Training-Free Transformer Architecture Search With Zero-Cost Proxy Guided Evolution

今日热心研友