计算机科学
可解释性
特征学习
特征(语言学)
代表(政治)
变压器
人工智能
工程类
政治学
语言学
政治
电气工程
哲学
电压
法学
作者
Guosheng Feng,Danqing Huang,Chin-Yew Lin,Damjan Dakic,Milos Milunovic,Tamara Stankovic,Igor Ilic
标识
DOI:10.1109/ictai56018.2022.00046
摘要
There are increasing interests in document layout representation learning and understanding. Transformer, with its great power, has become the mainstream model architecture and achieved promising results in this area. As elements in a document layout consist of multi-modal and multi-dimensional features such as position, size, and its text content, prior works represent each element by summing all feature embeddings into one unified vector in the input layer, which is then fed into the self-attention for element-wise interaction. However, this simple summation would potentially raise mixed correlations among heterogeneous features and bring noise to the representation learning. In this paper, we propose a novel two-step disentangled attention mechanism to allow more flexible feature interactions in the self-attention. Furthermore, inspired by the principles of document design (e.g., contrast, proximity), we propose an unsupervised learning objective to constrain the layout representations. We verify our approach on two layout understanding tasks, namely element role labeling and image captioning. Experiment results show that our approach achieves state-of-the-art performances. Moreover, we conduct extensive studies and observe better interpretability using our approach.
科研通智能强力驱动
Strongly Powered by AbleSci AI