计算机科学
云计算
变压器
分布式计算
嵌入式系统
操作系统
工程类
电气工程
电压
作者
Lingfei Deng,Yunong Wang,H. H. Wang,Xuhua Ma,Xiaoming Du,Xudong Zheng,Dongrui Wu
标识
DOI:10.1145/3637528.3671547
摘要
Log-based failure prediction helps identify and mitigate system failures ahead of time, increasing the reliability of cloud elastic computing systems. However, most existing log-based failure prediction approaches only focus on semantic information, and do not make full use of the information contained in the timestamps of log messages. This paper proposes time-aware attention-based transformer (TAAT), a failure prediction approach that extracts semantic and temporal information simultaneously from log messages and their timestamps. TAAT first tokenizes raw log messages into specific exceptions, and then performs: 1) exception sequence embedding that reorganizes the exceptions of each node as an ordered sequence and converts them to vectors; 2) time relation estimation that computes time relation matrices from the timestamps; and, 3) time-aware attention that computes semantic correlation matrices from the exception sequences and then combines them with time relation matrices. Experiments on Alibaba Cloud demonstrated that TAAT achieves an approximately 10% performance improvement compared with the state-of-the-art approaches. TAAT is now used in the daily operation of Alibaba Cloud. Moreover, this paper also releases the real-world cloud computing failure prediction dataset used in our study, which consists of about 2.7 billion syslogs from about 300,000 node controllers during a 4-month period. To our knowledge, this is the largest dataset of its kind, and is expected to be very useful to the community.
科研通智能强力驱动
Strongly Powered by AbleSci AI