Zhan Zhang,Tianming Liu,Yanjun Shu,Siyuan Chen,Xian Liu
标识
DOI:10.1109/icpads56603.2022.00076
摘要
For a stream processing system that uses checkpoints as a fault-tolerant method, selecting the appropriate checkpoint period is the key to ensuring the efficient operation of streaming applications. State-of-art stream processing systems currently only support fixed-cycle checkpoints, which is difficult to make a good trade-off between fault-tolerant processing and the cost of failure recovery in dynamically changing streaming application scenarios. Moreover, in a complex distributed streaming application environment, the dynamic environmental indicators (e.g., the values of workloads and failure rates) are not in coincidence with the model assumptions, such as the dynamics of Twitter’s hot events data changing quickly. In this paper, we consider the dynamic changes of environmental indicators and adaptively optimize the processing delay and fault recovery time. Then, we propose a dynamic adjustment method for the checkpoint interval by reinforcement learning, which is named DACM. DACM adaptively optimizes the processing delay and fault recovery time, while avoiding the overall environment modeling of streaming applications. The experiments conducted on the Flink platform show that DACM reduces the processing delay by 10% and the failure recovery time by 37% compared with the existing checkpoint interval optimization models.