远程直接内存访问
云计算
计算机科学
架空(工程)
数据库
操作系统
算法
作者
Zhuo Song,Jiejian Wu,Teng Ma,Zhe Wang,Linghe Kong,Zhenzao Wen,Jingxuan Li,Yang Lu,Yong Yang,Tao Ma,Zheng Liu,Guihai Chen
标识
DOI:10.1109/tnet.2024.3394514
摘要
Cloud services have shifted from monolithic designs to microservices running on cloud-native infrastructure with monitoring systems to ensure service level agreements (SLAs). However, traditional monitoring systems no longer meet the demands of cloud-native monitoring. In Alibaba's "double eleven" shopping festival, it is observed that the monitor occupies resources of the monitored infrastructure and even disrupts services. In this paper, we propose a novel monitoring system named for cloud-native monitoring. achieves zero overhead in collecting raw metrics using one-sided remote direct memory access (RDMA) and remedies network congestion by adopting a receiver-driven flow control scheme. also features a priority queue mechanism to meet different quality of service requirements and an efficient batch processing design to relieve CPU occupation. has been deployed and evaluated in four different clusters with heterogeneous RDMA NIC devices and architectures in Alibaba Cloud. Results show that achieves no CPU occupation at the monitored host and supports $1\sim10k$ hosts with $0.1\sim1s$ sampling interval using a single thread for network I/O. significantly relieves the incast issue and maintains $80\sim95\%$ of bandwidth utilization in several clusters when monitoring $1k$ hosts. also ensures services with high priority accomplish collecting metrics earlier than low priority ones by at least $400 \mu s$ when monitoring $1k$ hosts.
科研通智能强力驱动
Strongly Powered by AbleSci AI