作者
Z.J. Lu,Heng Pan,Yueyue Dai,Xueming Si,Yan Zhang
摘要
Federated learning (FL) is an efficient decentralized machine learning methodology for processing non-independent and identically distributed (non-IID) data due to geographical and temporal distribution differences. Non-IID data generally indicates substantial disparities in data distribution and features among clients. This assumption is completely different from the conventional assumption of independent and identically distributed (IID) data in which all clients' data originates from the same distribution. There are many factors that affect the features of non-IID data, such as user preferences, data collection methods, and client characteristics. The factors of data distribution, category proportions, and feature representation also affect the statistical properties of non-IID data. This paper conducts an in-depth exploration of FL with the consideration of diverse features and statistical properties of non-IID data. Specifically, we first discuss the impact of non-IID data on communication efficiency, model convergence, and FL accuracy. The presence of non-IID data leads to increased communication overhead, imbalanced class distribution, and uneven local model updates. All of these affect FL convergence and performance. Then, we present the latest advanced techniques, such as data partitioning/sharing, client selection, differential privacy, and secure aggregation [1], which are used to address the challenges posed by non-IID data in terms of communication efficiency and privacy protection. Furthermore, we show the emerging applications and use cases of FL with non-IID data in various domains, such as healthcare, IoT, and edge computing. Overall, this survey provides a comprehensive understanding of FL with non-IID data, including the challenges, advancements, and practical applications in different areas.