Efficient task-specific data valuation for nearest neighbor algorithms

近似算法 算法 计算机科学 有界函数 次线性函数 局部敏感散列 k-最近邻算法 数据点 数学 离散数学 散列函数 人工智能 哈希表 计算机安全 数学分析
作者
Ruoxi Jia,David Dao,Boxin Wang,Frances Ann Hubis,Nezihe Merve Gürel,Bo Li,Ce Zhang,Costas J. Spanos,Dawn Song
出处
期刊:Proceedings of the VLDB Endowment [Association for Computing Machinery]
卷期号:12 (11): 1610-1623 被引量:70
标识
DOI:10.14778/3342263.3342637
摘要

Given a data set D containing millions of data points and a data consumer who is willing to pay for $ X to train a machine learning (ML) model over D , how should we distribute this $X to each data point to reflect its "value"? In this paper, we define the "relative value of data" via the Shapley value, as it uniquely possesses properties with appealing real-world interpretations, such as fairness, rationality and decentralizability. For general, bounded utility functions, the Shapley value is known to be challenging to compute: to get Shapley values for all N data points, it requires O (2 N ) model evaluations for exact computation and O ( N log N ) for ( ϵ , δ)-approximation. In this paper, we focus on one popular family of ML models relying on K -nearest neighbors ( K NN). The most surprising result is that for unweighted K NN classifiers and regressors, the Shapley value of all N data points can be computed, exactly , in O ( N log N ) time - an exponential improvement on computational complexity! Moreover, for ( ϵ , δ)-approximation, we are able to develop an algorithm based on Locality Sensitive Hashing (LSH) with only sublinear complexity O ( N h ( ϵ , K ) log N ) when ϵ is not too small and K is not too large. We empirically evaluate our algorithms on up to 10 million data points and even our exact algorithm is up to three orders of magnitude faster than the baseline approximation algorithm. The LSH-based approximation algorithm can accelerate the value calculation process even further. We then extend our algorithm to other scenarios such as (1) weighed K NN classifiers, (2) different data points are clustered by different data curators , and (3) there are data analysts providing computation who also requires proper valuation. Some of these extensions, although also being improved exponentially, are less practical for exact computation (e.g., O ( N K ) complexity for weigthed K NN). We thus propose an Monte Carlo approximation algorithm, which is O ( N (log N ) 2 /(log K ) 2 ) times more efficient than the baseline approximation algorithm.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
玛卡完成签到,获得积分10
刚刚
123完成签到 ,获得积分10
1秒前
zz发布了新的文献求助10
1秒前
2秒前
迅速的念芹完成签到 ,获得积分10
2秒前
2秒前
春天在这李完成签到,获得积分10
3秒前
Tim发布了新的文献求助10
3秒前
Alara完成签到,获得积分10
4秒前
Jasper应助顾天理采纳,获得10
4秒前
xujjjjj完成签到,获得积分10
5秒前
7秒前
玛卡发布了新的文献求助10
7秒前
阳光梦易完成签到 ,获得积分10
9秒前
10秒前
小博小博完成签到 ,获得积分10
10秒前
紫荆发布了新的文献求助10
11秒前
Lucas应助HL采纳,获得50
12秒前
13秒前
17秒前
18秒前
乐乐应助lxy采纳,获得10
20秒前
22秒前
粒子发布了新的文献求助10
22秒前
22秒前
23秒前
秋半梦发布了新的文献求助10
24秒前
LULU完成签到,获得积分10
25秒前
rl完成签到,获得积分10
25秒前
怡然帅完成签到 ,获得积分10
25秒前
26秒前
26秒前
852应助嘿嘿江采纳,获得10
26秒前
27秒前
27秒前
LULU发布了新的文献求助10
27秒前
自强不息发布了新的文献求助10
27秒前
koitoyu发布了新的文献求助20
27秒前
nunu发布了新的文献求助10
28秒前
阿奎罗完成签到,获得积分10
30秒前
高分求助中
Continuum Thermodynamics and Material Modelling 2000
Neuromuscular and Electrodiagnostic Medicine Board Review 1000
こんなに痛いのにどうして「なんでもない」と医者にいわれてしまうのでしょうか 510
いちばんやさしい生化学 500
Genre and Graduate-Level Research Writing 500
The First Nuclear Era: The Life and Times of a Technological Fixer 500
岡本唐貴自伝的回想画集 500
热门求助领域 (近24小时)
化学 材料科学 医学 生物 工程类 有机化学 物理 生物化学 纳米技术 计算机科学 化学工程 内科学 复合材料 物理化学 电极 遗传学 量子力学 基因 冶金 催化作用
热门帖子
关注 科研通微信公众号,转发送积分 3676016
求助须知:如何正确求助?哪些是违规求助? 3230416
关于积分的说明 9791024
捐赠科研通 2941498
什么是DOI,文献DOI怎么找? 1612630
邀请新用户注册赠送积分活动 761174
科研通“疑难数据库(出版商)”最低求助积分说明 736702