Correcting Measurement Error in Regression Models with Variables Constructed from Aggregated Output of Data Mining Models

回归分析 回归 统计 数据挖掘 计算机科学 计量经济学 变量模型中的错误 观测误差 数学
作者
Mengke Qiao,Ke‐Wei Huang
出处
期刊:Management Information Systems Quarterly [MIS Quarterly]
标识
DOI:10.25300/misq/2024/18026
摘要

The burgeoning interest in data mining has catalyzed a proliferation of innovative techniques in extracting useful information from unstructured data sources, such as text and images in social sciences. One typical research design involves a two-stage process. In the first stage, researchers apply the classification algorithm to predict an individual-level categorical variable. In the second stage, the researchers aggregate the predicted values to construct a group-level variable for further regression analysis. For example, text classification has been applied to classify whether a review is positive or negative. The predicted review sentiment is aggregated at the product level as a focal independent variable in a regression model to examine the impact of the average review sentiment on product sales. Since the first-stage classification inevitably has errors, the aggregated variable may suffer from the measurement error in the regression analysis. Our study attempts to systematically investigate the theoretical properties of the estimation bias and introduce solutions rooted in theory to mitigate the issue of measurement error. We propose one exact solution and two approximated solutions based on the Central Limit Theorem (CLT) and the Law of Large Numbers (LLN), respectively. Our theoretical analysis and experimentation confirm that the consistency of regression estimators can be recovered across all examined scenarios and the approximated solutions offer a significantly reduced computational complexity compared to the exact solution. We also provide heuristic guidelines to choose one of three solutions.

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
zyp完成签到,获得积分10
刚刚
yucca完成签到,获得积分10
2秒前
dd完成签到 ,获得积分10
2秒前
大花完成签到,获得积分10
3秒前
feifan159完成签到,获得积分10
4秒前
大个应助麻瓜晋升小巫师采纳,获得10
5秒前
6秒前
yazhi发布了新的文献求助10
7秒前
8秒前
8秒前
9秒前
小吴搞科研完成签到,获得积分10
10秒前
11秒前
斯文败类应助R先生采纳,获得10
11秒前
nciwbh完成签到,获得积分10
11秒前
Maestro_S应助简一采纳,获得10
11秒前
科研通AI5应助烤番薯采纳,获得10
11秒前
12秒前
建新发布了新的文献求助10
12秒前
畅小畅完成签到,获得积分20
14秒前
jiajia发布了新的文献求助10
15秒前
SYLH应助堪怀采纳,获得10
15秒前
16秒前
17秒前
ZZ完成签到,获得积分10
17秒前
李健应助谢灵运采纳,获得10
18秒前
云龙完成签到,获得积分10
18秒前
科研通AI5应助叉猹的闰土采纳,获得10
18秒前
19秒前
王大炮完成签到 ,获得积分10
19秒前
黎明完成签到,获得积分10
19秒前
Akim应助打小老虎采纳,获得10
19秒前
荣荣酱发布了新的文献求助10
20秒前
炙热的小小完成签到 ,获得积分10
20秒前
R先生完成签到,获得积分10
20秒前
渺茫的星辰完成签到,获得积分10
24秒前
科目三应助可鹿丽采纳,获得10
24秒前
25秒前
26秒前
乐乐应助wos采纳,获得10
26秒前
高分求助中
Continuum Thermodynamics and Material Modelling 4000
Production Logging: Theoretical and Interpretive Elements 2700
Ensartinib (Ensacove) for Non-Small Cell Lung Cancer 1000
Les Mantodea de Guyane Insecta, Polyneoptera 1000
Unseen Mendieta: The Unpublished Works of Ana Mendieta 1000
El viaje de una vida: Memorias de María Lecea 800
Luis Lacasa - Sobre esto y aquello 700
热门求助领域 (近24小时)
化学 材料科学 生物 医学 工程类 有机化学 生物化学 物理 纳米技术 计算机科学 内科学 化学工程 复合材料 基因 遗传学 物理化学 催化作用 量子力学 光电子学 冶金
热门帖子
关注 科研通微信公众号,转发送积分 3525272
求助须知:如何正确求助?哪些是违规求助? 3105904
关于积分的说明 9277193
捐赠科研通 2803293
什么是DOI,文献DOI怎么找? 1538481
邀请新用户注册赠送积分活动 716275
科研通“疑难数据库(出版商)”最低求助积分说明 709371