统计
贝叶斯概率
数据挖掘
计算机科学
计量经济学
数据科学
情报检索
数学
标识
DOI:10.1080/00031305.1999.10474456
摘要
Abstract A common data mining task is the search for associations in large databases. Here we consider the search for "interestingly large" counts in a large frequency table, having millions of cells, most of which have an observed frequency of 0 or 1. We first construct a baseline or null hypothesis expected frequency for each cell, and then suggest and compare screening criteria for ranking the cell deviations of observed from expected count. A criterion based on the results of fitting an empirical Bayes model to the cell counts is recommended. An example compares these criteria for searching the FDA Spontaneous Reporting System database maintained by the Division of Pharmacovigilance and Epidemiology. In the example, each cell count is the number of reports combining one of 1,398 drugs with one of 952 adverse events (total of cell counts = 4.9 million), and the problem is to screen the drug-event combinations for possible further investigation.
科研通智能强力驱动
Strongly Powered by AbleSci AI