虚假关系
生物
样本量测定
选择(遗传算法)
统计
统计能力
人口
等位基因频率
等位基因
遗传学
计算生物学
进化生物学
计算机科学
数学
人工智能
基因
社会学
人口学
作者
Deborah M. Leigh,Heidi E L Lischer,Christine Grossen,Lukas F. Keller
标识
DOI:10.1111/1755-0998.12779
摘要
Abstract High‐throughput sequencing is a powerful tool, but suffers biases and errors that must be accounted for to prevent false biological conclusions. Such errors include batch effects; technical errors only present in subsets of data due to procedural changes within a study. If overlooked and multiple batches of data are combined, spurious biological signals can arise, particularly if batches of data are correlated with biological variables. Batch effects can be minimized through randomization of sample groups across batches. However, in long‐term or multiyear studies where data are added incrementally, full randomization is impossible, and batch effects may be a common feature. Here, we present a case study where false signals of selection were detected due to a batch effect in a multiyear study of Alpine ibex ( Capra ibex ). The batch effect arose because sequencing read length changed over the course of the project and populations were added incrementally to the study, resulting in nonrandom distributions of populations across read lengths. The differences in read length caused small misalignments in a subset of the data, leading to false variant alleles and thus false SNP s. Pronounced allele frequency differences between populations arose at these SNP s because of the correlation between read length and population. This created highly statistically significant, but biologically spurious, signals of selection and false associations between allele frequencies and the environment. We highlight the risk of batch effects and discuss strategies to reduce the impacts of batch effects in multiyear high‐throughput sequencing studies.
科研通智能强力驱动
Strongly Powered by AbleSci AI