虚假关系
雅卡索引
简单(哲学)
德布鲁恩序列
序列(生物学)
k-mer公司
数学
统计
组合数学
区间(图论)
算法
计算机科学
计算生物学
遗传学
生物
基因组
基因
聚类分析
认识论
哲学
作者
Antonio Blanca,Robert S. Harris,David Koslicki,Paul Medvedev
标识
DOI:10.1089/cmb.2021.0431
摘要
k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.
科研通智能强力驱动
Strongly Powered by AbleSci AI