计算机科学
载体(分子生物学)
变量(数学)
数学
遗传学
生物
数学分析
重组DNA
基因
出处
期刊:Cornell University - arXiv
日期:2017-01-01
被引量:59
标识
DOI:10.48550/arxiv.1701.06279
摘要
One of the ubiquitous representation of long DNA sequence is dividing it into shorter k-mer components. Unfortunately, the straightforward vector encoding of k-mer as a one-hot vector is vulnerable to the curse of dimensionality. Worse yet, the distance between any pair of one-hot vectors is equidistant. This is particularly problematic when applying the latest machine learning algorithms to solve problems in biological sequence analysis. In this paper, we propose a novel method to train distributed representations of variable-length k-mers. Our method is based on the popular word embedding model word2vec, which is trained on a shallow two-layer neural network. Our experiments provide evidence that the summing of dna2vec vectors is akin to nucleotides concatenation. We also demonstrate that there is correlation between Needleman-Wunsch similarity score and cosine similarity of dna2vec vectors.
科研通智能强力驱动
Strongly Powered by AbleSci AI