摘要
Traditionally scientists believed that, with a few key exceptions, RNAs played a secondary role in the cell. Recent discoveries have sharply revised this simple picture, revealing widespread and surprisingly sophisticated functional roles of RNAs.
Discovery of new functional RNA elements remains a very challenging task, both computationally and experimentally. It is computationally difficult largely because of the importance of an RNA molecule's 3-D structure, and the fact that molecules with very different nucleotide sequences can fold into the same shape.
In this thesis, we describe a computational tool called CMfinder that addresses the RNA motif discovery problem. It is one of the most effective tools for constructing multiple local structural alignments. It can extract an RNA motif from unaligned sequences with long extraneous flanking regions, and in cases when the motif is only present in a subset of sequences. On the basis of the original CMfinder, we propose several speedup techniques, which make this tool scalable to large datasets.
Another important problem regarding ncRNA discovery is to evaluate the significance of a predicted RNA motif, which is critical to sift high quality ncRNA candidates from an enormous number of predictions produced in a genome scale scan. We have designed two ranking schemes to address this problem in different application settings. The first is a heuristic method that is generally applicable, and the second is a probabilistic method based on the evolution theory. While we have effectively rediscovered known ncRNAs and obtained promising candidates using the first method, we found that the second behaves more robustly and has better statistical properties. The second scheme, however, requires a phylogeny of input sequences, which can be difficult to be obtained in some applications.
We have great success in applying CMfinder in genome scale discovery of noncoding RNAs. In particular, we applied a CMfinder centered computational pipeline to all bacteria, and found 22 novel putative RNA motifs. Six are high quality riboswitches candidates, and five have been confirmed as novel riboswitches in separate studies. We have also tested CMfinder in vertebrate ENCODE regions. This study produced thousands of candidates, most of which are not covered by any previous studies. Closer examination of these candidates suggests that CMfinder revised the alignment significantly compared to the multiple alignment based on the sequence only, and consequently, strongly argues for taking RNA structure directly into account in any searches for such structural elements. We have experimentally validated eleven top ranking candidates, and found transcription activities and tissue specificities for most of them. We are now in the process of applying CMfinder to search the whole human genome.
Our experiences have demonstrated that CMfinder can accelerate significantly the discovery of novel ncRNAs, with promises of many more discoveries to come.