摘要
By mapping the positions of millions of translating ribosomes in the cell, ribosome profiling (Ribo-seq) has established its role as a powerful tool to study gene expression. Several laboratories have introduced modifications to the experimental protocol and expanded the repertoire of biochemical methods to study translation transcriptome-wide. However, the diversity of protocols highlights a need for standardization. At the same time, different computational analysis strategies have used Ribo-seq data to identify the set of translated sequences with high confidence. In this review we present an overview of such methodologies, outlining their assumptions, data requirements, and availability. At the interface between RNA and proteins, Ribo-seq can complement data from multiple omics approaches, zooming in on the central role of translation in the molecular cell. By mapping the positions of millions of translating ribosomes in the cell, ribosome profiling (Ribo-seq) has established its role as a powerful tool to study gene expression. Several laboratories have introduced modifications to the experimental protocol and expanded the repertoire of biochemical methods to study translation transcriptome-wide. However, the diversity of protocols highlights a need for standardization. At the same time, different computational analysis strategies have used Ribo-seq data to identify the set of translated sequences with high confidence. In this review we present an overview of such methodologies, outlining their assumptions, data requirements, and availability. At the interface between RNA and proteins, Ribo-seq can complement data from multiple omics approaches, zooming in on the central role of translation in the molecular cell. Ribo-seq has become an established protocol to identify translated transcript regions via deep sequencing, closing the gap between RNA sequencing and proteomics. Recently developed Ribo-seq data analysis strategies use different features as hallmarks of translation. Specifically, the ability to monitor the positions of translating ribosomes with single-nucleotide precision has driven the development of computational tools that rely on ‘subcodon resolution’. Knowing the concrete assumptions and precise goals of different approaches is crucial. In addition to addressing translation-focused questions, from defining open reading frames to identifying alternative translation initiation sites and estimating differential translation rates, Ribo-seq data show great promise for integrative efforts combining additional omics approaches. Ribo-seq has become an established protocol to identify translated transcript regions via deep sequencing, closing the gap between RNA sequencing and proteomics. Recently developed Ribo-seq data analysis strategies use different features as hallmarks of translation. Specifically, the ability to monitor the positions of translating ribosomes with single-nucleotide precision has driven the development of computational tools that rely on ‘subcodon resolution’. Knowing the concrete assumptions and precise goals of different approaches is crucial. In addition to addressing translation-focused questions, from defining open reading frames to identifying alternative translation initiation sites and estimating differential translation rates, Ribo-seq data show great promise for integrative efforts combining additional omics approaches. a machine-learning approach whose objective is to assign datapoints to different classes (two in the case of binary classifiers). In supervised learning, the classifier is trained on known examples, while unsupervised classification methods are used in absence of known (or labeled) data. a sequence that is translated using one (or more) of the three possible reading frames. a probabilistic method in which a signal (e.g., a coverage track or a nucleotide sequence) is emitted from a finite succession of unknown (hidden) states. The hidden states can represent different biological concepts (e.g., 5′-UTRs, ORFs, etc. in genomic sequence classification); transitions between them specify possible sequences of the states, and can be defined and trained on available data (e.g., read coverage or nucleotide sequences in annotated genomic regions). Once the model is trained, it can be used to parse a new signal and label it with the optimal sequence of states. long transcripts (>200 nt) which do not exhibit clear coding potential. a signal processing method that aims to provide reliable estimates of the spectrum of frequencies present in a signal. In the multitaper method, multiple filters are applied as windows over the same signal, and coefficients for all frequency components are retrieved from each filtered sample (using the Fourier transform). Different types of filters have been proposed; specifically, the use of the so-called Slepian sequences enables the application of a statistical test to each frequency component. a modified version of the ordinary least squares, in which the regression coefficients cannot be negative values. an mRNA surveillance pathway that degrades aberrant transcripts, thus preventing the production of non-functional proteins. One of the proposed mechanisms for NMD involves the recognition of a premature termination codon (PTC), aided by the action of proteins that are part of the exon junction complex (EJC). a section of a transcript which contains a start and a stop codon in frame. In eukaryotes, most mRNA transcripts contain one main ORF that is translated into a polypeptide. a technique that isolates nascent protein chains. Ribosome–nascent chain complexes are first isolated, and biotinylated puromycin is incorporated into the complexes. Streptavidin pulldown allows the nascent protein chains to be extracted, and these can by analyzed by LC-MS/MS. proteomics techniques aimed at quantifying protein expression. Label-free quantification methods can be used, but techniques such as SILAC that label amino acids can represent superior alternatives for protein quantification. a classification algorithm that combines the classification output of multiple classifiers, called decision trees. Each tree splits the data into different groups (‘leaves’) and assigns a label to each datapoint in each leaf. Each tree is applied to a subset of the data and features to avoid overfitting. Usually used as a supervised learning method, random forests can also be used for unsupervised learning and for regression tasks. this aims to quantify the relationship between a target variable and one (or more) features. To this end, approaches fit a function that minimizes the distance between the predictor and the target variable (e.g., by using the least squares method). The regression coefficient quantifies the relationship between the target variable and the predictor. a set of techniques that enable the identification and quantification of protein expression from a mixture of digested peptides, using peptide isolation (usually with liquid chromatography, LC) and tandem mass spectrometry (MS/MS). When they are eluted in the LC step, peptides are ionized, and ions are selected in the first MS step according to their mass-to-charge (m/z) ratio. Ions are then fragmented, and in the second MS step fragment ions are again isolated according their m/z ratio and quantified. Using a reference protein database, m/z values can be mapped to expected values matching peptides from known proteins. a measure of correlation between two frequency spectra. Signals exhibiting a similar set of frequency components will have high coherence. pSILAC is a variant of SILAC in which labeled amino acids are added to the cell culture for short periods of time, thus allowing the kinetics of de novo protein synthesis to be monitored. a binary classification algorithm. SVMs are supervised learning methods and therefore need to be trained on known examples. In the training stage, SVMs aim to define a separating line maximizing the distance between the two sets of data. When a linear separation of the two sets is not effective, SVMs can compute the distance between datapoints in a higher-dimensional space by means of different kernel functions in which a linear separation between the samples is possible. This strategy (the ‘kernel trick’) enables non-linear classification, and has contributed to the popularity of SVMs in the machine-learning community. the section of a coding mature mRNA that does not code for protein. The 5′-UTR is located upstream of the start codon, while the 3′-UTR is downstream of the stop codon. a small (usually <100 aa) ORF whose start codon is located in the 5′-UTR upstream of the main ORF of a transcript. Many uORFs have been shown to regulate the translation of the main ORF. It is generally assumed that uORFs do not encode stable polypeptides.