摘要
Knockout experiments are critical for the evaluation of gene function. Researchers have increasingly relied on genome editing technologies for precise mutagenesis at loci of interest, using engineered nucleases such as Zinc finger nucleases, transcription activator-like effector nucleases (TALENs), and CRISPR (clustered regularly interspaced short palindromic repeats)-associated proteins. Sequence-specific targeting and cleavage by these systems generate double-stranded breaks and trigger endogenous repair machineries, resulting in small indels that can disrupt reading frames and gene function. These methods have been successfully applied to plants; the CRISPR system is particularly powerful for non-model species (Belhaj et al., 2013Belhaj K. Chaparro-Garcia A. Kamoun S. Nekrasov V. Plant genome editing made easy: targeted mutagenesis in model and crop plants using the CRISPR/Cas system.Plant Methods. 2013; 9: 39Crossref PubMed Scopus (409) Google Scholar, Lozano-Juste and Cutler, 2014Lozano-Juste J. Cutler S.R. Plant genome engineering in full bloom.Trends Plant Sci. 2014; 19: 284-287Abstract Full Text Full Text PDF PubMed Scopus (65) Google Scholar). Several tools, such as TALENT (Cermak et al., 2011Cermak T. Doyle E.L. Christian M. Wang L. Zhang Y. Schmidt C. Baller J.A. Somia N.V. Bogdanove A.J. Voytas D.F. Efficient design and assembly of custom TALEN and other TAL effector-based constructs for DNA targeting.Nucleic Acids Res. 2011; 39: e82Crossref PubMed Scopus (1560) Google Scholar) and CRISPR-P (Lei et al., 2014Lei Y. Lu L. Liu H.Y. Li S. Xing F. Chen L.L. CRISPR-P: a web tool for synthetic single-guide RNA design of CRISPR-system in plants.Mol. Plant. 2014; 7: 1494-1496Abstract Full Text Full Text PDF PubMed Scopus (401) Google Scholar), have been developed to facilitate the design of genome editing experiments. However, few tools are available to evaluate the outcome of genome editing. Amplicon sequencing is commonly employed for genome editing analysis where genomic sequences that span the target loci are amplified, sometimes cloned, and sequenced. A number of programs have been developed to decode heterozygous chromatograms from direct sequencing of PCR products for identification of sequence polymorphisms (Crowe, 2005Crowe M.L. SeqDoC: rapid SNP and mutation detection by direct comparison of DNA sequence chromatograms.BMC Bioinformatics. 2005; 6: 133Crossref PubMed Scopus (31) Google Scholar, Dmitriev and Rakitov, 2008Dmitriev D.A. Rakitov R.A. Decoding of superimposed traces produced by direct sequencing of heterozygous indels.PLoS Comput. Biol. 2008; 4: e1000113Crossref PubMed Scopus (102) Google Scholar, Ma et al., 2015Ma X. Chen L. Zhu Q. Chen Y. Liu Y.-G. Rapid decoding of sequence-specific nuclease-induced heterozygous and biallelic mutations by direct sequencing of PCR products.Mol. Plant. 2015; https://doi.org/10.1016/j.molp.2015.02.012Abstract Full Text Full Text PDF Scopus (98) Google Scholar). However, the throughput of Sanger sequencing, even without cloning, is not amenable to screening large numbers of transgenic lines, especially with increasingly sophisticated multiplex targeting (Xie et al., 2015Xie K.B. Minkenberg B. Yang Y.N. Boosting CRISPR/Cas9 multiplex editing capability with the endogenous tRNA-processing system.Proc. Natl. Acad. Sci. USA. 2015; 112: 3570-3575Crossref PubMed Scopus (764) Google Scholar). No open-source programs are currently available for analysis of amplicon-sequencing data from high-throughput sequencers. After quality-control filtering and demultiplexing, amplicon sequence analysis usually involves alignment with target/reference sequences and detection of editing events, such as indels or single nucleotide polymorphisms (SNPs). Much bioinformatic effort is required, unless commercial software is available. A web-based tool for amplicon-sequencing data analysis was recently reported (Guell et al., 2014Guell M. Yang L.H. Church G.M. Genome editing assessment using CRISPR Genome Analyzer (CRISPR-GA).Bioinformatics. 2014; 30: 2968-2970Crossref PubMed Scopus (98) Google Scholar). However, only one reference sequence is accepted at a time, which makes application to large datasets cumbersome. Here, we report a versatile and user-friendly tool, Analysis of Genome Editing by Sequencing (AGEseq), to address this limitation. AGEseq is available from AspenDB (http://aspendb.uga.edu) as a standalone program or a Galaxy (Goecks et al., 2010Goecks J. Nekrutenko A. Taylor J. Team T.G. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.Genome Biol. 2010; 11: R86Crossref PubMed Scopus (2670) Google Scholar)-based web tool. AGEseq supports both Sanger and deep-sequencing reads. For deep sequencing, degenerate primers can be designed to amplify both alleles of the target gene as well as closely related gene(s). Amplicons from unrelated genes or across samples are then barcoded and pooled for sequencing (Figure 1A). For data analysis, AGEseq requires a design file and a directory of read files as inputs. The design file describes the reference sequences, usually containing 30–40 bp flanking regions of the target editing site(s) (Figure 1B). The read files are stored in a directory named “reads” by default, and multiple file formats are accepted (Figure 1A). AGEseq uses BLAT to align reference and read sequences. Aligned reads are assigned to the best hit among the reference sequences provided in the design file, and matching regions are extracted for indel or SNP calling. The output file reports the aligned (target and read) sequences and detection frequency for each editing event (Figure 1C and 1D). Our laboratory has recently applied CRISPR-based genome editing to lignin biosynthesis perturbations in Populus. A gene-specific guide RNA (gRNA) was designed to target 4-coumarate:CoA ligase 1 (4CL1), but not the paralogous 4CL5 (Zhou et al., 2015Zhou X. Jacobs T.B. Xue L.-J. Harding S.A. Tsai C.-J. Exploiting SNPs for biallelic CRISPR mutations in the outcrossing woody perennial Populus reveals 4-coumarate:CoA ligase specificity and redundancy.New Phytol. 2015; https://doi.org/10.1111/nph.13470Crossref Scopus (200) Google Scholar). Degenerate primers were designed to amplify both 4CL1 (target) and 4CL5 (off-target) sequences from independent transgenic lines to assess editing specificity. AGEseq successfully distinguished the duplicates as well as their alleles (Figure 1B), and confirmed biallelic mutations in all transgenic lines examined, with no off-target cleavage of 4CL5 (Figure 1E). In support of a null 4CL1, all primary transformants exhibited a reddish-brown wood discoloration (Figure 1F) known to be associated with lignin modification (Zhou et al., 2015Zhou X. Jacobs T.B. Xue L.-J. Harding S.A. Tsai C.-J. Exploiting SNPs for biallelic CRISPR mutations in the outcrossing woody perennial Populus reveals 4-coumarate:CoA ligase specificity and redundancy.New Phytol. 2015; https://doi.org/10.1111/nph.13470Crossref Scopus (200) Google Scholar). As a further test, AGEseq was applied to amplicon data of soybean with DDM1 (Decrease in DNA Methylation) editing in one or two homoeologous loci as described in Jacobs et al., 2015Jacobs T.B. LaFayette P.R. Schmitz R.J. Parrott W.A. Targeted genome modifications in soybean with CRISPR/Cas9.BMC Biotechnol. 2015; 15: 16Crossref PubMed Scopus (351) Google Scholar. The editing patterns detected by AGEseq were consistent with those obtained by Geneious R7 (Biomatters Ltd.) used in that study, ranging from small indels (<5 nt) to large deletions (>10 nt), with varying (1–98%) editing efficiencies (Supplemental Table 1) (Jacobs et al., 2015Jacobs T.B. LaFayette P.R. Schmitz R.J. Parrott W.A. Targeted genome modifications in soybean with CRISPR/Cas9.BMC Biotechnol. 2015; 15: 16Crossref PubMed Scopus (351) Google Scholar). AGEseq flags events with a long stretch of indels and/or mismatches as “strange events” that require manual examination, and three such cases were identified. Manual inspection confirmed a large (44 nt) deletion in one case, while the other two were found by Jacobs et al., 2015Jacobs T.B. LaFayette P.R. Schmitz R.J. Parrott W.A. Targeted genome modifications in soybean with CRISPR/Cas9.BMC Biotechnol. 2015; 15: 16Crossref PubMed Scopus (351) Google Scholar to harbor unusual insertions from the Agrobacterium rhizogenes root-inducing plasmid after additional cloning and sequencing. These results demonstrate the versatility of AGEseq in detecting or flagging genome editing patterns across a wide range of data scenarios. Detailed instructions on AGEseq are provided for all operating systems (Supplemental Text). The analysis sensitivity can be adjusted by two user-configurable parameters: mismatch allowance (default at 10%) and minimum read coverage (default at 0). Systematic errors introduced during amplicon library preparation and sequencing that involve PCR or by base-calling algorithms are common in deep-sequencing data, and they will appear as “SNPs” in the AGEseq report (Figure 1C and Supplemental Text). For this reason, AGEseq considers indels as potential genome editing events by default, although SNPs are also reported. If SNPs are of interest, setting a minimum read coverage is recommended to reduce random errors. A known limitation of BLAT and similar aligners is their inconsistent gap handling in the presence of homo-nucleotides, as shown for both 4CL1 alleles in Figure 1C (red boxes, 1-nt deletion at position 56 or 57). AGEseq does not consider these differences and reports, by default, the sum of all indel reads as well as wild-type (WT)-like (non-edited) reads from each sample in the summary (Figure 1D). User inspection is therefore recommended. As mentioned, AGEseq also facilitates identification of unusual events that require manual inspection, and sometimes follow-up experiments to confirm the editing patterns. The ability of AGEseq to effectively discriminate allelic sequences of duplicated genes suggests that it can support analysis with polyploid genomes. When only one reference sequence is provided, the AGEseq output can be mined for allelic variations, if any, in the target region. As a standalone software, AGEseq is (1) easy to use; no command line or programming skill is required for Windows or Mac users; (2) versatile; multiple sequencing platforms and file types are supported for assessing genome editing, allelic variation and/or off-target cleavage; and (3) extensible; the Perl script can easily be exported to other bioinformatics pipelines. As an example, we adapted AGEseq as a utility in the Galaxy platform (Goecks et al., 2010Goecks J. Nekrutenko A. Taylor J. Team T.G. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.Genome Biol. 2010; 11: R86Crossref PubMed Scopus (2670) Google Scholar) to support web-based analysis. It is accessible at AspenDB (http://aspendb.uga.edu/ageseq) or through the Galaxy Tool Shed (https://toolshed.g2.bx.psu.edu) for installation in local instances. A limitation of the web tool is that only one sequence read file can be processed at a time. For a multiplexed dataset with a large number of samples, the use of the standalone AGEseq program is recommended. Although developed for genome editing analysis, AGEseq can be adapted for SNP genotyping, metagenomic analysis, or other amplicon-sequencing applications.