Recent single-molecule sequencing technologies generate very long reads, enabling the capture of multiple variant types including structural and copy number variations (SVs/CNVs). However, precise alignment around SVs is a challenge, because of large gaps in alignment (Abyzov et al., 2015; Lam et al., 2010; Sedlazeck et al., 2018). Previously, Alignment with Gap Excision (AGE) was described as a precise method that uses dynamic programming to solve the problem (Abyzov and Gerstein, 2011). While not designed to be used for aligning against a reference genome, its primary purpose is realigning reads around the sites of suspected SVs/CNVs. Thus, its application is limited to the alignment of short reads/contigs to relatively small genomic regions. However, it requires vast memory usage, because its implementation uses matrices.
Here, we introduce LongAGE, a memory-efficient implementation of AGE. LongAGE leverages linear space alignment algorithms based on the idea first presented to solve the longest common subsequence problem (Hirschberg, 1975) and several other such algorithms for sequence alignments (Chao et al., 1994). LongAGE vastly improves memory usage compared to AGE; that allows users to realign long reads (PacBio/Oxford Nanopore) or contigs on a regular compute node, desktop or laptop.
Given two sequences to be aligned of length N, M: and , with . Let denote the optimal score for the left flank () and denote the optimal score for the right flank (), where is defined to be the maximum sum of values (aligning x to y, or either x or y to a gap ‘-’) of up-to the aligned pairs. The AGE algorithm is summarized as follows:
Recall that the AGE algorithm uses matrices to compute the best score (BS) of aligning n and m nucleotides at the -ends and N − n and M − m nucleotides at the -ends is , where ML is the maximum in the leading submatrix and MR is the maximum in the trailing submatrix :
We reckon that ML and MR are values of Pi and Qj, respectively. To reduce memory usage, we can use a single array (α, β) for each matrix:
Our main implementation is summarized in two steps:
It is well known that CNVs and SVs can have homologous and identical sequences around their breakpoints (Kidd et al., 2010). Several optimal alignments exist with the same maxima scores because of identical sequences at SV breakpoints (Tran et al., 2016), differences in alignments result from shifting along the identical sequences. By common convention LongAGE returns the left-shifted solution. LongAGE reduces the space usage from to , while increasing computation time by at most four times.
The steps were as follows:
Identify SVs of interest (Fig. 1A): Aligned Illumina HiSeq short-reads (Zook et al., 2016) in BAM format are available for three trios from the Genome in a Bottle (GIAB) Consortium. The coverage was for the parents and for the child. CNVs were discovered in children using CNVnator (Abyzov et al., 2011) with default options and 1 kb bins. We then genotyped CNVs in corresponding parents using the same bin size. CNVnator returned estimated copy number (CN) for each member of the trio. Applying the condition: [ CN (in one parent) ] and [ CN (in the other parent) ] and [ CN (in child) ] for each GIAB trio, we obtained two candidate mCNVs. The candidate mCNV in the Ashkenazim trio was likely a false positive as no PacBio reads supported deletion and duplication in that region. The other mCNV in the Chinese trio was around 20 kb in length and contained a deletion in the father (HG006) and a duplication in the mother (HG007) (Fig. 1B).
Analyze long-reads containing SVs (Fig. 1C): NGMLR (Sedlazeck et al., 2018) was used to map the GIAB Mt Sinai PacBio reads of the Chinese son (HG005) (Zook et al., 2016) to the Human Reference GRCh38, where the option was “−x pacbio ”. Using SAMtools (Li et al., 2009), we extracted reads from regions of interest, which are chromosomal coordinates where coordinate intervals [L−40 kb, R+40 kb], where L and R refer to the left and right breakpoint coordinates from read depth analysis. Extracted reads were realigned to the reference genome around the breakpoints using LongAGE with either “−indel” or “−tdup ” which specify alignment that is expected to have indels or duplications in the read sequence, respectively. However, it should be noted that until recently long-reads have had high error rates (Lau et al., 2016), hence our use of a lower gap opening penalty “−go=−1”.
Rectify SV breakpoints (Fig. 1C): Realigned reads were grouped based on which haplotype (deletion or duplication) had better support. For the best alignment, we required that: (i) the breakpoints from LongAGE’s alignment are within 1 kb of the estimated breakpoints of mCNV; (ii) every flank of an aligned read should have a minimum length of 1.5 kb or at least a fifth of the read length; (iii) its score is at least 500 more than for the alignment in the alternative mode (“−indel” for “−tdup ” and vice versa). We assembled the above-selected reads into two contigs using a long-read assembler wtdbg2 (Ruan and Li, 2020) and then aligned those contigs with the same parameters to precisely resolve the breakpoints.
More descriptions of best practice of using the tools can be found in the Supplementary Material.
To study the trade-off between memory usage and running time, we created a synthetic dataset of SVs with lengths varying from 1 to 32 kb, and one of 1 Mbp length. Inspired by (Abyzov et al., 2015; Lam et al., 2010), we randomly generated coordinates of a synthetic deletion of a certain length, then created the pseudocontig of each deletion allele by joining left and right flanks of 10 kb in length total. We then aligned the created pseudocontig against the regions in the reference from the -end of the left flank to -end of the right flank. We perform alignment with AGE and LongAGE on each pair of such pseudocontigs for all lengths of synthetic SVs.
Table 1 summarizes run time and memory usage of AGE and LongAGE by Valgrind (Seward et al., 2008) on all pairs of synthesized sequences. In LongAGE, memory usage grows linearly, while computation time is 2.6 to longer than AGE, which is expected under Hirschberg’s method. Given 192 GB of memory on a Gold 6148 Processor workstation, AGE failed to align sequences of 1 Mbp due to the lack of memory allocation. LongAGE completed in less than 20 min and only needed a maximum of 114 megabytes for the task.
|Tools||1 kb||2 kb||4 kb||8 kb||16 kb||32 kb||1 Mbp|
|Memory usage (megabytes)|
|Running time (s)|
Thousands of deletion and duplication polymorphisms larger than 1 kb in human genomes, called copy number variations (CNVs), can impact phenotypes by causing gene dosage and structure to vary among individuals (Usher and McCarroll, 2015). Many CNVs are multiallelic (mCNVs) where their structural alleles have been rearranged multiple times in their ancestors. The origin of such events is not fully understood due to difficulties in resolving their breakpoints with short reads, as the breakpoints are often embedded in segmental duplications. To demonstrate the applicability of LongAGE, we resolved breakpoints of reciprocal deletion and duplication with long homologies around breakpoints in the Chinese Trio sequenced by the GIAB Consortium. Such events have been previously described by (Abyzov et al., 2011) and were hypothesized to occur from a single non-allelic homologous recombination (NAHR) mentioned by (Abyzov et al., 2015; Lam et al., 2010).
First, we identified a copy number neutral region on the Human Genome GRCh38 of mCNV (:–) with possible deletion and duplication haplotypes in a child using Illumina HiSeq short-read data (Zook et al., 2016) (Fig. 1A). Then, assuming the two (deletion and duplication) haplotypes are present in the child (Fig. 1B), we locally realigned PacBio long-reads with LongAGE using both INDEL (for alignment with deletion) and TDUP (for alignment with tandem duplication) modes. Next, by comparing alignments in each mode, we selected reads likely to be supported by deletion and tandem duplication. Breakpoints can be imprecise due to sequencing errors/homologies, yet roughly match those identified from read depth analysis. We obtained 26 deletion-supporting reads, and 20 duplication-supporting reads (Supplementary Table S1). We then assembled these reads into two contigs, and we aligned them to the reference (by LongAGE in appropriate mode) with a high percent identity of over 98%. We observed that deletion breakpoints are left-shifted compared to duplication breakpoints for 1538 and 1541 bp for the left breakpoint and the right breakpoint, respectively (Fig. 1C). Such a shift suggests that the deletion and duplication occurred ancestrally from two different events.
We have presented LongAGE, a memory-efficient implementation of AGE. Even when aligning megabase-long sequences, LongAGE’s memory footprint is less than hundreds of megabytes, while it is at most four times slower than AGE in terms of running time. The tool facilitates the resolution and standardization of SV breakpoints in highly repetitive regions at a single base pair. It is capable of refining read alignment once a read has been heuristically mapped to a particular genomic location that is expected to contain an SV.
The authors are grateful for the fruitful discussions with Arijit Panda and Milovan Suvakov (Mayo Clinic). The authors thank Justin Zook (NIST) for sharing long-read data and Eric Spangler (University of Memphis) for research high-performance computing support. This work was partially performed when the first author was an intern at Mayo Clinic.
The study was supported by the National Institute of Health grant U24CA220242.
Financial Support: none declared.
Conflict of Interest: The authors declare no conflicting interests exist.