Recent single-molecule sequencing technologies generate very long reads, enabling the capture of multiple variant types including structural and copy number variations (SVs/CNVs). However, precise alignment around SVs is a challenge, because of large gaps in alignment (Abyzov et al., 2015; Lam et al., 2010; Sedlazeck et al., 2018). Previously, Alignment with Gap Excision (AGE) was described as a precise method that uses dynamic programming to solve the problem (Abyzov and Gerstein, 2011). While not designed to be used for aligning against a reference genome, its primary purpose is realigning reads around the sites of suspected SVs/CNVs. Thus, its application is limited to the alignment of short reads/contigs to relatively small genomic regions. However, it requires vast memory usage, because its implementation uses matrices.
Here, we introduce LongAGE, a memory-efficient implementation of AGE. LongAGE leverages linear space alignment algorithms based on the idea first presented to solve the longest common subsequence problem (Hirschberg, 1975) and several other such algorithms for sequence alignments (Chao et al., 1994). LongAGE vastly improves memory usage compared to AGE; that allows users to realign long reads (PacBio/Oxford Nanopore) or contigs on a regular compute node, desktop or laptop.
Given two sequences to be aligned of length N, M:
Recall that the AGE algorithm uses matrices to compute the best score (BS) of aligning n and m nucleotides at the
We reckon that ML and MR are values of Pi and Qj, respectively. To reduce memory usage, we can use a single array (α, β) for each matrix:
Our main implementation is summarized in two steps:
It is well known that CNVs and SVs can have homologous and identical sequences around their breakpoints (Kidd et al., 2010). Several optimal alignments exist with the same maxima scores because of identical sequences at SV breakpoints (Tran et al., 2016), differences in alignments result from shifting along the identical sequences. By common convention LongAGE returns the left-shifted solution. LongAGE reduces the space usage from
The steps were as follows:
Identify SVs of interest (Fig. 1A): Aligned Illumina HiSeq short-reads (Zook et al., 2016) in BAM format are available for three trios from the Genome in a Bottle (GIAB) Consortium. The coverage was
Analyze long-reads containing SVs (Fig. 1C): NGMLR (Sedlazeck et al., 2018) was used to map the GIAB Mt Sinai PacBio reads of the Chinese son (HG005) (Zook et al., 2016) to the Human Reference GRCh38, where the option was “−x pacbio ”. Using SAMtools (Li et al., 2009), we extracted reads from regions of interest, which are chromosomal coordinates where coordinate intervals [L−40 kb, R+40 kb], where L and R refer to the left and right breakpoint coordinates from read depth analysis. Extracted reads were realigned to the reference genome around the breakpoints using LongAGE with either “−indel” or “−tdup ” which specify alignment that is expected to have indels or duplications in the read sequence, respectively. However, it should be noted that until recently long-reads have had high error rates (Lau et al., 2016), hence our use of a lower gap opening penalty “−go=−1”.
Rectify SV breakpoints (Fig. 1C): Realigned reads were grouped based on which haplotype (deletion or duplication) had better support. For the best alignment, we required that: (i) the breakpoints from LongAGE’s alignment are within 1 kb of the estimated breakpoints of mCNV; (ii) every flank of an aligned read should have a minimum length of 1.5 kb or at least a fifth of the read length; (iii) its score is at least 500 more than for the alignment in the alternative mode (“−indel” for “−tdup ” and vice versa). We assembled the above-selected reads into two contigs using a long-read assembler wtdbg2 (Ruan and Li, 2020) and then aligned those contigs with the same parameters to precisely resolve the breakpoints.
More descriptions of best practice of using the tools can be found in the Supplementary Material.
To study the trade-off between memory usage and running time, we created a synthetic dataset of SVs with lengths varying from 1 to 32 kb, and one of 1 Mbp length. Inspired by (Abyzov et al., 2015; Lam et al., 2010), we randomly generated coordinates of a synthetic deletion of a certain length, then created the pseudocontig of each deletion allele by joining left and right flanks of 10 kb in length total. We then aligned the created pseudocontig against the regions in the reference from the
Table 1 summarizes run time and memory usage of AGE and LongAGE by Valgrind (Seward et al., 2008) on all pairs of synthesized sequences. In LongAGE, memory usage grows linearly, while computation time is 2.6 to
Tools | 1 kb | 2 kb | 4 kb | 8 kb | 16 kb | 32 kb | 1 Mbp |
---|---|---|---|---|---|---|---|
Memory usage (megabytes) | |||||||
AGE | 550.83 | 600.85 | 700.90 | 901.04 | 1301.21 | 2101.68 |
|
LongAGE | 2.71 | 2.92 | 3.13 | 3.55 | 3.62 | 5.55 | 113.29 |
Running time (s) | |||||||
AGE | 5.05 | 5.55 | 6.57 | 8.37 | 12.03 | 19.27 |
|
LongAGE | 18.92 | 20.72 | 22.80 | 23.77 | 32.06 | 50.63 | 1159.61 |
Thousands of deletion and duplication polymorphisms larger than 1 kb in human genomes, called copy number variations (CNVs), can impact phenotypes by causing gene dosage and structure to vary among individuals (Usher and McCarroll, 2015). Many CNVs are multiallelic (mCNVs) where their structural alleles have been rearranged multiple times in their ancestors. The origin of such events is not fully understood due to difficulties in resolving their breakpoints with short reads, as the breakpoints are often embedded in segmental duplications. To demonstrate the applicability of LongAGE, we resolved breakpoints of reciprocal deletion and duplication with long homologies around breakpoints in the Chinese Trio sequenced by the GIAB Consortium. Such events have been previously described by (Abyzov et al., 2011) and were hypothesized to occur from a single non-allelic homologous recombination (NAHR) mentioned by (Abyzov et al., 2015; Lam et al., 2010).
First, we identified a copy number neutral region on the Human Genome GRCh38 of mCNV (
We have presented LongAGE, a memory-efficient implementation of AGE. Even when aligning megabase-long sequences, LongAGE’s memory footprint is less than hundreds of megabytes, while it is at most four times slower than AGE in terms of running time. The tool facilitates the resolution and standardization of SV breakpoints in highly repetitive regions at a single base pair. It is capable of refining read alignment once a read has been heuristically mapped to a particular genomic location that is expected to contain an SV.
The authors are grateful for the fruitful discussions with Arijit Panda and Milovan Suvakov (Mayo Clinic). The authors thank Justin Zook (NIST) for sharing long-read data and Eric Spangler (University of Memphis) for research high-performance computing support. This work was partially performed when the first author was an intern at Mayo Clinic.
The study was supported by the National Institute of Health grant U24CA220242.
Financial Support: none declared.
Conflict of Interest: The authors declare no conflicting interests exist.