In the early days of next-generation sequencing, read length was a critical factor. Compared to the previous standard of sequencing technology — Sanger sequencing reads, which were typically >700 bp long — NGS reads were quite short. This was especially true for the sequencing-by-synthesis technology developed by Solexa, which was acquired by Illumina in 2007 and became the basis for their market-dominating technology. The first genomes sequenced with that technology had 32-bp or 35-bp reads. They required development of a new generation of bioinformatics tools for sequence analysis. Fifteen years ago, I was part of the informatics team at WashU that analyzed NGS data, and remembered thinking that if we could figure out how to get Illumina read lengths to 50 base pairs, we’d be golden.
Now, of course, most Illumina instruments generate 150-bp paired-end reads and are capable of going out to 250 bp or 300 bp. At that length, with the many informatics tools now at our disposal, we can properly interrogate most of the genome and transcriptome. Yet these are still relatively short reads and they have limitations. Fortunately, long read single molecule sequencing platforms have also been evolving and improving over time. At the recent American Society of Human Genetics meetings, I heard more talks about long-read sequencing than any other time in memory. Another novelty: Most of them were describing work with the Pacific Biosciences (PacBio) platform; I heard far less about Oxford Nanopore.
One of the many ways we use PacBio long-read sequencing in house is to map and resolve complex structural variants (SVs). Often, previous testing with standard NGS (or microarray, or cytogenetics) will reveal the presence of a structural variant but its precise structure remains unclear. As the cost of PacBio sequencing continues to drop, it becomes more and more feasible to perform whole-genome long read sequencing of such patients. We describe some of our work in this area in Long-read genome sequencing resolves complex genomic rearrangements in rare genetic syndromes, which was published last week in Nature’s npj Genomic Medicine.
Patient #2 in that study harbored one of the most complex germline rearrangements I’ve encountered. She was a female with a neurodevelopmental disorder, 2 years old at the time of enrollment into our research study. Microarray testing had reported three deletions at chr18q11.2-q12.2 (deletion 1), chr18q12.3 (deletion 2), and chr18q21.1 (deletion 3).
When Multiple CNVs Indicate A Structural Variant
Multiple structural variants of the same type in relatively close proximity often suggest the presence of an underlying structural variant, and knowing that we enrolled the family trio for whole-genome long read sequencing. PacBio HiFi achieved an average coverage depth of 20–27×, with mean read lengths between 12,335 bp and 13,613 bp, and maximum read lengths ranging from 32,282 bp to 50,563 bp. The three reported deletions could even be appreciated on the CCS coverage map of chr18 from PacBio, which is not something we often use to estimate copy number (plot at right). The deletions were not present in either parent, which told us that whatever this was, it had arisen de novo.
When we visualized the PacBio long reads aligned to the reference sequence, there were obvious breakpoint-spanning long reads at each expected deletion breakpoint, such as these:
Those rainbow-colored portions of the reads are soft-clipped because they don’t match the reference sequence (due to spanning the structural breakpoint). These reads are long enough that both halves can be reliably mapped to their location in the human genome, thereby providing the precise breakpoints of the underlying variant. That’s where things got complicated, because some of these revealed the presence of yet a fourth CNV which was too small to be picked up by array testing, but was involved in the rearrangement. That meant eight breakpoints and they suggested the underlying rearrangement was a complex one laid out here:
In addition to the three deleted segments we knew about, one segment (D) was moved to a new location, and another one (C) was inverted. Talk about a complicated rearrangement! It would have been nearly impossible to decipher this with short reads alone. Anecdotally, we seem to be identifying these complex structural variants rather often, especially in patients with syndromic presentations. Sometimes Illumina WGS with paired-end reads can inform the nature of these, but we usually feel more comfortable resolving them with long reads that can map ~7-8kbp in each direction at a breakpoint.
Long-read Sequencing as a First Tier Test
As long-read sequencing gets cheaper, the notion of implementing it as a first-tier test becomes much more realistic. If I’m not mistaken, Children’s Mercy in Kansas City has already done so with a clinically validated long-read WGS assay for rare disease patients. PacBio should, in theory, capture SNVs/indels with similar performance to Illumina, and these can be analyzed with already-extant pipelines. Long reads also enable more comprehensive detection of structural variants. PacBio’s native SV caller, PBSV, does a reasonable job of calling these and providing an initial categorization (variant type, size, and certain classes of repeat-associated elements like ALUs). Yet this new detection power comes with a cost: the analytical challenge of sifting through thousands of SVs in addition to the 4-6 million SNVs and indels present in any individual’s genome.
In our paper, we describe a strategy for prioritizing those SVs using SvAnna, a phenotype-aware annotation and prioritization pipeline. Reviewing the top hits from SvAnna is a reasonable strategy for evaluating variants that could explain a patient’s phenotype. However, structural variants of unclassified types — called BND (breakend) by the PBSV pipeline — still required some manual intervention. Many of them turn out to be ordinary SVs that couldn’t be correctly classified based on the evidence, or whose breakpoints are too distant (on the same chromosome, or another chromosome) to be believed.
Detection of copy number variants with PacBio HiFi reads is theoretically possible — one obtains read depths from CCS reads and these are correlated with genomic copy number. At the edge of the deletions shown in the IGV screenshots earlier on in this post, you can visually appreciate the depth change accompanying copy number changes. This implies that CNVs are callable as changes in CCS depth. This has not been an area of significant bioinformatics development, at least to my knowledge, but I plan some testing with VarScan 2 to see how easily it can be adapted to this data type.
The Future of Long Read Sequencing
It’s sometimes hard to imagine anything replacing Illumina paired-end sequencing as the main workhorse genomic assay. However, long reads do have some distinct advantages. As the bioinformatics state-of-the-art for PacBio catches up to the ever-improving platform technology, it will only get better. One might expect an increase in diagnostic yield, which would immediately be transformative for care of rare disease patients. The other exciting aspect of more widespread long-read sequencing is what it will reveal about the frequency and distribution about hard-to-detect structural variations in the human population.