This year marks the 20th anniversary of the publication of the human genome reference sequence. As I enjoy recounting to people outside of the genomics field, the investment required to complete that initial assembly is staggering: ten years, dozens of laboratories, hundreds of sequencing instruments, and a billion dollars. Today, using the latest next-generation sequencing, we can sequence a human genome in about two days for a few thousand dollars (yes, a “$1000 genome” is feasible, but only in terms of reagent costs and at centers that can sequence at factory scale).
The human genome reference, advances in sequencing technology, and many years of prolific disease gene discovery have facilitated the widespread adoption of genetic testing as a frontline diagnostic tool. Most single-gene, gene panel, exome, and whole-genome tests now use next-generation sequencing. They have another commonality as well: most rely on alignment of those sequencing reads to “build 37” of the human genome sequence, which dates back to 2003.
There is a newer, better assembly of the human genome: build 38, which has been available for…. (checks watch)… about eight years. Build 38 offers several key advantages over its predecessor, as highlighted in 2017 by the Genome Reference Consortium:
- Resolution of assembly errors and gaps associated with complex haplotypes and segmental duplications
- Base-pair–level updates for sequencing errors
- Addition of “missing” sequences, with an emphasis on paralogous sequences and population variation
- Better sequence representation for certain difficult genomic structures, such as centromeres and telomeres.
However, Build 38 also comes with a significant cost: it changes the coordinates of genomic loci. In other words, a SNP’s position on build 37 is different (most of the time) on build 38. The same is true of genes and other annotations of the genome: everything has to be re-mapped on Build 38.
LiftOver versus Remapping to Build 38
The UCSC Genome Browser has a useful tool, called liftOver, which allows one to convert coordinates between different versions of the reference assembly. You provide a BED file, select the genome assemblies to convert to/from, and it will produce an output BED file with coordinates based on the desired assembly. Liftover works *pretty* well when one is attempting to obtain the coordinates for regions that are accurately represented in both genome assemblies, i.e. the only thing that has changed is the position number. That’s the case for ~95% of things one might need to convert. The problem is addressing the other 5% of loci that don’t have 1-to-1 unique map locations between assemblies.
For a possible liftOver use case, consider the gnomAD database — which contains variant allele frequencies from large human study cohorts (~120,000 exome sequences and ~18,000 genome sequences). Obviously, gnomAD and its exon-focused predecessor (ExAC) have proven exceptionally valuable when interpreting genetic variants because they offer a somewhat accurate estimate of worldwide prevalence [in the populations that are represented, at least]. Not all of the individuals in gnomAD are perfectly healthy — after all, many of them were part of large cohort studies of common complex disease — but the curators excluded individuals with severe congenital disorders. Thus, an X-linked variant that’s hemizygous in dozens of males in gnomAD is unlikely to cause a rare X-linked recessive disorder.
As noted in the flagship gnomAD paper, the initial WGS dataset yielded 14.9 million high-quality variants from the WES dataset and 229.9 million variants from the WGS dataset, all on build 37. A liftOver to build 38 is available for download. However, if 95% of variants were successfully converted to GRCh38 coordinates using liftOver, we’re still missing 750,000 exome variants and 11.5 million genome variants. These variants, when they’re identified in patients and cohorts with sequence aligned directly to build 38, will [incorrectly] appear novel to gnomAD and thus expected to be quite rare in human populations. The result is a much-diminished signal-to-noise ratio for extremely rare variants in analyses that rely on liftOver data.
Key Genomic Resources Moving to Build 38
Many of the key resources for genome analysis and variant interpretation have fully embraced GRCh38. The ClinVar database and the UCSC Genome Browser, for example, now default to build 38 coordinate systems while providing backward compatibility with build 37. NCBI’s dbSNP database moved to build GRCh38/hg19 with release 143 in March 2015. Many commercial tools are fully compatible with both genome assemblies, including VarSome, which we use in-house.
Other resources are increasingly supporting build38 analysis but still in a transition period. Although the UCSC Genome Browser database defaults to GRCh38, anyone familiar with its annotation tracks will notice that many of them are still missing from the newer assembly. This is an unfortunate reality of having infrastructure entrenched in a certain genome version. The best option would be to ask the groups who contributed key annotation tracks to re-generate their datasets on the new genome assembly. That’s a big ask, especially for groups like the ENCODE Project who have massive datasets on build 37 coordinates.
The vital gnomAD database, unfortunately, falls into the partial-transition category. The liftOver versions of gnomAD have been available for some time, and the latest release of gnomAD included a larger WGS dataset all mapped to GRCh38 (it has other advantages too, like better representation of minority populations). However, the critical WES dataset of gnomAD has yet to be re-mapped, and the gnomAD team recommends sticking with the old version for analysis of coding regions. This is a problem, since that’s where most clinical variant interpretation happens, and undoubtedly contributes to why many clinical labs are reluctant to make the switch.
The good news is that gnomAD plans to make another release, with WES data fully remapped to GRCh38, sometime in 2021. We can expect that in October, since the gnomAD team likes to make their big announcements around the time of the annual ASHG meeting.
Let’s Move to Build 38. Now, and Together
I firmly believe it’s time for the human genetics community as a whole to make a concerted effort to move to build 38 in 2021. In fact, I challenge everyone to do so. The more stakeholders — research groups, clinical laboratories, databases, and even journals — who embrace build38, and consider it standard, the better this will be. Yes, there are bound to be hiccups. Having gone through this transition before, I’d like to offer a few bits of advice:
- Before, during, and after the transition, train yourself to indicate the assembly version whenever providing chromosomal positions or coordinates (in e-mails, Excel files, presentations, etc). Make it a habit.
- Establish shared resources for converting between coordinate systems — such as a local installation of the UCSC liftOver tool — and use a common nomenclature for files/folders that makes the genome version obvious.
- Plan to conduct analyses on both build 37 and build 38 during the transition period, and systematically compare results to make sure that everything is working as it should be.
There are other more subtle differences that we’re likely to encounter as we move analyses to the new genome assembly — such as differences in read mapping performance given the better representation of duplicated sequences and alternate haplotypes. GRCh38 will have its quirks, and we’ll be better off if we muddle through them together.