High-throughput sequencing has accelerated human genetics research in countless ways, from rapid newborn sequencing to Mendelian disorders to the many discoveries being made in common, complex disease cohorts. Yet one of the most fundamental benefits of this technology is our ability to catalogue human genetic variation in large cohorts of individuals. For a long while, such catalogues necessarily focused on variation in human protein-coding genes. Coding regions harbor the vast majority of medically-relevant sequence variants — in part because their biological effects are easier to predict — but they only occupy a small 1-2% of the human genome.
Whole-genome sequencing offers a far more comprehensive interrogation of genetic variation. The gnomAD database and more recently the Trans-Omics for Precision Medicine (TopMed) program have catalogued the genomes of tens of thousands of individuals. Unlike the 1,000 Genomes Project, which surveyed the genomes of numerous diverse/representative world populations without collecting any phenotypic data, most of the individuals in gnomAD and TopMed are part of disease studies. Even so, the sheer size of these cohorts has made them vital tools for human genetics in both clinical and research settings.
Origins and Composition of the gnomAD Database
The genome aggregation database (gnomAD), owing to is size and accessibility, has served as a vital resource for human genetics over the past several years. Like its exome-centric predecessor (ExAC), gnomAD’s catalogue was produced by compiling, harmonizing, and generating summary data across numerous large-scale sequencing projects. More than 60 projects contributed data to the gnomAD database. Most of these are disease cohorts such as TCGA (cancer), ADSP (Alzheimer’s), and T2D-GENES (diabetes). There are two major releases of gnomAD available with some key differences between them:
Release: | gnomAD v2.1.1 | gnomAD v3.1.2 |
Assembly: | build37/hg19 | GRCh38 |
Genomes: | 15,708 | 76,156 |
Exomes: | 125,748 | 0 |
As you’ll noticed, the first key difference is the genome assembly: gnomAD v2.1.1, which has data from more individuals but is exon-centric, is on build 37. The more recent release, v3.1.2, is on the newer assembly (GRCh38) but contains only genomes. This dichotomy is understandable; it’s a lot more work to harmonize exome sequencing data because it was generated using multiple version of target enrichment kits from several different manufacturers, all with a slightly different definition of exome. However, it’s also unfortunate because neither release contains the maximum available information. In fact, it’s an often-cited reason by labs for not moving to the newer* genome assembly (GRCh38).
*Newer is a relative term. GRCh38 was released in December 2013, more than 8 years ago.
The Caveats of gnomAD
The gnomAD database is a wonderful resource. I use it every day, as do many of my colleagues. Yet anyone who relies on gnomAD data for analysis and interpretation of genetic variants should be fully aware of its caveats.
1. Not everyone in gnomAD is healthy.
As I mentioned earlier, more than half of the contributing projects are disease studies (especially cardiovascular disease studies). Although individuals with severe pediatric disease and their first-degree relatives are excluded, “some individuals with severe disease may still be included in the data sets, albeit likely at a frequency equivalent to or lower than that seen in the general population.”
2. Not everyone in gnomAD is young
In fact, very few people are. According to a talk by the gnomAD team at ASHG 2017, the average age of a gnomAD individual is 54 years old. Maybe it should not have been given the types of disease cohorts that went into it, but it’s rather advanced, and brings up another important caveat.
3. Some variants are somatic clonal mutations
This came to attention a few years ago when researchers observed that gnomAD contained loss-of-function variants in key developmental genes that also happen to be frequently mutated in myelodysplastic syndrome.
A great example of this phenomenon is ASXL1, a gene that encodes a member of the Polycomb group of proteins which are necessary for the maintenance of stable repression of homeotic and other loci. De novo loss-of-function mutations in ASXL1 cause Bohring-Opitz syndrome, a severe congenital malformation disorder, so it was initially puzzling to observe nonsense/frameshift/splice site variants in some individuals in the gnomAD database. Especially severe truncating mutations that would be expected to cause BOS if they were present in an embryo.
Closer inspection of the aligned sequencing data for individuals carrying such variants indicates that they’re probably mosaic alterations, i.e. present in a subpopulation of cells. For example, two gnomAD individuals carry a variant at chr20-31021118-C-T (GRCh37) encodes a nonsense change (p.Gln373Ter) that’s Pathogenic in ClinVar. Scroll down on the variant page and you can see the aligned sequence data:
Genome sequencing is generally done on DNA extracted from blood, and the older a person is, the more likely they have mosaic hematopoietic cell populations. The proliferation advantage of ASXL1 mutations allows them to reach appreciable allele frequencies in blood cells. As a result, we observe a fair number of recurrent LOF variants — such as p.Arg417Ter (9 heterozygotes), p.Arg404Ter (7 heterozygotes), and p.Arg965Ter (3 heterozygotes) — all of which appear to be mosaic clonal events. The gnomAD team even added a note to the page for ASXL1:
Analysis of allele balance and age data indicates that this gene shows evidence of clonal hematopoiesis of indeterminate potential (CHIP). The potential presence of somatic variants should be taken into account when interpreting the penetrance, pathogenicity, and frequency of assumed germline variants. For more information, see pages 37-40 of supplementary information for The mutational constraint spectrum quantified from variation in 141,456 humans and Pathogenic ASXL1 somatic variants in reference databases complicate germline variant interpretation for Bohring-Opitz Syndrome.
This is a reason that the presence of a variant in gnomAD — even in a few individuals — should be considered carefully when evaluating its pathogenicity.
4. Many populations are underrepresented.
The individuals in gnomAD were not selected as representatives of world populations, but rather as the groups selected for large-scale sequencing projects. Here’s the breakdown by population for the two major versions of the database:
Unsurprisingly, the majority are of Western European ancestry. There are also a large number of Finnish individuals, whose unique population history makes them especially valuable for genetic studies. Yet many other significant world populations are under-represented.
Summary: gnomAD is great, use with appropriate caution
I’d intended to also talk about the TopMed database, but since I’m already a thousand words in I’ll have to save that for another time. In summary, the gnomAD database is a spectacular resource for human genetics. Like any resource, it is not without certain flaws. Anyone using it should be aware of these caveats and account for them in their analyses.