Two months have passed since the announcement of two powerful human genetics resources — the RGC Million Exome Variant Browser (RGC-ME) and version 4 of the gnomAD database (gnomAD). Both of them, fundamentally, offer the same thing: a browsable, comprehensive catalogue of human genetic variants and their allele frequencies in several ancestral populations. Importantly, these are aggregate databases: they provide summary statistics grouped by population, rather than individual data. This protects the privacy of the sequenced individuals and obviates the need for extremely broad informed consent. No demographic or phenotypic data are made public, which further protects research participants and safeguards the scientific interests of contributing investigators. That’s what makes it possible to build such large cohorts.
And they are large: RGC-ME, in its current release, contained data from 824,159 unrelated individuals, while gnomADv4 comprises 807,162 individuals. That makes them powerful tools to study genetic architecture, population differences, natural selection, etc. They also inform variant prioritization and interpretation in genetic testing. For rare genetic disorders, knowing how common (or rare) an observed variant is across human populations is crucial to determining its pathogenicity. As these databases grow, they become ever more powerful for making such inferences. We have already begun systematically using gnomADv4 and RGC-ME in our analyses. They are incredibly useful, which was part of my motivation for writing about them. Yet, it’s important to be aware of the composition of their datasets to fully understand their strengths and weaknesses.
Exome Versus Genome Data
The predecessor to the gnomAD database was the Exome Aggregation Consortium (ExAC) database, which contained ~60,000 exomes and was published in the highly cited Lek et al 2016 study. These were uniformly processed, but they were not uniformly sequenced — various targeted enrichment strategies/kits were developed and these evolved over time. However, exomes were (until recently) far less costly than genome sequencing, and they target the regions most people care about. The ExAC dataset grew to include more than 100,000 exomes and was widely adopted by the biomedical/research community.
When gnomAD was introduced, it included most of ExAC plus around 15,500 sequenced genomes, hence the name. This exome-versus-genome distinction is important to keep in mind for several reasons:
- The “exome” — i.e., the full set of bases which code for proteins — includes only a fraction (~1.5%) of the genome.
- Due to evolutionary constraint, there are fewer variants per base in the exome than in noncoding regions and their allele frequencies are lower.
- Most GWAS hits (variants statistically associated with traits) are in noncoding regions: that’s where most variants are, and importantly, where most common variants with statistical power are.
- Genome sequencing offers more uniform coverage of coding regions and also interrogates the other 98% of the genome.
- However, genomes are more expensive to sequence both in cost and in data processing/analysis costs
What’s In gnomADv4 Compared to Previous Releases
There are important differences between major releases of the gnomAD database. One aspect which I won’t go into here is the genome assembly version used for the release. Early versions were build 37, i.e. the “old” genome assembly, whereas newer ones are on the “new” genome assembly called GRCh38, or simply build 38. This matters because the location of a variant is not the same between genome assemblies.
Another key difference between gnomAD releases is the content. Versions 1 and 2 had only around 15,000 genomes, but almost ten times that number in exomes. Version 3, the first on the new assembly, had around 72,000 genomes (no exomes). They were, unfortunately, mostly from European ancestry individuals which is why v3.1 added ~3,000 genomes specifically chosen to increase diversity. Version 3.1 contains five time as many genomes as v2.1, but since it didn’t have the exome data, the total sample n was lower. Thus, guidance for variant interpretation often recommended use of gnomAD v2.1.1 data, even when v3 was available, because the earlier release represents more people in coding regions where most variant interpretation happens anyway.
The gnomAD Release Cycle
You will note that after the first release of gnomAD, subsequent major releases all seem to happen in October/November. That happens to be when the annual meeting of the American Society of Human Genetics takes place. I truly enjoy giving my friends from the Broad Institute trouble about how they only seem to work on gnomAD when it means a victory lap among their peers. In their defense, it’s useful to have a deadline for ambitious team projects. Also, minor releases that happen in between often contain desirable content (SVs, CNVs, mitochondrial variants) and functionality (e.g. cloud access, variant co-occurrence queries, etc).
As highlighted in the gnomAD v4 blog post, this new release is the largest yet by a significant margin. It has 5x as many people as all previous versions and twice as many as v1/v2/v3 combined. Where did these come from? Well, some are the v2 exomes that had not yet been mapped to the new reference. However, the main source of growth was incorporation of exome data from the UK Biobank. The bad news is that it’s exome data (coding regions only) and, like the UK Biobank itself, diversity is low: 95% of participants are white Europeans. The good news is that it’s high-quality exome data from a modern and robust 39-Mbp exome kit made by Integrated DNA Technologies (IDT). I wasn’t paid to say it, but we like their kits, too.
The Regeneron Genetics Million Exome Variant Browser
Also announced at ASHG in November was a completely new resource, the Million Exome variant browser from Regeneron Genetics Center (RGC). Of note, Regeneron is not an academic consortium, but a pharmaceutical company. The RGC enjoys a fascinating — and largely positive — reputation in the genetics community. Part of that reputation stems from their visible investment in collaborative research, illustrated by their creation of the RGC-ME resource. Another part is their successful recruitment of major scientific talent from academia. Many scientists I know (personally or by reputation), especially from large-scale genetics consortia, now work for RGC.
It is thus not very surprising that the scientific output from RGC is extremely high quality. They have led or supported many of the studies of the UK BioBank which have yielded a lot of high-impact research. The RGC-ME cohort is also described in an impressive preprint suggesting that, once peer-reviewed, it will yield many insights into evolution, constraint, disease gene architecture, splice-altering variants, etc.
The interface has some differences from gnomAD but many similarities. One can browse and search for variants by their coordinates alone, and then see the summarized allele frequency of that variant in the population. This new resource is exciting for a few reasons:
- First, it was a new resource and a surprise (At least to me. Seriously, how many biotech companies put their data in public-facing browsers?)
- Its cohort of 824,159 unrelated individuals is considerably larger, in terms of individuals, than anything out there (even gnomADv4).
- These are also new individuals, i.e. not individuals already in ExAC or gnomAD versions 1-3.
- The dataset comprises more non-European individuals, including some key under-represented groups (e.g. South Asian)
There are some limitations of this new resource. First, it’s exome data, so only informative for the coding regions. I will say that RGC uses a modern exome kit from IDT targeting around ~39 Mbp, so it’s almost certainly more uniform than other exome datasets. Second, the RGC-ME browser is a little bit fragile at times, though this seems to be improving. Third, while it’s more diverse, it’s still predominantly (75%) European. Finally, RGC goes out of its way to provide as little information about the origins, demographics, or health statuses of their cohort.
The Overlap of RGC-ME and gnomADv4
One of the first questions to address, since there are two resources offering similar information, is whether or not RGC-ME and gnomADv4 overlap. Is it fair to say that one can now access genetic summary data from 1.6 million individuals? Not so fast. Remember, RGC did the exome sequencing for the UK Biobank, which was the main source of gnomAD v4 additions. The major release of UKBB genomic data comprised exomes for 455,000 individuals, and gnomAD’s v4 release statistics report 416,555 UKBB participants are included. If we assume that these individuals are also in RGC-ME, which seems likely, then gnomADv4 and RGC-ME together contain around 1.21 million unique individuals.
Despite all of the caveats, I think that’s an important denominator.
Using gnomADv4/RGC-ME Data in Variant Interpretation
As I mentioned, population allele frequency information is critical when evaluating the pathogenicity of a sequence variant — especially one identified in genetic testing of a patient with a suspected genetic disorder. Both formal ACMG-AMP interpretation guidelines and the in-house filtering strategies of many research laboratories generally expect that disease-causing variants will be at very low frequency in the general population and not often observed in healthy individuals. In the gnomAD v2/v3 era, that meant:
- For autosomal dominant disorders, <3 heterozygous individuals in gnomAD, which equates to an allele frequency (AF) < 0.00001
- For autosomal recessive disorders, 0 or 1 homozygotes, with maximum population AF < 0.001
- For X-linked disorders, 0-1 hemizygotes/homozygotes
We don’t usually have a hard AF cutoff for X-linked disorders because of the dynamics of sex chromosomes in population genetics. We also tend to count 1-2 hets as 0, or 1 homozygote as zero, to account for possible artifacts (e.g. a mis-called genotype or a person who is affected). Those must all be re-calibrated when you have access to data from 1.2 million individuals, of course. If we stipulate that a disease affecting only 1 in 10,000 individuals is rare, then there could be as many as 120 patients in these large datasets. Even though theoretically individuals with severe pediatric/congenital disorders are intentionally excluded, it’s still a worrisome thought.
On the bright side, access to significantly more individuals has clarified many variants with ambiguous allele frequency information. We have seen this in practice. Variants absent from gnomADv3 often has 5-8 carriers in each of gnomADv4 and RGC, which is more suggestive of a very rare population variant. The new datasets are especially useful when they contain new homozygous/hemizygous individuals for variants in recessive/X-linked genes to eliminate them from further consideration. Lastly, when a coding variant is absent from both gnomADv4 and RGC-ME, we can assume it is extremely rare indeed.
Words of Caution on Age and Disease Status
Lastly, a word of caution: it is incorrect to state that the individuals in gnomAD and RGC-ME are healthy controls. A quick perusal of the RGC-ME contributors page or the About gnomAD page should remind you that most of the studies contributing data to these databases are disease studies. It’s reasonable to assume that at least half of the people in cohorts for heart disease, diabetes, asthma, and other diseases will have that disease. We don’t know who they are, or precisely how many are in the aggregate databases. However, previous releases of gnomAD provided subsets, e.g. non-cancer, non-neuro, etc. Basic arithmetic tells us that, for example, gnomADv3 contained 2,133 cancer patients and 8,714 patients with neuropsychiatric disease.
Biobanks are another major source of aggregate samples. The largest of these is UKBB, which constitutes a significant proportion of both databases as noted above. According to the first UK Biobank paper, the resource then included 502,543 participants. Some key demographics:
- 95% were of European ancestry based on the first two principal components
- The average age was 58 years old
- 43.1% were current or former smokers (yikes)
- 13.5% have asthma
- 7.5% have cancer
- 7.1% have coronary artery disease
- 3.4% had type 2 diabetes
Of note, they’re also not young. From previous statements, we know the average age of a gnomAD participant was 62 years old, and if UKBB represents a large piece of the RGC dataset, their average age is close to 60 as well. An individual present in a population cohort thus is not necessarily healthy. If anything, they’re more likely to have a common disease of interest to the research or biotech community.