The vast majority of the human genome (98%) does not encode proteins. Most of the ~4-6 million sequence variants in any individual’s genome thus lie within this ‘noncoding’ portion. The deep, dark secret of current genetic testing is that it primarily interrogates and reports the relatively tiny fraction of variants (~50,000 or so per individual) that are in or near protein-coding regions. Although there are several scientifically valid reasons for this, the main one is that the structure of genes and the triplet protein code make it possible to predict the effect of a DNA sequence change with reasonable accuracy. Yet, we know that noncoding variants are one of the reasons that >50% of tested patients fail to receive a diagnosis from comprehensive genetic testing.
After all, there are many types of regulatory elements in or near genes which affect their protein production:
It should be noted that even as clinical genome sequencing replaces exome sequencing, the scope of reported variants remains on the coding regions. The widely-used 2015 ACMG Variant Interpretation Guidelines were developed around protein-altering variants. This holds true for the updated guidelines that are pervasively in use by groups such as ClinGen despite not being fully published at the time of writing. Variant interpretations for noncoding variants are woefully under-represented in databases such as ClinVar. All of this needs to change.
I volunteer for a ClinGen Variant Curation Expert Panel (VCEP) which develops the gene-specific variant interpretation protocols for sets of genes and applies them to the ClinVar-submitted variants for those genes. My VCEP is for Leber Congenital Amaurosis / early-onset retinal disease, which are among the major causes of inherited blindness. Our first curated gene, RPE65, is primarily associated with recessive disease that now has a gene therapy treatment (Luxturna). Interestingly, we have incorporated “response to gene therapy” into our interpretation protocols, but that’s a story for another time. The development of interest here is that at least a couple of patients who received and responded to Luxturna are compound-heterozygous for a coding variant and a noncoding variant in RPE65. Thus, we have begun discussing how to adapt our rigorous, coding-centric variant interpretation protocol for noncoding variants.
The good news is that an expert panel has drafted, tested, and published recommendations for clinical interpretation of noncoding variants.
Their study, published toward the end of 2022 in Genome Medicine, provides guiding principles and numerous specific adaptations of ACMG evidence codes for noncoding variants. Here, I’ll review and summarize some of the key findings.
Variant Interpretation Requires A Gene-Disease Relationship
First, as noted by the authors, to avoid a substantial burden of analysis and reporting, clinical interpretation should only be applied to variants that:
- Map to cis-regulatory elements that have well-established, functionally-validated links to target genes, and
- Affect genes that have documented association with a phenotype (disease) relevant for the patient harboring the variant
In other words, as is true with coding variants, clinical variant interpretation should only be applied to variants in genes that have an established gene-disease association. The authors specifically suggest “definitive, strong, or moderate level using the ClinGen classification approach or green for the phenotype of interest in PanelApp” though in practice, many labs still rely on OMIM as the arbiter of gene-disease relationships.
Defining Cis-Regulatory Elements for Genes
The expert panel defined several types of noncoding regions in which variants can be interpreted, including:
- Introns and UTRs as defined by well-validated transcripts, e.g. the canonical transcripts designated by the MANE database.
- Promoters, defined as the region of open chromatin surrounding the canonical transcription start site (TSS) as delineated by functional epigenetic data (e.g. ATAC-seq, DNAse-seq) in a disease-relevant tissue or cell type. If not available for a relevant type, consensus open chromatin regions from ENCODE can be used. If even that is not possible, one can use the TSS +/- 250 bp.
- Other cis-regulatory elements (e.g. transcription factor binding sites from CHiP-Seq or H3K27Ac/H3K4Me1-marked active enhancers) provided that they have experimental evidence linking them to the gene of interest.
The most obvious evidence of a link between a CRE and a target gene would be something like Hi-C, which captures regions that physically interact with gene promoters. However, the authors also would accept regions whose functional perturbation has a demonstrated effect on expression or which harbor established one or more expression quantitative trait loci (eQTLs) for the target gene.
Naturally, most of these strategies for CRE definition require access to rich epigenetic data, which not everyone currently enjoys. The authors recognize this and express their hope that the research community will make such datasets publicly available.
ACMG Evidence for Noncoding Variants
The authors next lay out some specific ACMG evidence codes along with guidance of whether and how they can be adapted to apply to noncoding variants.
Evidence Codes That Already Apply
Of course, as noted by the authors, some types of evidence are agnostic to variant location and can be applied as-is, including:
- Allele frequency information (evidence codes PM2, BS1, BS2, and BA1).
- De novo variant evidence (codes PS2 and PM6)
- Co-segregion evidence (codes PP1 and BS4)
These types of evidence are the “gimmes” for most variant interpretation, but I should point out that under the currently modified guidance, most of them are either benign evidence or weak-to-moderate evidence of pathogenicity (e.g. PM2 should now only be applied at supporting level). So these do help, but only a little. The exception to that is a de novo variant, which in many cases could qualify for PS2 (possibly downgraded). De novo mutations, unlike inherited variants, are exceptionally rare — on average, 60-70 per individual genome-wide compared to 4-6 million — and I think that when tackling noncoding variants, these are a good place to start.
Adaptation of Evidence Codes for Noncoding Variants
Here are the main ACMG evidence codes discussed, their original definitions under 2015 guidelines, and the expert recommendation on if/when they can be applied to noncoding variants:
Evidence_Code | Original_Definition | Adapted_Definition |
PVS1 | Predicted null variant (nonsense, frameshift, start loss, essential splice site) in a gene where loss-of-function is a known mechanism of disease | (You wish). The authors state that PVS1 should not be used for noncoding variants. |
PM1 | Located in a mutational hotspot and/or well-established functional domain without benign variation | Disrupts a TF binding motif whose disruption is shown to be pathogenic, or maps to a cluster of pathogenic variants within a well-defined CRE |
PS1 | Same amino acid change as a well-established pathogenic variant, but caused by a different nucleotide change | At supporting level, for splicing variants where a different substitution has been classified as LP, provided this one has a similar predicted effect |
PM5 | Disrupts same residue as a well-established pathogenic variant, but changes it to a different amino acid | Predicted to have the same impact on the same gene as established pathogenic variants but not at the same base. |
PM3 | For recessive disorders, detected in trans with a known pathogenic variant | Same usage, provided it’s rare enough, but at supporting level for genes where this is more likely to occur by chance (large size / high mutation rate). |
PS3/B3 | Well-established functional studies show deleterious effect (PS3) or demonstrate no effect (BS3) | RNAseq: PS3 if aberrant splicing isoforms are detected, BS3 if not detected, in both cases provided that the expression profile is similar to controls and sufficient depth is achieved. Reporter gene assays: PS3 if significant differences between variant and wild-type constructs, provided that the system is validated. MAVE assays should follow existing guidance (PMID: 31862013). |
PP3/BP4 | Computational tools predict a deleterious effect (PP3) or a benign impact (BP4), often based on REVEL score for missense variants. | Similar usage but computational tools designed for noncoding variants should be used. Examples: CADD, DANN, ReMM (Genomiser), FATHMM_MKL, GREEN-DB |
The authors offer numerous caveats and cautions for many of the above codes for the obvious reason that the study of regulatory regions (and the impact of variants therein) is a rapidly evolving field of research. Also, I find some of this guidance a bit confusing. For example, the authors note that PS1 should be downgraded to supporting level for a different nucleotide change at the same position as an established variant, but they do not mention downgrading PM5, which can be applied for variants at different positions expected to have the same regulatory impact. Of course, some areas of guidance are intentionally left vague, such as the choice of computational tools/thresholds to use for PP3/BP4.
Noncoding Variants: The Time Is Now
We know noncoding variants are important, and these guidelines provide a reasonable starting point for selecting ones to report and interpreting them with clinical rigor. It’s time to get to work.
The Challenges of Variants of Uncertain Significance (VUS)
As genetic testing continues to expand in both clinical and research settings, variants of uncertain significance (VUS) present a persistent challenge. For the uninitiated, VUS is one of five classifications assigned to genetic variants under ACMG guidelines which indicate the likelihood that a variant causes disease.
Generally, uncertain significance is the default classification for variants that cannot otherwise be classified as pathogenic/likely pathogenic (i.e. disease causing) or benign/likely benign (not disease causing).
If you’re familiar with genetic testing trends over the last decade, you probably know that VUS are increasingly prevalent on genetic testing reports. In part, that’s due to technological advances, e.g. high-throughput DNA sequencing, that make it possible to interrogate more of the patient’s genome in a timely and cost-effective manner. Gene panels for specific conditions now often encompass thousands of genes, and comprehensive testing — genome or exome sequencing — is increasingly available as a first-tier or second-tier test.
Increasing knowledge — specifically, the number of genes associated with disease — is another important contributor to the VUS explosion. The pace of gene discovery accelerated in the NGS era and continues to grow, as illustrated by the statistics provided by the Online Mendelian Inheritance in Man (OMIM) database:
Long story short, more variants detected in every patient (by comprehensive sequencing) combined with more genes that are possibly reportable (due to association with disease) means more variants on genetic testing reports. And, as I’m about to tell you, for reportable variants, VUS will often be the expected classification.
ACMG Variant Interpretation 101
First, a very brief introduction to the types of evidence that are used when interpreting variants and how they are represented. ACMG evidence codes are letter/number combinations.
- The first letter indicates the type of evidence (Pathogenic or Benign).
- The second 1-2 letters indicate he strength of evidence (Very Strong, Strong, Moderate, or SuPporting).
- The number is a category we use to keep them all straight.
So for example, PS2 is the evidence code applied when a variant occurs de novo in a patient with confirmed maternity/paternity. This is the second (2) type of strong (S) evidence of pathogenicity (P), hence the code PS2. For another example, when a variant’s population allele frequency is greater than expected for the disorder, it gets the code BS1 (the first type of benign strong evidence). The weakest level of evidence, supporting, is given the strength-designation P. For example, BP1 applies when you have a missense variant in a gene in which almost all disease-causing variants are truncating/loss-of-function.
When a variant is assessed, each type of evidence is evaluated to see if it applies. The final set of evidence is combined into a formula to determine the final classification. The rules for combination to get a pathogenic or likely pathogenic variant are shown to the right. So for example, for a variant with very strong (VS) evidence of pathogenicity, only one additional strong evidence code is required to classify it as pathogenic. If that second piece of evidence is moderate strength, the variant would be classified likely pathogenic. There’s a similar formula for benign/likely benign evidence.
How We Get To VUS
What if we have some evidence that a variant is pathogenic, but not enough to meet this threshold? Or worse, what if we have a lot of benign evidence but one pathogenic code? The ACMG guidelines lay it out:
VUS Due to Conflicting Evidence
Under the ACMG framework, every variant is assessed both for pathogenic and benign evidence criteria. It is thus quite possible — and does happen on a regular basis — that a variant has both types of evidence, i.e. conflicting evidence. For example, a variant that does not segregate with disease in a family (BS4) and has no predicted effect on the encoded protein (BP4) might still be rare in the general population (PM2).
Another example we often encounter is a missense variant that is rare (PM2), segregates with disease (PP1), and is computationally predicted to be damaging (PP3), but in a gene in which most known disease-causing variants are null variants (BP1). As written above, under ACMG rules, any variant with both types of evidence, no matter the tipping of the scale, defaults to VUS.
VUS Due to Insufficient Evidence
This is the more common pathway to classifying a variant as VUS: there is not enough evidence of pathogenicity to meet the threshold of likely pathogenic, or there’s benign evidence but not enough for likely benign. For example, a novel missense variant (PM2) in a dominant disease gene which is computationally predicted to damage the encoded protein (PP3), without additional evidence, is a VUS (PM2, PP3). Missense variants in general struggle to garner enough evidence to reach pathogenicity due to the strength of evidence codes that can be applied to them; more on that in the next section.
Variants in new or emerging disease genes are especially prone to the “Insufficient Evidence” VUS classification because the etiology of disease is still being established. If only a handful of disease-causing variants have been reported, it’s often difficult to ascertain:
- Whether null variants or missense variants are the predominant type of causal variants
- The presence of mutation hotspots or critical functional domains in which variants almost always cause disease
- The maximum population frequency of established disease-causing variants
Also, relatively new disease genes rarely have robust functional studies that can be used to enhance variant classification. These problems are all exacerbated for missense variants.
Some Variants Have It Easy: Null Variants and De Novo Mutations
You will note that having Very Strong evidence gets you a long way toward classifying a variant as pathogenic. Unfortunately, there is only one type of evidence that carries the weight Very Strong: PVS1. This is reserved for null variants, i.e. the types of variants (e.g., nonsense, frameshift, canonical ±1 or 2 splice sites, initiation codon, single exon or multiexon deletion) that are “assumed to disrupt gene function by leading to a complete absence of the gene product by lack of transcription or nonsense-mediated decay of an altered transcript.” (Richards et al 2015).
As I mentioned earlier, null variants that qualify for PVS1 only need one more piece of moderate-strength evidence to reach likely pathogenic. That’s great for null variants, but such variants represent a tiny fraction of the variants encountered in most genes in most patients. Missense variants are far more prevalent but face an uphill battle toward pathogenicity.
It’s a similar story for de novo mutations: a variant in a dominant disease gene that occurred de novo is rewarded with the strong evidence code PS2. That’s a long way toward a pathogenic classification. However, applying PS2 requires that you test both parents *and* that the parental relationships, especially paternity, have been confirmed. Again, it’s great when the stars align and you can do this. It’s also one of the reasons most labs prefer having family trios (proband and both parents) whenever possible. Yet we live in the real world where:
- Children are sometimes adopted or in foster care
- Families cannot afford all available testing
- Parents may no longer be alive
- Parents may be incarcerated
- Parents may be unwilling to participate in genetic studies
In these situations, testing both parents is not an option and that usually prevents some of the strongest evidence from being applied.
Consequences of Null and De Novo Variant Bias
The biases that favor null variants and de novo mutations may have scientific underpinnings, but they also exert real-world consequences that often skew the perceptions of emerging disease genes. Let’s be honest: it is far easier to publish a cohort of patients who all have de novo loss-of-function mutations in the same gene. I have been a part of multiple GeneMatcher collaborations in which the study leaders either gave preference to patients with null / de novo variants or were forced to do so to get the work published.
This often means the first few papers linking a gene to a disease describe only de novo / null variants, and that becomes the expected etiology of disease. Even established disease genes can be affected by the null-variant bias: because missense variants are harder to classify as likely pathogenic, they often remain VUS. Anyone who glances at the landscape of disease-causing (P/LP) variants for a gene might (incorrectly) assume that only null variants cause disease. I strongly encourage researchers to push back when they are told that the first paper will only include the “easy button” LOF/de novo variants.
Effects of Changing Variant Interpretation Guidelines
The ACMG 2015 guidelines for interpretation of sequence variants were published 8 years ago this month. It was an important milestone in our field, the members of which increasingly recognized that many variants reported as disease-causing were (in retrospect) probably not. The methods for classifying variants were inconsistent, and there was no universal set of rules that someone could apply. The 2015 guidelines provided such a framework.
However, a lot can change in eight years, and although the ClinGen Sequence Variant Interpretation (SVI) working group has released subsequent recommendations on the use of computational evidence for missense variants and refining classification of splicing variants, these are interim guidance.
The long-awaited revised framework for variant interpretation, which implements a points system to improve accuracy/consistency, is not yet published. In theory, it will help us resolve some VUS. That remains to be seen. Just as some types of evidence can be assigned higher strength (e.g. computational predictions of variant impact), other types of evidence are be blunted (e.g. rareness in the population). We won’t know until the revised guidelines are published, which probably will not be in 2024.
What About Variants in Candidate Genes?
I should take this moment to remind you — as I sometimes have to remind myself — that ACMG variant interpretation should only be applied to sequence variants that affect established disease genes. It should not really be used for variants in unknown genes or candidate genes not yet associated with human disease. That’s because we can’t assess pathogenicity of a variant without a definitive link between the gene and disease.
For clarity, we try to avoid the use of VUS when discussing candidate genes. Occasionally I hear the term GUS — for Gene of Uncertain Significance — and I really like it, but it does not seem to have gained much momentum.
More VUS, More Problems
The increasing number of VUS on genetic testing reports — and our inability to definitively classify them — present significant challenges for clinicians, laboratories, and patient families.
- For the lab, a VUS is a non-diagnostic outcome. They can be reported, but generally in the dreaded “Section 2” of the test report.
- Patients with VUS thus may not qualify for gene therapy or clinical trials if those are available.
- Clinicians must decide whether or not to pursue further testing, either to clarify the VUS or to keep searching.
Which VUS Merit Further Scrutiny?
It’s important to emphasize here that not all VUS are created equal. Because of the conflicting-evidence-means-VUS rules described above, plenty of variants receive this classification but are extremely unlikely to be disease-causing. On the other hand, sometimes VUS offer a promising potential diagnosis in a patient who otherwise has no significant findings. Perhaps the most important question to be answered is the phenotypic overlap, i.e. whether the gene’s associated condition matches the patient clinical presentation. This is why good clinical phenotyping is critical for genetic testing, especially when interpreting uncertain results.
The number, strength, and types of ACMG evidence codes that accompany a VUS classification are also relevant considerations. Some of the proposed revisions of variant classification guidelines allow for tiering of VUS into subsets representing the amount of pathogenic evidence behind them. If these come to pass, they’ll offer a useful communication tool for laboratories to indicate. In the meantime, I tend to refer to them as weak or strong VUS, with the latter category possibly warranting follow-up. Examples of strong VUS include:
- A VUS that is compound-heterozygous with a pathogenic variant in a recessive disease gene.
- A VUS with multiple pathogenic criteria that segregates with disease in a gene that fits the phenotype. For example, VUS (PM2, PP3, PP5) would indicate a rare variant that’s predicted to be deleterious and has been reported as disease-causing by another laboratory.
- A VUS with a predicted effect that could be evaluated by additional testing, such as metabolic/biochemical testing or even RNA-seq for potential splice variants.
This leads to the last section of my post, the million-dollar question.
How Can We Resolve a VUS?
I get asked this question all the time. Honestly, if you’re reading this post and have some ideas, I’d love if you shared them in the comments section below. Note, resolution can go either way: building a case for pathogenicity for a suspicious VUS, or ruling out a VUS that might otherwise be a concern. Here are some strategies we and other groups have tried.
- Segregation testing. Determining the segregation pattern and disease status in family members informs, at the very least, the plausibility of a variant fit and there are ACMG evidence codes for both segregation (PP1) and non-segregation (BS4).
- Clinical evaluations. The clinician can review patient/family medical records, bring them in for another clinical visit, or refer them to a relevant specialty to determine the presence (or absence) of clinical features associated with the disorder.
- Checking the latest population allele frequency databases to determine the variant’s prevalence in presumed-healthy individuals.
- Reaching out to other laboratories who have reported the variant according to ClinVar or the literature can sometimes yield useful information.
- Identifying additional patients with the variant can provide or strengthen certain categories of pathogenicity evidence. This is something we use in my ClinGen Variant Curation Expert Panel to resolve VUS in the RPE65 gene.
- Additional patient testing, such as biochemical/metabolic testing, methylation profiling, etc. that would support or exclude the diagnosis
- Variant functional studies in cells, organoids, or animal models. Obviously we’d love this to clarify any variant, but it can be expensive and time-consuming.
Sometimes strategies like the ones above can push a variant to a more definitive classification, and sometimes not. The hard truth is that some VUS cannot be resolved at the present time. Formal classifications aside, the clinicians can make their own judgements about uncertain findings, and counsel and treat the patients accordingly.
The Importance of Patient Phenotype in Genetic Testing
The tools and resources we have for human genomic analysis continue to grow in scale and quality. Computational tools like REVEL and SpliceAI leverage machine learning to provide increasingly accurate predictions of the effects of variants. Public databases of sequence variation like the newly expanded gnomAD tell us how common they are in populations. Community-supported resources like ClinVar continue to curate disease-gene associations and interpretations of those variants.
It follows that genetic testing should continue to improve, especially in the setting of rare disorders. Ten years ago, some of the earliest exome sequencing studies of Mendelian disorders showed that with a fairly straightforward filtering approach, it was possible to winnow the set of coding variants identified in a patient (usually in the tends of thousands) to just a handful of compelling candidate variants. And that was ten years ago.
Predictive Genomics for Rare Disorders
Recently, I began to wonder if we are approaching a GATTACA-like future of predictive genomics, at least for rare genetic conditions. If you obtained the genome sequence of a family trio and all you knew was that the proband has a genetic disorder, would it be possible to identify the most likely causal variant(s) and thus predict the rare disorder the patient has? The answer to this question is probably obvious from the theme of this post, but let’s consider it as a thought experiment. So you have trio WGS data from a family and all you know is that the proband is affected. Maybe it’s a severely ill baby in the NICU, or an as-yet-undiagnosed patient coming to Genetics clinic. In this scenario, you might:
- Run the trio WGS data through your existing pipeline to identify genomic variants (SNVs, indels, and CNVs).
- Annotate all variants with population frequency, in silico predictions, gene / disease associations, ClinVar status, etc.
- Identify variants that fit a Mendelian inheritance model (de novo, recessive, or X-linked)
- Remove variants that are too common in populations to cause the disease associated with their gene
- Apply automated variant interpretation to determine which variants reach pathogenicity
- Retain pathogenic variants in disease-associated genes that fit the inheritance for those genes
With these fairly intuitive steps, you’ll likely get a rather short list of candidates and it would be straightforward to rank them so that the most probable diagnostic findings are at the top. This process would be very amenable to automation, so it could be done at speed and scale. Will that become the new paradigm for genetic testing in rare disorders?
The Missing Component of Genome-Driven Analysis
The predictive genetics approach described above has a rational basis but does not account for some crucial information: the clinical phenotype and family history of the patient being tested. Clinical correlation — the overlap between patient symptoms and disease features — has an outsized influence on whether or not a result can be considered diagnostic. In our work, which is research, we encounter (at a surprising frequency) genetic variants that:
- Are in a known disease gene
- Segregate with the inheritance mode associated with that gene
- Reach pathogenicity under ACMG guidelines, but
- Are associated with a disease that is not clinically apparent in the proband.
In a world where the majority of tests are non-diagnostic and variants of uncertain significance (VUS) are increasingly prevalent, it is hard to ignore these variants. Naturally, we go back to the clinicians and/or medical records to verify that the patient does not have the disease. If there’s no clinical correlation, these are not considered diagnostic findings. No matter how compelling the variants are.
The Power of Phenotype
Admittedly, my perspective is biased: I work on translational research studies that primarily enroll undiagnosed patients. Often they have already undergone extensive genetic/molecular testing as part of their standard of care. When a clinician orders such testing, they provide patient clinical information. On the laboratory side, especially for comprehensive tests like exome/genome sequencing, patient clinical features are critical. The order forms collect extensive details about patient symptoms, which are converted into standardized disease terms (e.g. HPO terms) and used to identify/prioritize variants for interpretation.
Most rare diseases have genetic origins, and many of the genes responsible give rise to highly specific patterns of patient symptoms. Individually, a single patient symptom may not have significant diagnostic value, but the collective picture of patient clinical features can be very powerful. Especially when some of those features are specific and/or unusual. Even a rudimentary system that ranks a patient’s genetic variants based on clinical feature overlap (the number of features shared between the patient and the disease) helps put the most plausible genetic findings at the top.
Good clinical phenotyping also provides a powerful tool to exclude candidate findings. This is useful because some medical conditions that warrant genetic testing are associated with a wide range of disorders. In the pediatric setting, for example, global developmental delay is associated with thousands of genetic disorders and thus casts a very wide net. However, for many such disorders, global delays occur alongside a number of other distinctive clinical features. If these are not present in the proband, they can often be ruled out. This reduces the search space and interpretation burden for the laboratory.
Limitations of Phenotype-Driven Analysis
Despite these advantages, a phenotype-centric approach to genetic testing has some important limitations.
- Variable expressivity. Many genetic disorders have clinically significant features that can vary from one patient to the next, even within families.
- Phenocopies. I love this word, which refers to disorders that resemble one another clinically but have different underlying causes.
- Pleiotropy. On the other hand, some genes give rise to multiple disorders which can be clinically very distinct.
- Phenotype expansion. For many genetic disorders, our understanding of the full phenotypic spectrum changes over time. This is especially true for new/emerging rare disorders for which the clinical description is based on a small number of patients.
- Patient evolution. For many patients, the clinical picture changes over time. In the pediatric setting this is a major consideration, as lots of key diagnostic features take time to manifest or be clinically apparent.
- Blended phenotypes. At least 5-10% of patients suspected for a monogenic disorder have multiple genetic conditions and their presentation can thus be a confounding combination of the associated features.
The OMIM Curation Bottleneck
The Online Mendelian Inheritance in Man (OMIM) database is one of the most vital resources in human genetics. For many/most laboratories, OMIM is the primary and definitive source for the genes, inheritance patterns, and clinical manifestations associated with genetic disorders. The information in OMIM is curated from the peer-reviewed biomedical literature by trained experts at Johns Hopkins University. This manual curation is why the resource is so widely trusted by the community. However, it’s a double-edged sword because curation takes expertise, time, and funding. The latter two have been a challenge for OMIM, especially since the pace of genetic discovery has accelerated in the past decade. Simply put, there’s way too much literature for OMIM to curate it all.
This bottleneck has real consequences. We look to OMIM as our trusted source of information about disease genes, but that information is increasingly outdated or incomplete. Given the powerful influence that clinical correlation has over genetic testing results… well, it’s a problem. And not one that the OMIM curators will be able to solve on their own. The good news is that there are more sustainable efforts under way. ClinGen, for example, is both standardizing the way information is collected/curated and leveraging expert volunteers (i.e. crowdsourcing) from the community to manage the workload. We still have a long way to go because ClinGen is a relatively new endeavor. However, it’s a more sustainable model that we should continue to support with funding and volunteerism.
In other words, if you’re not part of a ClinGen working group or panel, please think about joining one.
Clinical Genome Sequencing Replaces Exome Sequencing
This month our clinical laboratory began offering genome sequencing as an orderable in-house test. It’s a milestone achievement made possible by a talented multidisciplinary team and 3+ years of pre-clinical work under a translational research study. Yes, clinical genome sequencing was already available to our clinical geneticists — as a sendout test to commercial laboratories — but there are distinct advantages to providing this state-of-the-art test in-house. Especially the rapid genome sequencing (rGS) test, for which results are called out just a few days. We have years of data showing that genomic testing results can inform patient care in acute cases. Not to over-hype it, but sometimes it saves lives.
Still, that is not my story to tell, so this post is more about the transition from exome to genome sequencing in a (pediatric) hospital setting. It seems likely that many institutions (not just ours) will make the leap this year. There are several factors driving this change, but one of them is simply the ever-increasing speed/ throughout of next-generation sequencing instruments. For a long period, approximately 2014-2020, exome sequencing was a more practical choice as the mainstay comprehensive genetic test.
The Exome Advantage
Often patients who qualified for genetic testing would first get cytogenetic and microarray testing for chromosomal abnormalities and CNVs, respectively. Depending on the patient’s clinical features, the next step would often be a gene panel, followed by exome sequencing. As a clinical test, exome sequencing was attractive as a comprehensive test because:
- Exome capture kits had matured significantly, achieving consistent coverage and enabling fairly reliable deletion/duplication calling.
- In terms of laboratory costs, generating ~40-50 Gbp (gigabase-pairs) of data per sample was far less than the ~120 Gpb required for genome.
- Turnaround times were pretty good.
- The variant interpretation was likely to be gene- and exon-centric anyway.
Simply put, exome sequencing interrogated virtually all genes with a reasonable turnaround time and cost, so it made sense as the comprehensive test. If it was working so well, the natural question might be:
Why move to genome sequencing?
Speed, for one thing. The hybridization process (where probes capture target regions) adds about 1-1.5 days to the laboratory prep time between library creation and when things get loaded on the sequencer. The instruments are now so fast that this increases the lab time by about 50% compared to going straight to genome. The throughput is also so high that exome libraries need to be increasingly multiplexed (i.e. run lots of things at once) to be sequenced. Believe it or not, that can also introduce a delay because one has to wait until enough samples have accumulated to pool and sequence them.
“We don’t have enough samples to sequence” is a phrase I never thought I would hear. Man, how a decade can change things.
Reagent costs are a factor, too, since exome kits cost money. As the per-base cost of sequencing goes down, the savings you get from exome capture instead of genome decrease as well. The capture also requires more input DNA, which can be an issue when dealing with precious clinical samples. So genome sequencing is faster, requires less DNA, and ends up costing about the same for reagents. That’s on top of the obvious advantages GS offers in terms of variant detection.
Does genome sequencing have a higher diagnostic yield than exome sequencing?
In most cases, it should. That’s the theoretical answer. GS interrogates both coding and noncoding regions, and it’s better suited to detecting copy number variants (CNVs) and structural variants (SVs) because the breakpoints of such variants often lie in noncoding regions. Plus, exome capture introduces some hybridization biases which, while somewhat addressable during analysis, make it harder to detect changes in sequence depth that signal the presence of a copy number variant.
However, in my opinion, a major diagnostic advantage of genome sequencing comes from its ability to cover genes and exons that don’t play nicely with exome capture. Immune system genes, for example, are notorious for their poor coverage by exome sequencing. We have numerous examples of diagnostic variants uncovered by genome sequencing which were missed by exome testing due to coverage. From the clinician’s point of view, genetic test results from genome sequencing (even when nondiagnostic) come with more confidence that all of the relevant exons and genes have been interrogated.
A second advantage of genome sequencing is the ability to find deep intronic “second hits” in patients who have a single pathogenic variant in a recessive disease gene. Under exome sequencing, you generally have to do another test. With genome data, labs can at least screen nearby noncoding regions (introns, etc) to see if a second variant is present. Computational tools to predict splicing effects of variants have improved substantially in the past few years to the point where SpliceAI scores have been incorporated into ACMG/AMP guidelines. With clinical GS, upon the identification of a single variant in a promising recessive gene, labs can thus screen the data for rare variants in trans that are predicted to disrupt splicing. We have done this in a translational research setting and I think it will be a major source of improved diagnostic rates.
When should a clinical genome be ordered for an exome-negative patient?
This is an important question as clinical GS becomes more widely available. We know that 50-70% of exome tests are nondiagnostic, and it’s reasonable to assume that most patients who have undergone comprehensive testing in the last decade had an exome, not a genome. As I wrote in my recent post on post-exome strategies for Mendelian disorders, a negative exome result means that genome should be considered as the next step. If the clinical test already was genome, this changes the calculus.
I think it will be difficult to establish a perfect set of rules because every patient is different. However, I’d suggest that clinical GS should be considered when:
- The WES testing was done more than two years ago. This seems to be the sweet spot for exome reanalysis anyway, because enough new genes and disease-causing variants have been discovered to significantly boost diagnostic rates.
- New and relevant phenotypic information has emerged. Clinical exome testing is almost always guided/driven by the phenotypic data provided to the laboratory. If that changes, so too could the result. In particular, new phenotypes of features with significant genetic associations (dysmorphism, seizures, metabolic changes, neurological/neuromuscular changes, etc.) can significantly impact how variants are considered.
- There is medical urgency. A patient who continues to decline, or whose care is limited due to the lack of diagnosis, stands to benefit.
- Previous testing or new knowledge hints at a possible diagnosis. So-called “Section 2” variants and newly identified genes/pathways relevant for a patient’s phenotype may justify a harder look at certain loci.
What are the limitations of genome sequencing as a first-tier test?
No test is perfect, and despite the many advantages of genome as a first-line test, it comes with some limitations. GS may have similar experimental costs, but it comes with higher analysis costs (especially computational processing and data storage) because it’s 3-4x more data per sample. Processing that is an occasional cost, but storing data is like the Netflix subscription that never ends. The human staffing costs of interpretation are also higher because there are more variants (and detected variant classes) to evaluate. Balancing workload among staff also becomes more challenging, especially for rapid turnaround tests. And on the technical side, there is sequence depth to consider: Typical depths for exome sequencing (150-200x) have more power to detect somatic/mosaic variation. Patients undergoing testing for conditions associated with somatic mutations — the obvious example being tumor sequencing — are likely to benefit more from exome or panel testing.
RGC, gnomADv4, and the Power of Large Sequenced Cohorts
Two months have passed since the announcement of two powerful human genetics resources — the RGC Million Exome Variant Browser (RGC-ME) and version 4 of the gnomAD database (gnomAD). Both of them, fundamentally, offer the same thing: a browsable, comprehensive catalogue of human genetic variants and their allele frequencies in several ancestral populations. Importantly, these are aggregate databases: they provide summary statistics grouped by population, rather than individual data. This protects the privacy of the sequenced individuals and obviates the need for extremely broad informed consent. No demographic or phenotypic data are made public, which further protects research participants and safeguards the scientific interests of contributing investigators. That’s what makes it possible to build such large cohorts.
And they are large: RGC-ME, in its current release, contained data from 824,159 unrelated individuals, while gnomADv4 comprises 807,162 individuals. That makes them powerful tools to study genetic architecture, population differences, natural selection, etc. They also inform variant prioritization and interpretation in genetic testing. For rare genetic disorders, knowing how common (or rare) an observed variant is across human populations is crucial to determining its pathogenicity. As these databases grow, they become ever more powerful for making such inferences. We have already begun systematically using gnomADv4 and RGC-ME in our analyses. They are incredibly useful, which was part of my motivation for writing about them. Yet, it’s important to be aware of the composition of their datasets to fully understand their strengths and weaknesses.
Exome Versus Genome Data
The predecessor to the gnomAD database was the Exome Aggregation Consortium (ExAC) database, which contained ~60,000 exomes and was published in the highly cited Lek et al 2016 study. These were uniformly processed, but they were not uniformly sequenced — various targeted enrichment strategies/kits were developed and these evolved over time. However, exomes were (until recently) far less costly than genome sequencing, and they target the regions most people care about. The ExAC dataset grew to include more than 100,000 exomes and was widely adopted by the biomedical/research community.
When gnomAD was introduced, it included most of ExAC plus around 15,500 sequenced genomes, hence the name. This exome-versus-genome distinction is important to keep in mind for several reasons:
- The “exome” — i.e., the full set of bases which code for proteins — includes only a fraction (~1.5%) of the genome.
- Due to evolutionary constraint, there are fewer variants per base in the exome than in noncoding regions and their allele frequencies are lower.
- Most GWAS hits (variants statistically associated with traits) are in noncoding regions: that’s where most variants are, and importantly, where most common variants with statistical power are.
- Genome sequencing offers more uniform coverage of coding regions and also interrogates the other 98% of the genome.
- However, genomes are more expensive to sequence both in cost and in data processing/analysis costs
What’s In gnomADv4 Compared to Previous Releases
There are important differences between major releases of the gnomAD database. One aspect which I won’t go into here is the genome assembly version used for the release. Early versions were build 37, i.e. the “old” genome assembly, whereas newer ones are on the “new” genome assembly called GRCh38, or simply build 38. This matters because the location of a variant is not the same between genome assemblies.
Another key difference between gnomAD releases is the content. Versions 1 and 2 had only around 15,000 genomes, but almost ten times that number in exomes. Version 3, the first on the new assembly, had around 72,000 genomes (no exomes). They were, unfortunately, mostly from European ancestry individuals which is why v3.1 added ~3,000 genomes specifically chosen to increase diversity. Version 3.1 contains five time as many genomes as v2.1, but since it didn’t have the exome data, the total sample n was lower. Thus, guidance for variant interpretation often recommended use of gnomAD v2.1.1 data, even when v3 was available, because the earlier release represents more people in coding regions where most variant interpretation happens anyway.
The gnomAD Release Cycle
You will note that after the first release of gnomAD, subsequent major releases all seem to happen in October/November. That happens to be when the annual meeting of the American Society of Human Genetics takes place. I truly enjoy giving my friends from the Broad Institute trouble about how they only seem to work on gnomAD when it means a victory lap among their peers. In their defense, it’s useful to have a deadline for ambitious team projects. Also, minor releases that happen in between often contain desirable content (SVs, CNVs, mitochondrial variants) and functionality (e.g. cloud access, variant co-occurrence queries, etc).
As highlighted in the gnomAD v4 blog post, this new release is the largest yet by a significant margin. It has 5x as many people as all previous versions and twice as many as v1/v2/v3 combined. Where did these come from? Well, some are the v2 exomes that had not yet been mapped to the new reference. However, the main source of growth was incorporation of exome data from the UK Biobank. The bad news is that it’s exome data (coding regions only) and, like the UK Biobank itself, diversity is low: 95% of participants are white Europeans. The good news is that it’s high-quality exome data from a modern and robust 39-Mbp exome kit made by Integrated DNA Technologies (IDT). I wasn’t paid to say it, but we like their kits, too.
The Regeneron Genetics Million Exome Variant Browser
Also announced at ASHG in November was a completely new resource, the Million Exome variant browser from Regeneron Genetics Center (RGC). Of note, Regeneron is not an academic consortium, but a pharmaceutical company. The RGC enjoys a fascinating — and largely positive — reputation in the genetics community. Part of that reputation stems from their visible investment in collaborative research, illustrated by their creation of the RGC-ME resource. Another part is their successful recruitment of major scientific talent from academia. Many scientists I know (personally or by reputation), especially from large-scale genetics consortia, now work for RGC.
It is thus not very surprising that the scientific output from RGC is extremely high quality. They have led or supported many of the studies of the UK BioBank which have yielded a lot of high-impact research. The RGC-ME cohort is also described in an impressive preprint suggesting that, once peer-reviewed, it will yield many insights into evolution, constraint, disease gene architecture, splice-altering variants, etc.
The interface has some differences from gnomAD but many similarities. One can browse and search for variants by their coordinates alone, and then see the summarized allele frequency of that variant in the population. This new resource is exciting for a few reasons:
- First, it was a new resource and a surprise (At least to me. Seriously, how many biotech companies put their data in public-facing browsers?)
- Its cohort of 824,159 unrelated individuals is considerably larger, in terms of individuals, than anything out there (even gnomADv4).
- These are also new individuals, i.e. not individuals already in ExAC or gnomAD versions 1-3.
- The dataset comprises more non-European individuals, including some key under-represented groups (e.g. South Asian)
There are some limitations of this new resource. First, it’s exome data, so only informative for the coding regions. I will say that RGC uses a modern exome kit from IDT targeting around ~39 Mbp, so it’s almost certainly more uniform than other exome datasets. Second, the RGC-ME browser is a little bit fragile at times, though this seems to be improving. Third, while it’s more diverse, it’s still predominantly (75%) European. Finally, RGC goes out of its way to provide as little information about the origins, demographics, or health statuses of their cohort.
The Overlap of RGC-ME and gnomADv4
One of the first questions to address, since there are two resources offering similar information, is whether or not RGC-ME and gnomADv4 overlap. Is it fair to say that one can now access genetic summary data from 1.6 million individuals? Not so fast. Remember, RGC did the exome sequencing for the UK Biobank, which was the main source of gnomAD v4 additions. The major release of UKBB genomic data comprised exomes for 455,000 individuals, and gnomAD’s v4 release statistics report 416,555 UKBB participants are included. If we assume that these individuals are also in RGC-ME, which seems likely, then gnomADv4 and RGC-ME together contain around 1.21 million unique individuals.
Despite all of the caveats, I think that’s an important denominator.
Using gnomADv4/RGC-ME Data in Variant Interpretation
As I mentioned, population allele frequency information is critical when evaluating the pathogenicity of a sequence variant — especially one identified in genetic testing of a patient with a suspected genetic disorder. Both formal ACMG-AMP interpretation guidelines and the in-house filtering strategies of many research laboratories generally expect that disease-causing variants will be at very low frequency in the general population and not often observed in healthy individuals. In the gnomAD v2/v3 era, that meant:
- For autosomal dominant disorders, <3 heterozygous individuals in gnomAD, which equates to an allele frequency (AF) < 0.00001
- For autosomal recessive disorders, 0 or 1 homozygotes, with maximum population AF < 0.001
- For X-linked disorders, 0-1 hemizygotes/homozygotes
We don’t usually have a hard AF cutoff for X-linked disorders because of the dynamics of sex chromosomes in population genetics. We also tend to count 1-2 hets as 0, or 1 homozygote as zero, to account for possible artifacts (e.g. a mis-called genotype or a person who is affected). Those must all be re-calibrated when you have access to data from 1.2 million individuals, of course. If we stipulate that a disease affecting only 1 in 10,000 individuals is rare, then there could be as many as 120 patients in these large datasets. Even though theoretically individuals with severe pediatric/congenital disorders are intentionally excluded, it’s still a worrisome thought.
On the bright side, access to significantly more individuals has clarified many variants with ambiguous allele frequency information. We have seen this in practice. Variants absent from gnomADv3 often has 5-8 carriers in each of gnomADv4 and RGC, which is more suggestive of a very rare population variant. The new datasets are especially useful when they contain new homozygous/hemizygous individuals for variants in recessive/X-linked genes to eliminate them from further consideration. Lastly, when a coding variant is absent from both gnomADv4 and RGC-ME, we can assume it is extremely rare indeed.
Words of Caution on Age and Disease Status
Lastly, a word of caution: it is incorrect to state that the individuals in gnomAD and RGC-ME are healthy controls. A quick perusal of the RGC-ME contributors page or the About gnomAD page should remind you that most of the studies contributing data to these databases are disease studies. It’s reasonable to assume that at least half of the people in cohorts for heart disease, diabetes, asthma, and other diseases will have that disease. We don’t know who they are, or precisely how many are in the aggregate databases. However, previous releases of gnomAD provided subsets, e.g. non-cancer, non-neuro, etc. Basic arithmetic tells us that, for example, gnomADv3 contained 2,133 cancer patients and 8,714 patients with neuropsychiatric disease.
Biobanks are another major source of aggregate samples. The largest of these is UKBB, which constitutes a significant proportion of both databases as noted above. According to the first UK Biobank paper, the resource then included 502,543 participants. Some key demographics:
- 95% were of European ancestry based on the first two principal components
- The average age was 58 years old
- 43.1% were current or former smokers (yikes)
- 13.5% have asthma
- 7.5% have cancer
- 7.1% have coronary artery disease
- 3.4% had type 2 diabetes
Of note, they’re also not young. From previous statements, we know the average age of a gnomAD participant was 62 years old, and if UKBB represents a large piece of the RGC dataset, their average age is close to 60 as well. An individual present in a population cohort thus is not necessarily healthy. If anything, they’re more likely to have a common disease of interest to the research or biotech community.