Rare Diseases

The Challenges of Variants of Uncertain Significance (VUS)

April 30, 2024 by dkoboldt Leave a Comment

As genetic testing continues to expand in both clinical and research settings, variants of uncertain significance (VUS) present a persistent challenge. For the uninitiated, VUS is one of five classifications assigned to genetic variants under ACMG guidelines which indicate the likelihood that a variant causes disease.

Variant interpretation scale — Variant interpretation categories (NHGRI, Guide to Interpreting Genomic Reports: A Genomics Toolkit)

Generally, uncertain significance is the default classification for variants that cannot otherwise be classified as pathogenic/likely pathogenic (i.e. disease causing) or benign/likely benign (not disease causing).

If you’re familiar with genetic testing trends over the last decade, you probably know that VUS are increasingly prevalent on genetic testing reports. In part, that’s due to technological advances, e.g. high-throughput DNA sequencing, that make it possible to interrogate more of the patient’s genome in a timely and cost-effective manner. Gene panels for specific conditions now often encompass thousands of genes, and comprehensive testing — genome or exome sequencing — is increasingly available as a first-tier or second-tier test.

Increasing knowledge — specifically, the number of genes associated with disease — is another important contributor to the VUS explosion. The pace of gene discovery accelerated in the NGS era and continues to grow, as illustrated by the statistics provided by the Online Mendelian Inheritance in Man (OMIM) database:

OMIM pace of gene discovery — The Pace of Gene Discovery (Credit: OMIM)

Long story short, more variants detected in every patient (by comprehensive sequencing) combined with more genes that are possibly reportable (due to association with disease) means more variants on genetic testing reports. And, as I’m about to tell you, for reportable variants, VUS will often be the expected classification.

ACMG Variant Interpretation 101

First, a very brief introduction to the types of evidence that are used when interpreting variants and how they are represented. ACMG evidence codes are letter/number combinations.

The first letter indicates the type of evidence (Pathogenic or Benign).
The second 1-2 letters indicate he strength of evidence (Very Strong, Strong, Moderate, or SuPporting).
The number is a category we use to keep them all straight.

Combining ACMG Rules for pathogenic variants — Evidence required for P/LP variants (Richards et al, Genetics in Medicine, 2015)

So for example, PS2 is the evidence code applied when a variant occurs de novo in a patient with confirmed maternity/paternity. This is the second (2) type of strong (S) evidence of pathogenicity (P), hence the code PS2. For another example, when a variant’s population allele frequency is greater than expected for the disorder, it gets the code BS1 (the first type of benign strong evidence). The weakest level of evidence, supporting, is given the strength-designation P. For example, BP1 applies when you have a missense variant in a gene in which almost all disease-causing variants are truncating/loss-of-function.

When a variant is assessed, each type of evidence is evaluated to see if it applies. The final set of evidence is combined into a formula to determine the final classification. The rules for combination to get a pathogenic or likely pathogenic variant are shown to the right. So for example, for a variant with very strong (VS) evidence of pathogenicity, only one additional strong evidence code is required to classify it as pathogenic. If that second piece of evidence is moderate strength, the variant would be classified likely pathogenic. There’s a similar formula for benign/likely benign evidence.

How We Get To VUS

What if we have some evidence that a variant is pathogenic, but not enough to meet this threshold? Or worse, what if we have a lot of benign evidence but one pathogenic code? The ACMG guidelines lay it out:

COnflicting ACMG interpretation evidence guidelines — Guideline for conflicting evidence (Richards et al, GiM 2015)

VUS Due to Conflicting Evidence

Under the ACMG framework, every variant is assessed both for pathogenic and benign evidence criteria. It is thus quite possible — and does happen on a regular basis — that a variant has both types of evidence, i.e. conflicting evidence. For example, a variant that does not segregate with disease in a family (BS4) and has no predicted effect on the encoded protein (BP4) might still be rare in the general population (PM2).

Another example we often encounter is a missense variant that is rare (PM2), segregates with disease (PP1), and is computationally predicted to be damaging (PP3), but in a gene in which most known disease-causing variants are null variants (BP1). As written above, under ACMG rules, any variant with both types of evidence, no matter the tipping of the scale, defaults to VUS.

VUS Due to Insufficient Evidence

This is the more common pathway to classifying a variant as VUS: there is not enough evidence of pathogenicity to meet the threshold of likely pathogenic, or there’s benign evidence but not enough for likely benign. For example, a novel missense variant (PM2) in a dominant disease gene which is computationally predicted to damage the encoded protein (PP3), without additional evidence, is a VUS (PM2, PP3). Missense variants in general struggle to garner enough evidence to reach pathogenicity due to the strength of evidence codes that can be applied to them; more on that in the next section.

Variants in new or emerging disease genes are especially prone to the “Insufficient Evidence” VUS classification because the etiology of disease is still being established. If only a handful of disease-causing variants have been reported, it’s often difficult to ascertain:

Whether null variants or missense variants are the predominant type of causal variants
The presence of mutation hotspots or critical functional domains in which variants almost always cause disease
The maximum population frequency of established disease-causing variants

Also, relatively new disease genes rarely have robust functional studies that can be used to enhance variant classification. These problems are all exacerbated for missense variants.

Some Variants Have It Easy: Null Variants and De Novo Mutations

You will note that having Very Strong evidence gets you a long way toward classifying a variant as pathogenic. Unfortunately, there is only one type of evidence that carries the weight Very Strong: PVS1. This is reserved for null variants, i.e. the types of variants (e.g., nonsense, frameshift, canonical ±1 or 2 splice sites, initiation codon, single exon or multiexon deletion) that are “assumed to disrupt gene function by leading to a complete absence of the gene product by lack of transcription or nonsense-mediated decay of an altered transcript.” (Richards et al 2015).

As I mentioned earlier, null variants that qualify for PVS1 only need one more piece of moderate-strength evidence to reach likely pathogenic. That’s great for null variants, but such variants represent a tiny fraction of the variants encountered in most genes in most patients. Missense variants are far more prevalent but face an uphill battle toward pathogenicity.

It’s a similar story for de novo mutations: a variant in a dominant disease gene that occurred de novo is rewarded with the strong evidence code PS2. That’s a long way toward a pathogenic classification. However, applying PS2 requires that you test both parents *and* that the parental relationships, especially paternity, have been confirmed. Again, it’s great when the stars align and you can do this. It’s also one of the reasons most labs prefer having family trios (proband and both parents) whenever possible. Yet we live in the real world where:

Children are sometimes adopted or in foster care
Families cannot afford all available testing
Parents may no longer be alive
Parents may be incarcerated
Parents may be unwilling to participate in genetic studies

In these situations, testing both parents is not an option and that usually prevents some of the strongest evidence from being applied.

Consequences of Null and De Novo Variant Bias

The biases that favor null variants and de novo mutations may have scientific underpinnings, but they also exert real-world consequences that often skew the perceptions of emerging disease genes. Let’s be honest: it is far easier to publish a cohort of patients who all have de novo loss-of-function mutations in the same gene. I have been a part of multiple GeneMatcher collaborations in which the study leaders either gave preference to patients with null / de novo variants or were forced to do so to get the work published.

This often means the first few papers linking a gene to a disease describe only de novo / null variants, and that becomes the expected etiology of disease. Even established disease genes can be affected by the null-variant bias: because missense variants are harder to classify as likely pathogenic, they often remain VUS. Anyone who glances at the landscape of disease-causing (P/LP) variants for a gene might (incorrectly) assume that only null variants cause disease. I strongly encourage researchers to push back when they are told that the first paper will only include the “easy button” LOF/de novo variants.

Effects of Changing Variant Interpretation Guidelines

The ACMG 2015 guidelines for interpretation of sequence variants were published 8 years ago this month. It was an important milestone in our field, the members of which increasingly recognized that many variants reported as disease-causing were (in retrospect) probably not. The methods for classifying variants were inconsistent, and there was no universal set of rules that someone could apply. The 2015 guidelines provided such a framework.

However, a lot can change in eight years, and although the ClinGen Sequence Variant Interpretation (SVI) working group has released subsequent recommendations on the use of computational evidence for missense variants and refining classification of splicing variants, these are interim guidance.

The long-awaited revised framework for variant interpretation, which implements a points system to improve accuracy/consistency, is not yet published. In theory, it will help us resolve some VUS. That remains to be seen. Just as some types of evidence can be assigned higher strength (e.g. computational predictions of variant impact), other types of evidence are be blunted (e.g. rareness in the population). We won’t know until the revised guidelines are published, which probably will not be in 2024.

What About Variants in Candidate Genes?

I should take this moment to remind you — as I sometimes have to remind myself — that ACMG variant interpretation should only be applied to sequence variants that affect established disease genes. It should not really be used for variants in unknown genes or candidate genes not yet associated with human disease. That’s because we can’t assess pathogenicity of a variant without a definitive link between the gene and disease.

For clarity, we try to avoid the use of VUS when discussing candidate genes. Occasionally I hear the term GUS — for Gene of Uncertain Significance — and I really like it, but it does not seem to have gained much momentum.

More VUS, More Problems

The increasing number of VUS on genetic testing reports — and our inability to definitively classify them — present significant challenges for clinicians, laboratories, and patient families.

For the lab, a VUS is a non-diagnostic outcome. They can be reported, but generally in the dreaded “Section 2” of the test report.
Patients with VUS thus may not qualify for gene therapy or clinical trials if those are available.
Clinicians must decide whether or not to pursue further testing, either to clarify the VUS or to keep searching.

Which VUS Merit Further Scrutiny?

It’s important to emphasize here that not all VUS are created equal. Because of the conflicting-evidence-means-VUS rules described above, plenty of variants receive this classification but are extremely unlikely to be disease-causing. On the other hand, sometimes VUS offer a promising potential diagnosis in a patient who otherwise has no significant findings. Perhaps the most important question to be answered is the phenotypic overlap, i.e. whether the gene’s associated condition matches the patient clinical presentation. This is why good clinical phenotyping is critical for genetic testing, especially when interpreting uncertain results.

The number, strength, and types of ACMG evidence codes that accompany a VUS classification are also relevant considerations. Some of the proposed revisions of variant classification guidelines allow for tiering of VUS into subsets representing the amount of pathogenic evidence behind them. If these come to pass, they’ll offer a useful communication tool for laboratories to indicate. In the meantime, I tend to refer to them as weak or strong VUS, with the latter category possibly warranting follow-up. Examples of strong VUS include:

A VUS that is compound-heterozygous with a pathogenic variant in a recessive disease gene.
A VUS with multiple pathogenic criteria that segregates with disease in a gene that fits the phenotype. For example, VUS (PM2, PP3, PP5) would indicate a rare variant that’s predicted to be deleterious and has been reported as disease-causing by another laboratory.
A VUS with a predicted effect that could be evaluated by additional testing, such as metabolic/biochemical testing or even RNA-seq for potential splice variants.

This leads to the last section of my post, the million-dollar question.

How Can We Resolve a VUS?

I get asked this question all the time. Honestly, if you’re reading this post and have some ideas, I’d love if you shared them in the comments section below. Note, resolution can go either way: building a case for pathogenicity for a suspicious VUS, or ruling out a VUS that might otherwise be a concern. Here are some strategies we and other groups have tried.

Segregation testing. Determining the segregation pattern and disease status in family members informs, at the very least, the plausibility of a variant fit and there are ACMG evidence codes for both segregation (PP1) and non-segregation (BS4).
Clinical evaluations. The clinician can review patient/family medical records, bring them in for another clinical visit, or refer them to a relevant specialty to determine the presence (or absence) of clinical features associated with the disorder.
Checking the latest population allele frequency databases to determine the variant’s prevalence in presumed-healthy individuals.
Reaching out to other laboratories who have reported the variant according to ClinVar or the literature can sometimes yield useful information.
Identifying additional patients with the variant can provide or strengthen certain categories of pathogenicity evidence. This is something we use in my ClinGen Variant Curation Expert Panel to resolve VUS in the RPE65 gene.
Additional patient testing, such as biochemical/metabolic testing, methylation profiling, etc. that would support or exclude the diagnosis
Variant functional studies in cells, organoids, or animal models. Obviously we’d love this to clarify any variant, but it can be expensive and time-consuming.

Sometimes strategies like the ones above can push a variant to a more definitive classification, and sometimes not. The hard truth is that some VUS cannot be resolved at the present time. Formal classifications aside, the clinicians can make their own judgements about uncertain findings, and counsel and treat the patients accordingly.

The Importance of Patient Phenotype in Genetic Testing

February 23, 2024 by dkoboldt Leave a Comment

The tools and resources we have for human genomic analysis continue to grow in scale and quality. Computational tools like REVEL and SpliceAI leverage machine learning to provide increasingly accurate predictions of the effects of variants. Public databases of sequence variation like the newly expanded gnomAD tell us how common they are in populations. Community-supported resources like ClinVar continue to curate disease-gene associations and interpretations of those variants.

It follows that genetic testing should continue to improve, especially in the setting of rare disorders. Ten years ago, some of the earliest exome sequencing studies of Mendelian disorders showed that with a fairly straightforward filtering approach, it was possible to winnow the set of coding variants identified in a patient (usually in the tends of thousands) to just a handful of compelling candidate variants. And that was ten years ago.

Predictive Genomics for Rare Disorders

Recently, I began to wonder if we are approaching a GATTACA-like future of predictive genomics, at least for rare genetic conditions. If you obtained the genome sequence of a family trio and all you knew was that the proband has a genetic disorder, would it be possible to identify the most likely causal variant(s) and thus predict the rare disorder the patient has? The answer to this question is probably obvious from the theme of this post, but let’s consider it as a thought experiment. So you have trio WGS data from a family and all you know is that the proband is affected. Maybe it’s a severely ill baby in the NICU, or an as-yet-undiagnosed patient coming to Genetics clinic. In this scenario, you might:

Run the trio WGS data through your existing pipeline to identify genomic variants (SNVs, indels, and CNVs).
Annotate all variants with population frequency, in silico predictions, gene / disease associations, ClinVar status, etc.
Identify variants that fit a Mendelian inheritance model (de novo, recessive, or X-linked)
Remove variants that are too common in populations to cause the disease associated with their gene
Apply automated variant interpretation to determine which variants reach pathogenicity
Retain pathogenic variants in disease-associated genes that fit the inheritance for those genes

With these fairly intuitive steps, you’ll likely get a rather short list of candidates and it would be straightforward to rank them so that the most probable diagnostic findings are at the top. This process would be very amenable to automation, so it could be done at speed and scale. Will that become the new paradigm for genetic testing in rare disorders?

The Missing Component of Genome-Driven Analysis

The predictive genetics approach described above has a rational basis but does not account for some crucial information: the clinical phenotype and family history of the patient being tested. Clinical correlation — the overlap between patient symptoms and disease features — has an outsized influence on whether or not a result can be considered diagnostic. In our work, which is research, we encounter (at a surprising frequency) genetic variants that:

Are in a known disease gene
Segregate with the inheritance mode associated with that gene
Reach pathogenicity under ACMG guidelines, but
Are associated with a disease that is not clinically apparent in the proband.

In a world where the majority of tests are non-diagnostic and variants of uncertain significance (VUS) are increasingly prevalent, it is hard to ignore these variants. Naturally, we go back to the clinicians and/or medical records to verify that the patient does not have the disease. If there’s no clinical correlation, these are not considered diagnostic findings. No matter how compelling the variants are.

The Power of Phenotype

Admittedly, my perspective is biased: I work on translational research studies that primarily enroll undiagnosed patients. Often they have already undergone extensive genetic/molecular testing as part of their standard of care. When a clinician orders such testing, they provide patient clinical information. On the laboratory side, especially for comprehensive tests like exome/genome sequencing, patient clinical features are critical. The order forms collect extensive details about patient symptoms, which are converted into standardized disease terms (e.g. HPO terms) and used to identify/prioritize variants for interpretation.

Most rare diseases have genetic origins, and many of the genes responsible give rise to highly specific patterns of patient symptoms. Individually, a single patient symptom may not have significant diagnostic value, but the collective picture of patient clinical features can be very powerful. Especially when some of those features are specific and/or unusual. Even a rudimentary system that ranks a patient’s genetic variants based on clinical feature overlap (the number of features shared between the patient and the disease) helps put the most plausible genetic findings at the top.

Good clinical phenotyping also provides a powerful tool to exclude candidate findings. This is useful because some medical conditions that warrant genetic testing are associated with a wide range of disorders. In the pediatric setting, for example, global developmental delay is associated with thousands of genetic disorders and thus casts a very wide net. However, for many such disorders, global delays occur alongside a number of other distinctive clinical features. If these are not present in the proband, they can often be ruled out. This reduces the search space and interpretation burden for the laboratory.

Limitations of Phenotype-Driven Analysis

Despite these advantages, a phenotype-centric approach to genetic testing has some important limitations.

Variable expressivity. Many genetic disorders have clinically significant features that can vary from one patient to the next, even within families.
Phenocopies. I love this word, which refers to disorders that resemble one another clinically but have different underlying causes.
Pleiotropy. On the other hand, some genes give rise to multiple disorders which can be clinically very distinct.
Phenotype expansion. For many genetic disorders, our understanding of the full phenotypic spectrum changes over time. This is especially true for new/emerging rare disorders for which the clinical description is based on a small number of patients.
Patient evolution. For many patients, the clinical picture changes over time. In the pediatric setting this is a major consideration, as lots of key diagnostic features take time to manifest or be clinically apparent.
Blended phenotypes. At least 5-10% of patients suspected for a monogenic disorder have multiple genetic conditions and their presentation can thus be a confounding combination of the associated features.

The OMIM Curation Bottleneck

The Online Mendelian Inheritance in Man (OMIM) database is one of the most vital resources in human genetics. For many/most laboratories, OMIM is the primary and definitive source for the genes, inheritance patterns, and clinical manifestations associated with genetic disorders. The information in OMIM is curated from the peer-reviewed biomedical literature by trained experts at Johns Hopkins University. This manual curation is why the resource is so widely trusted by the community. However, it’s a double-edged sword because curation takes expertise, time, and funding. The latter two have been a challenge for OMIM, especially since the pace of genetic discovery has accelerated in the past decade. Simply put, there’s way too much literature for OMIM to curate it all.

This bottleneck has real consequences. We look to OMIM as our trusted source of information about disease genes, but that information is increasingly outdated or incomplete. Given the powerful influence that clinical correlation has over genetic testing results… well, it’s a problem. And not one that the OMIM curators will be able to solve on their own. The good news is that there are more sustainable efforts under way. ClinGen, for example, is both standardizing the way information is collected/curated and leveraging expert volunteers (i.e. crowdsourcing) from the community to manage the workload. We still have a long way to go because ClinGen is a relatively new endeavor. However, it’s a more sustainable model that we should continue to support with funding and volunteerism.

In other words, if you’re not part of a ClinGen working group or panel, please think about joining one.

Clinical Genome Sequencing Replaces Exome Sequencing

January 19, 2024 by dkoboldt Leave a Comment

This month our clinical laboratory began offering genome sequencing as an orderable in-house test. It’s a milestone achievement made possible by a talented multidisciplinary team and 3+ years of pre-clinical work under a translational research study. Yes, clinical genome sequencing was already available to our clinical geneticists — as a sendout test to commercial laboratories — but there are distinct advantages to providing this state-of-the-art test in-house. Especially the rapid genome sequencing (rGS) test, for which results are called out just a few days. We have years of data showing that genomic testing results can inform patient care in acute cases. Not to over-hype it, but sometimes it saves lives.

Still, that is not my story to tell, so this post is more about the transition from exome to genome sequencing in a (pediatric) hospital setting. It seems likely that many institutions (not just ours) will make the leap this year. There are several factors driving this change, but one of them is simply the ever-increasing speed/ throughout of next-generation sequencing instruments. For a long period, approximately 2014-2020, exome sequencing was a more practical choice as the mainstay comprehensive genetic test.

The Exome Advantage

Often patients who qualified for genetic testing would first get cytogenetic and microarray testing for chromosomal abnormalities and CNVs, respectively. Depending on the patient’s clinical features, the next step would often be a gene panel, followed by exome sequencing. As a clinical test, exome sequencing was attractive as a comprehensive test because:

Exome capture kits had matured significantly, achieving consistent coverage and enabling fairly reliable deletion/duplication calling.
In terms of laboratory costs, generating ~40-50 Gbp (gigabase-pairs) of data per sample was far less than the ~120 Gpb required for genome.
Turnaround times were pretty good.
The variant interpretation was likely to be gene- and exon-centric anyway.

Simply put, exome sequencing interrogated virtually all genes with a reasonable turnaround time and cost, so it made sense as the comprehensive test. If it was working so well, the natural question might be:

Why move to genome sequencing?

Speed, for one thing. The hybridization process (where probes capture target regions) adds about 1-1.5 days to the laboratory prep time between library creation and when things get loaded on the sequencer. The instruments are now so fast that this increases the lab time by about 50% compared to going straight to genome. The throughput is also so high that exome libraries need to be increasingly multiplexed (i.e. run lots of things at once) to be sequenced. Believe it or not, that can also introduce a delay because one has to wait until enough samples have accumulated to pool and sequence them.

“We don’t have enough samples to sequence” is a phrase I never thought I would hear. Man, how a decade can change things.

Reagent costs are a factor, too, since exome kits cost money. As the per-base cost of sequencing goes down, the savings you get from exome capture instead of genome decrease as well. The capture also requires more input DNA, which can be an issue when dealing with precious clinical samples. So genome sequencing is faster, requires less DNA, and ends up costing about the same for reagents. That’s on top of the obvious advantages GS offers in terms of variant detection.

Does genome sequencing have a higher diagnostic yield than exome sequencing?

In most cases, it should. That’s the theoretical answer. GS interrogates both coding and noncoding regions, and it’s better suited to detecting copy number variants (CNVs) and structural variants (SVs) because the breakpoints of such variants often lie in noncoding regions. Plus, exome capture introduces some hybridization biases which, while somewhat addressable during analysis, make it harder to detect changes in sequence depth that signal the presence of a copy number variant.

However, in my opinion, a major diagnostic advantage of genome sequencing comes from its ability to cover genes and exons that don’t play nicely with exome capture. Immune system genes, for example, are notorious for their poor coverage by exome sequencing. We have numerous examples of diagnostic variants uncovered by genome sequencing which were missed by exome testing due to coverage. From the clinician’s point of view, genetic test results from genome sequencing (even when nondiagnostic) come with more confidence that all of the relevant exons and genes have been interrogated.

A second advantage of genome sequencing is the ability to find deep intronic “second hits” in patients who have a single pathogenic variant in a recessive disease gene. Under exome sequencing, you generally have to do another test. With genome data, labs can at least screen nearby noncoding regions (introns, etc) to see if a second variant is present. Computational tools to predict splicing effects of variants have improved substantially in the past few years to the point where SpliceAI scores have been incorporated into ACMG/AMP guidelines. With clinical GS, upon the identification of a single variant in a promising recessive gene, labs can thus screen the data for rare variants in trans that are predicted to disrupt splicing. We have done this in a translational research setting and I think it will be a major source of improved diagnostic rates.

When should a clinical genome be ordered for an exome-negative patient?

This is an important question as clinical GS becomes more widely available. We know that 50-70% of exome tests are nondiagnostic, and it’s reasonable to assume that most patients who have undergone comprehensive testing in the last decade had an exome, not a genome. As I wrote in my recent post on post-exome strategies for Mendelian disorders, a negative exome result means that genome should be considered as the next step. If the clinical test already was genome, this changes the calculus.

I think it will be difficult to establish a perfect set of rules because every patient is different. However, I’d suggest that clinical GS should be considered when:

The WES testing was done more than two years ago. This seems to be the sweet spot for exome reanalysis anyway, because enough new genes and disease-causing variants have been discovered to significantly boost diagnostic rates.
New and relevant phenotypic information has emerged. Clinical exome testing is almost always guided/driven by the phenotypic data provided to the laboratory. If that changes, so too could the result. In particular, new phenotypes of features with significant genetic associations (dysmorphism, seizures, metabolic changes, neurological/neuromuscular changes, etc.) can significantly impact how variants are considered.
There is medical urgency. A patient who continues to decline, or whose care is limited due to the lack of diagnosis, stands to benefit.
Previous testing or new knowledge hints at a possible diagnosis. So-called “Section 2” variants and newly identified genes/pathways relevant for a patient’s phenotype may justify a harder look at certain loci.

What are the limitations of genome sequencing as a first-tier test?

No test is perfect, and despite the many advantages of genome as a first-line test, it comes with some limitations. GS may have similar experimental costs, but it comes with higher analysis costs (especially computational processing and data storage) because it’s 3-4x more data per sample. Processing that is an occasional cost, but storing data is like the Netflix subscription that never ends. The human staffing costs of interpretation are also higher because there are more variants (and detected variant classes) to evaluate. Balancing workload among staff also becomes more challenging, especially for rapid turnaround tests. And on the technical side, there is sequence depth to consider: Typical depths for exome sequencing (150-200x) have more power to detect somatic/mosaic variation. Patients undergoing testing for conditions associated with somatic mutations — the obvious example being tumor sequencing — are likely to benefit more from exome or panel testing.

RGC, gnomADv4, and the Power of Large Sequenced Cohorts

January 11, 2024 by dkoboldt Leave a Comment

Two months have passed since the announcement of two powerful human genetics resources — the RGC Million Exome Variant Browser (RGC-ME) and version 4 of the gnomAD database (gnomAD). Both of them, fundamentally, offer the same thing: a browsable, comprehensive catalogue of human genetic variants and their allele frequencies in several ancestral populations. Importantly, these are aggregate databases: they provide summary statistics grouped by population, rather than individual data. This protects the privacy of the sequenced individuals and obviates the need for extremely broad informed consent. No demographic or phenotypic data are made public, which further protects research participants and safeguards the scientific interests of contributing investigators. That’s what makes it possible to build such large cohorts.

And they are large: RGC-ME, in its current release, contained data from 824,159 unrelated individuals, while gnomADv4 comprises 807,162 individuals. That makes them powerful tools to study genetic architecture, population differences, natural selection, etc. They also inform variant prioritization and interpretation in genetic testing. For rare genetic disorders, knowing how common (or rare) an observed variant is across human populations is crucial to determining its pathogenicity. As these databases grow, they become ever more powerful for making such inferences. We have already begun systematically using gnomADv4 and RGC-ME in our analyses. They are incredibly useful, which was part of my motivation for writing about them. Yet, it’s important to be aware of the composition of their datasets to fully understand their strengths and weaknesses.

Exome Versus Genome Data

The predecessor to the gnomAD database was the Exome Aggregation Consortium (ExAC) database, which contained ~60,000 exomes and was published in the highly cited Lek et al 2016 study. These were uniformly processed, but they were not uniformly sequenced — various targeted enrichment strategies/kits were developed and these evolved over time. However, exomes were (until recently) far less costly than genome sequencing, and they target the regions most people care about. The ExAC dataset grew to include more than 100,000 exomes and was widely adopted by the biomedical/research community.

When gnomAD was introduced, it included most of ExAC plus around 15,500 sequenced genomes, hence the name. This exome-versus-genome distinction is important to keep in mind for several reasons:

The “exome” — i.e., the full set of bases which code for proteins — includes only a fraction (~1.5%) of the genome.
Due to evolutionary constraint, there are fewer variants per base in the exome than in noncoding regions and their allele frequencies are lower.
- Most GWAS hits (variants statistically associated with traits) are in noncoding regions: that’s where most variants are, and importantly, where most common variants with statistical power are.
Genome sequencing offers more uniform coverage of coding regions and also interrogates the other 98% of the genome.
However, genomes are more expensive to sequence both in cost and in data processing/analysis costs

What’s In gnomADv4 Compared to Previous Releases

Major gnomAD Releases — Major gnomAD releases

There are important differences between major releases of the gnomAD database. One aspect which I won’t go into here is the genome assembly version used for the release. Early versions were build 37, i.e. the “old” genome assembly, whereas newer ones are on the “new” genome assembly called GRCh38, or simply build 38. This matters because the location of a variant is not the same between genome assemblies.

Another key difference between gnomAD releases is the content. Versions 1 and 2 had only around 15,000 genomes, but almost ten times that number in exomes. Version 3, the first on the new assembly, had around 72,000 genomes (no exomes). They were, unfortunately, mostly from European ancestry individuals which is why v3.1 added ~3,000 genomes specifically chosen to increase diversity. Version 3.1 contains five time as many genomes as v2.1, but since it didn’t have the exome data, the total sample n was lower. Thus, guidance for variant interpretation often recommended use of gnomAD v2.1.1 data, even when v3 was available, because the earlier release represents more people in coding regions where most variant interpretation happens anyway.

The gnomAD Release Cycle

You will note that after the first release of gnomAD, subsequent major releases all seem to happen in October/November. That happens to be when the annual meeting of the American Society of Human Genetics takes place. I truly enjoy giving my friends from the Broad Institute trouble about how they only seem to work on gnomAD when it means a victory lap among their peers. In their defense, it’s useful to have a deadline for ambitious team projects. Also, minor releases that happen in between often contain desirable content (SVs, CNVs, mitochondrial variants) and functionality (e.g. cloud access, variant co-occurrence queries, etc).

As highlighted in the gnomAD v4 blog post, this new release is the largest yet by a significant margin. It has 5x as many people as all previous versions and twice as many as v1/v2/v3 combined. Where did these come from? Well, some are the v2 exomes that had not yet been mapped to the new reference. However, the main source of growth was incorporation of exome data from the UK Biobank. The bad news is that it’s exome data (coding regions only) and, like the UK Biobank itself, diversity is low: 95% of participants are white Europeans. The good news is that it’s high-quality exome data from a modern and robust 39-Mbp exome kit made by Integrated DNA Technologies (IDT). I wasn’t paid to say it, but we like their kits, too.

The Regeneron Genetics Million Exome Variant Browser

Also announced at ASHG in November was a completely new resource, the Million Exome variant browser from Regeneron Genetics Center (RGC). Of note, Regeneron is not an academic consortium, but a pharmaceutical company. The RGC enjoys a fascinating — and largely positive — reputation in the genetics community. Part of that reputation stems from their visible investment in collaborative research, illustrated by their creation of the RGC-ME resource. Another part is their successful recruitment of major scientific talent from academia. Many scientists I know (personally or by reputation), especially from large-scale genetics consortia, now work for RGC.

It is thus not very surprising that the scientific output from RGC is extremely high quality. They have led or supported many of the studies of the UK BioBank which have yielded a lot of high-impact research. The RGC-ME cohort is also described in an impressive preprint suggesting that, once peer-reviewed, it will yield many insights into evolution, constraint, disease gene architecture, splice-altering variants, etc.

Breakdown of ancestry in the RGC-ME cohort — RGC Cohort Ancestry (source: preprint)

The interface has some differences from gnomAD but many similarities. One can browse and search for variants by their coordinates alone, and then see the summarized allele frequency of that variant in the population. This new resource is exciting for a few reasons:

First, it was a new resource and a surprise (At least to me. Seriously, how many biotech companies put their data in public-facing browsers?)
Its cohort of 824,159 unrelated individuals is considerably larger, in terms of individuals, than anything out there (even gnomADv4).
These are also new individuals, i.e. not individuals already in ExAC or gnomAD versions 1-3.
The dataset comprises more non-European individuals, including some key under-represented groups (e.g. South Asian)

There are some limitations of this new resource. First, it’s exome data, so only informative for the coding regions. I will say that RGC uses a modern exome kit from IDT targeting around ~39 Mbp, so it’s almost certainly more uniform than other exome datasets. Second, the RGC-ME browser is a little bit fragile at times, though this seems to be improving. Third, while it’s more diverse, it’s still predominantly (75%) European. Finally, RGC goes out of its way to provide as little information about the origins, demographics, or health statuses of their cohort.

The Overlap of RGC-ME and gnomADv4

One of the first questions to address, since there are two resources offering similar information, is whether or not RGC-ME and gnomADv4 overlap. Is it fair to say that one can now access genetic summary data from 1.6 million individuals? Not so fast. Remember, RGC did the exome sequencing for the UK Biobank, which was the main source of gnomAD v4 additions. The major release of UKBB genomic data comprised exomes for 455,000 individuals, and gnomAD’s v4 release statistics report 416,555 UKBB participants are included. If we assume that these individuals are also in RGC-ME, which seems likely, then gnomADv4 and RGC-ME together contain around 1.21 million unique individuals.

Despite all of the caveats, I think that’s an important denominator.

Using gnomADv4/RGC-ME Data in Variant Interpretation

As I mentioned, population allele frequency information is critical when evaluating the pathogenicity of a sequence variant — especially one identified in genetic testing of a patient with a suspected genetic disorder. Both formal ACMG-AMP interpretation guidelines and the in-house filtering strategies of many research laboratories generally expect that disease-causing variants will be at very low frequency in the general population and not often observed in healthy individuals. In the gnomAD v2/v3 era, that meant:

For autosomal dominant disorders, <3 heterozygous individuals in gnomAD, which equates to an allele frequency (AF) < 0.00001
For autosomal recessive disorders, 0 or 1 homozygotes, with maximum population AF < 0.001
For X-linked disorders, 0-1 hemizygotes/homozygotes

We don’t usually have a hard AF cutoff for X-linked disorders because of the dynamics of sex chromosomes in population genetics. We also tend to count 1-2 hets as 0, or 1 homozygote as zero, to account for possible artifacts (e.g. a mis-called genotype or a person who is affected). Those must all be re-calibrated when you have access to data from 1.2 million individuals, of course. If we stipulate that a disease affecting only 1 in 10,000 individuals is rare, then there could be as many as 120 patients in these large datasets. Even though theoretically individuals with severe pediatric/congenital disorders are intentionally excluded, it’s still a worrisome thought.

On the bright side, access to significantly more individuals has clarified many variants with ambiguous allele frequency information. We have seen this in practice. Variants absent from gnomADv3 often has 5-8 carriers in each of gnomADv4 and RGC, which is more suggestive of a very rare population variant. The new datasets are especially useful when they contain new homozygous/hemizygous individuals for variants in recessive/X-linked genes to eliminate them from further consideration. Lastly, when a coding variant is absent from both gnomADv4 and RGC-ME, we can assume it is extremely rare indeed.

Words of Caution on Age and Disease Status

Lastly, a word of caution: it is incorrect to state that the individuals in gnomAD and RGC-ME are healthy controls. A quick perusal of the RGC-ME contributors page or the About gnomAD page should remind you that most of the studies contributing data to these databases are disease studies. It’s reasonable to assume that at least half of the people in cohorts for heart disease, diabetes, asthma, and other diseases will have that disease. We don’t know who they are, or precisely how many are in the aggregate databases. However, previous releases of gnomAD provided subsets, e.g. non-cancer, non-neuro, etc. Basic arithmetic tells us that, for example, gnomADv3 contained 2,133 cancer patients and 8,714 patients with neuropsychiatric disease.

Biobanks are another major source of aggregate samples. The largest of these is UKBB, which constitutes a significant proportion of both databases as noted above. According to the first UK Biobank paper, the resource then included 502,543 participants. Some key demographics:

95% were of European ancestry based on the first two principal components
The average age was 58 years old
43.1% were current or former smokers (yikes)
13.5% have asthma
7.5% have cancer
7.1% have coronary artery disease
3.4% had type 2 diabetes

Of note, they’re also not young. From previous statements, we know the average age of a gnomAD participant was 62 years old, and if UKBB represents a large piece of the RGC dataset, their average age is close to 60 as well. An individual present in a population cohort thus is not necessarily healthy. If anything, they’re more likely to have a common disease of interest to the research or biotech community.

Post-Exome Strategies for Mendelian Disorders

August 28, 2023 by dkoboldt Leave a Comment

Today I’m delivering the research genomics lecture at NCH’s Myology Training Course, an annual, week-long, in person training program that covers numerous aspects of clinical, research, and laboratory topics relevant to the field. In a stroke of excellent timing, Monica H. Wojcik and colleagues from the Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Consortium have just published a review on diagnostic testing beyond the exome. In other words, they review the current available tests and diagnostic procedures that may elucidate a molecular diagnosis for a patient with a Mendelian disorder when exome sequencing (ES) has failed to do so.

A flowchart for choosing assays for a Mendelian patient after exome testing is negative — Post-exome strategies for Mendelian patients (Credit: Wojcik et al, AJHG 2023)

This is a subject I know something about, having spent more than a decade studying rare genetic diseases in a large genome center. A negative ES report is no longer the end of the road, as there are numerous other possible strategies to uncover the genetic basis of a rare disorder. This review covers them well, so I thought it would make a useful blog post.

First, it needs to be said that we are living in a golden era of genetic testing. Technological advances have enabled comprehensive assays for genomic interrogation – microarrays, exome sequencing, and genome sequencing – while improvements in bioinformatics and community-created resources like gnomAD and ClinVar have improved our ability to identify and classify disease-causing variants. Best practice guidelines from organizations like the American College of Medical Genetics have been modified to leverage the strengths of these new technologies. Under those guidelines, for patients with congenital anomalies, developmental delays, or intellectual disability, exome/genome sequencing should be the first- or second-line test. One of the reasons for this is the fact that an increasing number of genetic conditions are clinically similar but genetically heterogeneous. Another is the observation that ~5% of diagnosed patients have multiple genetic conditions.

When Comprehensive Genetic Testing Fails

Despite all of this progress, some 50-60% of individuals with a suspected Mendelian condition remain undiagnosed after comprehensive molecular testing. Why does testing fail to produce a diagnosis? It’s a question I spend a lot of time thinking about, discussing with colleagues, and posing to candidates during interviews. Generally, they fall into two categories:

Category 1: The causal variant was detected, but:

The genetic basis of the disorder is not yet known. New disease genes are being discovered every day, but we have a long way to go.
The gene is associated with disease, but the patient represents a new phenotypic manifestation or severity, i.e. variable expressivity.
The variant is inherited from an unaffected parent, i.e. incomplete penetrance.
There is not enough evidence to call the variant pathogenic. Variant interpretation is challenging, especially in the scenario where information is limited concerning the pattern of disease-causing variants, the origin of the variant in the patient, or both.

Category 2: The causal variant was not detected, because:

The gene (or exon) is not interrogated by the sequencing assay, i.e. poorly captured for ES or poorly covered for GS.
The variant is difficult to detect by short-read sequencing, e.g. structural variants and trinucleotide repeat expansions.
The variant lies in a noncoding region (note, it might well be detected by GS, but could be challenging to interpret).
The variant is epigenetic, not genetic.
The disorder is not genetic but has an infectious or other acquired origin.

This is a partial list of reasons that the most comprehensive test available (currently exome sequencing in most situations) fails to make a diagnostic finding.

Post-Exome Testing Options

The central focus of “Beyond the Exome” is the set of options for further diagnostic testing, many of which fall under research. They include:

Exome Sequencing Reanalysis

A key advantage of ES as a genetic test is the ability to re-analyze data when new phenotypic information emerges and/or when some time (usually 2-3 years or more) has passed. The yield of ES reanalysis can vary widely, but a systematic review estimated the increased diagnostic yield at 15% and recommended that reanalysis is warranted 18 months after the initial test. Generally speaking, diagnoses made by ES reanalysis are the result of:

New gene discovery for Mendelian conditions, i.e. identification of a variant in a gene now associated with disease. Consistently the major contributor to new diagnoses.
Resolution of previously known variants of uncertain significance (VUS) as pathogenic.
Improvements in bioinformatics pipelines for variant calling and annotation.

As the authors off this review highlight, diagnoses found on exome reanalysis may also be in known disease genes not previously thught to explain the phenotype, where the clinical interpretation of a variant has changed due to novel data such as additional clinical information, new variant inheritance information, segregation data from other affected family members, newly published case reports, or an expansion of the phenotype associated with the gene. Clinical re-evaluation / clinician input are also essential in these scenarios.

Short-read Genome Sequencing

Genome sequencing (GS) in most contexts means “short-read” genome sequencing — paired-end sequencing of 150-bp to 250-bp reads from the ends of ~350-500 bp fragments by whole genome shotgun approaches. Illumina platforms continue to dominate this market. Compared to ES, GS has some key advantages:

More uniform coverage of genes and exons, including certain genes which are notoriously difficult to capture (especially immune genes) for exome sequencing.
Identification of copy number variants (CNVs) and structural variants (SVs), usually with better sensitivity and resolution than SNP microarrays.
Comprehensive interrogation of noncoding regions that may harbor pathogenic variants, such as introns, promoters, regulatory elements, etc.

It’s important to note that while clinical GS is increasingly being offered on a clinical basis, it is fairly exome-centric in terms of variants reported. In other words, although millions of noncoding variants are identified by GS, our ability to interpret them remains limited. The incremental diagnostic yield of GS in exome-negative patients varies but is probably in the 5-15% range. As expected, some diagnoses made by GS involve types or sizes of SVs that are difficult to detect by other assays. Some are splice-region variants. Some are variants in poorly-captured genes. Yet a significant proportion of findings afforded by GS are made not because the detection was superior, but rather because GS was performed later and on a research basis. This can enable the identification of candidate variants and the exclusion of other genetic causes.

Long-read Targeted/Genome Sequencing

Single molecule long-read sequencing is commercially available on two platforms — Pacific Biosciences and Oxford Nanopore Technologies — and produces reads that are significantly longer than standard GS approaches: 10,000 to 15,000 bp on average, compared to 150 bp. On a per-base level, these platforms have a higher rate of sequencing error than sequencing-by-synthesis (Illumina) approaches, especially in certain sequence contexts (e.g. homopolymers). However, even with slightly diminished accuracy, long reads in this size range are extremely useful for resolving structural variants / complex rearrangements and for interrogating otherwise hard-to-sequence regions of the genome. We have used PacBio long-read sequencing to:

Identify causal variants in syndromic rare disease patients that were poorly covered / not detected by GS. Example: Polyalanine repeat expansions in HOXD13.
Resolve the genomic breakpoints of translocations, inversions, and other complex rearrangements.
Determine the phase of two variants in the same gene, e.g. somatic variants in PTEN in hemimegalencephaly patients.

Naturally, there are disadvantages to long-read sequencing compared to traditional ES and GS approaches. The first and most obvious disadvantage is the cost, which can be 3-4x higher than standard GS. The “DNA cost” is also high, as long read technologies require a large amount of high-molecular-weight DNA. Informatics pipelines are not as mature for long-read technologies, so the analysis cost is higher as well.

RNA Sequencing

Transcriptome profiling by RNA sequencing is, in my opinion, one of the most powerful research genomics tools for undiagnosed patients. RNA can be co-extracted from blood along with DNA, and RNA sequencing is relatively inexpensive. RNA-seq provides a lot of useful information, including:

Comprehensive gene expression measurements with higher precision than microarray testing.
Isoform expression, i.e. the expression level of each exon and splicing of adjacent exons.
Quantification of allele-specific expression, i.e. the balance of alleles of a variant in expressed transcript
Splicing patterns, including both canonical and disruptive splicing.

RNA-Seq data is most useful when paired with genomic data. In our hands, it has been most useful in identifying “missing” variants in known disease genes (e.g. a second hit in a recessive gene in patients who have a single pathogenic variant by standard testing but otherwise fit the condition well). Many of those missing second hits are deep intronic variants with occult disruptions to canonical splicing, but we also see splice-disrupting variants in coding regions and intronic splice regions (outside canonical splice site, but close to the intron-exon junction). RNA-seq can also resolve VUS by showing the variant’s impact on mRNA transcripts. We have used it both to prove and disprove effects on splicing.

The main disadvantage to RNA-seq is that it is only informative if the gene is expressed in available tissue. Otherwise you get no reads, or too few reads to infer splicing patterns. Many genes are expressed in fibroblasts, which is why RNA-seq from blood can be useful even for disorders that affect other systems / developmental timepoints. However, we should be cautious about over-promising the number of genes expressed at high enough levels to analyze: in my experience, it’s only around 50%. RNA-seq has made the most gains in disorders for which disease-relevant tissue is available, e.g. muscle diseases (due to muscle biopsy). Many genes have tissue-specific expression and that tissue often is not available for research testing.

Optical Genome Mapping (OGM) and Epigenetic Methylation Profiling

Both of these are relatively new/emerging technologies with a lot of promise that have already begun to be implemented in some clinical areas. Optical Genome Mapping (OGM) is not sequencing per se, but high-resolution imaging of long labeled DNA molecules coupled with sophisticated informatics to map the physical structure of chromosomes. OGM is therefore very useful for identifying CNVs, SVs, and complex rearrangements with higher resolution (to ~500 bp) than standard of care approaches. Like long-read sequencing, it is costly in terms of reagents and input DNA requirements. OGM in fact requires a specialized sample prep, so you need access to fresh patient material (blood or fresh/frozen tissue) to do the library prep, and that library is only useful for OGM.

Epigenetic profiling, which in the current field usually refers to DNA methylation profiling by microarray, is another diagnostic tool available on clinical and/or research basis depending on the phenotype. Among Mendelian disorders, it has found the most success in diagnosing neurodevelopmental/ID disorders caused by mutations with altered global methylation profiles, i.e. mutations in methylation pathway genes and transcription factors. Methylation profiles for the test patient are generated and clustered alongside reference cohorts of individuals with known diagnoses, assigning the patient a cluster and a confidence score. Doing this type of analysis thus requires access to a large reference cohort of profiles from the same tissue type from many patients with known disorders. If diagnostic, it does not provide sequence-level information, i.e. the mutation responsible for the aberrant methylation pattern. But it can tell you where to look, and in some cases can resolve VUS in genes with abnormal methylation profiles.

Summary

In summary, a nondiagnostic ES report is no longer the end of the road for patients with Mendelian disorders. A growing number of other assays, many of which are only available on a research basis, can provide answers to a considerable proportion of patients and families.

Main navigation