The vast majority of the human genome (98%) does not encode proteins. Most of the ~4-6 million sequence variants in any individual’s genome thus lie within this ‘noncoding’ portion. The deep, dark secret of current genetic testing is that it primarily interrogates and reports the relatively tiny fraction of variants (~50,000 or so per individual) that are in or near protein-coding regions. Although there are several scientifically valid reasons for this, the main one is that the structure of genes and the triplet protein code make it possible to predict the effect of a DNA sequence change with reasonable accuracy. Yet, we know that noncoding variants are one of the reasons that >50% of tested patients fail to receive a diagnosis from comprehensive genetic testing.
After all, there are many types of regulatory elements in or near genes which affect their protein production:
It should be noted that even as clinical genome sequencing replaces exome sequencing, the scope of reported variants remains on the coding regions. The widely-used 2015 ACMG Variant Interpretation Guidelines were developed around protein-altering variants. This holds true for the updated guidelines that are pervasively in use by groups such as ClinGen despite not being fully published at the time of writing. Variant interpretations for noncoding variants are woefully under-represented in databases such as ClinVar. All of this needs to change.
I volunteer for a ClinGen Variant Curation Expert Panel (VCEP) which develops the gene-specific variant interpretation protocols for sets of genes and applies them to the ClinVar-submitted variants for those genes. My VCEP is for Leber Congenital Amaurosis / early-onset retinal disease, which are among the major causes of inherited blindness. Our first curated gene, RPE65, is primarily associated with recessive disease that now has a gene therapy treatment (Luxturna). Interestingly, we have incorporated “response to gene therapy” into our interpretation protocols, but that’s a story for another time. The development of interest here is that at least a couple of patients who received and responded to Luxturna are compound-heterozygous for a coding variant and a noncoding variant in RPE65. Thus, we have begun discussing how to adapt our rigorous, coding-centric variant interpretation protocol for noncoding variants.
The good news is that an expert panel has drafted, tested, and published recommendations for clinical interpretation of noncoding variants.
Their study, published toward the end of 2022 in Genome Medicine, provides guiding principles and numerous specific adaptations of ACMG evidence codes for noncoding variants. Here, I’ll review and summarize some of the key findings.
Variant Interpretation Requires A Gene-Disease Relationship
First, as noted by the authors, to avoid a substantial burden of analysis and reporting, clinical interpretation should only be applied to variants that:
- Map to cis-regulatory elements that have well-established, functionally-validated links to target genes, and
- Affect genes that have documented association with a phenotype (disease) relevant for the patient harboring the variant
In other words, as is true with coding variants, clinical variant interpretation should only be applied to variants in genes that have an established gene-disease association. The authors specifically suggest “definitive, strong, or moderate level using the ClinGen classification approach or green for the phenotype of interest in PanelApp” though in practice, many labs still rely on OMIM as the arbiter of gene-disease relationships.
Defining Cis-Regulatory Elements for Genes
The expert panel defined several types of noncoding regions in which variants can be interpreted, including:
- Introns and UTRs as defined by well-validated transcripts, e.g. the canonical transcripts designated by the MANE database.
- Promoters, defined as the region of open chromatin surrounding the canonical transcription start site (TSS) as delineated by functional epigenetic data (e.g. ATAC-seq, DNAse-seq) in a disease-relevant tissue or cell type. If not available for a relevant type, consensus open chromatin regions from ENCODE can be used. If even that is not possible, one can use the TSS +/- 250 bp.
- Other cis-regulatory elements (e.g. transcription factor binding sites from CHiP-Seq or H3K27Ac/H3K4Me1-marked active enhancers) provided that they have experimental evidence linking them to the gene of interest.
The most obvious evidence of a link between a CRE and a target gene would be something like Hi-C, which captures regions that physically interact with gene promoters. However, the authors also would accept regions whose functional perturbation has a demonstrated effect on expression or which harbor established one or more expression quantitative trait loci (eQTLs) for the target gene.
Naturally, most of these strategies for CRE definition require access to rich epigenetic data, which not everyone currently enjoys. The authors recognize this and express their hope that the research community will make such datasets publicly available.
ACMG Evidence for Noncoding Variants
The authors next lay out some specific ACMG evidence codes along with guidance of whether and how they can be adapted to apply to noncoding variants.
Evidence Codes That Already Apply
Of course, as noted by the authors, some types of evidence are agnostic to variant location and can be applied as-is, including:
- Allele frequency information (evidence codes PM2, BS1, BS2, and BA1).
- De novo variant evidence (codes PS2 and PM6)
- Co-segregion evidence (codes PP1 and BS4)
These types of evidence are the “gimmes” for most variant interpretation, but I should point out that under the currently modified guidance, most of them are either benign evidence or weak-to-moderate evidence of pathogenicity (e.g. PM2 should now only be applied at supporting level). So these do help, but only a little. The exception to that is a de novo variant, which in many cases could qualify for PS2 (possibly downgraded). De novo mutations, unlike inherited variants, are exceptionally rare — on average, 60-70 per individual genome-wide compared to 4-6 million — and I think that when tackling noncoding variants, these are a good place to start.
Adaptation of Evidence Codes for Noncoding Variants
Here are the main ACMG evidence codes discussed, their original definitions under 2015 guidelines, and the expert recommendation on if/when they can be applied to noncoding variants:
Evidence_Code | Original_Definition | Adapted_Definition |
PVS1 | Predicted null variant (nonsense, frameshift, start loss, essential splice site) in a gene where loss-of-function is a known mechanism of disease | (You wish). The authors state that PVS1 should not be used for noncoding variants. |
PM1 | Located in a mutational hotspot and/or well-established functional domain without benign variation | Disrupts a TF binding motif whose disruption is shown to be pathogenic, or maps to a cluster of pathogenic variants within a well-defined CRE |
PS1 | Same amino acid change as a well-established pathogenic variant, but caused by a different nucleotide change | At supporting level, for splicing variants where a different substitution has been classified as LP, provided this one has a similar predicted effect |
PM5 | Disrupts same residue as a well-established pathogenic variant, but changes it to a different amino acid | Predicted to have the same impact on the same gene as established pathogenic variants but not at the same base. |
PM3 | For recessive disorders, detected in trans with a known pathogenic variant | Same usage, provided it’s rare enough, but at supporting level for genes where this is more likely to occur by chance (large size / high mutation rate). |
PS3/B3 | Well-established functional studies show deleterious effect (PS3) or demonstrate no effect (BS3) | RNAseq: PS3 if aberrant splicing isoforms are detected, BS3 if not detected, in both cases provided that the expression profile is similar to controls and sufficient depth is achieved. Reporter gene assays: PS3 if significant differences between variant and wild-type constructs, provided that the system is validated. MAVE assays should follow existing guidance (PMID: 31862013). |
PP3/BP4 | Computational tools predict a deleterious effect (PP3) or a benign impact (BP4), often based on REVEL score for missense variants. | Similar usage but computational tools designed for noncoding variants should be used. Examples: CADD, DANN, ReMM (Genomiser), FATHMM_MKL, GREEN-DB |
The authors offer numerous caveats and cautions for many of the above codes for the obvious reason that the study of regulatory regions (and the impact of variants therein) is a rapidly evolving field of research. Also, I find some of this guidance a bit confusing. For example, the authors note that PS1 should be downgraded to supporting level for a different nucleotide change at the same position as an established variant, but they do not mention downgrading PM5, which can be applied for variants at different positions expected to have the same regulatory impact. Of course, some areas of guidance are intentionally left vague, such as the choice of computational tools/thresholds to use for PP3/BP4.
Noncoding Variants: The Time Is Now
We know noncoding variants are important, and these guidelines provide a reasonable starting point for selecting ones to report and interpreting them with clinical rigor. It’s time to get to work.
The Challenges of Variants of Uncertain Significance (VUS)
As genetic testing continues to expand in both clinical and research settings, variants of uncertain significance (VUS) present a persistent challenge. For the uninitiated, VUS is one of five classifications assigned to genetic variants under ACMG guidelines which indicate the likelihood that a variant causes disease.
Generally, uncertain significance is the default classification for variants that cannot otherwise be classified as pathogenic/likely pathogenic (i.e. disease causing) or benign/likely benign (not disease causing).
If you’re familiar with genetic testing trends over the last decade, you probably know that VUS are increasingly prevalent on genetic testing reports. In part, that’s due to technological advances, e.g. high-throughput DNA sequencing, that make it possible to interrogate more of the patient’s genome in a timely and cost-effective manner. Gene panels for specific conditions now often encompass thousands of genes, and comprehensive testing — genome or exome sequencing — is increasingly available as a first-tier or second-tier test.
Increasing knowledge — specifically, the number of genes associated with disease — is another important contributor to the VUS explosion. The pace of gene discovery accelerated in the NGS era and continues to grow, as illustrated by the statistics provided by the Online Mendelian Inheritance in Man (OMIM) database:
Long story short, more variants detected in every patient (by comprehensive sequencing) combined with more genes that are possibly reportable (due to association with disease) means more variants on genetic testing reports. And, as I’m about to tell you, for reportable variants, VUS will often be the expected classification.
ACMG Variant Interpretation 101
First, a very brief introduction to the types of evidence that are used when interpreting variants and how they are represented. ACMG evidence codes are letter/number combinations.
- The first letter indicates the type of evidence (Pathogenic or Benign).
- The second 1-2 letters indicate he strength of evidence (Very Strong, Strong, Moderate, or SuPporting).
- The number is a category we use to keep them all straight.
So for example, PS2 is the evidence code applied when a variant occurs de novo in a patient with confirmed maternity/paternity. This is the second (2) type of strong (S) evidence of pathogenicity (P), hence the code PS2. For another example, when a variant’s population allele frequency is greater than expected for the disorder, it gets the code BS1 (the first type of benign strong evidence). The weakest level of evidence, supporting, is given the strength-designation P. For example, BP1 applies when you have a missense variant in a gene in which almost all disease-causing variants are truncating/loss-of-function.
When a variant is assessed, each type of evidence is evaluated to see if it applies. The final set of evidence is combined into a formula to determine the final classification. The rules for combination to get a pathogenic or likely pathogenic variant are shown to the right. So for example, for a variant with very strong (VS) evidence of pathogenicity, only one additional strong evidence code is required to classify it as pathogenic. If that second piece of evidence is moderate strength, the variant would be classified likely pathogenic. There’s a similar formula for benign/likely benign evidence.
How We Get To VUS
What if we have some evidence that a variant is pathogenic, but not enough to meet this threshold? Or worse, what if we have a lot of benign evidence but one pathogenic code? The ACMG guidelines lay it out:
VUS Due to Conflicting Evidence
Under the ACMG framework, every variant is assessed both for pathogenic and benign evidence criteria. It is thus quite possible — and does happen on a regular basis — that a variant has both types of evidence, i.e. conflicting evidence. For example, a variant that does not segregate with disease in a family (BS4) and has no predicted effect on the encoded protein (BP4) might still be rare in the general population (PM2).
Another example we often encounter is a missense variant that is rare (PM2), segregates with disease (PP1), and is computationally predicted to be damaging (PP3), but in a gene in which most known disease-causing variants are null variants (BP1). As written above, under ACMG rules, any variant with both types of evidence, no matter the tipping of the scale, defaults to VUS.
VUS Due to Insufficient Evidence
This is the more common pathway to classifying a variant as VUS: there is not enough evidence of pathogenicity to meet the threshold of likely pathogenic, or there’s benign evidence but not enough for likely benign. For example, a novel missense variant (PM2) in a dominant disease gene which is computationally predicted to damage the encoded protein (PP3), without additional evidence, is a VUS (PM2, PP3). Missense variants in general struggle to garner enough evidence to reach pathogenicity due to the strength of evidence codes that can be applied to them; more on that in the next section.
Variants in new or emerging disease genes are especially prone to the “Insufficient Evidence” VUS classification because the etiology of disease is still being established. If only a handful of disease-causing variants have been reported, it’s often difficult to ascertain:
- Whether null variants or missense variants are the predominant type of causal variants
- The presence of mutation hotspots or critical functional domains in which variants almost always cause disease
- The maximum population frequency of established disease-causing variants
Also, relatively new disease genes rarely have robust functional studies that can be used to enhance variant classification. These problems are all exacerbated for missense variants.
Some Variants Have It Easy: Null Variants and De Novo Mutations
You will note that having Very Strong evidence gets you a long way toward classifying a variant as pathogenic. Unfortunately, there is only one type of evidence that carries the weight Very Strong: PVS1. This is reserved for null variants, i.e. the types of variants (e.g., nonsense, frameshift, canonical ±1 or 2 splice sites, initiation codon, single exon or multiexon deletion) that are “assumed to disrupt gene function by leading to a complete absence of the gene product by lack of transcription or nonsense-mediated decay of an altered transcript.” (Richards et al 2015).
As I mentioned earlier, null variants that qualify for PVS1 only need one more piece of moderate-strength evidence to reach likely pathogenic. That’s great for null variants, but such variants represent a tiny fraction of the variants encountered in most genes in most patients. Missense variants are far more prevalent but face an uphill battle toward pathogenicity.
It’s a similar story for de novo mutations: a variant in a dominant disease gene that occurred de novo is rewarded with the strong evidence code PS2. That’s a long way toward a pathogenic classification. However, applying PS2 requires that you test both parents *and* that the parental relationships, especially paternity, have been confirmed. Again, it’s great when the stars align and you can do this. It’s also one of the reasons most labs prefer having family trios (proband and both parents) whenever possible. Yet we live in the real world where:
- Children are sometimes adopted or in foster care
- Families cannot afford all available testing
- Parents may no longer be alive
- Parents may be incarcerated
- Parents may be unwilling to participate in genetic studies
In these situations, testing both parents is not an option and that usually prevents some of the strongest evidence from being applied.
Consequences of Null and De Novo Variant Bias
The biases that favor null variants and de novo mutations may have scientific underpinnings, but they also exert real-world consequences that often skew the perceptions of emerging disease genes. Let’s be honest: it is far easier to publish a cohort of patients who all have de novo loss-of-function mutations in the same gene. I have been a part of multiple GeneMatcher collaborations in which the study leaders either gave preference to patients with null / de novo variants or were forced to do so to get the work published.
This often means the first few papers linking a gene to a disease describe only de novo / null variants, and that becomes the expected etiology of disease. Even established disease genes can be affected by the null-variant bias: because missense variants are harder to classify as likely pathogenic, they often remain VUS. Anyone who glances at the landscape of disease-causing (P/LP) variants for a gene might (incorrectly) assume that only null variants cause disease. I strongly encourage researchers to push back when they are told that the first paper will only include the “easy button” LOF/de novo variants.
Effects of Changing Variant Interpretation Guidelines
The ACMG 2015 guidelines for interpretation of sequence variants were published 8 years ago this month. It was an important milestone in our field, the members of which increasingly recognized that many variants reported as disease-causing were (in retrospect) probably not. The methods for classifying variants were inconsistent, and there was no universal set of rules that someone could apply. The 2015 guidelines provided such a framework.
However, a lot can change in eight years, and although the ClinGen Sequence Variant Interpretation (SVI) working group has released subsequent recommendations on the use of computational evidence for missense variants and refining classification of splicing variants, these are interim guidance.
The long-awaited revised framework for variant interpretation, which implements a points system to improve accuracy/consistency, is not yet published. In theory, it will help us resolve some VUS. That remains to be seen. Just as some types of evidence can be assigned higher strength (e.g. computational predictions of variant impact), other types of evidence are be blunted (e.g. rareness in the population). We won’t know until the revised guidelines are published, which probably will not be in 2024.
What About Variants in Candidate Genes?
I should take this moment to remind you — as I sometimes have to remind myself — that ACMG variant interpretation should only be applied to sequence variants that affect established disease genes. It should not really be used for variants in unknown genes or candidate genes not yet associated with human disease. That’s because we can’t assess pathogenicity of a variant without a definitive link between the gene and disease.
For clarity, we try to avoid the use of VUS when discussing candidate genes. Occasionally I hear the term GUS — for Gene of Uncertain Significance — and I really like it, but it does not seem to have gained much momentum.
More VUS, More Problems
The increasing number of VUS on genetic testing reports — and our inability to definitively classify them — present significant challenges for clinicians, laboratories, and patient families.
- For the lab, a VUS is a non-diagnostic outcome. They can be reported, but generally in the dreaded “Section 2” of the test report.
- Patients with VUS thus may not qualify for gene therapy or clinical trials if those are available.
- Clinicians must decide whether or not to pursue further testing, either to clarify the VUS or to keep searching.
Which VUS Merit Further Scrutiny?
It’s important to emphasize here that not all VUS are created equal. Because of the conflicting-evidence-means-VUS rules described above, plenty of variants receive this classification but are extremely unlikely to be disease-causing. On the other hand, sometimes VUS offer a promising potential diagnosis in a patient who otherwise has no significant findings. Perhaps the most important question to be answered is the phenotypic overlap, i.e. whether the gene’s associated condition matches the patient clinical presentation. This is why good clinical phenotyping is critical for genetic testing, especially when interpreting uncertain results.
The number, strength, and types of ACMG evidence codes that accompany a VUS classification are also relevant considerations. Some of the proposed revisions of variant classification guidelines allow for tiering of VUS into subsets representing the amount of pathogenic evidence behind them. If these come to pass, they’ll offer a useful communication tool for laboratories to indicate. In the meantime, I tend to refer to them as weak or strong VUS, with the latter category possibly warranting follow-up. Examples of strong VUS include:
- A VUS that is compound-heterozygous with a pathogenic variant in a recessive disease gene.
- A VUS with multiple pathogenic criteria that segregates with disease in a gene that fits the phenotype. For example, VUS (PM2, PP3, PP5) would indicate a rare variant that’s predicted to be deleterious and has been reported as disease-causing by another laboratory.
- A VUS with a predicted effect that could be evaluated by additional testing, such as metabolic/biochemical testing or even RNA-seq for potential splice variants.
This leads to the last section of my post, the million-dollar question.
How Can We Resolve a VUS?
I get asked this question all the time. Honestly, if you’re reading this post and have some ideas, I’d love if you shared them in the comments section below. Note, resolution can go either way: building a case for pathogenicity for a suspicious VUS, or ruling out a VUS that might otherwise be a concern. Here are some strategies we and other groups have tried.
- Segregation testing. Determining the segregation pattern and disease status in family members informs, at the very least, the plausibility of a variant fit and there are ACMG evidence codes for both segregation (PP1) and non-segregation (BS4).
- Clinical evaluations. The clinician can review patient/family medical records, bring them in for another clinical visit, or refer them to a relevant specialty to determine the presence (or absence) of clinical features associated with the disorder.
- Checking the latest population allele frequency databases to determine the variant’s prevalence in presumed-healthy individuals.
- Reaching out to other laboratories who have reported the variant according to ClinVar or the literature can sometimes yield useful information.
- Identifying additional patients with the variant can provide or strengthen certain categories of pathogenicity evidence. This is something we use in my ClinGen Variant Curation Expert Panel to resolve VUS in the RPE65 gene.
- Additional patient testing, such as biochemical/metabolic testing, methylation profiling, etc. that would support or exclude the diagnosis
- Variant functional studies in cells, organoids, or animal models. Obviously we’d love this to clarify any variant, but it can be expensive and time-consuming.
Sometimes strategies like the ones above can push a variant to a more definitive classification, and sometimes not. The hard truth is that some VUS cannot be resolved at the present time. Formal classifications aside, the clinicians can make their own judgements about uncertain findings, and counsel and treat the patients accordingly.
The Importance of Patient Phenotype in Genetic Testing
The tools and resources we have for human genomic analysis continue to grow in scale and quality. Computational tools like REVEL and SpliceAI leverage machine learning to provide increasingly accurate predictions of the effects of variants. Public databases of sequence variation like the newly expanded gnomAD tell us how common they are in populations. Community-supported resources like ClinVar continue to curate disease-gene associations and interpretations of those variants.
It follows that genetic testing should continue to improve, especially in the setting of rare disorders. Ten years ago, some of the earliest exome sequencing studies of Mendelian disorders showed that with a fairly straightforward filtering approach, it was possible to winnow the set of coding variants identified in a patient (usually in the tends of thousands) to just a handful of compelling candidate variants. And that was ten years ago.
Predictive Genomics for Rare Disorders
Recently, I began to wonder if we are approaching a GATTACA-like future of predictive genomics, at least for rare genetic conditions. If you obtained the genome sequence of a family trio and all you knew was that the proband has a genetic disorder, would it be possible to identify the most likely causal variant(s) and thus predict the rare disorder the patient has? The answer to this question is probably obvious from the theme of this post, but let’s consider it as a thought experiment. So you have trio WGS data from a family and all you know is that the proband is affected. Maybe it’s a severely ill baby in the NICU, or an as-yet-undiagnosed patient coming to Genetics clinic. In this scenario, you might:
- Run the trio WGS data through your existing pipeline to identify genomic variants (SNVs, indels, and CNVs).
- Annotate all variants with population frequency, in silico predictions, gene / disease associations, ClinVar status, etc.
- Identify variants that fit a Mendelian inheritance model (de novo, recessive, or X-linked)
- Remove variants that are too common in populations to cause the disease associated with their gene
- Apply automated variant interpretation to determine which variants reach pathogenicity
- Retain pathogenic variants in disease-associated genes that fit the inheritance for those genes
With these fairly intuitive steps, you’ll likely get a rather short list of candidates and it would be straightforward to rank them so that the most probable diagnostic findings are at the top. This process would be very amenable to automation, so it could be done at speed and scale. Will that become the new paradigm for genetic testing in rare disorders?
The Missing Component of Genome-Driven Analysis
The predictive genetics approach described above has a rational basis but does not account for some crucial information: the clinical phenotype and family history of the patient being tested. Clinical correlation — the overlap between patient symptoms and disease features — has an outsized influence on whether or not a result can be considered diagnostic. In our work, which is research, we encounter (at a surprising frequency) genetic variants that:
- Are in a known disease gene
- Segregate with the inheritance mode associated with that gene
- Reach pathogenicity under ACMG guidelines, but
- Are associated with a disease that is not clinically apparent in the proband.
In a world where the majority of tests are non-diagnostic and variants of uncertain significance (VUS) are increasingly prevalent, it is hard to ignore these variants. Naturally, we go back to the clinicians and/or medical records to verify that the patient does not have the disease. If there’s no clinical correlation, these are not considered diagnostic findings. No matter how compelling the variants are.
The Power of Phenotype
Admittedly, my perspective is biased: I work on translational research studies that primarily enroll undiagnosed patients. Often they have already undergone extensive genetic/molecular testing as part of their standard of care. When a clinician orders such testing, they provide patient clinical information. On the laboratory side, especially for comprehensive tests like exome/genome sequencing, patient clinical features are critical. The order forms collect extensive details about patient symptoms, which are converted into standardized disease terms (e.g. HPO terms) and used to identify/prioritize variants for interpretation.
Most rare diseases have genetic origins, and many of the genes responsible give rise to highly specific patterns of patient symptoms. Individually, a single patient symptom may not have significant diagnostic value, but the collective picture of patient clinical features can be very powerful. Especially when some of those features are specific and/or unusual. Even a rudimentary system that ranks a patient’s genetic variants based on clinical feature overlap (the number of features shared between the patient and the disease) helps put the most plausible genetic findings at the top.
Good clinical phenotyping also provides a powerful tool to exclude candidate findings. This is useful because some medical conditions that warrant genetic testing are associated with a wide range of disorders. In the pediatric setting, for example, global developmental delay is associated with thousands of genetic disorders and thus casts a very wide net. However, for many such disorders, global delays occur alongside a number of other distinctive clinical features. If these are not present in the proband, they can often be ruled out. This reduces the search space and interpretation burden for the laboratory.
Limitations of Phenotype-Driven Analysis
Despite these advantages, a phenotype-centric approach to genetic testing has some important limitations.
- Variable expressivity. Many genetic disorders have clinically significant features that can vary from one patient to the next, even within families.
- Phenocopies. I love this word, which refers to disorders that resemble one another clinically but have different underlying causes.
- Pleiotropy. On the other hand, some genes give rise to multiple disorders which can be clinically very distinct.
- Phenotype expansion. For many genetic disorders, our understanding of the full phenotypic spectrum changes over time. This is especially true for new/emerging rare disorders for which the clinical description is based on a small number of patients.
- Patient evolution. For many patients, the clinical picture changes over time. In the pediatric setting this is a major consideration, as lots of key diagnostic features take time to manifest or be clinically apparent.
- Blended phenotypes. At least 5-10% of patients suspected for a monogenic disorder have multiple genetic conditions and their presentation can thus be a confounding combination of the associated features.
The OMIM Curation Bottleneck
The Online Mendelian Inheritance in Man (OMIM) database is one of the most vital resources in human genetics. For many/most laboratories, OMIM is the primary and definitive source for the genes, inheritance patterns, and clinical manifestations associated with genetic disorders. The information in OMIM is curated from the peer-reviewed biomedical literature by trained experts at Johns Hopkins University. This manual curation is why the resource is so widely trusted by the community. However, it’s a double-edged sword because curation takes expertise, time, and funding. The latter two have been a challenge for OMIM, especially since the pace of genetic discovery has accelerated in the past decade. Simply put, there’s way too much literature for OMIM to curate it all.
This bottleneck has real consequences. We look to OMIM as our trusted source of information about disease genes, but that information is increasingly outdated or incomplete. Given the powerful influence that clinical correlation has over genetic testing results… well, it’s a problem. And not one that the OMIM curators will be able to solve on their own. The good news is that there are more sustainable efforts under way. ClinGen, for example, is both standardizing the way information is collected/curated and leveraging expert volunteers (i.e. crowdsourcing) from the community to manage the workload. We still have a long way to go because ClinGen is a relatively new endeavor. However, it’s a more sustainable model that we should continue to support with funding and volunteerism.
In other words, if you’re not part of a ClinGen working group or panel, please think about joining one.
Clinical Genome Sequencing Replaces Exome Sequencing
This month our clinical laboratory began offering genome sequencing as an orderable in-house test. It’s a milestone achievement made possible by a talented multidisciplinary team and 3+ years of pre-clinical work under a translational research study. Yes, clinical genome sequencing was already available to our clinical geneticists — as a sendout test to commercial laboratories — but there are distinct advantages to providing this state-of-the-art test in-house. Especially the rapid genome sequencing (rGS) test, for which results are called out just a few days. We have years of data showing that genomic testing results can inform patient care in acute cases. Not to over-hype it, but sometimes it saves lives.
Still, that is not my story to tell, so this post is more about the transition from exome to genome sequencing in a (pediatric) hospital setting. It seems likely that many institutions (not just ours) will make the leap this year. There are several factors driving this change, but one of them is simply the ever-increasing speed/ throughout of next-generation sequencing instruments. For a long period, approximately 2014-2020, exome sequencing was a more practical choice as the mainstay comprehensive genetic test.
The Exome Advantage
Often patients who qualified for genetic testing would first get cytogenetic and microarray testing for chromosomal abnormalities and CNVs, respectively. Depending on the patient’s clinical features, the next step would often be a gene panel, followed by exome sequencing. As a clinical test, exome sequencing was attractive as a comprehensive test because:
- Exome capture kits had matured significantly, achieving consistent coverage and enabling fairly reliable deletion/duplication calling.
- In terms of laboratory costs, generating ~40-50 Gbp (gigabase-pairs) of data per sample was far less than the ~120 Gpb required for genome.
- Turnaround times were pretty good.
- The variant interpretation was likely to be gene- and exon-centric anyway.
Simply put, exome sequencing interrogated virtually all genes with a reasonable turnaround time and cost, so it made sense as the comprehensive test. If it was working so well, the natural question might be:
Why move to genome sequencing?
Speed, for one thing. The hybridization process (where probes capture target regions) adds about 1-1.5 days to the laboratory prep time between library creation and when things get loaded on the sequencer. The instruments are now so fast that this increases the lab time by about 50% compared to going straight to genome. The throughput is also so high that exome libraries need to be increasingly multiplexed (i.e. run lots of things at once) to be sequenced. Believe it or not, that can also introduce a delay because one has to wait until enough samples have accumulated to pool and sequence them.
“We don’t have enough samples to sequence” is a phrase I never thought I would hear. Man, how a decade can change things.
Reagent costs are a factor, too, since exome kits cost money. As the per-base cost of sequencing goes down, the savings you get from exome capture instead of genome decrease as well. The capture also requires more input DNA, which can be an issue when dealing with precious clinical samples. So genome sequencing is faster, requires less DNA, and ends up costing about the same for reagents. That’s on top of the obvious advantages GS offers in terms of variant detection.
Does genome sequencing have a higher diagnostic yield than exome sequencing?
In most cases, it should. That’s the theoretical answer. GS interrogates both coding and noncoding regions, and it’s better suited to detecting copy number variants (CNVs) and structural variants (SVs) because the breakpoints of such variants often lie in noncoding regions. Plus, exome capture introduces some hybridization biases which, while somewhat addressable during analysis, make it harder to detect changes in sequence depth that signal the presence of a copy number variant.
However, in my opinion, a major diagnostic advantage of genome sequencing comes from its ability to cover genes and exons that don’t play nicely with exome capture. Immune system genes, for example, are notorious for their poor coverage by exome sequencing. We have numerous examples of diagnostic variants uncovered by genome sequencing which were missed by exome testing due to coverage. From the clinician’s point of view, genetic test results from genome sequencing (even when nondiagnostic) come with more confidence that all of the relevant exons and genes have been interrogated.
A second advantage of genome sequencing is the ability to find deep intronic “second hits” in patients who have a single pathogenic variant in a recessive disease gene. Under exome sequencing, you generally have to do another test. With genome data, labs can at least screen nearby noncoding regions (introns, etc) to see if a second variant is present. Computational tools to predict splicing effects of variants have improved substantially in the past few years to the point where SpliceAI scores have been incorporated into ACMG/AMP guidelines. With clinical GS, upon the identification of a single variant in a promising recessive gene, labs can thus screen the data for rare variants in trans that are predicted to disrupt splicing. We have done this in a translational research setting and I think it will be a major source of improved diagnostic rates.
When should a clinical genome be ordered for an exome-negative patient?
This is an important question as clinical GS becomes more widely available. We know that 50-70% of exome tests are nondiagnostic, and it’s reasonable to assume that most patients who have undergone comprehensive testing in the last decade had an exome, not a genome. As I wrote in my recent post on post-exome strategies for Mendelian disorders, a negative exome result means that genome should be considered as the next step. If the clinical test already was genome, this changes the calculus.
I think it will be difficult to establish a perfect set of rules because every patient is different. However, I’d suggest that clinical GS should be considered when:
- The WES testing was done more than two years ago. This seems to be the sweet spot for exome reanalysis anyway, because enough new genes and disease-causing variants have been discovered to significantly boost diagnostic rates.
- New and relevant phenotypic information has emerged. Clinical exome testing is almost always guided/driven by the phenotypic data provided to the laboratory. If that changes, so too could the result. In particular, new phenotypes of features with significant genetic associations (dysmorphism, seizures, metabolic changes, neurological/neuromuscular changes, etc.) can significantly impact how variants are considered.
- There is medical urgency. A patient who continues to decline, or whose care is limited due to the lack of diagnosis, stands to benefit.
- Previous testing or new knowledge hints at a possible diagnosis. So-called “Section 2” variants and newly identified genes/pathways relevant for a patient’s phenotype may justify a harder look at certain loci.
What are the limitations of genome sequencing as a first-tier test?
No test is perfect, and despite the many advantages of genome as a first-line test, it comes with some limitations. GS may have similar experimental costs, but it comes with higher analysis costs (especially computational processing and data storage) because it’s 3-4x more data per sample. Processing that is an occasional cost, but storing data is like the Netflix subscription that never ends. The human staffing costs of interpretation are also higher because there are more variants (and detected variant classes) to evaluate. Balancing workload among staff also becomes more challenging, especially for rapid turnaround tests. And on the technical side, there is sequence depth to consider: Typical depths for exome sequencing (150-200x) have more power to detect somatic/mosaic variation. Patients undergoing testing for conditions associated with somatic mutations — the obvious example being tumor sequencing — are likely to benefit more from exome or panel testing.
Post-Exome Strategies for Mendelian Disorders
Today I’m delivering the research genomics lecture at NCH’s Myology Training Course, an annual, week-long, in person training program that covers numerous aspects of clinical, research, and laboratory topics relevant to the field. In a stroke of excellent timing, Monica H. Wojcik and colleagues from the Genomics Research to Elucidate the Genetics of Rare Diseases (GREGoR) Consortium have just published a review on diagnostic testing beyond the exome. In other words, they review the current available tests and diagnostic procedures that may elucidate a molecular diagnosis for a patient with a Mendelian disorder when exome sequencing (ES) has failed to do so.
This is a subject I know something about, having spent more than a decade studying rare genetic diseases in a large genome center. A negative ES report is no longer the end of the road, as there are numerous other possible strategies to uncover the genetic basis of a rare disorder. This review covers them well, so I thought it would make a useful blog post.
First, it needs to be said that we are living in a golden era of genetic testing. Technological advances have enabled comprehensive assays for genomic interrogation – microarrays, exome sequencing, and genome sequencing – while improvements in bioinformatics and community-created resources like gnomAD and ClinVar have improved our ability to identify and classify disease-causing variants. Best practice guidelines from organizations like the American College of Medical Genetics have been modified to leverage the strengths of these new technologies. Under those guidelines, for patients with congenital anomalies, developmental delays, or intellectual disability, exome/genome sequencing should be the first- or second-line test. One of the reasons for this is the fact that an increasing number of genetic conditions are clinically similar but genetically heterogeneous. Another is the observation that ~5% of diagnosed patients have multiple genetic conditions.
When Comprehensive Genetic Testing Fails
Despite all of this progress, some 50-60% of individuals with a suspected Mendelian condition remain undiagnosed after comprehensive molecular testing. Why does testing fail to produce a diagnosis? It’s a question I spend a lot of time thinking about, discussing with colleagues, and posing to candidates during interviews. Generally, they fall into two categories:
Category 1: The causal variant was detected, but:
- The genetic basis of the disorder is not yet known. New disease genes are being discovered every day, but we have a long way to go.
- The gene is associated with disease, but the patient represents a new phenotypic manifestation or severity, i.e. variable expressivity.
- The variant is inherited from an unaffected parent, i.e. incomplete penetrance.
- There is not enough evidence to call the variant pathogenic. Variant interpretation is challenging, especially in the scenario where information is limited concerning the pattern of disease-causing variants, the origin of the variant in the patient, or both.
Category 2: The causal variant was not detected, because:
- The gene (or exon) is not interrogated by the sequencing assay, i.e. poorly captured for ES or poorly covered for GS.
- The variant is difficult to detect by short-read sequencing, e.g. structural variants and trinucleotide repeat expansions.
- The variant lies in a noncoding region (note, it might well be detected by GS, but could be challenging to interpret).
- The variant is epigenetic, not genetic.
- The disorder is not genetic but has an infectious or other acquired origin.
This is a partial list of reasons that the most comprehensive test available (currently exome sequencing in most situations) fails to make a diagnostic finding.
Post-Exome Testing Options
The central focus of “Beyond the Exome” is the set of options for further diagnostic testing, many of which fall under research. They include:
Exome Sequencing Reanalysis
A key advantage of ES as a genetic test is the ability to re-analyze data when new phenotypic information emerges and/or when some time (usually 2-3 years or more) has passed. The yield of ES reanalysis can vary widely, but a systematic review estimated the increased diagnostic yield at 15% and recommended that reanalysis is warranted 18 months after the initial test. Generally speaking, diagnoses made by ES reanalysis are the result of:
- New gene discovery for Mendelian conditions, i.e. identification of a variant in a gene now associated with disease. Consistently the major contributor to new diagnoses.
- Resolution of previously known variants of uncertain significance (VUS) as pathogenic.
- Improvements in bioinformatics pipelines for variant calling and annotation.
As the authors off this review highlight, diagnoses found on exome reanalysis may also be in known disease genes not previously thught to explain the phenotype, where the clinical interpretation of a variant has changed due to novel data such as additional clinical information, new variant inheritance information, segregation data from other affected family members, newly published case reports, or an expansion of the phenotype associated with the gene. Clinical re-evaluation / clinician input are also essential in these scenarios.
Short-read Genome Sequencing
Genome sequencing (GS) in most contexts means “short-read” genome sequencing — paired-end sequencing of 150-bp to 250-bp reads from the ends of ~350-500 bp fragments by whole genome shotgun approaches. Illumina platforms continue to dominate this market. Compared to ES, GS has some key advantages:
- More uniform coverage of genes and exons, including certain genes which are notoriously difficult to capture (especially immune genes) for exome sequencing.
- Identification of copy number variants (CNVs) and structural variants (SVs), usually with better sensitivity and resolution than SNP microarrays.
- Comprehensive interrogation of noncoding regions that may harbor pathogenic variants, such as introns, promoters, regulatory elements, etc.
It’s important to note that while clinical GS is increasingly being offered on a clinical basis, it is fairly exome-centric in terms of variants reported. In other words, although millions of noncoding variants are identified by GS, our ability to interpret them remains limited. The incremental diagnostic yield of GS in exome-negative patients varies but is probably in the 5-15% range. As expected, some diagnoses made by GS involve types or sizes of SVs that are difficult to detect by other assays. Some are splice-region variants. Some are variants in poorly-captured genes. Yet a significant proportion of findings afforded by GS are made not because the detection was superior, but rather because GS was performed later and on a research basis. This can enable the identification of candidate variants and the exclusion of other genetic causes.
Long-read Targeted/Genome Sequencing
Single molecule long-read sequencing is commercially available on two platforms — Pacific Biosciences and Oxford Nanopore Technologies — and produces reads that are significantly longer than standard GS approaches: 10,000 to 15,000 bp on average, compared to 150 bp. On a per-base level, these platforms have a higher rate of sequencing error than sequencing-by-synthesis (Illumina) approaches, especially in certain sequence contexts (e.g. homopolymers). However, even with slightly diminished accuracy, long reads in this size range are extremely useful for resolving structural variants / complex rearrangements and for interrogating otherwise hard-to-sequence regions of the genome. We have used PacBio long-read sequencing to:
- Identify causal variants in syndromic rare disease patients that were poorly covered / not detected by GS. Example: Polyalanine repeat expansions in HOXD13.
- Resolve the genomic breakpoints of translocations, inversions, and other complex rearrangements.
- Determine the phase of two variants in the same gene, e.g. somatic variants in PTEN in hemimegalencephaly patients.
Naturally, there are disadvantages to long-read sequencing compared to traditional ES and GS approaches. The first and most obvious disadvantage is the cost, which can be 3-4x higher than standard GS. The “DNA cost” is also high, as long read technologies require a large amount of high-molecular-weight DNA. Informatics pipelines are not as mature for long-read technologies, so the analysis cost is higher as well.
RNA Sequencing
Transcriptome profiling by RNA sequencing is, in my opinion, one of the most powerful research genomics tools for undiagnosed patients. RNA can be co-extracted from blood along with DNA, and RNA sequencing is relatively inexpensive. RNA-seq provides a lot of useful information, including:
- Comprehensive gene expression measurements with higher precision than microarray testing.
- Isoform expression, i.e. the expression level of each exon and splicing of adjacent exons.
- Quantification of allele-specific expression, i.e. the balance of alleles of a variant in expressed transcript
- Splicing patterns, including both canonical and disruptive splicing.
RNA-Seq data is most useful when paired with genomic data. In our hands, it has been most useful in identifying “missing” variants in known disease genes (e.g. a second hit in a recessive gene in patients who have a single pathogenic variant by standard testing but otherwise fit the condition well). Many of those missing second hits are deep intronic variants with occult disruptions to canonical splicing, but we also see splice-disrupting variants in coding regions and intronic splice regions (outside canonical splice site, but close to the intron-exon junction). RNA-seq can also resolve VUS by showing the variant’s impact on mRNA transcripts. We have used it both to prove and disprove effects on splicing.
The main disadvantage to RNA-seq is that it is only informative if the gene is expressed in available tissue. Otherwise you get no reads, or too few reads to infer splicing patterns. Many genes are expressed in fibroblasts, which is why RNA-seq from blood can be useful even for disorders that affect other systems / developmental timepoints. However, we should be cautious about over-promising the number of genes expressed at high enough levels to analyze: in my experience, it’s only around 50%. RNA-seq has made the most gains in disorders for which disease-relevant tissue is available, e.g. muscle diseases (due to muscle biopsy). Many genes have tissue-specific expression and that tissue often is not available for research testing.
Optical Genome Mapping (OGM) and Epigenetic Methylation Profiling
Both of these are relatively new/emerging technologies with a lot of promise that have already begun to be implemented in some clinical areas. Optical Genome Mapping (OGM) is not sequencing per se, but high-resolution imaging of long labeled DNA molecules coupled with sophisticated informatics to map the physical structure of chromosomes. OGM is therefore very useful for identifying CNVs, SVs, and complex rearrangements with higher resolution (to ~500 bp) than standard of care approaches. Like long-read sequencing, it is costly in terms of reagents and input DNA requirements. OGM in fact requires a specialized sample prep, so you need access to fresh patient material (blood or fresh/frozen tissue) to do the library prep, and that library is only useful for OGM.
Epigenetic profiling, which in the current field usually refers to DNA methylation profiling by microarray, is another diagnostic tool available on clinical and/or research basis depending on the phenotype. Among Mendelian disorders, it has found the most success in diagnosing neurodevelopmental/ID disorders caused by mutations with altered global methylation profiles, i.e. mutations in methylation pathway genes and transcription factors. Methylation profiles for the test patient are generated and clustered alongside reference cohorts of individuals with known diagnoses, assigning the patient a cluster and a confidence score. Doing this type of analysis thus requires access to a large reference cohort of profiles from the same tissue type from many patients with known disorders. If diagnostic, it does not provide sequence-level information, i.e. the mutation responsible for the aberrant methylation pattern. But it can tell you where to look, and in some cases can resolve VUS in genes with abnormal methylation profiles.
Summary
In summary, a nondiagnostic ES report is no longer the end of the road for patients with Mendelian disorders. A growing number of other assays, many of which are only available on a research basis, can provide answers to a considerable proportion of patients and families.