In my last post on clinical interpretation of noncoding variants, I focused mainly on which noncoding variants were appropriate for clinical interpretation and the guidelines for modifying evidence codes for variants with regulatory rather than coding effects. However, the Ellingford et al study in Genome Medicine which proposed those modifications also included some fascinating analyses into what we know about noncoding variants and their role in disease.
Noncoding Variants Are Under-Represented in ClinVar
The first insight is not terribly surprising: variants in noncoding regions known to regulate established disease genes are dramatically under-reported to ClinVar, the central repository for disease-causing variation. If one considers the “genomic footprint” encompassed by the all MANE transcripts broken down by the number of bases in each gene region (promoter, exon, splice site, intron, and UTR), here are some useful reminders:
- Essential splice site regions (i.e. the first two and last two bases of each intron) occupy the smallest slice of the footprint (<<1%) but have an outsized representation among (likely) pathogenic variants (>5%).
- Untranslated regions (UTRs) and upstream promoters account for only 0.34% of high-confidence pathogenic variants but despite occupying approximately the same amount of genomic space as protein-coding exons and having well-established regulatory roles.
- All three of the above are dwarfed by introns, which account for >90% of the genomic footprint of MANE transcripts.
Admittedly, high confidence pathogenic variants in ClinVar include far more intronic variants than UTR or promoter variants, but more than half of these are in splice sites, which only represent 4 bases of each intron. It should also be noted that the proportion of “variants of uncertain significance” or VUS is much lower among the intronic variant group (around 22%, versus 44% for coding and 63% for UTR variants). This suggests that most of the intronic variants in introns are submitted because they’re known to cause disease by disrupting splice sites. The non-splice-relevant majority of most introns is not really represented here.
Intronic Second Hits in Undiagnosed Probands
The authors of the Ellingford et al study performed another analysis that I thought to be useful but didn’t have room to discuss in my last post. They were discussing guidance on applying the PM3 rule, which is used for recessive disorders when a variant is known to be in trans with an established pathogenic variant. In the original 2015 guidelines, the logical argument behind the rule is that when a variant in a recessive disease gene which is demonstrably in trans (on the opposite allele) from a known disease-causing variant, this can be taken as evidence of pathogenicity. Essentially, it helps identify and classify novel pathogenic variants when they occur as compound-heterozygotes with a known pathogenic variant.
The natural question, when considering this rule for noncoding variants, is whether it is reasonable to apply the same argument since there could be many, many noncoding variants inherited from the other parent. So the authors went to the Genomics England (GEL) database and obtained 2,016 rare disease trios which:
- Had whole-genome sequencing data for proband, mother, and father
- Were currently nondiagnostic according to GEL (i.e. no Tier 1 or Tier 2 variants)
- Harbored a single rare heterozygous LOF variant in a recessive disease gene
They restricted their search space to 794 genes in which biallelic variants cause disease (i.e. autosomal recessive). There were 2,714 such single variants in 2,016 unsolved probands. The number of probands without a starting-single-hit is unfortunately not provided. In any case, the authors next queried the WGS data for those trios to identify possible noncoding second hits, i.e. variants in the UTR, intron, or 2kb promoter of the same gene that were in trans (inherited from the other parent) and rare enough to cause a recessive disorder. In total, 37.8% of the single-hit trios (1,027 of 2,016) had at least one noncoding variant in trans. As you might expect from the analysis above, nearly all of these (94%) were in intronic sequences:
In a practical application, evaluating those second hits would require a detailed interpretation of possible regulatory elements and variant effects. As a proxy for these, however, the authors simply asked what proportion of intronic variants had a SpliceAI score of >0.20, indicating a good probability of splice disruption. This relatively low bar eliminates most variants. Only 55 of 2,418 (2.2%) in trans variants have a SpliceAI score of >0.20. Based on this relatively small proportion, the authors feel that the use of PM3 is justified as long as other common-sense requirements are met (e.g. rareness, gene-disease validity), and I agree with that conclusion.
Beware Second Hits in Large Genes
One note of caution: the probability of observing a possible second hit in the same gene (coding or noncoding) is proportional to the size of the gene and (to some extent) the total number of variants present in the individual due to ancestry. In this analysis, for example, the authors identified 22 variants in trans with one pLOF variant in the WWOX gene, (very large at >1.1 Mbp in size) in an individual who self-identified as Black or Black British: African. Because of these factors, it may eventually be necessary to use a probability-driven approach to decide whether or not observation of a second hit is significant enough to warrant additional evidence of pathogenicity.
Guidance for Unsolved Rare Disease Patients
Another compelling facet of this analysis is the insight into possible explanations for undiagnosed rare disease patients. GEL’s pilot publication, other large studies, and our own internal data report a fairly consistent diagnostic rate of ~25-30% for patients who get exome or genome sequencing. This means 70-75% of rare disease patients who undergo comprehensive testing fail to achieve a diagnosis. How many of those can be explained by a missing second hit in a noncoding regulatory region?
Well, 53 of the starting 2,016 undiagnosed trios had one pLOF variant which is compound-heterozygous with an intronic, possibly-splice-disrupting variant in an established autosomal recessive disease gene. That alone is 2.6% of the undiagnosed cohort who now have a promising lead. If one applied other types of functional annotation to the noncoding variants (beyond splice prediction), even more could have a reasonable candidate that merits clinical correlation.
Another Starting Point for Noncoding Variants
As illustrated above, when looking for innovative ways to solve undiagnosed probands, starting with ones who have a single hit in an autosomal recessive gene — especially one that seems to fit the clinical phenotype — is a great first strategy. What about the 2/3 of patients who don’t have a second noncoding variant in trans? There’s another logical set of noncoding variants to pursue in such cases: those that occur de novo in a proband whose parents are unaffected. There are several reasons this makes sense:
- The number of variants is manageable. Based on the baseline mutation rate and our own experience, every child has about 60-70 detectable high quality de novo variants genome-wide that pass manual review.
- De novo variants account for a large proportion of diagnosed cases. Although it can vary, often as many as half of genetic diagnoses from rare disease cohorts involve de novo mutations.
- Starting with strong evidence. The recent guidance from Ellingford et al notes that de novo evidence codes can be applied to noncoding variants similar to how they’re applied to coding variants. When parental relationships are confirmed, the stronger code for a de novo variant is “Strong” (PS2). That gets you a long way toward (likely) pathogenic.
To me, noncoding variants are always interesting regardless of whether they provide a missing genetic diagnosis — more than enough reason to delve into these first when tackling noncoding regions.