Understanding the impact of genetic variants on observable traits is a fundamental goal of human genetics. Yet for the >98% of known sequence variants that reside outside of protein-coding sequences, this remains a significant challenge.
There is considerable evidence that noncoding variation can and does impact observable phenotypes. Genome-wide association studies, for example, have pinpointed thousands of loci that are associated with complex disease. Many of them are in noncoding sequences or regions or gene deserts. RNA sequencing studies of undiagnosed patients with Mendelian disorders, too, have uncovered causal variants well beyond the coding regions. Our recent study of an intronic variant that disrupts ATP7B splicing in Wilson disease is just one example.
Most geneticists recognize that noncoding variants are important. Even so, interpreting them remains difficult because we have not yet deciphered the “regulatory code.” A bevy of papers from two large-scale international consortia — the ENCODE Project and the GTEx Consortium — has shed new light on regulatory sequences in the genome and their impact on gene expression. In this post, I’ll explore two papers from the latter project that offer fascinating insights into the human genetic regulatory code.
The GTEX Dataset
The GTEx project was launched in 2010 with the goal of cataloguing gene expression across a variety of human tissues using the emerging technology of massively parallel RNA sequencing (RNAseq). Many gene hunters like myself benefited from the patterns of gene expression across different tissues when considering potential new disease genes. Last month, the consortium published their latest atlas of gene expression variation in Science, which comprises 15,201 RNAseq datasets representing 49 tissues from 838 postmortem donors. This approximately doubles the catalogue since the intermediate publication in 2017 (42 tissues from 449 donors).
Whole-genome sequencing was performed on all 838 donors as well, enabling the authors to search for relationships between sequence variation and gene expression differences between individuals. They identified a total of 43.1 million SNPs after quality control and phasing. That’s an impressive number, especially when one considers that there were only ~30.4 million human variants catalogued just a decade ago (dbSNP build 132, September 2010).
cis-eQTL Discovery in GTEx
The authors searched for variants associated with the activity of nearby genes (cis-eQTLs), uncovering that 4.23 million variants were associated with at least one gene expression level in at least one tissue. This is nearly half (43%) of common population variants (MAF>0.01) in the cohort. Some interesting findings about cis-eQTLs:
- Most genes have at least one eQTL. Some 18,262 protein-coding genes (94.7%) and 5,006 long noncoding RNA genes (57.3%) had at least one significantly associated cis-regulatory variant.
- Most cis-eQTLs had small effect sizes. However, about one in five (22%) had a greater-than-twofold effect on gene expression.
- Discovery of cis-eQTLs saturates at ~1500 genes in tissues with >200 samples. In other words, this study is extremely well-powered; only 200 individuals are required to discover all of the large-effect cis-eQTLs (which should be around 1,500).
cis-sQTL (Splicing) Discovery in GTEx
The authors mapped variants associated with exon-intron splicing patterns of nearby genes (cis-sQTLs) using intron excision ratios from LeafCutter.
- Splice-QTLs are pervasive. 12,828 protein-coding genes (66.5%) and 1600 lincRNA genes (21.5%) had at least one sQTL in at least one tissue.
- Cis-sQTLs are enriched almost entirely in transcribed regions. In other words, variants that affect splicing are located within the transcript (i.e. UTR, exon, splice region, or intron). This is somewhat intuitive when you think about it, but reassuring to see.
- Variants in expected and unexpected places affect splicing. Splice acceptor, splice donor, splice region, and loss-of-function variants were most enriched for sQTLs. This again is somewhat expected. Yet there was also >2x enrichment for sQTLs among missense, synonymous, UTR, and intronic variants. See the figure at right.
Rare Variants That Regulate Genes
Another article from the GTEx Consortium in the same issue of Science explores in-depth the role of rare variation in driving transcriptomic signatures across tissues. Using the WGS data for 838 individuals, the authors evaluated how rare genetic variants contribute to:
- Differences in gene expression (eOutliers)
- Differences in allele expression (aseOutliers)
- Differences in splicing (sOutliers)
I chose to discuss this paper along with the main GTEx article because it’s arguably most relevant to those of us working on the genomic basis of rare and pediatric diseases. Whereas the flagship findings from above pertain largely to common (MAF>0.01) variants that offer the statistical power to detect associations in a cohort of ~800 individuals, this study took a complementary approach. They identified individual outliers with respect to gene expression, allelic expression, or splicing in the RNA-seq datasets, and then interrogated the genomes of those individuals to look for nearby rare variation that might explain the aberration. In other words, this is a study of extreme outliers whose unique patterns of gene activity could be due to rare large-effect variants.
The authors prioritized outlier observations that were consistent across multiple tissues from the same individual, eventually identifying, in each individual, a median of:
- 4 genes that were outliers for expression (eOutliers)
- 4 genes that were outliers for allelic expression (aseOutliers)
- 5 genes that were outliers for splicing (sOutliers)
Gene Outliers and “Suspect” Rare Variants
When the most stringent thresholds were applied to identify these outliers, most (82-94%) individuals harbored at least one rare variant in the gene body or within 2kb. I should point out here the co-occurrence of the outlier status and rare variant does not represent directly causal evidence. The authors use such language as “variants leading to any outlier status” which in my opinion rather make the presumption of a directly causal relationship when it’s not proven. For example, it’s very possible that the nearest RV to an outlier gene in an individual has nothing to do with the gene’s outlier status. Even so, most of human genetics is about probabilities, so I think it could be reasonable to say that there’s a good probability that a rare variant observed in or near a gene that’s an outlier for that particular individual could be contributory.
Interestingly, despite the observation of RVs near most outlier genes, the opposite correlation was not true. That is, a large proportion of genes with rare variants did not appear to be outliers, even for the most predictions such as loss-of-function variants. This is perhaps an unexpected and important finding. Even the most predictive category, splice donor and splice acceptor variants, caused a splice outlier only 7.2% and 6.8% of the time. I hope nobody tells ACMG about this.
The Impact of Splice Region Variants
The relevance of variation in and around splice sites remains an area of vigorous debate, even within our lab. Most agree that variants in the canonical splice donor (first two bases of intron after an exon) and splice acceptor (last two bases of intron before the next exon) are the most likely to disrupt splicing, and that holds true in the GTEx dataset. To their credit, the authors explored the “relative risk” of splice disruption based on enrichment of implicated rare variants in the wider splice region:
On average, the relative risk for a rare variant in the canonical splice site was 195, and most of that signal is coming from the minus 2 splice acceptor position (the “A” in “GT-AG”). This matches evolutionary conservation evidence. Interestingly, we do see strong enrichment for rare variants throughout the rest of the splice region, though not nearly at the same level. There’s an interesting spike at +6 bp into the intron that might be worth evaluating further. I’ve seen plenty of splice region variants that don’t impact splicing at all (according to the RNA-seq data), but this suggests that there are some out there which probably do.
All told, it’s an interesting study and the highlights I’ve shared here barely scratch the surface of the findings in the massive GTEx v8 dataset. It represents a huge body of work, and an important step forward in our understanding of the human genetic regulatory code.