Single nucleotide polymorphisms are single nucleotides in a genome that vary between subsets of a population. They are known to cause many diseases, but the ways in which they cause disease are often unknown. We show that one way in which single nucleotide polymorphisms may cause disease is by altering how RNA molecules fold. The nucleotides in an RNA molecule can base pair to each other to form complicated structures. This structure dictates where some proteins can bind, since many proteins can only bind to single stranded RNA. If the change from one base to another base caused by a single nucleotide polymorphism causes a protein binding motif to become preferentially base paired, then it will be harder for a protein to bind there. By modelling the folding of RNAs in conjunction with protein binding in silico, we show that single nucleotide polymorphisms can affect the ability of specific proteins to bind to RNAs, usually making it harder for proteins to bind.
Single Nucleotide Polymorphisms (SNPs) and Single Nucleotide Variations (SNVs) are nucleotide changes at single genomic positions that differ between significant subsets of a population, or general mutations that often arise due to diseases such as cancer, respectively . While very common and known to cause many diseases, their effects on gene expression, protein binding, and ways in which they cause disease are not completely understood . Missense mutations in coding regions are easily linked to disease, since they cause translation of a defective protein , but most SNPs (∼93% of disease and trait associated SNPs in genome-wide association studies) occur in non-coding regions . Non-coding SNPs can appear in non-coding RNAs, introns, or in 5’ and 3’ untranslated regions (UTRs). Because these non-coding SNPs do not produce an altered protein, the pathways through which they cause disease are less well known, but they are still regularly associated with disease . Understanding the effect of these non-coding or same-sense SNPs has wide-ranging implications for understanding disease, as well as evolutionary genetics [6, 7].
A possible explanation of the effect on phenotype of SNPs in 5’ and 3’ UTRs or non-coding RNAs is that they affect crucial interactions between an RNA and other biomolecules. Indeed, RNAs naturally interact with RNA-binding proteins (RBPs), RNA-protein complexes like the ribosome and the spliceosome, as well as with other RNAs [8–10]. These interactions control every step in an RNA’s life cycle, such as the life time of an RNA molecule, its subcellular localization, and the recruitment of ribosomes to mRNA molecules and ultimately the amount of protein expressed per transcribed mRNA [11, 12]. Thus, it is not surprising that interrupting these interactions is known to cause disease . In line with their importance, there are over 1500 RNA binding proteins and thousands of microRNAs annotated in the human genome alone [14, 15].
It is clear that a SNP will affect protein or microRNA binding if it occurs directly on a binding site [16, 17]. However, as we will show, SNPs are also able to affect protein (or microRNA) binding “at a distance” through the involvement of RNA secondary structure. RNA secondary structures form due to the propensity of the nucleotides of an RNA to base pair . For structural RNAs these base pairings are a significant determinant of the functionally relevant physical shape of the RNA, but messenger and non-coding RNAs that are not necessarily designed for specific structures will also form base pairs and thus secondary structure . As microRNAs and a large fraction of RNA binding proteins bind to unpaired bases only, RNA secondary structure competes with binding of microRNAs or single-stranded RNA binding proteins and thus affects the binding affinity of the RNA for these molecules. For example, we have previously shown the existence of secondary structure mediated cooperativity between RNA binding proteins: binding of one protein to an RNA changes the ensemble of possible secondary structures by excluding the bases in its footprint from base-pairing [20, 21]. This change in secondary structures modifies the accessibility of the footprint for a second protein and thus the affinity of the RNA for this second protein. Depending on the specific sequence one binding event can make the other binding event easier or harder.
It has also been shown experimentally that specific SNPs can affect the secondary structures of mRNAs , and that SNPs can cause disease through changes in RNA secondary structure [23–25]. Here, we show how single nucleotide changes in an RNA molecule can, by making different conformations energetically more or less favorable, also change secondary structure drastically enough to change the affinity of an RNA for an RNA binding protein or a microRNA, and that there is some evidence that this effect might be under selective pressure in the human transcriptome. For simplicity, in the rest of the paper we will refer to the molecules binding to RNAs as “proteins”, even though these binding events could equally occur with mircoRNAs, as shown in , or any other molecule that binds single-stranded RNA. Likewise, we will be referring to the effect of “SNPs” on RNA-protein binding, but these effects should occur equally with any point mutation including SNVs. By computationally folding RNAs using a modified version of the Vienna RNA Package, we are able to quantitatively measure the effect of SNPs on protein binding. Using known human SNPs and PAR-CLIP data, we investigate the genome wide effect of SNPs on HuR (ELAVL1) binding. HuR is an extensively studied RNA binding protein with nearly 500 articles on PubMed. It is a member of the ELAVL family of RNA-binding proteins that selectively bind AU rich sequences, and HuR binds with a 7 nucleotide footprint mostly in the UTRs of many mRNAs . HuR has diverse functions, including stabilizing mRNAs against degradation as a means of regulating gene expression and controlling nuclear export of mRNAs, and has been implicated in several diseases including cancer [28, 29]. We find that SNPs can have a many-fold effect on the binding affinity of HuR binding to RNA transcripts from tens of bases away, simply through changes in secondary structure, and propose this as a general mechanism through which SNPs can affect protein binding.
As a first step in the investigation into secondary structure mediated effects of SNPs on RNA-protein binding, we wanted to find out if effects of sequence alterations outside the binding site on RNA-protein binding are generically possible and if so, at which distances between the sequence alteration and the protein binding site. To this end, we selected random sequences and computationally quantified the effect of a sequence alteration at the central nucleotide on the affinity of the randomly chosen RNA molecule to a hypothetical protein binding at variable locations along the molecule. We selected random sequences of lengths 101, 201, and 401 nucleotides, and saw no significant differences between the lengths. Specifically, we used the Vienna package to calculate ΔΔG , or the difference in the effective free energies of binding for a protein binding to the altered and unaltered sequences taking into account the entire ensemble of RNA secondary structures (see Methods). A positive ΔΔG means the alteration makes it easier for a protein to bind, while a negative ΔΔG means it is harder for a protein to bind. We find that, in an ensemble of random sequences, the average of ΔΔG is near zero unless a sequence alteration is directly inside a protein binding motif (see Fig 1(A) and 1(C) for 201 nucleotide data and S1(A), S1(C), S2(A) and S2(C) Figs for 101 nucleotide and 401 nucleotide data, respectively). This means that the average effect of a sequence alteration on RNA-protein binding is low, unless the sequence alteration directly impacts the protein binding motif. However, an average of zero does not necessarily imply that the effect of each individual sequence alteration is low, just that the effect of sequence alterations is symmetric. We thus next calculated the standard deviation of ΔΔG and found it to be on the order of a kcal/mol even when the sequence alteration is 30-50 bp away from the protein binding site (see Fig 1(B) and 1(D) for 201 nucleotide data and S1(B), S1(D), S2(B) and S2(D) Figs for 101 nucleotide and 401 nucleotide data, respectively). This indicates that individual sequence alterations have the potential for biologically relevant effects of several kcal/mol on the binding of a protein, simply through changes in secondary structure, but that this effect is mostly symmetric for random sequences. Directly comparing standard deviations of ΔΔG for the two different footprint sizes, averaging over all possible sequence changes and over 10 base pair sliding windows to reduce the noise, we also see that proteins with larger footprints are affected at slightly farther distances from an associated sequence alteration (see Fig 2 for 201 nucleotide data and S3 and S4 Figs for 101 nucleotide and 401 nucleotide data, respectively).
Given our finding that single nucleotide sequence alterations can have an effect on proteins binding to RNA through changes in secondary structure, we wished to investigate this effect in an actual genome using known SNPs. Using in vivo PAR-CLIP HuR binding data from three different studies [30–32], we were able to locate documented SNPs near HuR binding sites in transcript coordinates (see Methods). By utilizing a previously modified version of the Vienna Package  we are able to quantitatively predict the effective dissociation constant KD for a single stranded RNA binding protein such as HuR as a function of RNA sequence fully taking the effects of RNA secondary structure into account. By folding genomic sequences we were thus able to determine the change in dissociation constant KD for HuR binding to an RNA transcript, associated with changing the SNP from its reference to its alternate allele. We emphasize that the approach we use does not attempt to determine the dissociation constants of the protein RNA interaction from protein structure as was done in , but rather uses complete sets of experimentally determined protein dissociation constants  for unstructured RNA to predict the effective dissociation constant in the presence of RNA secondary structures. While this is still computationally challenging, it allows a transcriptome wide analysis, which more first principle based approaches would not.
We again folded sequences of length 101, 201, and 401 nucleotides. Taking the ratio of KD ’s for HuR binding to an RNA transcript with the reference and the alternate allele of the SNP allows us to measure the effect of SNPs on HuR binding to RNA. Histograms of these affinity ratios for both data sets are shown in Fig 3. We find that while a majority of ratios are close to one, for some HuR binding sites near known SNPs in the tail of the distribution the predicted affinity of HuR to the transcript changes by tenfold or higher (see the minimal and maximal affinity ratios in Table 1) depending on the allele of the SNP. While this tail of highly impactful SNPs is a small fraction of the whole, it is still on the order of tens of thousands of SNPs.
|Data Set||Ratio Count >1||Ratio Count <1||Ratio Min||Ratio Max|
|Binomial p-value||Ratios >1 Mean||Ratios <1 Mean||All Ratios Mean|
|Kishore||6.944 ⋅ 10−41||1.277||0.8551||1.072|
|Lebedeva||3.423 ⋅ 10−28||1.267||0.8530||1.064|
|Mukherjee||1.487 ⋅ 10−52||1.259||0.8572||1.062|
Next, we wanted to know how distance between the SNP and the HuR binding site affects the binding affinity ratio associated with the SNP. Fig 4 shows histograms of SNP positions relative to the nearest HuR binding site for different ranges of KD ratios. The most obvious observation from these histograms is that in all data sets SNPs occur less frequently than expected on the HuR binding motifs themselves, with the first nucleotide in the motif occurring slightly more often than the others, and the first nucleotide upstream of the motif enriched in SNPs. This is unsurprising, since we would expect such an important binding motif to be evolutionarily conserved. In addition to this general trend, we find that although the distributions for SNPs with different affinity fold-changes become narrower with higher strength (standard deviations for the blue, green, and red curves of Fig 4(A) being 33.25±.06, 28.23±.26, and 25.28±.36, of Fig 4(B) being 34.43±.05, 29.76±.23, and 27.94±.35, and of Fig 4(C) being 34.19±.04, 29.50±.18, and 27.89±.28, respectively) even for the highly impactful SNPs (with a fold change of three-fold or larger) a majority fall outside of the HuR binding motifs. This further supports the idea that a SNP does not need to be directly on a protein binding motif to impact RNA-protein binding, and can affect RNA-protein binding at a distance through changes in secondary structure.
In the previous two sections we demonstrated that SNPs have the ability to affect the interactions between RNA and regulatory proteins. This leaves the question of whether or not the effect of SNP alleles on RNA protein interactions has any functional relevance in vivo . In order to address this question, we asked if there is any evidence for selection for or against protein binding. We reasoned that if the effects of sequence on protein RNA interactions do not play a functional role, protein RNA affinity should increase as often as decrease when changing the sequence from the reference to the alternate allele of a SNP. In contrast to this expectation under the null assumption of no functional relevance, we in fact find a significant asymmetry in the direction of SNPs’ effect on protein binding, which we thus take as an indication of functional relevance. It is clear from the cumulative distributions in Fig 5 (where ratios below 1 are reciprocated) that SNPs with affinity ratios above 1 are more prevalent, and that ratios above 1 have a larger maximum effect for sequences of this length. Since we take affinity ratios to be the KD for binding with the alternate allele over the KD for binding with the reference allele, this indicates that changing the SNP from its reference to its alternate allele is more likely to make it harder for a protein to bind. This effect is quantified in Table 1, which notes that if we assume a binomial distribution for SNPs with affinity ratios above and below 1, we can reject the null hypothesis of a 50/50 split with p-values of 6.9 ⋅ 10−41, 3.4 ⋅ 10−28, and 1.5 ⋅ 10−52 for the Kishore, Lebedeva, and Mukherjee data set, respectively.
SNPs have long been associated with disease, but the role of non-coding and synonymous SNPs in altering phenotypes is still unclear. Using a modified version of the Vienna Package we have verified that SNPs can affect RNA protein binding affinity by modifying RNA structures from tens of nucleotides away, and performed a genome wide analysis of the effect of SNPs on the binding affinity of the RNA binding protein HuR. Our analysis shows that SNPs can affect the affinity of a protein binding to RNA by many fold and that a significant number of HuR binding sites in the human genome, mostly in UTRs and introns, are strongly affected in their binding affinity by nearby SNPs. We also identify an asymmetry in the effect of SNPs on HuR binding, implying that the effect of SNPs on RNA structure might be under selective pressure in the human genome, at least in the case of HuR binding sites.
SNPs have been known to affect the structure of RNAs, in particular many RiboSNitches, or SNPs with a large impact on RNA secondary structure, have been identified , but characterizing the effect of these SNPs and revealing the ways in which they cause disease remains a challenge. Several studies and web servers use various minimum free energy (MFE) or partition function distance measures to characterize which SNPs have a high impact on global or local RNA structure [36–38], but our analysis is to our knowledge the first to identify the genome-wide effects of the structural change caused by a SNP on a given RNA binding protein and give the change in binding affinity associated with the different alleles of a SNP. To encourage individual experimental validation of our findings, the full data for the Lebedeva data set, including sequences of length 201 and calculated binding affinities, is provided (see S1 Table).
Interference with RNA protein binding is a clear link between SNPs and causes of disease. The disruptive nature of SNPs on the human genome is evident both from SNPs occurring less frequently on HuR binding sites and from the asymmetry in the effect of SNPs on HuR binding affinities, which suggests that a SNP is more likely to disrupt protein binding than enhance it. This trend is similar to a trend observed in previous studies of the effect of SNPs on the MFE of RNA secondary structures, which found that the effects of SNPs skew RNAs towards higher free energy structures . This trend suggests that existing RNA secondary structure is optimized to leave important single stranded RNA binding protein motifs unpaired, and SNPs have the ability to disrupt these naturally optimized configurations.
While we interpret this asymmetry to be the result of evolutionary selection preferring uninhibited HuR binding sites, biased HuR binding data could provide another possible explanation for the observed asymmetry. If an experimental binding site is higher affinity in the presence of the reference allele, it has a higher chance of appearing in PAR-CLIP data, and it will appear as impeded in the presence of the SNP. In the opposite case, if a binding site is lower affinity in the presence of the reference allele, it is less likely to appear in the PAR-CLIP data, which could contribute to the asymmetry. To disprove this alternate explanation, an analysis of only the heterozygous and homozygous alternate SNPs in HEK 293 cells (which should be immune to this selection bias or be biased in the opposite direction, respectively) was performed, but the number of these SNPs near HuR sites was not high enough to draw statistically significant conclusions. We also find that with increasing length of the sequence fragment studied there is a reduction in effect asymmetry and the number of highly impactful SNPs (see S2 Table), which we attribute to non-specific binding. Indeed, when we perform a hard constraint analysis (as we did on random sequences above) which only measures the effect of a SNP on the central binding site, we find no reduction of asymmetry or number of impactful SNPs with increased sequence length.
We have shown that individual SNPs can disrupt specific RNA protein binding sites, but many SNPs in a genome could all contribute to lower RNA protein binding efficiency and cause disease on a genome-wide or many gene scale. SNPs are typically not independent of each other but appear together in haplotypes. The combined effect of SNPs in a haplotype is not considered in this analysis, but it is reasonable to expect in general that sequence changes at multiple positions lead to even larger structural changes and thus stronger effects on protein binding. While this is the general expectation, it is also possible that multiple SNPs could have compensatory effects, but a systematic search for compensatory effects on protein binding is beyond the scope of this work. It is also clear from our analysis that SNPs need not be directly on a protein binding motif, or even within 50 nucleotides of a motif, to disrupt binding. This wide range of effect suggests that future studies on the structural effect of SNPs examine SNPs in a wide radius of their target feature.
While we have focused here on changes of individual nucleotides, other genomic variations, i.e., short insertions or deletions, might have even stronger effects on protein affinity of mRNAs and will be the subject of future investigations. Post-transcriptional modifications to mRNAs could also cause structural changes analogously to an allele change in a SNP, and once energy parameters for post-transcriptional modifications are available the analysis performed here for SNPs could be applied to them as well. Similarly, it will be interesting to investigate if similar effects apply to proteins with preferences for double stranded RNAs.
Although RNA is synthesized as a single stranded molecule, its constituent bases can pair with each other, ultimately leading to formation of complicated 3D structures. To perform our analysis of the effect of SNPs on RNA-protein binding we must model these structures in silico . In principle, a complete 3D model, or tertiary structure, is required to fully describe an RNA. However, many properties of RNA structure can already be understood at the level of secondary structure, i.e., the list of base pairs in the molecule . The secondary structure is modeled by the Vienna Package , state-of-the-art software which takes into account base pairing and nearest-neighbor stacking energies when modeling secondary structure. We take a similarly simplified approach to RNA-protein binding, modelling a bound protein by simply forcing any bound bases to remain unpaired and adding a protein binding energy for those bound configurations . The Vienna RNA Package uses recursion relations to efficiently fold RNAs in O(N3) time for RNAs of length N and allows the exclusion of individual bases from the folding through its constraint folding capabilities. We also make use of a previously published altered version of the Vienna Package that takes single stranded RNA binding proteins and their experimentally determined sequence-dependent binding energies  into account in the recursion relations themselves, and calculates the dissociation constant KD of a known protein to an RNA of a given sequence fully taking into account RNA secondary structures [33, 43]. We note that while the quality of computational secondary structure prediction via determination of the minimum free energy structure can be questionable, all our calculations evaluate partition functions over the entire Boltzmann ensemble of all RNA secondary structures, which are much more reliable . Also, it is important to note that while we do not explicitly allow non-canonical base pairs in our secondary structure predictions, their effect on the secondary structure is at least partially taken into account by the Vienna package in the measured free energy parameters for short interior loops exhibiting such base pairs.
To investigate the role of SNPs on protein binding we used all human (GRCh38.p7) SNPs from dbSNP (build 151) downloaded in VCF format . SNPs were formatted using vcftools  and transcript coordinates were obtained using Variant Effect Predictor (VEP) . We analyzed the effect of SNPs on three different HuR PAR-CLIP data sets, one from Kishore et al . , one from Lebedeva et al . , and one from Mukherjee et al . , which were all downloaded from the doRiNA database . HuR binding sites were matched to transcripts using the ensembldb bioconductor package . HuR binding affinities to different 7-mers were obtained from RNAcompete data . All reference transcript sequences were obtained from Ensembl BioMart .
To investigate the effect of single nucleotide sequence changes on proteins binding in the presence of RNA secondary structure we computed the changes in ensemble Gibbs free energy for proteins binding to random sequences at different positions using the Vienna Package. We selected 100 random sequences each for varying lengths (101, 201, and 401 nucleotides) with equal probabilities for all four nucleotides. Results for sequences of length 201 nucleotides are shown in the text while results for sequences of length 101 and 401 are shown in the supplementary material (see S1, S2, S3 and S4 Figs). Results for the 201 nucleotide sequence length were also replicated using RNAstructure, and we found very close agreement between the two software packages (see S5 and S6 Figs) . For each of these “wild type” sequences we considered “mutated” sequences that differ from wild type only in the identity of the central nucleotide. Then we used the constrained folding feature of the Vienna Package as described above to calculate free energies for four different configurations: the wild type sequence without a protein, the wild type sequence with a protein bound, the mutant sequence without a protein, and the mutant sequence with a protein bound. We used protein footprints of 7 nucleotides (the same as HuR) and 10 nucleotides to interrogate how the effect depends on footprint size. We then calculated the difference in free energy to bind a protein, ΔGWT and ΔGMut for the wild type and mutant sequence, respectively (see Fig 6). This difference between the free energy of the unconstrained ensemble of all RNA secondary structures and the ensemble of all RNA secondary structures in which the binding site of the protein remains unpaired can also be interpreted as ΔGWT = −kBT log(pWT), where pWT is the probability that the entire binding site of the protein is unpaired. As a quantitative measure of the effect of the sequence alteration on protein binding, we then calculated the difference ΔΔG = ΔGMut—ΔGWT. For each sequence we computed this quantity for each of the six possible combinations of wild type and mutant nucleotide at the central position and for every possible position of the protein binding site along the molecule. Finally, we calculated the average and standard deviation of ΔΔG over the 100 random sequences considered.
We identified SNPs near HuR binding sites by first matching the genomic coordinates of SNPs to transcript coordinates. Given a list of SNPs in VCF format, VEP provides each SNP’s associated Ensemble transcript ID (ENST) and cDNA position (if the SNP is transcribed), as well as its allele. The resulting list of SNPs with transcript coordinates was further filtered for only transcripts annotated by Ensembl as canonical, which are generally chosen as the transcript with the longest coding sequence when given a choice between isoforms. HuR binding sites were also mapped to ENSTs and the list of SNP transcript coordinates and HuR transcript coordinates was compared. If a SNP was found to be within 40 bases on either side of the middle of an HuR binding site it was considered a candidate for RNA secondary structure folding analysis. From this data we were also able to examine the positional distribution of SNPs around HuR binding sites.
To determine the effect of SNPs on binding affinity as mediated by RNA secondary structure we used a modified version of the Vienna Package that incorporates the effect of single stranded protein binding on RNA secondary structure calculations . We first determined the 7 bp motif within the PAR-CLIP binding site (usually ∼40 bp) that HuR has the highest affinity for using RNAcompete data , and then folded a stretch of the RNA transcript centered on this motif for the reference sequence and the SNP-altered sequence. We fold sequences of length 101, 201, and 401 nucleotides for each motif, and results for sequences of length 201 are again shown in the main text while results for sequences of length 101 and 401 are shown in the supplementary material (see S7, S8, S9, S10, S11 and S12 Figs). Although folding longer sequences could improve the accuracy of our calculated structures, longer sequences quickly become computationally intractable. Our modified version of the Vienna Package is able to determine the dissociation constant KD for HuR binding to any sequence (taking into account altered secondary structure), and by taking a ratio of these dissociation constants we are able to quantify the effect of SNPs on HuR binding due to changes in secondary structure. We take the ratio of dissociation constants to be the dissociation constant of the alternate allele over the dissociation constant of the reference allele.