Upstream open reading frames (uORFs) are short sequences within 5′ UTRs that regulate the rate at which the downstream coding sequence is translated into protein. Variants that create or disrupt uORFs (uORF-perturbing variants) have been shown to cause rare disease (Calvo et al., 2009; Whiffin et al., 2020). We recently used data from the Genome Aggregation Database (gnomAD) to systematically characterize the deleteriousness of different categories of uORF-perturbing variants and prioritize those that are more likely to be disease causing (Whiffin et al., 2020). Current variant annotation approaches focus on the impact of protein-coding variants, with only limited annotation of predicted consequences for non-coding variants. For example, the Ensembl Variant Effect Predictor (VEP) (McLaren et al., 2016), only annotates variants within UTRs as 3′ or 5′ to the coding sequence, without any further information about their predicted effect.
To aid the assessment of high-impact uORF-perturbing variants, we have developed a plugin for VEP to identify 5′UTR variants that create upstream start sites (uAUGs), disrupt the start or stop codon of existing uORFs, create a new stop codon within existing uORFs, or shift the frame of an existing uORF. In each case, the tool outputs detailed annotations that allow the user to predict the likely impact of the variant on protein translation.
Recently, the MORFEE tool was described (Aïssi et al., 2020), however, it is limited to annotating single nucleotide variants (SNVs) that create uAUGs. The UTRannotator is, to our knowledge, the first comprehensive annotation tool for 5′UTR uORF creating and disrupting variants. Our tool has initially been created to characterize the impact of uORF-perturbing variants, however, it will be updated to annotate additional UTR variants as we learn how to interpret these for a role in human disease.
For any SNV, 1–5 bp small insertion/deletion (indel) or multi-nucleotide variant (MNV) in a 5′UTR, we first summarize the number of uORFs in the 5′UTR in the reference sequence. Then, for each variant within the 5′UTR we evaluate whether it would have any of the following consequences, on any annotated transcript: (i) creating a new start codon AUG to introduce a new uORF; (ii) removing an existing start codon AUG; (iii) removing the STOP codon of an existing uORF; (iv) creating a new stop codon to shorten an existing uORF; (v) disrupting an existing uORF with a frameshift deletion or insertion, whose number of nucleotides inserted or deleted is not a multiple of three. Where a variant has multiple annotation consequences, it is evaluated for each separately.
To enable evaluation of the effect of each variant, the UTRannotator outputs detailed annotations for each type of uORF-perturbing variant (Table 1). This includes describing the subtype of uORF created and/or disrupted (i.e. whether this is a distinct uORF with a stop codon in the 5′ UTR, or an ORF that overlaps the coding sequence either in- or out-of-frame), and the strength of the created and/or disrupted uORF start site match to the Kozak consensus sequence (Kozak, 1989). For a variant disrupting an uORF, we also evaluate whether the uORF has any experimental evidence of translation, by assessing a curated list of uORFs previously identified with ribosome profiling from the online repository of small ORFs (www.sorfs.org) (Olexiouk et al., 2018). Users can also use their own customized list of translated uORFs. Given that ribosome profiling datasets are currently limited in the cell types/tissues and conditions analysed, we output results for all possible uORF-disrupting variants and include experimental evidence as an annotation.
|Number of existing uORFs||√||√||√||√||√|
|KozakContext: sequence and strength||√||√||√||√||√|
|Start distance to CDS||√||√||√||√|
|Start distance to STOP||√||√|
|With translated evidence||√||√||√||√|
|uORF subtype||√||√||√||√ (ref and alt)|
|Other annotations||Start distance from cap||Whether there is an alternative STOP, alternative stop distance to CDS, frame of disrupted uORF with CDS||New stop distance to CDS|
Since a 5′UTR can have multiple existing uORFs, for each 5′UTR variant we output the annotations for all disrupted uORFs.
Detailed information on installing and running UTRannotator can be found in Supplementary Information. The time complexity of our implementation is linear to the number of input variants. The ratio of running time without the plugin to that with the plugin, tested on 1000 random variants (60% annotated as 5′UTR variants) is 1.02–1.07 (5 replications).
To show the utility of our UTR annotator tool, we annotated all 5′ UTR variants interpreted as pathogenic/likely pathogenic and uncertain significance from ClinVar (version 202005) (Landrum et al., 2018). These variants do not have a coding annotation on any transcript. However, we note that 5′UTR variants are under-represented in ClinVar as they are rarely sequenced and/or reported.
There are 97 Pathogenic/Likely pathogenic 5′ UTR variants in ClinVar (97/113 969 = 0.085% of all ClinVar Pathogenic/Likely pathogenic). 91 are 1–5 bp small variations, 29 of which (31.9%) are annotated as creating or disrupting uORFs by our plugin (Fig. 1; Supplementary Table S1). We examined the evidence behind the reported clinical significance for each variant, and found 15 (51.7%) have previously been attributed to a uORF-perturbing mechanism.
There are 5128 5′ UTR variants of uncertain significance (VUS) reported in ClinVar (5128/255 691 = 2% of all VUS), 4966 of which are 1–5 bp small variations. Our plugin annotated 377 of these (7.6%) as creating or disrupting uORFs, on at least one annotated transcript (Supplementary Table S2).
We used the detailed annotations from the UTRannotator to illustrate how to prioritize 5′ UTR VUS that are most promising for further follow-up. We first restricted to variants that form new overlapping ORFs (oORFs) with start sites that are Strong or Moderate matches to the Kozak consensus sequence, or that are uORFs with documented evidence of translation, as we previously showed that variants with these consequences are under strongest negative selection (Whiffin et al., 2020). Finally, we took variants in 3191 genes previously identified as having a ‘High’ likelihood that uORF-perturbation could be an important disease mechanism (Whiffin et al., 2020). Through this approach, we identified 31 potential ‘high-impact’ ClinVar 5′ UTR VUS (Supplementary Table S3).
We have created a freely available tool, as a plugin to the Ensembl VEP, that annotates variants that create or disrupt uORFs. The output from the tool can be used to predict the possible impact of variants identified in patients for a role in disease. It is also directly applicable to annotate 5′ UTR variants from other eukaryotes (see Supplementary Information).
We initially referenced our development of this tool in prior work (Whiffin et al., 2020), however, since then have greatly expanded the variant types evaluated (including small indels and MNVs) and the consequences annotated (including uAUG-lost, uSTOP-gained and uORF frameshift variants).
We note several limitations to our tool. Firstly, the UTRannotator has been configured to annotate only variants up to 5 bp in length. We included this length restriction for two reasons: (i) the annotation of longer indels is tricky, as the chance of variants having multiple possible annotations is increased, and (ii) the impact of larger indels that add or remove large stretches of UTR is currently unknown. We also currently only consider uORFs with canonical AUG start sites. It is known that many translated uORFs use non-canonical start sites (McGillivray et al., 2018). More research is needed into the impact of variants that create or disrupt these non-canonical uORFs in human disease.
For the initial tool release, we have included five variant types that create or disrupt uORFs, however, we will continue to develop the UTRannotator to include additional types of UTR variants.
N.W. was supported by a Rosetrees and Stoneygate Imperial College Research Fellowship. This work was supported by the Wellcome Trust [107469/Z/15/Z; 200990/A/16/Z], Medical Research Council (UK), British Heart Foundation [RE/18/4/34215], National Institute for Health Research (NIHR) Royal Brompton Cardiovascular Biomedical Research Unit, and the NIHR Imperial College Biomedical Research Centre.
Conflict of Interest: none declared.