Molecular Vision 2021; 27:233-242
Received 03 June 2020 | Accepted 06 May 2021 | Published 08 May 2021
Michelle E. McClements, Anum Butt, Elena Piotter, Caroline F. Peddle, Robert E. MacLaren
Nuffield Laboratory of Ophthalmology, University of Oxford, Oxford, UK,
Correspondence to: Robert E MacLaren, Nuffield Laboratory of Ophthalmology, Department of Clinical Neurosciences, University of Oxford, Oxford, UK; Phone: 01865 234796; FAX: 01865 228974; email: email@example.com
Purpose: The classic Kozak consensus is a critical genetic element included in gene therapy transgenes to encourage the translation of the therapeutic coding sequence. Despite optimizations of other transgene elements, the Kozak consensus has not yet been considered for potential tissue-specific sequence refinement. We screened the −9 to −1 region relative to the AUG start codon of retina-specific genes to identify whether a Kozak consensus that is different from the classic sequence may be more appropriate for inclusion in gene therapy transgenes that treat inherited retinal disease.
Methods: Sequences for 135 genes known to cause nonsyndromic inherited retinal disease were extracted from the NCBI database, and the −9 to −1 nucleotides were compared. This panel was then refined to 75 genes with specific retinal functions, for which the −9 to −1 nucleotides were placed in front of a GFP transcript sequence and RNAfold predictions performed. These were compared with a GFP sequence with the classic Kozak consensus (GCCGCCACC), and sequences from retinal genes with minimum free energy (MFE) predictions greater than the reference sequence were selected to generate an optimized Kozak consensus sequence. The original Kozak consensus and the refined retina Kozak consensus were placed upstream of the Renilla luciferase coding sequence, which were used to transfect retinoblastoma cell lines Y-79 and WERI-RB-1 and HEK 293T/17 cells.
Results: The nucleotide frequencies of the original panel of genes were determined to be comparable to the classic Kozak consensus. RNAfold analysis of a GFP transcript with the classic Kozak sequence in the 5′ untranslated region (UTR) generated an MFE prediction of −503.3 kcal/mol. RNAfold analysis was then performed with a GFP transcript containing each −9 to −1 Kozak sequence of 75 retinal genes. Thirty-eight of the 75 genes provided a greater MFE value than −503.3 kcal/mol and exhibited an absence of stable secondary structures before the AUG codon. The −9 to −1 nucleotide frequencies of these genes identified a Kozak consensus of ACCGAGACC, differing from the classic Kozak consensus at positions −9, −5, and −4. Applying this sequence to the GFP transcript increased the MFE prediction to −500.1 kcal/mol. The newly identified retina Kozak sequence was also applied to Renilla luciferase plus the REP1 and RPGR transcripts used in current clinical trials. In all examples, the predicted transcript MFE score increased when compared with the current transcript sequences containing classic Kozak consensus sequences. In vitro transfections identified a 7%–9% increase in Renilla activity when incorporating the optimized Kozak sequence.
Conclusions: The Kozak consensus is a critical element of eukaryotic genes; therefore, it is a required feature of gene therapy transgenes. To date, the classic sequence of GCCRCC (−6 to −1) has typically been incorporated in gene therapy transgenes, but the analysis described here suggests that, for vectors targeting the retina, using a Kozak consensus derived from retinal genes can provide increased expression of the target product.
The Kozak consensus is named after the extensive investigations of Marilyn Kozak, who first identified the significance of the nucleotides immediately upstream of the start codon . Critical for guiding ribosomes in their identification of where to begin translation from eukaryotic transcripts, a consensus sequence common to all vertebrate genes from positions −9 to −1 was determined to be GCCGCCRCC in 1987 . However, despite being called the Kozak consensus, the extent of conservation in vertebrate genes was considered to be low, at about 0.2% . Despite this, it was clear that the identified Kozak consensus provided good translation efficiency compared to other sequence versions . The −6 to −1 sequence of GCCRCC was defined as a strong Kozak sequence, providing efficient translation, with other variants, such as UAAACC, considered to be weak and to provide less efficient translation.
A more recent larger analysis of 10,012 human genes confirmed the preferred human Kozak sequence as GCCGCCRMC . This was later supported by an analysis of 32,526 human genes that determined a consensus of GCCGCCACC . This study also showed that vertebrate species-specific variations in the Kozak consensus influenced expression efficiencies within a Zebrafish model. The researchers identified a Zebrafish-specific Kozak consensus that showed a twofold increase in translation efficiency over the classic Kozak sequence, despite differing by just two nucleotides. If such small changes are important between species, it is plausible to consider that there may also be tissue-specific biases within species.
Detailed investigations have revealed that single-nucleotide variations can alter the strength of a given Kozak sequence . It has been considered that weak Kozak sequences could be important for the regulation of translation, and indeed, it has been suggested that mRNAs with weak Kozak sequences are enriched for genes involved in neurobiology in Drosophila . However, to date, we are unaware of any human gene tissue-specific analysis of Kozak consensus sequences having been performed.
Mutations in the human Kozak sequence have been identified and associated with disease [9-13]. Of 275,716 current mutation entries in the Human Gene Mutation Database (HGMD, accessed March 2020), 4,575 were in regulatory regions, and of these, 2,695 were upstream of the AUG initiation codon, with 84 presenting evidence of disease-associated influence at nucleotides −9 to −1.
It is clear that nucleotide variations in the Kozak sequence can influence the efficiency of translation. Transgenes for gene therapy have been optimized in various ways, including capsid and promoter selections  and the addition of untranslated regions that improve transcript stability, such as a Woodchuck posttranscriptional regulatory element (WPRE) . Inclusion of introns between the transcriptional start site and Kozak sequence has also proven to be beneficial in improving expression levels from transgenes [16,17]; therefore, it seems rational to consider the Kozak consensus for refinement in therapeutic transgenes. The more efficient a transgene is at generating the required therapeutic product in the target cell type, the lower the viral load that should be required to achieve a therapeutic outcome. Researchers developing synthetic promoters have considered the potential for a Kozak sequence that is not necessarily the classic consensus. For example, in yeast, single point mutations to adenine at position −5 improved translational strength, whereas changes to guanine elsewhere reduced translational strength . It was also identified that the Kozak sequence that provided the highest predicted minimum free energy (MFE) for a given transcript also provided the most protein synthesis. In this study, we systematically screened genes with known retinal functions to identify an optimized variant of the Kozak consensus that might provide translational benefits, and therefore, could be implemented as an enhanced element in transgenes for retinal gene therapy.
Gene sequences were downloaded from the NCBI database using Geneious 10.2.6 in March 2020, and nucleotides −9 to −1 were extracted (relative to the “A” of the AUG start codon). Data were exported to Microsoft Excel, and nucleotide preferences at each position were determined for 135 genes linked to inherited retinal disease but not syndromic disorders (RetNet). Of this panel, 75 genes identified as having retina-specific isoform functions were extracted for nucleotide frequency comparisons and RNAfold analysis.
The RNAfold web server was used to analyze transcript sequences identical but for the Kozak −9 to −1 nucleotides. To directly compare the influence of different Kozak sequences on mRNA secondary structure predictions, a GFP transcript was used as the reference sequence. The −9 to −1 Kozak of 75 retinal genes were inserted in this GFP transcript, and the RNAfold predictions generated and compared.
Primers were designed to create the original Kozak consensus (K SDM FW AAT ACG ACT CAC TAT AGG
Short tandem repeat (STR) profiling for cell line authentication of HEK 293T/17, Y-79, and WERI-RB-1 cell lines was achieved before performing the luciferase assays (Eurofins Genomics, Wolverhampton, UK). Corning white costar 96-well plates (Corning Optical Communications, Flintshire, UK) were coated with 0.2 mg/ml of poly-D-lysine hydrobromide (Merck Life Science UK Limited, Gillingham, UK) for 5 min at 37 °C, and the wells were rinsed with water before applying 0.05 mg/ml of human fibronectin (Merck Life Science) and incubating for 30 min at room temperature. After removal of the human fibronectin solution, plates were air dried for 2 h. HEK 293T/17 cells were seeded at 2E+05 cells/ml in no phenol red high glucose DMEM, pyruvate and FBS purchased from Life Technologies Ltd, Paisley, UK, L-glutamine and pen&strep purchased from Merck Life Science, 10% fetal bovine serum (FBS), and 1% penicillin and streptomycin. Y-79 and WERI-RB-1 cells were seeded at 5E+05 cells/ml in RPMI 1640 (Life Technologies Ltd, Paisley, UK) 1640 with HEPES and NHCO3 supplemented with 1% L-glutamine, 1% penicillin and streptomycin, and 10% FBS. A transfection mix of Opti-MEM (Life Technologies Ltd), ViaFect Transfection Reagent (Promega), and 400 ng of plasmid was prepared per well and applied immediately after cell seeding. Samples were incubated for 72 h at 37 °C and 5% CO2.
The Dual-Glo luciferase kit (Promega) was used following the manufacturer guidelines. Briefly, 50 µl of media was removed from each well, and 60 µl of Dual-Glo Solution was added. Samples were left to incubate at room temperature for 30 min, after which the luminescence of the control luciferase (firefly) was determined. Then, 60 µl of Dual Stop&Glo Solution was added to each well, and the samples were left to incubate at room temperature for 30 min. The luminescence of the test luciferase (Renilla) was then determined. All luminescence readings were taken with a FLUOstar Omega device (BMG Labtech, Aylesbury, UK). Data are presented as levels of Renilla activity relative to firefly activity.
Chi-square tests were performed to compare nucleotide frequencies using previously published human nucleotide frequency data from 32,526 human genes  as the reference values. Normal distribution of all luciferase assay data was confirmed (by Anderson-Darling, D’Agostino & Pearson, Shapiro–Wilk, and Kolmogorov–Smirnov tests, Appendix 1); a two-way analysis of variance (ANOVA) with multiple comparisons was performed on the dataset.
Sequences for 135 genes with known retina-specific functions or causes of inherited retinal disease (but not syndromic disorders) as listed on RetNet were extracted for comparison of nucleotides at positions −9 to −1 (Appendix 2). The nucleotide frequencies at each position were determined (Table 1, Figure 1A) and compared with the preferences reported by analysis of 32,526 human gene sequences . This analysis revealed comparable nucleotide frequencies for the 135 genes associated with inherited retinal disease and the classic Kozak consensus (Table 1). Only position −3 showed a significant variation in nucleotide preference between our 135 gene panel and the reference dataset (p<0.011). This did not reflect a change in the dominant nucleotide or the order of nucleotide preference of A>G>C>T at this position; instead, it reflected a difference in the frequency of the pattern of these nucleotides.
As the original panel contained genes with functions in other cell types, and our interest was investigating an appropriate Kozak consensus for use in retinal gene therapy vectors, the panel was refined to include only genes for which roles in other cell types are not known. This was achieved by investigating each gene in the original panel using the human gene database GeneCards. Genes for which only retinal roles (in any cell type) are currently known were selected for subgroup analysis, providing a 75-gene panel. Nucleotide frequency analysis of this refined gene panel maintained the Kozak consensus identified from the original panel of 135 genes (Figure 1B, Appendix 3). The only difference was at position −9; however, the nucleotide preference at this position was marginal in both panels.
It has previously been shown that mRNA transcripts with more efficient Kozak sequences provide greater MFE values [18,19]. To directly compare the influence of each Kozak sequence of the 75-gene panel with retina-specific roles on MFE, a reference mRNA transcript was generated. This was derived from a GFP reporter transgene (Figure 2), with the reference sequence containing the classic Kozak (GCCGCCACC) determined to have an MFE of −503.3 kcal/mol (Figure 3A). Each Kozak sequence from the 75-gene panel with retina-specific roles was applied in the GFP transcript, and the MFE was determined (Appendix 3).
Of the 75 genes with retina-specific roles, 38 generated a transcript MFE greater than the reference sequence containing the classic Kozak consensus. Because transcripts with Kozak sequences that provide greater MFE values have been associated with more efficient translation rates, the −9 to −1 sequences for these 38 genes were extracted (Appendix 4). The nucleotide frequencies for this refined panel of genes were determined (Table 2, Figure 4). At all positions except −8 and −1, the nucleotide frequencies of the 38 genes with retina-specific roles were significantly different (p<0.05) from the classic Kozak consensus, and the nucleotide preference was changed at three of the nine positions. At the other six nucleotide positions, the strength of the nucleotide preference increased compared with the classic reference frequencies. For example, the differences in −6 nucleotide frequencies between our data set and the reference panel were highly significant (p<0.00001), yet the preference for guanine did not change; rather, the strength of the preference increased while the use of thymine at this position decreased in our refined panel.
Our analysis identified a potential optimized retina Kozak consensus of ACCGAGACC (Figure 4). When this −9 to −1 consensus was applied to the GFP transcript sequence, the MFE increased from −503.3 kcal/mol containing the classic Kozak sequence, to −500.1 kcal/mol and was predicted to remove a stem-loop structure immediately upstream of the AUG start codon (Figure 3B).
Our research group has reported on two ongoing clinical trials delivering coding sequences for Rab escort protein 1 (REP1)  and retinitis pigmentosa GTPase regulator (RPGR) , and the transgenes for these two vectors contain different versions of the classic Kozak consensus. The REP1 transgene uses the original REP1 cDNA  with the −9 to −1 sequence GGCGGCACC, and the predicted transcript has an MFE of −819.1 kcal/mol. When the Kozak sequence was changed to the optimized retina Kozak of ACCGAGACC, the MFE for this transcript increased to −812.1 kcal/mol. In contrast, the RPGR clinical trial transgene contains the Kozak sequence GGGGCCACC , and the MFE for the predicted transcript was determined to be −1090.37 kcal/mol. When the retina Kozak identified above was applied in the RPGR transcript, the MFE was also increased to −1086.67 kcal/mol. Given the sizes of these clinical trial transcript sequences, it was interesting to observe that changing the sequence of −9 to −1 nucleotides in the 5′ untranslated region (UTR) was predicted to change the MFE in both cases, although by a limited amount. When considering the differences in Kozak consensus between these vectors, it should be noted that REP1 is expressed ubiquitously, whereas RPGR is expressed in rod and cone photoreceptors only.
As the identification of an optimized Kozak consensus from genes with specific roles in the retina was derived from human sequences, testing of this sequence would be most relevant in human cells. To this end, the retinoblastoma cell lines Y-79 and WERI-RB-1 were selected for an in vitro comparison of the original Kozak consensus with the optimized retina Kozak consensus, along with HEK 293T/17 cells.
STR profiling confirmed the identity of these cell lines, and to provide the most sensitive output, a dual-luciferase assay was designed using the Psi-CHECK2 construct. In this plasmid, firefly luciferase was driven by an identical expression cassette in both constructs, and the Renilla luciferase expression cassette differed only in the −9 to −1 Kozak consensus (Figure 5A). RNAfold predictions of the original Kozak-Renilla and retina Kozak-Renilla transcripts provided an MFE of −472.20 kcal/mol and −471.80 kcal/mol, respectively, an increase in MFE that aligned with previous transcript predictions described above.
Despite the small change in MFE prediction for the Renilla transcripts, the Kozak consensus change influenced the levels of Renilla activity in vitro (p<0.0001, Figure 5B). The optimized retina Kozak consensus provided approximately 7%–9% more Renilla activity relative to firefly in all three cell lines (9% in Y-79, p = 0.0289; 6.9% in WERI-RB-1, p = 0.0234; 7.6% in HEK 293T/17, p = 0.0032). It is worth noting that the transfection efficiencies were not saturated in any of the cell lines (Appendix 5). The retinoblastoma cell lines did not transfect as efficiently as HEK 293T/17 cells did, where the latter consistently achieved transfection rates of 80%–90%. This was compared with 50%–60% of Y-79 cells and 20%–30% of WERI-RB-1 cells. Furthermore, the transfection rates of each cell line were consistent between replicates, and luciferase expression levels were detectable with the Dual-Glo assay. Hence, when tested in vitro using a reliable plasmid assay, the slightly modified Kozak consensus derived from retina-specific genes led to small but consistent increases in translation in three independent human cell lines of retinal and neuronal lineage.
The Kozak sequence has been known as an important feature for achieving efficient translation for decades following the pioneering work of Marilyn Kozak . A Kozak consensus for vertebrate genes was identified in 1987 and has since been confirmed as being GCCGCCRCC for positions −9 to −1 relative to the AUG initiation codon in human genes [2,5,6]. Variations in this sequence have been shown to influence the efficiency of translation  and cause human disease [9-13].
It has been identified in Drosophila that transcripts with weak Kozak sequences are enriched in neuron-related genes, indicating an important role for a neuron-related Kozak consensus that varies from other genes . That study indicated that the impact of the Kozak sequence on translation efficiency is related to the elongation and initiation rates of specific cell types, suggesting that, for a given cell or tissue type, the preferred Kozak arrangement may differ from the classic Kozak consensus. Given the inclusion of a Kozak sequence in all gene therapy transgenes, it seems unusual not to consider the potential optimization of this sequence when designing tissue-specific vectors. Critical features of retinal gene therapy vectors are that they transduce the cell type of interest and efficiently produce the desired therapeutic product . The more optimized and efficient a vector is, the lower the dose required to achieve a therapeutic outcome while avoiding the potential for toxic responses from an increased number of vector particles .
Many aspects of transgene design have already been considered and optimized, such as refinement of capsid choices; cell-specific promoters , including synthetic promoters ; and refined regulatory elements that enhance transcript stability and translation efficiency [15,16]. Given that vectors for gene therapy have undergone many types of optimization to ensure they are as efficient as possible, in this study, we considered the Kozak sequence for the first time. To further investigate transgene design for retinal gene therapy, we screened retinal genes to identify whether a different Kozak consensus should be considered for incorporation into transgenes for the treatment of inherited retinal disease. Using in silico tools, the data presented here identified a prediction for an efficient Kozak consensus derived from genes with retina-specific roles. The sequence we propose is based on RNAfold predictions and the link previously shown between greater MFE values and translation efficiency [18,19]. It has previously been suggested that such secondary structures in the 5′UTR impede translation initiation [26,27]. Our secondary structure predictions indicated that our refined retina Kozak sequence removed stable secondary structure formation in the 5′UTR, although it should be noted that the MFE may not always represent the native conformation , and this form of prediction is limited . Indeed, some studies have shown no significant relationship between predicted mRNA folding energy and translation efficiency . However, in this study, we showed that implementation of an optimized Kozak consensus could generate more luciferase activity compared with an identical construct containing the original Kozak consensus. The optimized Kozak consensus was derived from genes with retina-specific roles with the intention of enhancing transgenes for retinal gene therapy, and improvements from this consensus were evident in human retinoblastoma cell lines. However, the RNAfold predictions identified an increase in MFE that could provide a translational benefit regardless of the cell type, which proved to be the case when we found that an improvement was also achieved in HEK 293T/17 cells. It may be that, were the assessment process described here applied to other genes with tissue-specific functions, a similar optimized Kozak consensus could also be achieved.
In this study, for the first time, we considered the potential for implementing a retina-derived Kozak consensus for gene therapy vectors used in the treatment of inherited retinal disease. We identified a new sequence that differs in three of the nine nucleotide positions, providing an enhancement over the classic Kozak consensus currently used in retinal gene therapy vectors.
Appendix 1. Normality assessments of the luciferase assay data.
Appendix 2. 135 genes associated with inherited retinal disease.
Appendix 3. −9 to −1 nucleotide sequences of 75 genes with retina-specific functions.
Appendix 4. −9 to −1 nucleotide sequences of 38 genes with retina-specific functions.
Appendix 5. Example HEK 293T/17, Y-79 and WERI-RB-1 cell images.
Funding was provided by the NIHR Oxford Biomedical Research Centre, Fight for Sight UK, Retina UK, Wellcome Trust Medical and Life Sciences Translational Fund and the Royal College of Surgeons of Edinburgh. Author contributions: conceptualisation, M.E.M, R.E.M; methodology, M.E.M; formal analysis, M.E.M; investigation, M.E.M, A.B, E.P, C.P; writing - original draft, M.E.M; writing - review & editing, M.E.M, A.B, E.P, C.P, R.E.M; supervision, R.E.M; funding acquisition, R.E.M.