|Molecular Vision 2004;
Received 4 March 2004 | Accepted 30 August 2004 | Published 8 October 2004
Novel retinal genes discovered by mining the mouse embryonic RetinalExpress database
Shuguang Liang,1 Sheng Zhao,2 Xiuqian Mu,1
Terry Thomas,3,4 William H.
Departments of 1Biochemistry and Molecular Biology and 2Biostatistics, The University of Texas M. D. Anderson Cancer Center, Houston, TX; 3Department of Biology and 4The Laboratory for Functional Genomics, Texas A & M University, College Station, TX
Correspondence to: William H. Klein, Ph.D., Department of Biochemistry and Molecular Biology, The University of Texas M. D. Anderson Cancer Center, 1515 Holcombe Boulevard, Unit 117, Houston, TX, 77030; Phone: (713) 792-3646; FAX: (713) 563-2968; email: firstname.lastname@example.org
Purpose: Bioinformatics has emerged as a powerful tool for identifying novel genes and pathways associated with retinal biology and disease. The developing mouse retina expresses an exceedingly large and complex variety of genes. Many of these genes have not been characterized but nevertheless are likely to have important developmental or physiological functions. The purpose of this study was to use an in silico approach with a mouse embryonic retinal database of cDNAs/expressed sequence tags (ESTs) named RetinalExpress to identify previously uncharacterized genes that are represented in the developing retina.
Methods: cDNA clones unique to the RetinalExpress database were identified by comparing clones in the RetinalExpress database with those in other cDNA/EST databases. We used a hierarchical filtering procedure with high stringency criteria that included sequence quality, colinearity with hypothetical gene sequences, and absence of any substantial existing annotation to select clones that were likely to represent novel genes. Selected clones were located on mouse chromosomes using National Center for Biotechnology Informatics Map Viewer software and the database from the University of California at Santa Cruz Genome Bioinformatics Web browser. The expression of selected retinal transcripts was determined using reverse transcriptase (RT)-PCR. In situ hybridization of sectioned embryonic and postnatal retinas was performed to determine spatial expression patterns of selected transcripts.
Results: Of the 27,765 cDNA clones from RetinalExpress that we filtered through several public cDNA/EST databases, 26 cDNA/EST sequences were identified that, at the time of the analysis, were unique to RetinalExpress. Seventeen clones were selected for RT-PCR analysis, and retinal transcripts corresponding to previously uncharacterized genes were unambiguously detected for six clones. Three genes encoded open reading frames containing putative functional domains; one sequence contained an HMG DNA binding domain, another, an RFX DNA binding domain, and another, a phospholipase C catalytic domain X. Transcripts from the genes encoding DNA binding domains were expressed in embryonic and postnatal retinas with distinct spatial patterns.
Conclusions: The characterization of 26 mouse genes whose partial nucleotide sequences were uniquely represented in the RetinalExpress cDNA/EST database demonstrated the feasibility of retinal gene discovery using in silico analysis. Two of these genes had distinctive spatial expression patterns in the retina and one was likely to function as a DNA binding protein in embryonic and postnatal retinas. The gene identification approach described here demonstrates the usefulness of establishing large cDNA/EST databases from highly specialized neuronal tissues such as the retina to find novel genes.
Even a cursory inspection of the available retinal databases of cDNAs/expressed sequence tags (ESTs) reveals the existence of thousands of expressed sequences with unknown functions [1,2] (see NEIBank). Although many of these uncharacterized sequences are likely to represent genes with widespread expression patterns, it is just as likely that a substantial number have specialized expression patterns and functions in the retina. A complete understanding of retinogenesis and retinal disease will be realized only after the functions of most of these uncharacterized genes have been elucidated. Several general techniques are presently being applied to identify novel retinal genes. These include cloning genes with mutations that yield variant retinal phenotypes [3,4], searching for genes with domains or motifs already known to be functionally important in well characterized genes , and screening for genes with restricted expression patterns using high throughput differential hybridization . In silico approaches are also being used now that databases and software tools are available to perform large scale comparative analyses.
Several issues regarding retinal databases must be considered for optimal use in gene discovery. Retinal databases contain heterogeneous mixtures of nucleotide sequences that reflect the methods used to construct cDNA libraries and databases [1,2]. Sequences can correspond to those of previously characterized genes, uncharacterized ESTs from other databases, or novel sequences with no representation in available cDNA/EST databases. In addition to sequences with open reading frames (ORFs), some database member sequences contain mostly 3' untranslated regions (UTRs), particularly when the cDNA was synthesized using oligonucleotide dT primers, or regions partially composed of unprocessed transcripts. Although cDNA/EST database member composition can be complex, given the current tools for comparing sequences in different cDNA/EST databases and for accessing the mouse and human genomes, it should be possible to identify novel retinal genes solely by appropriately mining a large retinal cDNA/EST database.
Our laboratory has devoted considerable effort to establishing a mouse embryonic retinal cDNA/EST database called RetinalExpress . In the text, all clones derived from RetinalExpress are named with the prefix "Rx". This database was generated by sequencing the 5' ends of cDNA clones from an embryonic day 14.5 (E14.5) mouse arrayed retinal cDNA library . To date, RetinalExpress has more than 27,000 sequenced clones representing approximately 12,000 distinct genes, all of which have been annotated by functional class, chromosome position, and National Center for Biotechnology Information (NCBI) Unigene cluster group RetinalExpress. The 27,765 RetinalExpress sequences that were deposited in GenBank were filtered through RepeatMasker to produce high quality sequences, and random sampling shows that the vast majority of the sequences represent exons or portions of exons from either characterized or hypothetical genes. RetinalExpress has served as a platform for generating high density microarrays for use in transcription profiling of embryonic and postnatal retinas from brn3b and math5 knockout mice (unpublished data) [1,7].
Because the embryonic retina is a small, highly specialized neuronal tissue, many retinal expressed sequences are likely to be underrepresented in cDNA/EST databases derived from combinations of multiple tissues or from individual nonretinal tissues. RetinalExpress provides vision researchers with a source of embryonic retinal sequences that are not readily accessible from other databases. The goal of this study was to screen RetinalExpress for just a small number of uniquely expressed cDNAs in early to mid retinogenesis before the terminal differentiation of rod photoreceptors. Our thinking is that these cDNAs would be critical for retinal progenitor cell proliferation and commitment or for retinal ganglion cell (RGC) differentiation. We used an in silico approach to identify nucleotide sequences that were present in RetinalExpress but not in any other available cDNA/EST database. Our strategy was to filter the RetinalExpress cDNAs/ESTs through public EST databases and then apply a final high stringency filter to identify only those sequences that were greater than 550 bp and had virtually no simple repeat elements. By selectively filtering 27,765 cDNA sequences on the basis of these criteria, we identified a limited set of 26 cDNA clones whose sequences were, at the time of the analysis, unique to RetinalExpress and appeared to represent portions of uncharacterized genes. Three of the 26 characterized sequences were shown to represent novel retinal expressed genes; two encoding putative DNA binding proteins, and another encoding a novel protein containing a phospholipase C catalytic domain X.
Details on the construction, management, and properties of the RetinalExpress database were described previously . We emphasize that the sequences in RetinalExpress are high quality sequences that were filtered through RepeatMasker to remove long stretches of repetitive sequences elements and low quality sequences before being deposited in GenBank. RetinalExpress cDNAs/ESTs largely represent exons or portions of exons from known or hypothetical genes, although a small percentage represent introns. Virtually all RetinalExpress sequences, as determined by random sampling, are represented in the mouse genome. The strategy described below made use of a second high stringency sequence quality filter adjusting parameters so that cDNAs with >10 bp of repetitive sequence and/or <550 bp in length were removed to identify just a limited set of unique RetinalExpress cDNAs/ESTs.
Database comparisons and sequence filtering
We performed database comparisons with the Riken full length enriched database (Y. Hayashizaki, The Institute of Physical and Chemical Research [Riken], Yokohama, Kanagawa 230-0045, Japan), the adult mouse retina National Institutes of Health (NIH)_MGC_94 database (dbEST Library ID.5390, R. L. Strausberg and C. K. Prange, Lawrence Livermore National Laboratory, Livermore, CA), the NCBI EST database, and the NCBI mouse UniGene database. The Riken and adult mouse retina databases were downloaded from the NCBI UniGene site to a local computer. The BLASTN search program installed on a Compaq Tru64 UNIX V5.1B system running on a Compaq AlphaServer ES40 at our institution, was used to identify nucleotide matches among the databases in batch mode with default parameters. Best scores were saved to a file and post-processing of this file was done with an in-house computer program. Positive matches were defined as those with probability values of <1x10-20 (e-20). The selection of e-20 was based on visual inspection of the nucleotide sequence alignment and e-values of test sequences with 90-95% identity. The RetinalExpress database of 27,765 cDNAs/ESTs, which was originally filtered through RepeatMasker, was compared to the above databases as unmasked sequences to insure that the BLASTN searches encompassed the full length sequence. After the original filtering, we performed additional filtering to remove cDNAs/ESTs that had >10 bp of repetitive sequence and/or were <550 bp using RepeatMasker software obtained from Washington University. This final filter was inserted to remove any ambiguities and limit our search to just those sequences that were unique to RetinalExpress. The logic of our filtering was such that this stringent quality control step could have been inserted anywhere in the hierarchy. Filtered sequences were positioned on the mouse genome using the database for Mus musculus from NCBI Map Viewer software and the database from the University of California at Santa Cruz (UCSC) Genomic Bioinformatics web site.
We used three prediction programs to identify full length transcripts associated with RetinalExpress cDNAs/ESTs: Genscan at the Massachusetts Institute of Technology, Twinscan at Washington University, which uses mouse-human homology comparisons, and the mouse genome in NCBI, already analyzed by Gnomon. Default parameters were used in all cases.
Retinal cDNA preparation and polymerase chain reaction
In all experiments using mice, the U. S. Public Health Service Policy on Humane Care and Use of Laboratory Animals was followed. Mice pregnant with embryos at E14.5, E16.5, and E17.5 were sacrificed, embryos removed, and total RNA, which would necessarily contain nuclear RNA with unspliced and partially spliced primary transcripts, was isolated from 30 manually dissected retinas using TRIZOL Reagent (Invitrogen Life Technologies, Carlsbad, CA). cDNA was synthesized by priming with oligo(dT) using the SuperScript first strand synthesis system for reverse transcriptase-polymerase chain reaction (RT-PCR, Invitrogen Life Technologies). Total RNA (1 μg) in 12 μl of diethylpyrocarbonate (DEPC) treated water was used as a template after denaturation at 80 °C for 5 min. The reaction mixtures were then incubated at room temperature for 5 min, 42 °C for 1 h, 50 °C for 10 min, and 75 °C for 20 min. RNase H (1 μl, 2 U/μl) was added, and the mixture was incubated at 37 °C for 20 min to digest the RNA. The PCR primers used are shown in Table 1.
Six of the 17 sequences yielded positive PCR products. The PCR conditions for Rx-0612-84, Rx-0691-17, Rx-2083-71, Rx-3193-74, Rx-3132-51, and Rx-3201-21 were 94 °C for 45 s, 52 °C for 45 s (35 cycles), and 72 °C for 1 min. PCR products were cloned using the pGEM-T easy vector system (Promega, Madison, WI) and were sequenced at The University of Texas M. D. Anderson Cancer Center DNA Analysis Core Facility. To obtain purified DNA suitable for sequencing, 5 μl of the PCR product (about 100 ng/μl) was added to 2 μl of ExoSAP-IT (USB Corp., Cleveland, OH) and incubated at 37 °C for 15 min, followed by additional incubation at 80 °C for 20 min. The primers were removed from the PCR product by ExoSAP-IT before sequencing. The sequences were analyzed by two-way comparisons using BLAST against the NCBI predicted cDNAs or Map Viewer at NCBI against the mouse genome.
In situ hybridization
Digoxygenin (DIG) labeled antisense RNA probes were made by in vitro transcription from pGEM-T templates with T7 RNA polymerase (Ambion, Austin, TX), followed by ethanol precipitation. E14.5 embryos and eyes from postnatal (P) 15-day mice were washed in phosphate buffered saline (PBS), fixed in 4% paraformaldehyde-PBS, washed in PBS containing 1% Tween-20, dehydrated through a graded series of methanol in PBS containing 1% Tween-20 at 4 °C, and wax embedded and sectioned at 6 μm onto glass slides. The sections were de-waxed in xylene and then treated with proteinase K treatment for 7 min, postfixed in 4% paraformaldehyde-PBS for 20 min, and washed in PBS and 2X SSC for 5 min each. The sections were dehydrated through a graded series of ethanol and then air dried.
The sections were hybridized overnight at 65 °C in hybridization buffer (50% formaldehyde, 5X SSC (pH 4.5-5.0), 1% sodium dodecyl sulfate, 50 μg/ml yeast tRNA, and 50 μg/ml heparin) containing the DIG labeled probe. After the hybridization, the sections were washed three times for 30 min each in prewash solution (50% formaldehyde, 1X SSC, 0.1% Tween-20) at 65 °C and twice in 100 mM maleic acid, 150 mM NaCl, 0.1% Tween-20 (MABT) at 65 °C . The sections were then preincubated in 1X MABT, 2% blocking reagent (for nucleic acid hybridization and detection, Roche, Basel, Switzerland) and 10% sheep serum for 2 h, followed by incubation in 1X MABT, 2% blocking reagent, and 10% sheep serum with a 1:2000 dilution of anti-DIG (Roche) at room temperature overnight. After the sections were washed five times for 30 min, each with 1X MABT and once with 100 mM Tris-HCl (pH 9.5), 100 mM NaCl, 50 mM MgCl2, 0.1% Tween-20, and 0.048% levamisole, the hybridization signals were visualized using a BM Purple substrate (Roche) at room temperature for the desired time. The images were photographed using a digital camera interfacecd to the microscope.
Mouse tissue gene expression scanning
The mouse rapid-scan gene expression panel (Origene Technologies, Rockville, MD) containing cDNA pools that were reverse transcribed from RNA extracted from 24 different mouse tissues was used for PCR with the primers shown in Table 1 to identify transcripts from non retinal tissues. The PCR conditions were the same as described above. The cDNAs were normalized against a β-actin standard.
Properties of RetinalExpress
To date, the RetinalExpress cDNA/EST cluster set contains 27,765 members (i.e., individual cDNA clones). Cluster sizes range from one (12,793 members represented as singletons) to 291 (one member represented 291 times), with 75% of the clones represented in cluster sizes of one to five. Thus, the database represents a highly diverse expressed sequence population, with most members appearing at frequencies of less than 2x10-4. The high complexity and low abundance of expressed sequences that we found in the E14.5 retina suggested that many low frequency sequences are uniquely represented in RetinalExpress. However, several factors may complicate in silico searches for new retinal genes. For example, the cDNAs in RetinalExpress often do not represent full length transcripts, so two or more cDNA clones could derive from different regions of the same gene. In an earlier analysis, we used known genes to estimate the number of different cDNA clones belonging to the same gene . This approach reduced the total number of genes represented in RetinalExpress by a factor of 1.23. We also observed that some uncharacterized cDNAs in the database were in fact 3' UTRs or in a few cases unprocessed introns that mapped to known genes. Identifying these sequences as parts of known genes required positioning them individually on the mouse genome within known genes. In addition, although random sampling showed that the large majority of sequences in RetinalExpress were represented in the mouse genome as exons or portions of exons, a small fraction of cDNA entries had sequencing ambiguities that made comparisons with ESTs in other databases using BLAST searches problematic. These complications were taken into account when we searched for unique nucleotide sequences in RetinalExpress. Our strategy was to compare the RetinalExpress database with available databases and then apply a final high stringency sequence quality filter to remove any sequences that might result in ambiguities with regard to their uniqueness in RetinalExpress. Our goal was to use RetinalExpress to identify genes that were involved either in retinal progenitor cell proliferation and commitment or RGC differentiation.
Comparison of RetinalExpress with other mouse cDNA/EST databases
To obtain a maximum estimate of unique sequences, we first compared the 27,765 cDNA entries in RetinalExpress with the Riken mouse full length enriched database containing 60,770 entries  and an adult mouse retinal database (dbEST mouse retina) containing 28,486 entries. We found that 11,234 RetinalExpress sequences were common to all three databases, 8,930 were represented in Riken but were not in the adult mouse retinal database, 1,359 were represented in the adult mouse retinal database but were not in Riken, and 6,242 RetinalExpress sequences were not represented in the other two databases (Figure 1). The results suggested that a substantial number of cDNA clones were unique to RetinalExpress.
We next undertook a more stringent filtering approach. After removing RetinalExpress sequences that were represented in the Riken and adult mouse retinal databases, we filtered the remaining 6,242 RetinalExpress sequences through the NCBI EST database containing 3,636,920 ESTs. This approach resulted in 3,111 sequences represented in RetinalExpress but not in the aforementioned databases. We then subjected the filtered sequences to RepeatMasker to select only a limited set of sequences that would be unique candidates. Although the original 27,765 sequences had already been filtered through RepeatMasker, our strategy was to eliminate sequences that contained short stretches of simple repeats or ambiguous sequences to optimize our chances that unique sequences were identified. A high stringency filter was applied in which only those stretches >550 bp with <10 ambiguous bases or simple repetitive elements were considered suitable for a more detailed analysis. We chose to insert this filter at this point in the hierarchy but of course it could have been placed at any point with the same outcome. As we anticipated with such a stringent step, filtering the sequences through RepeatMasker further reduced the number of unique clones from 3,111 to 785.
Identification of unique RetinalExpress cDNAs
NCBI Map Viewer and UCSC Genomic Bioinformatics are Web based systems that locate known and hypothetical genes on the mouse genome. Gene annotation is provided through Web links that supply information on specific genes; annotations can range anywhere in detail from a preliminary report of a new gene sequence to an extensive body of information on a well characterized gene. Hypothetical genes are predicted by Map Viewer linked sequence algorithms that assign exons and introns, ORFs, and transcriptional start and stop sites to previously uncharacterized genomic regions. For our analysis, we reduced our set of 785 unique RetinalExpress sequences to cDNAs/ESTs that had no substantial annotation and were within or adjacent to hypothetical genes. Using NCBI Map Viewer, we first positioned all 785 sequences on the mouse genome and then sorted them into those representing portions of hypothetical genes and those representing previously uncharacterized portions of known genes. We discarded 587 cDNAs/ESTs that we found overlapped with known genes, thereby yielding 182 cDNAs unique to RetinalExpress. A small program in Perl (available upon request) was run on a local computer to identify the 182 sequences. We then repeated the procedure using UCSC Genomic Bioinformatics, discarding an additional 159 sequences that were found to be parts of previously characterized genes. The number of unique RetinalExpress cDNAs/ESTs was thus reduced to 26 potentially novel retinal genes not reported in any other databases at the time of our analysis.
ESTs unique to RetinalExpress
The 26 selected cDNAs/ESTs were dispersed over 13 chromosomes and were not identical to any reported genes in the mouse or human genome, although some had minimal annotation on gene structure (Table 2). The sequences were located in regions in which hypothetical genes had been predicted; the placement of the RetinalExpress cDNA/EST within each gene is shown in Figure 2 (see also Figure 3A). The sequences were located mostly in regions predicted to be exons, but they were also found in introns and other noncoding portions of the hypothetical genes (Figure 2). As might be expected with cDNAs synthesized using oligonucleotide dT primers, many of the cDNAs/ESTs were located near or at the 3' ends of the hypothetical genes (Figure 2). Interestingly, most of the unique RetinalExpress cDNAs/ESTs represented stretches of predicted genes that encompassed both exons and introns, suggesting either that they were unprocessed transcripts or that the exon/intron boundary predictions were inaccurate (Figure 2).
The information that we learned regarding the hypothetical genes was limited, but we did identify notable features for several of these genes. For example, nine of the genes (Rx-0243-48, Rx-0691-17, Rx-2044-45, Rx-2052-73, Rx-3132-51, Rx-3193-74, Rx-3201-21, Rx-7103-74, and Rx-8023-08) appeared to have human homologs (Table 2). Rx-2044-45 had a human homolog that is related to the RAR orphan receptor A, and Rx-3193-74 had the human homolog SET binding protein 1. The gene represented by Rx-2071-55 had sequence similarity (but was not identical) to a gene encoding ODZ3/glycoprotein m6a. Rx-0691-17 and Rx-3201-21 belonged to hypothetical genes containing RFX and HMG DNA binding domains, respectively. At the time we were compiling our report, a Riken cDNA (9930116O05) was reported to correspond to the hypothetical gene containing the Rx-0691-17 sequence (Table 2 and Figure 2).
We selected seventeen cDNAs/ESTs to determine whether transcripts predicted from the hypothetical genes were in fact present in embryonic retina. Using cDNA synthesized from RNA templates isolated from E14.5, E16.5 and E17.5 embryos, we designed PCR primers sets to amplify the transcripts that spanned predicted exons (Table 1). In some cases, the prediction programs did not agree with each other, and in these cases we generated primers to test all predictions. Of the seventeen cDNAs/ESTs tested, six gave predicted PCR products (Rx-0612-84, Rx-0691-17, Rx-2083-71, Rx-3201-21, Rx-3132-51, and Rx-3193-74), whereas the eleven others did not. Rx-0691-17, Rx-3201-21, and Rx-3132-51 are discussed below. Primers used for the Rx-0612-84, Rx-2083-71, and Rx-3193-74 PCR analysis yielded 1.2 kb, 0.7 kb and 0.34 kb products, respectively (Figure 2). The reason for the lack of predicted products for the other eleven sequences was not clear. It is possible that the primers were not adequate for PCR amplification or that the transcripts were very rare. Alternatively, although RetinalExpress cDNAs/ESTs were necessarily derived from E14.5 retinal RNA, some transcripts from predicted exons might not have been, because of either tissue specific splicing or false exon assignment.
A novel HMG-box retinal expressed gene
Rx-3201-21 represented a unique EST that spanned exon 8 of the hypothetical gene LOC269389, whose revised exon structure was predicted by Gnomon on October 11, 2003 (Table 2 and Figure 3A). According to the revised Gnomon prediction, LOC269389 had nine exons totaling 1.7 kb and spanned 125 kb of genomic DNA (Figure 3A, upper row). Forty-one base pairs of Rx-3201-21 mapped to exon 8 of LOC269389, but an additional 1.4 kb extended beyond exon 8 into another unpredicted exon closer to the 3' end (Figure 3A, middle row). RT-PCR and subsequent sequencing analysis showed that exons 2-8 of LOC269389 were expressed in the embryonic retina, represented by a 1.1 kb transcript containing exons 4-8 (Figure 3A, primers p1 and p2; Figure 3B, lane 1), and a 1.3 kb transcript containing exons 2-8 (Figure 3A, primers p2 and p3; Figure 3B, lane 2). In addition, primer p5 for exon 1 and primer p2 yielded the expected 1.4 kb product in the retina (Figure 3B, lane 3) and matched in sequence with predicted sequences indicating that the predicted exons 1-8 were accurate. However, we found no evidence to indicate that predicted exon 9 was expressed in the retina since a primer (p4) within that exon failed to produce a transcript (Figure 3A, middle row). The unpredicted exon closest to the 3' end that contained Rx-3201-21 may have represented the bona fide last exon because the Rx-3201-21 sequence contained exon 8 and the unpredicted exon but skipped over a small stretch of genomic DNA that probably represented the last intron of LOC269389 (Figure 3A, middle row). Within the unpredicted exon, the Rx-3201-21 sequence contained the C-terminal end of a putative ORF and a 3' UTR with a poly-A addition site 20 bp from the poly-A tail (Figure 3A, middle row, and Figure 4A).
Our RT-PCR results therefore led us to a new prediction for the LOC269389 gene in which exon 1 predicted by Gnomon represented the 5' UTR, exon 2 represented the exon containing the first potential initiation codon (AUG) of the putative ORF, exons 2-8 were ORF containing exons, and the Gnomon predicted exon 9 was replaced by a new exon closest to the 3' end (Figure 3A). To provide further evidence for our model, we searched for other RetinalExpress cDNAs/ESTs that contained LOC269389 sequences. We identified Rx-3132-64, which overlapped with RX-3201-21 and corresponded to exons 7, 8, and 9 of LOC269389 (Figure 3A). For the LOC269389 gene structure predicted by Genscan (Figure 3A, lower row), several additional exons were predicted that were probably inaccurate since we could not identify PCR products in the retina containing their sequences and the Genscan prediction differed substantially from that of the revised model of Gnomon.
LOC269389 has a 515 codon ORF that encodes a putative protein of 62 kDa (Figure 4A). Our search for putative domains and motifs within LOC269389 led to our identification of an HMG box (amino acid residues 245-315) that matched the key signature features of an HMG motif critical for DNA binding (Figure 4B). Alignment of the LOC269389 HMG box indicated that the closest matches were to the human IRV6, mouse HMG1, and yeast ABF2 boxes (Figure 4B). LOC269389 also contained a proline rich stretch from amino acid residues 390-454 that is often indicative of a transcriptional activation/repression domain or a site for protein-protein interaction. We believe that LOC269389 is a previously undiscovered retinal DNA binding protein with possible transcription factor functions in retinal tissue. In the rest of this article, we will refer to LOC269389 as RxHMG1 to reflect its representation in the RetinalExpress database and its novel HMG domain.
To identify sequences belonging to cDNA/EST databases other than RetinalExpress that might correspond to LOC269389 sequences, we performed a BLAST search with LOC269389 exons 2-9. We identified 19 ESTs, several of which belonged to databases representing eye, head, or brain. This pattern was consistent with the hypothesis that the expression of LOC269389 was restricted to the retina. However, other ESTs were found in databases representing lung and skeletal muscle, suggesting that LOC269389 had a somewhat broader expression pattern (see below). With one notable exception, the ESTs representing LOC269389 sequences lacked substantial annotation. The exception, designated AB096685, was very recently characterized as a portion of LOC269389, and was reported as an HMG box termed GCX-1 that was strongly expressed in rat ovarian granulosa cells .
To determine where RxHMG1 was expressed in the embryonic and adult retinas, we generated a probe representing the Rx-3201-21 sequence and performed in situ hybridizations with retina sections from P15 mice and E14.5 embryos (Figure 5). At P15, the mouse retina was fully differentiated into its three nuclear layers and two plexiform layers. RxHMG1 transcripts were found in the cytoplasm but not the nucleus of cells in the outer nuclear layer (rod and cone photoreceptor cells), in the inner nuclear layer (bipolar, horizontal, and amacrine cells), and in the ganglion cell layer (ganglion and amacrine cells, Figure 5A, upper left). Expression levels were substantially above those seen with a sense control (Figure 5A, lower left). Expression above background levels was also observed in the outer plexiform layer and inner plexiform layer, both of which contain synapsing axons and dendrites (Figure 5A, left). These results suggest that RxHMG1 was expressed in all cells of the differentiated postnatal retina. The strong expression in the outer and inner plexiform layer was unexpected because the translated product of RxHMG1 was a putative nuclear protein, and it was therefore not clear why the RxHMG1 mRNA would be located in a region associated with synapsing axons and dendrites.
In E14.5 retinas, actively dividing progenitor cells in the ventricular zone continuously drop out of the cell cycle and commit to the various retinal cell types in a sequential fashion. Retinal ganglion cells are the first cell type to differentiate and are the main differentiated cell type at E14.5. We used an in situ hybridization probe encoding a neurofilament protein, NF66, as a marker for retinal ganglion cells  to distinguish these differentiated cells from progenitor cells in the ventricular zone (Figure 5B, upper left). However, in contrast to NF66, RxHMG1 was expressed both in newly differentiated ganglion cells in the ganglion cell layer and in progenitor cells in the ventricular zone (Figure 5B, middle left). In addition, RxHMG1 appeared to be expressed in the lens epithelium (Figure 5B, middle left). The hybridization signals were much higher in these tissues than those observed with the sense strand control (Figure 5B, lower left). These results suggested that RxHMG1 was initially expressed in all cells of the embryonic retina and possibly in other cells of the embryonic eye.
Rx-0691-17, part of a gene containing an RFX DNA binding domain
Rx-0691-17 corresponded to an unprocessed transcript at the 3' end of hypothetical gene 9930116O05Rik (Figure 6). A Riken cDNA, 9930116O05, corresponding to 9930116O05Rik, was reported by Riken a short time after we completed our search for unique RetinalExpress cDNAs/ESTs. The gene contains 10 exons and has an ORF encoding a putative protein with a molecular weight of 153 kDa that includes an RFX DNA binding domain (Figure 6). RFX containing proteins bind to X-box consensus sites and are widespread throughout the animal kingdom . They have been implicated as critical transcription factors in regulating the expression of class II major histocompatability genes and genes expressed in the immune system and in sensory neurons [10,11]. In this study, RT-PCR analysis showed that the final two exons of 9930116O05Rik, which contained the Rx-0691-17 sequence, were represented in E14.5 retinas (Figure 6). The Rx-0691-17 sequence represented the C-terminal region of the 9930116O05 ORF but did not contain the RFX DNA binding domain (Figure 6). In fact, no PCR products representing the upstream exons that contained the RFX domain were detected in our analysis. It was not clear why the remainder of the 9930116O05Rik transcript was not detected. It was possible that the rest of the transcript was expressed in the retina and that the primer sets used for amplification were inadequate. Alternatively, it was also possible that only the final two exons were expressed in the retina and that no translation product containing the RFX domain was produced.
We used Rx-0691-17 as a probe for in situ hybridization to determine where in the retina 9930116O05 was expressed. In P15 retinas, expression was observed in the ganglion cell and inner nuclear layers, although weak expression above sense control levels was also observed in the outer nuclear layer (Figure 5A, right). In the embryonic retina, Rx-0691-17 was expressed almost exclusively in differentiated retinal ganglion cells in the ganglion cell layer (Figure 5B, middle right) and this signal was substantially above that of the sense control (Figure 5B, lower right). The expression pattern of Rx-0691-17 was clearly distinct from that of RxHMG1, where expression was observed in both retinal ganglion cells and progenitor cells (compare left to right in Figure 5B).
Rx-3132-51 is part of a hypothetical phospholipase C catalytic domain X containing gene
Rx-3132-51 was found within hypothetical gene LOC313850, which was predicted by Genscan to contain six exons (Table 2 and Figure 7). We used PCR primers complementary to sequences within the first and last exons to determine whether LOC313850 transcripts were expressed in E14.5 embryonic retina by RT-PCR (Figure 7, primers p1 and p4). A 1.0 kb product amplified from embryonic retinal cDNA was isolated, sequenced and found to contain four exons rather than six (data not shown). Two cryptic exons between exons 2 and 3 were absent from the amplified sequence (dashed exons in Figure 7). Rx-3132-51 was located in a region that covered the downstream-most cryptic exon and the bordering intron sequences (Figure 7). It was therefore possible that Rx-3132-51 represented solely intronic sequences or alternatively, that rare retinal transcripts not amplified with the p1 and p4 primers contained one or both of the cryptic exons. However, another gene prediction program (Ensembl, Sanger Institute) predicted a hypothetical gene (ENMUST00000036306) with four exons as depicted in Figure 7 rather than the six predicted by Genscan, and the first and last exons were slightly smaller in the Ensembl prediction than in the Genscan prediction. At this point in the analysis, it cannot be stated with certainty whether mature mRNA forms exist containing one or both of the two cryptic exons predicted by Genscan.
NCBI protein prediction analysis showed that the putative protein encoded by LOC313850 was 340 amino acids in length and contained a phospholipase C catalytic domain X (PLCXc) between amino acids 84 and 215 (Figure 7). Phospholipase C isoforms contain two regions (X and Y), which together form a TIM barrel-like structure containing the active site residues. Several proteins contain only the PLCXc domain (for example, see PLCXc with the Conserved Domain Architecture Retrival Tool (CDART), NCBI) but it is not clear what role PLCXc plays without the associated Y domain. No sequence homology to any other mouse protein was found outside of the PLCXc region suggesting that LOC313850 was a novel protein containing a PLCXc domain. A human gene and hypothetical protein homolog has been predicted for LOC313850 (Table 2), and a putative protein sequence containing an additional C-terminal 300 amino acids has been reported in the rat. However, no further information beyond DNA sequence analysis exist for the mouse, rat, or human hypothetical protein. Although we detected LOC313850 transcripts in the embryonic retina by RT-PCR, we were unable to detect them in RNAs from other tissues, suggesting that LOC313850 had restricted tissue expression patterns (see below).
RetinalExpress ESTs expressed in other embryonic and adult tissues
Four RetinalExpress ESTs that yielded RT-PCR products in embryonic retina (Rx-0612-84, Rx-0691-17, Rx-2083-71, Rx-3132-51, Rx-3193-74, and Rx-HMG1) were used to determine the range of expression in other tissues. We used a commercial kit containing normalized cDNAs that were reversed transcribed from RNA from 24 different mouse sources, including all major adult organs and E9.5, E12.5, and E19 embryos. We used the same PCR primers that yielded positive results for the embryonic retina. However, only RxHMG1 yielded detectable products from tissues other than embryonic retina (Figure 8). With RxHMG1, we observed expression in adult brain, kidney, spleen, skeletal muscle, and testis (Figure 8, arrowheads). In brain, spleen, skeletal muscle, and testis, the PCR product was identical in size to that found in the retina. However, the kidney product was substantially smaller suggesting that the RxHMG1 transcript in kidney tissue was altered. This smaller product was also found at minor levels in cDNA derived from adult brain (Figure 8, lane 1).
To control for our PCR conditions, the β-actin transcript was amplified using β-actin specific primers. We found much higher levels of products in the cDNAs from all 24 tissue sources than were found with RxHMG1 (Figure 8). These results suggested that transcripts from the other RetinalExpress ESTs were highly restricted in their expression since we could not detect PCR products at the levels observed with RxHMG1. The fact that we failed to detect expression in the adult brain indicated that transcript levels in the adult retina, which represents only a small fraction of the brain cDNAs, were very low. Although, we cannot make definitive conclusions, our analysis suggested that the other three RetinalExpress ESTs were either very rare in the cDNA population from non retinal tissues or were not present.
Database gene discovery approaches are becoming increasingly important for addressing issues regarding gene expression and function in retinogenesis and retinal disease [12,13]. Using an in silico approach, we have shown that it is possible to identify and characterize previously unknown genes expressed in the retina using the RetinalExpress database. In our analysis, we identified 26 genes, which at the time of our survey, were essentially uncharacterized except hypothetical gene placement on the mouse genome. Our selection process for unique cDNAs/ESTs involved filtering all 27,765 cDNAs through the major mouse EST databases. Two subsequent steps in our mining strategy then greatly reduced the number of clones under consideration. First, the unique sequences identified by database filtering were subjected to a highly stringent sequence quality filter. This selection step removed approximately 75% of the cDNAs. Many of the sequences that were removed by this procedure were likely to be unique to RetinalExpress and could very well represent novel retinal sequences. However, this filter was applied because our objective was to identify a limited number of unique sequences that would represent genes with potentially important functions in early retinogenesis and could be used for future analysis. Second, we positioned the remaining cDNAs/ESTs on the mouse genome and considered only those sequences that were within hypothetical gene regions with no substantial annotation beyond a predicted gene structure. Clearly, our criteria were subjective; in fact by relaxing the stringency for either of the latter two steps, we would have substantially increased the number of cDNAs/ESTs defined as unique to RetinalExpress.
Our analysis selected only 0.1% of the cDNAs/ESTs in the database. Many of these sequences represented pieces of hypothetical genes that either were unprocessed transcripts containing both putative exons and introns or had putative exons that were larger than predicted. We also found that many cDNAs/ESTs that were initially selected as unique to RetinalExpress were in fact noncoding portions of previously identified genes. For example, Rx-2064-47 was part of the 3' UTR of the gene encoding the Brn3a POU domain transcription factor, and Rx-7103-74 was within the first intron of the exostoses gene, Ext1. The fact that many unique ESTs actually represented previously unreported portions of known genes demonstrated the complexity and heterogeneity of cDNA/EST databases and suggested that without a detailed analysis, identifying an individual nucleotide sequence as a previously unknown gene can often be misleading.
Gene prediction programs vary in their predictive capabilities, each having their own strengths and successes in predicting rates of false positive and false negative exon placement . For example, Genscan had substantially higher accuracy than other programs when tested on standardized sets of human and vertebrate genes, with 75 to 80% of exons identified exactly . We tested the validity of several gene prediction algorithms in seventeen cases by searching for the putative transcripts associated with the hypothetical genes where the RetinalExpress cDNAs/ESTs were positioned. In six cases, we detected retinal expression extending beyond the cDNA/EST sequence. For RxHMG1 and Rx-3132-51, where gene predictions were available from Gnomon, Genscan, and Ensembl, differences were found among the different prediction algorithms, and the gene predictions did not accurately reflect the retinal transcript sequences obtained from RT-PCR. In the eleven cases for which we failed to confirm the gene structure prediction, we could not say for certain that the predicted structure was incorrect, but it was likely that in some cases the absence of the predicted PCR products reflected the absence of bona fide exons.
We showed RxHMG1 to be a novel HMG gene with expression in ganglion cells and progenitor cells in the embryonic retina and in all differentiated cell types in the adult retina. HMG-box proteins are generally associated with binding DNA, either nonspecifically or specifically to particular target sites, and with altering DNA confirmation by inducing DNA bending [16,17]. The HMG box of RxHMG1 aligned most closely with human IRV6, mouse HMG1, and yeast ABF2 HMG boxes. The yeast protein is a mitochondrial protein that binds to mitochondrial DNA, whereas mouse HMG1 has been implicated in cytokine induced inflammation by a mechanism that is independent of DNA binding [18,19]. Although it is not clear at this point what role RxHMG1 plays in the developing or mature retina, the fact that LOC269389 transcripts from nonretinal sources (skeletal muscle, kidney, spleen, and testis) could be found suggests that RxHMG1 is not specific to the retina, although it appeared to be restricted in its expression to only a few adult organs.
The genes depicted in Table 2, Figure 2, Figure 3, Figure 6, and Figure 7 represent regions of the mouse genome where unique RetinalExpress cDNAs/ESTs are located. In a few cases, there are clues as to the functions of these genes, but for most, virtually no information is available on their expression or function in the retina. A future challenge will be to place these novel genes into the context of gene networks involved in retinal development and physiological function.
This work was supported by grants EY11930 and EY13523 of the National Eye Institute, and by the Robert A. Welch Foundation. The University of Texas M. D. Anderson Cancer Center DNA Analysis Core Facility is supported in part by National Cancer Institute Cancer Center Support Grant CA16672.
1. Mu X, Zhao S, Pershad R, Hsieh TF, Scarpa A, Wang SW, White RA, Beremand PD, Thomas TL, Gan L, Klein WH. Gene expression in the developing mouse retina by EST sequencing and microarray analysis. Nucleic Acids Res 2001; 29:4983-93.
2. Yu J, Farjo R, MacNee SP, Baehr W, Stambolian DE, Swaroop A. Annotation and analysis of 10,000 expressed sequence tags from developing mouse eye and adult retina. Genome Biol 2003; 4:R65.
3. Bowne SJ, Sullivan LS, Blanton SH, Cepko CL, Blackshaw S, Birch DG, Hughbanks-Wheaton D, Heckenlively JR, Daiger SP. Mutations in the inosine monophosphate dehydrogenase 1 gene (IMPDH1) cause the RP10 form of autosomal dominant retinitis pigmentosa. Hum Mol Genet 2002; 11:559-68.
4. Hayward C, Shu X, Cideciyan AV, Lennon A, Barran P, Zareparsi S, Sawyer L, Hendry G, Dhillon B, Milam AH, Luthert PJ, Swaroop A, Hastie ND, Jacobson SG, Wright AF. Mutation in a short-chain collagen gene, CTRP5, results in extracellular deposit formation in late-onset retinal degeneration: a genetic model for age-related macular degeneration. Hum Mol Genet 2003; 12:2657-67.
5. Brown NL, Kanekar S, Vetter ML, Tucker PK, Gemza DL, Glaser T. Math5 encodes a murine basic helix-loop-helix transcription factor expressed during early stages of retinal neurogenesis. Development 1998; 125:4821-33.
6. Hackam AS, Bradford RL, Bakhru RN, Shah RM, Farkas R, Zack DJ, Adler R. Gene discovery in the embryonic chick retina. Mol Vis 2003; 9:262-76 <http://www.molvis.org/molvis/v9/a38/>.
7. Mu X, Beremand PD, Zhao S, Pershad R, Sun H, Scarpa A, Liang S, Thomas TL, Klein WH. Discrete gene sets depend on POU domain transcription factor Brn3b/Brn-3.2/POU4f2 for their expression in the mouse embryonic retina. Development 2004; 131:1197-210.
8. Okazaki Y, Furuno M, Kasukawa T, Adachi J, Bono H, Kondo S, Nikaido I, Osato N, Saito R, Suzuki H, Yamanaka I, Kiyosawa H, Yagi K, Tomaru Y, Hasegawa Y, Nogami A, Schonbach C, Gojobori T, Baldarelli R, Hill DP, Bult C, Hume DA, Quackenbush J, Schriml LM, Kanapin A, Matsuda H, Batalov S, Beisel KW, Blake JA, Bradt D, Brusic V, Chothia C, Corbani LE, Cousins S, Dalla E, Dragani TA, Fletcher CF, Forrest A, Frazer KS, Gaasterland T, Gariboldi M, Gissi C, Godzik A, Gough J, Grimmond S, Gustincich S, Hirokawa N, Jackson IJ, Jarvis ED, Kanai A, Kawaji H, Kawasawa Y, Kedzierski RM, King BL, Konagaya A, Kurochkin IV, Lee Y, Lenhard B, Lyons PA, Maglott DR, Maltais L, Marchionni L, McKenzie L, Miki H, Nagashima T, Numata K, Okido T, Pavan WJ, Pertea G, Pesole G, Petrovsky N, Pillai R, Pontius JU, Qi D, Ramachandran S, Ravasi T, Reed JC, Reed DJ, Reid J, Ring BZ, Ringwald M, Sandelin A, Schneider C, Semple CA, Setou M, Shimada K, Sultana R, Takenaka Y, Taylor MS, Teasdale RD, Tomita M, Verardo R, Wagner L, Wahlestedt C, Wang Y, Watanabe Y, Wells C, Wilming LG, Wynshaw-Boris A, Yanagisawa M, Yang I, Yang L, Yuan Z, Zavolan M, Zhu Y, Zimmer A, Carninci P, Hayatsu N, Hirozane-Kishikawa T, Konno H, Nakamura M, Sakazume N, Sato K, Shiraki T, Waki K, Kawai J, Aizawa K, Arakawa T, Fukuda S, Hara A, Hashizume W, Imotani K, Ishii Y, Itoh M, Kagawa I, Miyazaki A, Sakai K, Sasaki D, Shibata K, Shinagawa A, Yasunishi A, Yoshino M, Waterston R, Lander ES, Rogers J, Birney E, Hayashizaki Y, FANTOM Consortium, RIKEN Genome Exploration Research Group Phase I & II Team. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 2002; 420:563-73.
9. Kajitani T, Mizutani T, Yamada K, Yazawa T, Sekiguchi T, Yoshino M, Kawata H, Miyamoto K. Cloning and characterization of granulosa cell high-mobility group (HMG)-box protein-1, a novel HMG-box transcriptional regulator strongly expressed in rat ovarian granulosa cells. Endocrinology 2004; 145:2307-18.
10. Emery P, Durand B, Mach B, Reith W. RFX proteins, a novel family of DNA binding proteins conserved in the eukaryotic kingdom. Nucleic Acids Res 1996; 24:803-7.
11. Dubruille R, Laurencon A, Vandaele C, Shishido E, Coulon-Bublex M, Swoboda P, Couble P, Kernan M, Durand B. Drosophila regulatory factor X is necessary for ciliated sensory neuron differentiation. Development 2002; 129:5487-98.
12. Cepko CL. Genomics approaches to photoreceptor development and disease. Harvey Lect 2001-2002; 97:85-110.
13. Mu X, Klein WH. A gene regulatory hierarchy for retinal ganglion cell specification and differentiation. Semin Cell Dev Biol 2004; 15:115-23.
14. Burge CB, Karlin S. Finding the genes in genomic DNA. Curr Opin Struct Biol 1998; 8:346-54.
15. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J Mol Biol 1997; 268:78-94.
16. Thomas JO. HMG1 and 2: architectural DNA-binding proteins. Biochem Soc Trans 2001; 29:395-401.
17. Alexander-Bridges M, Ercolani L, Kong XF, Nasrin N. Identification of a core motif that is recognized by three members of the HMG class of transcriptional regulators: IRE-ABP, SRY, and TCF-1 alpha. J Cell Biochem 1992; 48:129-35.
18. Dequard-Chablat M, Allandt C. Two copies of mthmg1, encoding a novel mitochondrial HMG-like protein, delay accumulation of mitochondrial DNA deletions in Podospora anserina. Eukaryot Cell 2002; 1:503-13.
19. Li J, Kokkola R, Tabibzadeh S, Yang R, Ochani M, Qiang X, Harris HE, Czura CJ, Wang H, Ulloa L, Wang H, Warren HS, Moldawer LL, Fink MP, Andersson U, Tracey KJ, Yang H. Structural basis for the proinflammatory cytokine activity of high mobility group box 1. Mol Med 2003; 9:37-45.