|Molecular Vision 1999;
Received 1 February 1999 | Accepted 28 April 1999 | Published 4 May 1999
Identifying and mapping novel retinal-expressed ESTs from humans
Melanie M. Sohocki,1
Lori S. Sullivan,1,2
Stephen P. Daiger1,2
1Human Genetics Center, School of Public Health and 2Department of Ophthalmology and Visual Science, The University of Texas Health Science Center, Houston, TX
Correspondence to: Stephen P. Daiger, Human Genetics Center, School of Public Health, PO Box 20334, Houston, TX, 77225-0334; Phone: (713) 500-9829; FAX: (713) 500-0900 email: firstname.lastname@example.org
Purpose: The goal of this study was to develop efficient methods to identify tissue-specific expressed sequence tags (ESTs) and to map their locations in the human genome. Through a combination of database analysis and laboratory investigation, unique retina-specific ESTs were identified and mapped as candidate genes for inherited retinal diseases.
Methods: DNA sequences from retina-specific EST clusters were obtained from the TIGR Human Gene Index Database. Further processing of the EST sequence data was necessary to ensure that each EST cluster represented a novel, non-redundant mapping candidate. Processing involved screening for homologies to known genes and proteins using BLAST, excluding known human gene sequences and repeat sequences, and developing primers for PCR amplification of the gene encoding each cDNA cluster from genomic DNA. The EST clusters were mapped using the GeneBridge 4.0 Radiation Hybrid Mapping Panel with standard PCR conditions.
Results: A total of 83 retinal-expressed EST clusters were examined as potential novel, non-redundant mapping candidates. Fifty-five clusters were mapped successfully and their locations compared to the locations of known retinal disease genes. Fourteen EST clusters localize to candidate regions for inherited retinal diseases.
Conclusions: This pilot study developed methodology for mapping uniquely expressed retinal ESTs and for identifying potential candidate genes for inherited retinal disorders. Despite the overall success, several complicating factors contributed to the high failure rate (33%) for mapping EST-clustered sequences. These include redundancy in the sequence data, widely dispersed sequences, ambiguous nucleotides within the sequences, the possibility of amplifying through introns and the presence of repetitive elements within the sequence. However, the combination of database analysis and laboratory mapping is a powerful method for identification of candidate genes for inherited diseases.
A rapidly growing area of genome research is the analysis of expressed sequence tags (ESTs). ESTs are generated when large numbers of randomly selected cDNA clones from specific tissues are sequenced partially. The resulting collection of ESTs reflects the level and complexity of gene expression in the sampled tissue. ESTs can be used to rapidly identify expressed genes because they are usually unique to the cDNA from which they are derived and they correspond to a specific gene in the genome.
The goal of this project was to identify potential candidate genes for inherited retinopathies using bioinformatic tools to ascertain retina-specific gene sequences and to map these sequences in a human radiation hybrid panel. The sequence data source was the TIGR Human Gene Index Database. Clusters of expressed sequences (ESTs) representing the same transcript are assembled by TIGR based on sequence overlap. ESTs that share one or more stretches of high sequence identity are grouped into a cluster. The set of sequences that forms the cluster is then reduced to a single tentative human consensus sequence (THC). These THCs were used to further analyze retina-specific clusters. It is important to note that some retinopathies are due to mutations in genes expressed in a number of tissues besides the retina. However, by choosing to analyze abudant, retina-specific ESTs only, we performed a survey of the most likely candidates for inherited retinal diseases.
To date, less than half of the more than 100 genes causing inherited retinal diseases have been cloned (RetNet) , and it is highly likely that additional disease loci will identified. Therefore, another goal of this project was to develop efficient methods for EST analysis and to uncover any inherent limitations in using ESTs for mapping purposes. We identified novel retina-specific ESTs using bioinformatic tools and subsequently, through laboratory analysis, localized these ESTs to specific chromosomes. Results using a similar approach to map retina/pineal-specific ESTs, distinct from those described here, are reported elsewhere .
The purpose of this project was to identify retina-specific gene sequences in public databases for subsequent mapping using a radiation hybrid panel. We used the TIGR database to ascertain candidate sequences and compared these with data from the UniGene and GenBank databases. ESTs (expressed sequence tags) are grouped into clusters of overlapping, highly similar sequences, called THCs or "tentative human consensus sequences" in TIGR. The ESTs within a THC are presumed to derive from a single gene. The tissue origin of each EST within a THC, and the number of ESTs per cluster, provide information on the tissue distribution and relative abundance of the gene transcript. This information was used to select candidate sequences for mapping.
Retina-specific ESTs were identified in the TIGR database version 3.3 on July 1, 1998. Duplicate entries and identical clusters with different THC numbers were eliminated manually. Expression information and map locations (if known) of each cluster were acquired by entering the GenBank accession number of at least one STS (sequence tagged site) for each cluster into UniGene. Repeat sequences within THC sequences were masked using RepeatMasker. BLAST homology searches were performed  using the NCBI server. Clusters identified by BLAST as representing known genes or containing additional non-retinal ESTs were excluded from the study. Only clusters that contained retinal ESTs exclusively (or retinal and tumor-derived ESTs) and not previously mapped were considered for further analysis.
PCR primers for each cluster were designed using the Primer3 program. Primer pairs were optimized for PCR in human genomic DNA using AmpliTaq Gold polymerase (Perkin-Elmer) with a standard protocol of 35 cycles and an annealing temperature gradient within the Stratagene Robocycler thermocycler . The resulting DNA fragments were separated on standard 2% agarose gels. The sequence of fragments that were not of the expected size was determined by treating an aliquot of the genomic PCR product with shrimp alkaline phosphatase and exonulease (Amersham) followed by manual sequencing with the AmpliCycleTM Sequencing Kit (Perkin-Elmer) using primers end-labeled with 32P. The sequence fragments were separated on 6% Long RangerTM (FMC Bioproducts) denaturing acrylamide gels. Each cluster that amplified in genomic DNA was localized in the genome using the GeneBridge 4.0 Radiation Hybrid Panel (Research Genetics). The screening results were submitted to the GeneBridge 4.0 mapping server at the Whitehead Institute using a minimum LOD score of 15 for placement. To obtain chromosomal band identification, the resulting mapping data were compared to the information in the Stanford Radiation Hybrid mapping database and databases at the Whitehead Institute. Primer pairs that successfully identified a specific cluster location were submitted to GenBank for STS accession numbers.
In total, 1,315 EST clusters containing sequences from exclusively retinal cDNA clones were obtained from the TIGR Human Genome Database on July 1, 1998 (Table 1). In random primed libraries, two different ESTs can be derived from non-overlapping segments of the same gene thus causing redundancy in the database. Due to redundancy and multiple entries of the same EST in the TIGR database, 348 ESTs were removed from further consideration. To select for highly expressed transcripts, we chose to evaluate only EST clusters containing three or more independent sequences. This reduced the number of potential candidates to 276 EST clusters. One hundred and forty-nine (54%) of the EST clusters were mapped previously according to the UniGene database; 84 (30%) of the ESTs showed identity to known genes in the GenBank database; and 67 (24%) contained Alu, SINEs, LINEs, or other repeat elements. (These categories overlap.) The remaining 83 ESTs represent novel retinal clusters with no significant match to any sequence in the databases.
Because ESTs are obtained by single-pass sequencing, some sequences contain errors and have ambiguous nucleotides. Due to this problem, we were unable to make suitable primer pairs for 7 clusters. The STS for each of the remaining clusters was optimized in genomic DNA prior to PCR assay in the radiation hybrid panel. Of the remaining 76 mapping candidates, 11 were composed of widely dispersed sequences and mapped to multiple chromosomes in the GeneBridge 4.0 Radiation Hybrid Panel (Table 2). An additional 17 of the mapping candidates failed to amplify in either genomic DNA or in the radiation hybrid panel. The amplified fragment for THC137122 was much larger than the expected size, but sequencing revealed that the fragment included an intron flanked by the coding sequence for this cDNA cluster. Fifty-five of the 83 potential candidate ESTs were successfully mapped and their locations compared to the locations of known retinal disease genes. The EST name, number of cDNAs per cluster, and chromosomal mapping location (including flanking markers) for each cluster are shown in Table 3. Table 3 also lists the GenBank accession number assigned to each unique primer pair used to map these ESTs.
Each THC sequence examined in this study was found in UniGene; similarities to known genes, map locations, and information on tissue expression were obtained. The total number of ESTs per mapped cluster (3 or more) is shown in Table 3, giving a rough indication of abundance. Using TIGR, UniGene, and GenBank, we reduced the number of retina-specific EST clusters by excluding clusters that represented known genes, were mapped previously, or that included transcripts derived from tissues other than the retina. (Clusters with ESTs from tumors or transformed cell lines were not excluded because these ESTs may derive from transcripts expressed subsequent to transformation). Also, the human genome contains large segments of repetitive DNA sequences, such as Alu, SINEs, or LINEs, and these repeat elements can pose a potential problem for EST analysis . To reduce the probability of analyzing THCs containing repetitive elements, we screened each retina-specific THC for repeats using the analysis program RepeatMasker, thus eliminating several additional clusters.
Following these procedures, we identified 83 novel retinal EST clusters as potential mapping candidates. PCR primers were designed for each potential candidate and the primer product was mapped using the GeneBridge 4.0 Radiation Hybrid Panel. By this process, we localized 55 unique retina-specific genes, 14 of which map within the candidate region for an inherited retinopathy (Table 4).
One potential problem in mapping ESTs is the considerable degree of redundancy in the data and the overlap with more completely characterized, traditional GenBank entries that represent functionally cloned mRNAs and genes . In this study, 84 (30%) of the 276 retinal EST clusters could be identified as known genes in the UniGene database and 149 (54%) were mapped previously. After completion of this study, the International RH Mapping Consortium released GeneMap98, the latest expression map of the human genome. Eleven ESTs localized in this study were confirmed by GeneMap 98.
Another problem with using ESTs for mapping is the presence of repetitive DNA sequences. Despite screening of each of the EST clusters for repetitive elements, 11 of the mapping candidates localized to multiple chromosomes in the radiation hybrid panel. Possible explanations are that the retinal sequence is a member of a dispersed gene family or the existence of multiple pseudogenes.
ESTs are generated from single pass sequencing of random cDNA clones and, as a consequence, they may contain inaccurate regions and ambiguous nucleotides. Due to the possibility of incorrect nucleotides occurring within the primer sequence, these sequences may cause difficulties in primer design. The primers may not anneal to the DNA and therefore fail to amplify in a PCR reaction. This could be an explanation for the relatively high failure rate (20%) of EST mapping in this study. Other explanations could be the presence of primer-dimers or other amplification artifacts.
Despite the problems with EST mapping, the 55 EST clusters mapped in this study represent novel, retina-specific genes and potential candidates for inherited retinopathies. Fourteen of these genes fall within the candidate region for a mapped, but not cloned, form of retinal disease. The specific retinal expression of these novel genes distinguish them from the retina/pineal-specific ESTs identified in a related study . This confirms the utility of using ESTs to identify and map novel inherited retinal genes in the human genome.
We thank Odessa L. June, Human Genetics Center, the University of Texas Health Science Center, Houston for expert technical assistance. Supported by grants from the Foundation Fighting Blindness and the George Gund Foundation, by grants from the William Stamps Farish Fund and the M. D. Anderson Foundation, by NIH grant EY07142 and NIH-NEI National Institutional Service Award EY07024.
1. Daiger SP, Rossiter BF, Greenberg J, Christoffels A, Hide W. Data services and software for identifying genes and mutations causing retinal degeneration. Invest Ophthalmol Vis Sci 1998; 39:S295.
2. Sohocki MM, Malone KA, Sullivan LS, Daiger SP. Localization of retina/pineal expressed sequences (ESTs): identification of novel candidate genes for inherited retinal disorders. Genomics. In press 1999.
3. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990; 215:403-10.
4. Eichler EE. Masquerading repeats: paralogous pitfalls of the human genome. Genome Res 1998; 8:758-62.
5. Boguski MS, Schuler GD. ESTablishing a human transcript map. Nat Genet 1995; 10:369-71.
6. Mitchell SJ, McHale DP, Campbell DA, Lench NJ, Mueller RF, Bundey SE, Markham AF. A syndrome of severe mental retardation, spasticity, and tapetoretinal degeneration linked to chromosome 15q24. Am J Hum Genet 1998; 62:1070-6.