|Molecular Vision 2002;
Received 22 April 2002 | Accepted 5 June 2002 | Published 15 June 2002
A project for ocular bioinformatics: NEIBank
Section on Molecular Structure and Function, National Eye Institute, National Institutes of Health, Bethesda, MD
Correspondence to: Graeme Wistow, Ph.D., Chief, Section on Molecular Structure and Function, National Eye Institute, Building 6, Room 331,National Institutes of Health, Bethesda, MD, 20892-2740; Phone: (301) 402-3452; FAX: (301) 496-0078; email: email@example.com
NEIBank is a project aimed at helping to integrate different kinds of data for eye research into a molecular encyclopedia of the eye. This should eventually include a wide range of input from genomics, genetics, structure/function and expression studies. As a starting point, the project has begun with efforts to assemble a catalog of the genes expressed in different parts of the eye.
A large proportion of all known (but usually not understood) human genes have been identified through expressed sequence tag (EST) analyses and major efforts, such as the Cancer Genome Anatomy Project (CGAP), have been instituted to produce EST resources for different tissue and disease systems. This type of genome project uncovers new genes but, equally importantly, it also helps to define the repertoire of expressed genes, novel or known, for different tissues and cell types. Overall, public EST projects have contributed over 9 million partial cDNA sequences from many species and resources such as UniGene have been employed to group these sequences into clusters that (potentially) represent specific human genes. However, with the notable exception of retina, most human eye tissues have, until quite recently, been rather poorly represented in most of the analyses available in the public domain.
The eye is a complex system of highly differentiated tissues of various developmental origins. Many genes essential for eye function are tissue-specific or highly tissue-preferred and, not surprisingly, many of these have proved to be associated with genetically based eye diseases [1-4]. At the same time, many genes with more diverse patterns of expression are also essential for normal eye function, while for both novel and known genes there may be alternative transcripts that can have important consequences for the role of the gene in the eye. For example, the transcription factor Pax6 is expressed in different parts of the eye, brain and pancreas [5-8] but it seems to use different patterns of alternative splicing in different tissues to fine-tune its function and to select different families of target genes [9-11].
Strategies for NEIBank
For NEIBank a key strategy has been to produce high quality cDNA libraries from dissected tissues, to sequence the clones primarily from the 5' end to increase the access to full-length clones and functionally significant splice variants, and to develop informatics tools to analyze and display the results. The aim is to use cDNA libraries that represent as closely as possible the transcript profile of the specialized tissues of the eye. While cultured cells can be powerful experimental tools it is clear that they differ significantly from the tissue from which they were derived. As just one of many known examples (both published and unpublished), MHC gene expression seems to be absent in intact lens but is present in cultured cells that are used as models for lens .
Various steps in library synthesis and analysis can also alter the abundance and quality of clones. For example, cDNA libraries are commonly amplified to increase the available resource. This can drastically reduce the frequency of some clones that grow poorly, perhaps because of size, G+C content or other factors. In many studies aimed primarily at gene discovery, cDNA libraries are normalized . This is a powerful methodology that essentially employs a process of self-subtraction to enrich for tissue-preferred or rare transcripts and to remove the more abundant clones. However, it necessarily eliminates any information on transcript abundance and also tends to increase the abundance of cloning artifacts (since these are rare). So, as far as possible, NEIBank uses un-amplified, un-normalized libraries to obtain as close as possible a representation of normal transcript abundance, to maximize clone length and to allow the discovery of "difficult" clones that might disappear during library manipulation. As described below and in the accompanying papers, this seems to have been successful [14-17]. In some cases, libraries have later been amplified and normalized to reduce the content of highly abundant clones (such as crystallins in the lens) for "deeper" sequencing of the library, or in procedures to reduce the content of empty vector [14,15], but in these cases, low temperature, semi-solid methods have been used during library expansion in order to minimize any growth bias [18,19].
Typical EST analyses employ 3' sequence reads in order to anchor clusters of clones to "unique" 3' ends. This has been useful, but as perusal of UniGene will reveal, it has not been completely successful. Many genes have multiple 3' ends and there may be mis-priming at internal A-rich regions. Furthermore, 3'sequencing often encounters problems in reading through the highly repetitive polyA tail. For these reasons 5' sequence reads have been emphasized for NEIBank, although some sets of clones have been sequenced in both directions to confirm that the libraries are generally complete at the 3' end and to gain more sequence for apparently novel clones. This strategy gives a high return in numbers of "quality" sequences, as judged by the program PHRED , and also increases the chances of observing novel protein coding regions and alternative splicing events. This is because 3' untranslated regions (UTR), which naturally contain no open reading frame (ORF), may be long and are rarely interrupted by introns in the genome.
The un-normalized libraries subjected to extensive sequencing so far show a high fraction of "full-length" cDNAs, inasmuch as over 50% of cDNAs corresponding to known sequences contain the initiator codon start site of the open reading frame. Indeed, the un-normalized lens library , which contains a high content of relatively short transcripts due to its crystallin content, contains approximately 75% "full-length" cDNAs. For the libraries representing the four tissues described in the accompanying papers, human lens, iris, retina and RPE/choroid [14-17], the content of bacterial, mitochondrial and other contaminant sequences is typically no more than 5% and ribosomal RNA content is low, ranging from 0.4% in the normalized iris library to 3% in the un-normalized lens library. The typical read length for quality EST sequences is 500 bp and is often longer.
A major effort has been made to use bioinformatics to assemble, organize and present the NEIBank EST data, to identify and group the high quality sequences and to remove the various classes of poor quality, non-mRNA contaminants and chimeric clones. This has evolved into a rules-based procedure named GRIST (GRouping and Identification of Sequence Tags)  that uses sequence matches generated by BLAST programs  and extracts information from GenBank, UniGene, and other databases. The collated information on grouped and identified cDNAs from the EST analyses is displayed at the NEIBank web site. In addition to data derived from libraries made specifically for NEIBank, the same procedures are used to extract, organize and display EST data for eye tissues from parallel sequencing efforts. Many keywords for topics such as functional class and chromosomal location are incorporated as well as links to many related sites, including a direct link for each group or cluster of related sequences to the human and mouse genome builds at the Human Genome Project.
However, the real purpose of all this work is to produce insight into the molecular mechanisms of the eye. The four accompanying papers, that describe cDNA libraries for human eye tissues, give examples of some of the classes of biologically interesting information that can be mined from the accumulating data. Here are some highlights.
In addition to a number of apparently novel genes whose transcripts are represented in the sequence datasets at single copy levels, several newly recognized genes were found amongst the most abundant cDNA clones in the libraries. These include lengsin, a novel glutamine-synthetase superfamily member in the lens ; oculoglycan/opticin, a novel member of the small leucine rich proteoglycans found in iris and in RPE [15,17,23] and retbindin, an abundant transcript in the retina library that appears to encode a secreted binding protein . Some of the novel genes seen in these libraries may have escaped detection previously because of their high G+C content. This is exemplified by oculospanin, which is found in the iris and RPE/choroid libraries, and whose cDNA is 70% G+C rich [15,17], and IEGF/PDGFD, a new member of the PDGF/VEGF family of growth factors that is expressed in human iris .
Detailed inspection of many of the groups of clones revealed novel splice variants with potential biological significance. These include a major new splice form of the lens protein MP19/Lim2 that encodes a larger version of the protein ; a variant of the retinal transcription factor Nrl that make use of an exon in what would otherwise be an intron of the major transcript ; alternative versions of Bestrophin, that are actually the dominant forms of transcript for this gene that are detected among cDNAs from the RPE/choroid library ; and a splice variant of oculoglycan/opticin that deletes a conserved motif without disrupting the rest of the open reading frame . Many other splice variants can be seen. Some are unlikely to produce functional proteins but could still have biomedical importance. In very long-lived cells such as those in lens, retina and RPE, the accumulation of mis-spliced transcripts and their protein products could contribute to declining cellular function, particularly with age. A possible example of this sort of splice "accident", involving γS-crystallin in the lens, has been described previously .
In addition to the discovery of new genes, the EST libraries also make a useful contribution to cataloging the genes that are normally expressed in eye tissues. These data provide a baseline for other studies, such as SAGE analyses  in which short tag sequences derived from the 3' ends of cDNAs are identified. Novel SAGE tags may represent new genes. However, they are typically not long enough to allow unambiguous identification of new genes and they could also arise through cloning or amplification artifacts. Thus it is useful to have a reference set of cDNA clones for comparison. A recent SAGE study has suggested that many of the most abundant genes expressed in RPE are tissue-specific and novel . In contrast, cDNA EST sequencing, which allows for clearer identification, finds that the most abundant genes expressed in RPE/choroid are not tissue specific  and raises the possibility that novel SAGE tags may not necessarily represent novel genes.
Information on the transcriptional repertoire of eye tissues is also of importance for micro-arrays studies. The EST expression data provide a view of what should be detectable in array hybridizations of RNA from particular tissues, helping to validate baseline studies. More directly, sequence verified EST clones obviously provide the critical resource for the construction of cDNA micro-arrays. Indeed, clones from the NEIBank collection are now being used to construct cDNA arrays of eye-expressed genes. The expression data also help in the evaluation of commercially produced arrays, giving some basis for judging how many of the major genes expressed in eye are actually represented.
Since the NEIBank web site has been available, several other research groups have joined the effort to expand the EST representation of eye tissues for humans and for other species. As a result of this, and of continuing sequencing of several NEIBank libraries, the database continues to expand. Collaborations are already underway to add resources for structural biology and proteomics to the NEIBank web site. Suggestions, additions and links to other relevant resources are welcomed.
I thank the colleagues and collaborators who have made this effort possible. Their contributions are recognized in the accompanying papers. I also particularly thank Dr. Robert Nussenblatt of NEI for his support and encouragement.
1. Hejtmancik JF. The genetics of cataract: our vision becomes clearer. Am J Hum Genet 1998; 62:520-5.
2. He W, Li S. Congenital cataracts: gene mapping. Hum Genet 2000; 106:1-13.
3. Phelan JK, Bok D. A brief review of retinitis pigmentosa and the identified retinitis pigmentosa genes. Mol Vis 2000; 6:116-24 <http://www.molvis.org/molvis/v6/a16/>.
4. Clarke G, Heon E, McInnes RR. Recent advances in the molecular basis of inherited photoreceptor degeneration. Clin Genet 2000; 57:313-29.
5. Li HS, Yang JM, Jacobson RD, Pasko D, Sundin O. Pax-6 is first expressed in a region of ectoderm anterior to the early neural plate: implications for stepwise determination of the lens. Dev Biol 1994; 162:181-94.
6. Walther C, Gruss P. Pax-6, a murine paired box gene, is expressed in the developing CNS. Development 1991; 113:1435-49.
7. St-Onge L, Sosa-Pineda B, Chowdhury K, Mansouri A, Gruss P. Pax6 is required for differentiation of glucagon-producing alpha-cells in mouse pancreas. Nature 1997; 387:406-9.
8. Turque N, Plaza S, Radvanyi F, Carriere C, Saule S. Pax-QNR/Pax-6, a paired box- and homeobox-containing gene expressed in neurons, is also expressed in pancreatic endocrine cells. Mol Endocrinol 1994; 8:929-38.
9. Epstein JA, Glaser T, Cai J, Jepeal L, Walton DS, Maas RL. Two independent and interactive DNA-binding subdomains of the Pax6 paired domain are regulated by alternative splicing. Genes Dev 1994; 8:2022-34.
10. Richardson J, Cvekl A, Wistow G. Pax-6 is essential for lens-specific expression of zeta-crystallin. Proc Natl Acad Sci U S A 1995; 92:4676-80.
11. Jaworski C, Sperbeck S, Graham C, Wistow G. Alternative splicing of Pax6 in bovine eye and evolutionary conservation of intron sequences. Biochem Biophys Res Commun 1997; 240:196-202.
12. Shaughnessy M, Wistow G. Absence of MHC gene expression in lens and cloning of dbpB/YB-1, a DNA-binding protein expressed in mouse lens. Curr Eye Res 1992; 11:175-81.
13. Bonaldo MF, Lennon G, Soares MB. Normalization and subtraction: two approaches to facilitate gene discovery. Genome Res 1996; 6:791-806.
14. Wistow G, Bernstein SL, Wyatt MK, Behal A, Touchman JW, Bouffard G, Smith D, Peterson K. Expressed sequence tag analysis of adult human lens for the NEIBank Project: Over 2000 non-redundant transcripts, novel genes and splice variants. Mol Vis 2002; 8:171-84 <http://www.molvis.org/molvis/v8/a24/>.
15. Wistow G, Bernstein SL, Ray S, Wyatt MK, Behal A, Touchman JW, Bouffard G, Smith D, Peterson K. Expressed sequence tag analysis of adult human iris for the NEIBank Project: Steroid-response factors and similarities with retinal pigment epithelium. Mol Vis 2002; 8:185-95 <http://www.molvis.org/molvis/v8/a25/>.
16. Wistow G, Bernstein SL, Wyatt MK, Ray S, Behal A, Touchman JW, Bouffard G, Smith D, Peterson K. Expressed sequence tag analysis of human retina for the NEIBank Project: Retbindin, an abundant, novel retinal cDNA and alternative splicing of other retina-preferred gene transcripts. Mol Vis 2002; 8:196-204 <http://www.molvis.org/molvis/v8/a26/>.
17. Wistow G, Bernstein SL, Wyatt MK, Fariss RN, Behal A, Touchman JW, Bouffard G, Smith D, Peterson K. Expressed sequence tag analysis of human RPE/choroid for the NEIBank Project: Over 6000 non-redundant transcripts, novel genes and splice variants. Mol Vis 2002; 8:205-20 <http://www.molvis.org/molvis/v8/a27/>.
18. Hanahan D, Jessee J, Bloom FR. Plasmid transformation of Escherichia coli and other bacteria. Methods Enzymol 1991; 204:63-113.
19. Kriegler M. Gene transfer and expression: a laboratory manual. New York: Stockton Press; 1990.
20. Ewing B, Green P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 1998; 8:186-94.
21. Wistow G, Bernstein SL, Touchman JW, Bouffard G, Wyatt MK, Peterson K, Gao J, Buchoff P, Smith D. Grouping and identification of sequence tags (GRIST): Bioinformatics tools for the NEIBank database. Mol Vis 2002; 8:164-70 <http://www.molvis.org/molvis/v8/a23/>.
22. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990; 215:403-10.
23. Hobby P, Ward FJ, Denbury AN, Williams DG, Staines NA, Sutton BJ. Molecular modeling of an anti-DNA autoantibody (V-88) and mapping of its V region epitopes recognized by heterologous and autoimmune antibodies. J Immunol 1998; 161:2944-52.
24. Wistow G, Sardarian L, Gan W, Wyatt MK. The human gene for gammaS-crystallin: alternative transcripts and expressed sequences from the first intron. Mol Vis 2000; 6:79-84 <http://www.molvis.org/molvis/v6/a11/>.
25. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analysis of gene expression. Science 1995; 270:484-7.
26. Sharon D, Blackshaw S, Cepko CL, Dryja TP. Profile of the genes expressed in the human peripheral retina, macula, and retinal pigment epithelium determined through serial analysis of gene expression (SAGE). Proc Natl Acad Sci U S A 2002; 99:315-20.