|Molecular Vision 2002;
Received 31 August 2001 | Accepted 4 June 2002 | Published 15 June 2002
Grouping and identification of sequence tags (GRIST): Bioinformatics tools for the NEIBank database
Steven L. Bernstein,2
Jeffrey W. Touchman,3
M. Keith Wyatt,1
1Section on Molecular Structure and Function, National Eye Institute, National Institutes of Health, Bethesda, MD; 2Departments of Ophthalmology and Neurobiology & Genetics, University of Maryland School of Medicine, Baltimore, MD; 3NIH Intramural Sequencing Center, Gaithersburg, MD
Correspondence to: Graeme Wistow, Ph.D., Chief, Section on Molecular Structure and Function, National Eye Institute, Building 6, Room 331,National Institutes of Health, Bethesda, MD, 20892-2740; Phone: (301) 402-3452; FAX: (301) 496-0078; email: email@example.com
NEIBank is a project to develop and organize genomics and bioinformatics resources for the eye. As part of this effort, tools have been developed for bioinformatics analysis and web based display of data from expressed sequence tag (EST) analyses. EST sequences are identified and formed into groups or clusters representing related transcripts from the same gene. This is carried out by a rules-based procedure called GRIST (GRouping and Identification of Sequence Tags) that uses sequence match parameters derived from BLAST programs. Linked procedures are used to eliminate non-mRNA contaminants. All data are assembled in a relational database and assembled for display as web pages with annotations and links to other informatics resources. Genome projects generate huge amounts of data that need to be classified and organized to become easily accessible to the research community. GRIST provides a useful tool for assembling and displaying the results of EST analyses. The NEIBank web site contains a growing set of pages cataloging the known transcriptional repertoire of eye tissues, derived from new NEIBank cDNA libraries and from eye-related data deposited in the dbEST section of GenBank.
NEIBank is a project to develop genomics resources and bioinformatics for eye research . As a first step, the project has begun to assemble a catalog of genes that are expressed in different parts of the eye in humans and other biomedically important species. This employs the technique of expressed sequence tag (EST) analysis, which is essentially a process of random sequencing of clones from cDNA libraries . As a starting point, high quality cDNA libraries have been made from several human eye tissues. Some of the first libraries for this effort are described in detail in a series of accompanying papers [3-6]. Other NEIBank libraries are also under analysis and eye-related EST data from other public sources (see compilation at UniGene) are also being gathered.
Genomics projects, such as EST analyses, can generate extremely large amounts of data. To be useful and usable for a wide range of researchers, these data need to be analyzed, organized and displayed in an easily accessible way, using the computer-based techniques of bioinformatics. Here we describe some bioinformatics tools that have been developed for NEIBank to identify and group ESTs in a relational database and to display the results in webpage format with additional annotations, links and search functions. The main tool for group assembly and database organization is a hierarchical, rules-based procedure that uses matches from BLAST sequence comparisons to GRoup and Identify Sequence Tags (GRIST).
A major emphasis of GRIST is to provide reliable identification of ESTs and to group them into clusters that represent transcripts from the same gene. GRIST achieves this with a simple process that builds upon the well-characterized BLAST programs for sequence comparison . Particularly for the NEIBank libraries, there is also an emphasis on the use of bioinformatics to remove non-mRNA contaminant sequences, such as those derived from vector, mitochondrial genome and ribosomal RNA (rRNA). The reliability of the GRIST processing is linked closely to the quality of the input data, so that the relatively high quality, relatively long sequence reads that have been obtained for cDNA libraries in the NEIBank pipeline tend to give very good results. Some of the publicly available datasets for other EST libraries have shorter and lower quality sequence reads and higher levels of non-mRNA contamination (see for example a comparison in reference ). Naturally, this decreases the number of clones that can be reliably identified. Nevertheless, GRIST has been used on many large datasets of clones for eye (and some other tissues) from several species and has produced good results in terms of correct identification of clones and appropriate grouping of clones from the same gene.
The results of these efforts are displayed at the NEIBank web site that is being developed as a resource for organizing many kinds of biological and biomedical information for eye research. So, although this project has begun with analyses of EST collections, it is planned that it will grow to include data for other kinds of genomic analysis as well as protein structure and function studies. New bioinformatics and display tools will be needed to ensure that related information from widely different sources is integrated in a helpful way. This manuscript describes the basic bioinformatics tools that have been used for analysis and display of EST data for NEIBank.
ABI sequence data is analyzed using PHRED  to identify and trim quality reads. Vector, E.coli genome and human mitochondrial sequences are trimmed or eliminated using the programs RepeatMasker (by Arian Smit and Phil Green) and CrossMatch (by Phil Green) as described previously . High quality cDNA reads are analyzed using BLAST  (National Center for Biotechnology Information (NCBI), National Library of Medicine, Bethesda, MD) to compare with GenBank nucleotide sequences (NT), protein sequences (NR) and dbEST. Sequences are also BLASTed against the other clones in the library dataset to identify overlapping clones.
Linker sequences and cloning artifacts that survive prior processing are removed by a custom program (rmlinkseq), which uses an updateable list of observed artifacts and trims iteratively from the 5' end of each sequence. The same program removes sequences that have less than 50 bp of unmasked sequence and also, after initial BLAST runs, removes clones identified as non-mRNA. Custom software (GRIST) is used to group and identify the ESTs based on BLAST results as described in the following sections. For validation and examination of individual clusters, sequences are also aligned using the contig assembly program SeqMan II (DNASTAR, Madison, WI).
Figure 1 shows a simplified flow chart of the NEIBank EST pipeline, including a broad outline of GRIST processing. GRIST relies on sequence matches generated by the widely used BLAST programs . The BLAST results are processed in a relational database using a series of rules that have been developed to select for the most reliable and informative matches. The effectiveness of this process is judged by exhaustive inspection of the groups (or clusters) of clones that are produced and the names attached to each group. Each group should consist of cDNAs from a single gene and, as far as possible, all cDNAs from that gene should be in the same group. Curation of the databases is important and the processing rules are subject to adjustment as new issues in GenBank and other databases arise. As indicated in Figure 1, the process can be described in four steps, each based on examination of different BLAST searches.
Step 1: GenBank NT matches
A reliable way to identify a nucleotide sequence is through a BLASTN search of the non-redundant nucleotide (NT) section of GenBank. However, the most statistically significant match from such a match may not be the most biologically relevant. For example, a contiguous cDNA sequence from a non-human species can give a higher significance score than an interrupted human gene sequence that contains several short, separated exons. To focus on the most biologically useful matches, GRIST uses simple criteria of percent identity and sequence length. These criteria were developed empirically through repeated analysis of large (>1000 clones) sets of NEIBank EST data [3-6]. In Step 1 of GRIST, the information on these scores and sequence lengths is parsed from BLASTN alignment output. Hits with 97% identity or better over at least 50 bp against NT are taken as high quality (HQ) matches. These parameters have been derived through repeated testing of large datasets and have been found to be reliable identifiers of a cDNA sequence, allowing correct clustering of sequences from individual genes and accurate discrimination among all but the most closely related members of gene families (see many examples at the NEIBank web site). When co-clustering of very highly conserved sequences does occur it is flagged in a subsequent step in the analysis through the identification of multiple UniGene hits for one GRIST cluster (see Step 3 below).
For many (perhaps most) genes, there are multiple entries in GenBank, usually consisting of cDNA or gene sequences for the same gene produced by different researchers. This means that different ESTs from the same gene may give different lists of matches with the multiple targets in GenBank. As the numbers of GenBank clones for each gene increases, the number of BLASTN matches that have to be considered by GRIST can be increased. Currently GRIST examines the top eight matches and stores the parsed BLASTN parameters for each EST in a relational database. The ESTs are then grouped into clusters according to their common HQ matches with NT database entries and a relational chain is established to pull together groups of clones that have HQ NT hits in common.
An important feature of GRIST is the application of several filters that are used to remove non-mRNA sequences and to prevent clustering of unrelated transcripts through BLASTN matches with large genomic clones that may contain several different genes. To do this, GRIST uses text interrogation of BLAST results to identify problem sequences. For example, rRNA-related sequences are identified by text word search of the BLASTN output and removed from the analysis. Other non-mRNA clones, such as a few containing bacterial plasmid sequences that are not included in the E.coli genome database, are similarly removed. After the first pass of BLAST analysis, all the identified non-mRNA clones are eliminated from the library files for each dataset and are excluded from future analyses and subsequent submissions to GenBank. A similar text-based procedure detects matches with large genomic clones that contain multiple genes and excludes these from the clustering procedure. Criteria for this filter include the size of the target clone (which must be less than 50 kbp) and keywords (such as BAC, PAC, P1, and chromosome) that signify large clones. These procedures prevent co-clustering of transcripts from neighboring genes.
Since there are multiple entries for most genes in GenBank, there are also several different possible names for each entry, names that can vary greatly in usefulness. For example, a rapidly growing category of entries in GenBank consists of sequences that are derived from "full-length" cDNA sequencing projects [10-13]. Many of these sequences have series identifiers rather than useful gene names, even when they correspond to known genes that were already represented in GenBank. GRIST contains rules to select among the many available GenBank clone names for those that are likely to be more informative. It selects against names containing characteristic identifiers for clones from the full-length projects. A series of choices are also made to give preference to names containing the word "mRNA" and to select against entries containing words such as "chromosome". This tends to select for names derived from individual cDNAs rather than from larger clones, which often have complex designations. From names that satisfy the selection rules the first in alphabetical order is chosen. One quirk is that, for human data, the few sequences derived from gorilla may rank higher than those for human, so another rule is included to select for the correct species.
Step 2: "Self" match
Step 1 is sufficient to identify and group the majority of ESTs that correspond to known genes. However, ESTs that correspond to novel genes or to genes whose representatives in GenBank are not full length may remain ungrouped. Step 2 of the GRIST process, the "self-library match", examines BLASTN matches between different ESTs in the same library (i.e. each lens clone compared with all the lens clones). Any match between two ESTs with a BLASTN value better than 1x10-20 is taken as reflecting a significant overlap with another clone. Subject to this criterion, separate groups of clones from Step 1 are merged. This groups sequences that correspond to incomplete GenBank sequences and novel genes. The combined effects of Steps 1 and 2 are illustrated in Figure 2. This provides most of the identification and grouping of GRIST clusters, but further information and some additional grouping is provided in two subsequent steps.
Step 3: dbEST and UniGene
As another way to identify clones and to gain access to related information (such as chromosome location and functional keywords), UniGene identifiers are assigned for GRIST clusters. Although it is possible to search reference sequences for each UniGene cluster, such references sequences are not always complete and indeed may not always be representative of all the clones in the cluster. For these reasons, GRIST takes an independent route to UniGene identification. Sequences are compared with the dbEST section of GenBank using BLASTN and a match of at least 96% identity over at least 100 nt is taken as a HQ match with another EST. Again, these parameters were derived by inspection of BLASTN results followed by cycles of adjustment and inspection. The accession numbers of the ESTs matched in the dbEST search are then located in lists of ESTs contained in each UniGene cluster. The corresponding UniGene identifiers are then attached to the groups of sequences assembled in the GRIST relational database.
Matches to UniGene are also used as a last resort to group otherwise unidentified clones. Those ungrouped clones that have no GenBank or self-library matches but do match a common UniGene cluster are grouped. Although this has the potential to associate clones from different genes that happen to be erroneously grouped in UniGene, it does provide a convenient link to related dbEST clones and gives some indication of possible chromosomal location. Such groups, however, need careful examination if they prove to be of interest.
Some clones have no significant NT or UniGene hit but do match individual, ungrouped dbEST entries. The accession numbers of these ESTs are attached to the GRIST clusters. Another text-based filter is applied to dbEST matches in order to eliminate "self-hits" as NEIBank clones are submitted to dbEST and consequently match themselves.
Following Steps 1-3, ESTs are grouped into GRIST clusters with identifiers corresponding both to NT and UniGene. The use of both sources of identification can help to resolve uninformative clone names and to give more complete designation for genes and proteins with multiple names. These GenBank-derived names and identifiers may change as GenBank and (particularly) UniGene are updated. However the GRIST clusters themselves remain essentially constant, although ungrouped clones from unidentified genes will regroup in later reanalysis as GenBank becomes more complete in its cataloging of human genes. To provide a group identifier, separate from the GenBank name, each GRIST cluster is named internally using the clone name of the first clone as listed alphanumerically.
It is not uncommon for GRIST clustering to reveal discrepancies between GenBank and UniGene identifiers for clone clusters. Closer inspection usually reveals a problem in UniGene, often the result of erroneous grouping of multiple transcripts in one cluster because of chimeric sequences (usually fused cDNAs or sequencer tracking errors). Several examples of UniGene clusters containing sequences from more than one gene or of sequences from one gene dispersed among several UniGene clusters have been detected, reported and (largely) corrected in successive builds of UniGene. However, many examples persist of GRIST clusters that match ESTs contained in a UniGene corresponding to a known gene, but do not match the named gene itself in GenBank. In many cases inspection reveals that this is because the EST represents sequence from untranslated regions that are not included in the GenBank sequence.
There are also cases in which sequences in one GRIST cluster match more than one UniGene. We have examined many examples of this in detail. In the majority of cases the GRIST cluster proves to be completely self-consistent. Again, the problem often arises from the presence of chimeric clones in GenBank or dbEST. One striking example of this phenomenon is the gene designated as "alpha" (GenBank accession number AF203815), a gene (or at least a source of transcripts) of unknown function that gives rise to one of the most abundant groups of cDNAs represented in the human NEIBank RPE/choroid collection . At the time of writing, there is no separate UniGene cluster for the products of this gene. Instead these ESTs are distributed among UniGenes for several other genes because of the presence of several "alpha" chimeras.
Step 4: ORF detection and unidentified or "novel" clones
Even after steps 1-3, some sequences (typically 10-15% of the total) remain with no significant NT or dbEST matches. Caution is necessary before designating these as "novel" genes. While this class does include novel transcripts, some of which are described in accompanying manuscripts, it also includes some non-mRNA noise. This can range from cloned fragments of genomic DNA to poor sequence reads that nevertheless pass the quality criteria of PHRED (sequence "judder" in repetitive regions, such as polyA tail sequence, can give surprisingly "high quality" scores). Since not everything that fails to match known genes in GenBank is in fact "novel", in GRIST and the NEIBank web pages, these clones are simply called "unidentified".
One useful way of identifying unidentified cDNA clones that may actually have biological significance is to detect open reading frames (ORFs) that reveal significant similarity to known families of proteins. In Step 4 of GRIST, this is done using the results of BLASTX searches of the GenBank NR protein databases. BLASTX matches are excellent indicators of "real" gene transcripts. For example, some otherwise unidentified clones in the NEIBank collections for RPE/choroid and retina libraries revealed BLASTX similarity with cadherins and later proved to be clones for CDH23, a novel cadherin that was subsequently identified as the locus for Usher Syndrome 1D and Nonsyndromic Autosomal Recessive Deafness DFNB12 [14,15]. Similarly, it is also common to see otherwise unidentified clones that match protein motifs or domains in yeast, nematode, fly and even plant databases. While these proteins are generally of unknown function, the predicted protein similarity is a strong indication that the EST encodes a novel protein with a domain that has been well conserved during evolution. In other cases, Step 4 may reveal identity with predicted proteins derived from large genome sequence contigs. These may not be detected in Step 1 of GRIST if there is no individual gene or mRNA sequence in GenBank.
NEIBank Web Site
As a first step towards the creation of a molecular encyclopedia of the eye, the EST data from NEIBank and from other resources deposited in dbEST are being collected, assembled using GRIST, and displayed at the NEIBank web site.
For navigation, the homepage displays an illustration of the human eye. The illustration and equivalent text markers for each tissue are linked to resource pages that list GRIST-compiled cDNA library data and other relevant resources including links to other eye-related web sites. Data are available for individual libraries and for combined collections, such as un-normalized and normalized data for the same tissue, to produce a view of the transcriptional repertoire of eye tissues. In the complete listing of a library, the GRIST clusters of ESTs are arranged in order of abundance to give some insight into expression levels (Figure 3). For each cluster, identifying names from GenBank and UniGene are shown and these are linked back to their respective databases where there are further links to other resources.
If the NEIBank cDNAs have no significant match to either the NT or to UniGene clusters, any significant hit against ungrouped sequences in the dbEST section of GenBank is shown. For clones with less than two identifiers, any protein similarity matches derived from BLASTX are reported as "% protein similarity". Hits against large BAC clones or genomic contigs are not currently shown, but a feature to address such locations will be added. Some clones match genes predicted from genome contigs but which do not have separate GenBank entries. As a result, GRIST does not list a nucleotide sequence hit but often shows a 100% BLASTX hit.
The basic web page display also includes information on chromosome location extracted from UniGene, where available. This is useful, but only as reliable as the data in UniGene. If the UniGene itself contains sequences from more than one gene, the reported location may be inaccurate or ambiguous. An alternative method to access location in the genome for a particular group of ESTs is provided through a link to a major human genome database. This is accessed through the final column of the basic library web page presentation which shows the number of clones in each NEIBank cluster. This number is hyperlinked to a full listing of the clone names and this in turn is linked to a FASTA library of the cluster sequences themselves. This can be downloaded for analysis. A link is included to the BLAT server of the UCSC Genome Browser that allows rapid search of the FASTA format sequences against the human genome. This provides a rapid validation of sequences and chromosomal location in addition to a visual aid for detection of variant transcripts.
As an alternative to the full library listing (which can create a large web page), subsets of the data can be selected using keywords that are derived either through manual annotation or through automatic retrieval from LocusLink and are associated with each group of clones in the database. Several keyword categories relating to the chromosomal location of the gene or to the function, cellular location and structural class of the protein product are included in the database. Keywords and other annotation are an area of continuing development. Library pages can also be interrogated using a text search tool that produces a web page listing of all clusters whose GenBank identifiers contain an exact match to a required text (such as "glutathione"). Specific EST clones (which may have been mentioned in a manuscript or derived from a search of dbEST) can also be located in their GRIST cluster. This is achieved with another search tool that locates a specific NEIBank EST identity code (such as "by01a01") in the GRIST clusters for a particular library and produces a web page listing that shows the appropriate cluster at the top of the screen. Finally, the NEIBank libraries and other eye-related EST collections can be searched using a BLAST server available through the web site. This allows investigators to search for DNA or protein ORF sequences matching their gene or protein of interest.
Tools to collect and display data from eye-related SAGE and micro-array experiments are also being developed. As genome projects move beyond sequence to function studies it is also planned to add resources for display of protein structure and proteomics data. Suggestions and additional links are welcomed.
SLB is supported by the V. Kann Rasmussen Foundation (Denmark) and is a Career Development Awardee of Research to Prevent Blindness (RPB).
1. Wistow G. A project for ocular bioinformatics: NEIBank. Mol Vis 2002; 8:161-3 <http://www.molvis.org/molvis/v8/a22/>.
2. Fields C. Analysis of gene expression by tissue and developmental stage. Curr Opin Biotechnol 1994; 5:595-8.
3. Wistow G, Bernstein SL, Wyatt MK, Behal A, Touchman JW, Bouffard G, Smith D, Peterson K. Expressed sequence tag analysis of adult human lens for the NEIBank Project: Over 2000 non-redundant transcripts, novel genes and splice variants. Mol Vis 2002; 8:171-84 <http://www.molvis.org/molvis/v8/a24/>.
4. Wistow G, Bernstein SL, Ray S, Wyatt MK, Behal A, Touchman JW, Bouffard G, Smith D, Peterson K. Expressed sequence tag analysis of adult human iris for the NEIBank Project: Steroid-response factors and similarities with retinal pigment epithelium. Mol Vis 2002; 8:185-95 <http://www.molvis.org/molvis/v8/a25/>.
5. Wistow G, Bernstein SL, Wyatt MK, Ray S, Behal A, Touchman JW, Bouffard G, Smith D, Peterson K. Expressed sequence tag analysis of human retina for the NEIBank Project: Retbindin, an abundant, novel retinal cDNA and alternative splicing of other retina-preferred gene transcripts. Mol Vis 2002; 8:196-204 <http://www.molvis.org/molvis/v8/a26/>.
6. Wistow G, Bernstein SL, Wyatt MK, Fariss RN, Behal A, Touchman JW, Bouffard G, Smith D, Peterson K. Expressed sequence tag analysis of human RPE/choroid for the NEIBank Project: Over 6000 non-redundant transcripts, novel genes and splice variants. Mol Vis 2002; 8:205-20 <http://www.molvis.org/molvis/v8/a27/>.
7. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990; 215:403-10.
8. Ewing B, Green P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 1998; 8:186-94.
9. Bouffard GG, Iyer LM, Idol JR, Braden VV, Cunningham AF, Weintraub LA, Mohr-Tidwell RM, Peluso DC, Fulton RS, Leckie MP, Green ED. A collection of 1814 human chromosome 7-specific STSs. Genome Res 1997; 7:59-64.
10. Strausberg RL, Feingold EA, Klausner RD, Collins FS. The mammalian gene collection. Science 1999; 286:455-7.
11. Yudate HT, Suwa M, Irie R, Matsui H, Nishikawa T, Nakamura Y, Yamaguchi D, Peng ZZ, Yamamoto T, Nagai K, Hayashi K, Otsuki T, Sugiyama T, Ota T, Suzuki Y, Sugano S, Isogai T, Masuho Y. HUNT: launch of a full-length cDNA database from the Helix Research Institute. Nucleic Acids Res 2001; 29:185-8.
12. Nagase T, Kikuno R, Ishikawa K, Hirosawa M, Ohara O. Prediction of the coding sequences of unidentified human genes. XVII. The complete sequences of 100 new cDNA clones from brain which code for large proteins in vitro. DNA Res 2000; 7:143-50.
13. Wiemann S, Weil B, Wellenreuther R, Gassenhuber J, Glassl S, Ansorge W, Bocher M, Blocker H, Bauersachs S, Blum H, Lauber J, Dusterhoft A, Beyer A, Kohrer K, Strack N, Mewes HW, Ottenwalder B, Obermaier B, Tampe J, Heubner D, Wambutt R, Korn B, Klein M, Poustka A. Toward a catalog of human genes and proteins: sequencing and analysis of 500 novel complete protein coding human cDNAs. Genome Res 2001; 11:422-35.
14. Bork JM, Peters LM, Riazuddin S, Bernstein SL, Ahmed ZM, Ness SL, Polomeno R, Ramesh A, Schloss M, Srisailpathy CR, Wayne S, Bellman S, Desmukh D, Ahmed Z, Khan SN, Kaloustian VM, Li XC, Lalwani A, Riazuddin S, Bitner-Glindzicz M, Nance WE, Liu XZ, Wistow G, Smith RJ, Griffith AJ, Wilcox ER, Friedman TB, Morell RJ. Usher syndrome 1D and nonsyndromic autosomal recessive deafness DFNB12 are caused by allelic mutations of the novel cadherin-like gene CDH23. Am J Hum Genet 2001; 68:26-37.
15. Bolz H, von Brederlow B, Ramirez A, Bryda EC, Kutsche K, Nothwang HG, Seeliger M, del C-Salcedo Cabrera M, Vila MC, Molina OP, Gal A, Kubisch C. Mutation of CDH23, encoding a new member of the cadherin gene family, causes Usher syndrome type 1D. Nat Genet 2001; 27:108-12.