Molecular Vision 2011; 17:3034-3054
Received 27 July 2011 | Accepted 18 November 2011 | Published 23 November 2011
Matthew J. Brooks, Harsha K. Rajasimha, Jerome E. Roger, Anand Swaroop
The first two authors contributed equally to this work
Neurobiology-Neurodegeneration and Repair Laboratory, National Eye Institute, National Institutes of Health, Bethesda, MD
Correspondence to: Anand Swaroop, Neurobiology-Neurodegeneration and Repair Laboratory, National Eye Institute, National Institutes of Health, MSC0610, 6 Center Drive, Bethesda, MD, 20892; Phone: (301) 435-5754; FAX: (301) 480-9917; email: firstname.lastname@example.org
Purpose: Next-generation sequencing (NGS) has revolutionized systems-based analysis of cellular pathways. The goals of this study are to compare NGS-derived retinal transcriptome profiling (RNA-seq) to microarray and quantitative reverse transcription polymerase chain reaction (qRT–PCR) methods and to evaluate protocols for optimal high-throughput data analysis.
Methods: Retinal mRNA profiles of 21-day-old wild-type (WT) and neural retina leucine zipper knockout (Nrl−/−) mice were generated by deep sequencing, in triplicate, using Illumina GAIIx. The sequence reads that passed quality filters were analyzed at the transcript isoform level with two methods: Burrows–Wheeler Aligner (BWA) followed by ANOVA (ANOVA) and TopHat followed by Cufflinks. qRT–PCR validation was performed using TaqMan and SYBR Green assays.
Results: Using an optimized data analysis workflow, we mapped about 30 million sequence reads per sample to the mouse genome (build mm9) and identified 16,014 transcripts in the retinas of WT and Nrl−/− mice with BWA workflow and 34,115 transcripts with TopHat workflow. RNA-seq data confirmed stable expression of 25 known housekeeping genes, and 12 of these were validated with qRT–PCR. RNA-seq data had a linear relationship with qRT–PCR for more than four orders of magnitude and a goodness of fit (R2) of 0.8798. Approximately 10% of the transcripts showed differential expression between the WT and Nrl−/− retina, with a fold change ≥1.5 and p value <0.05. Altered expression of 25 genes was confirmed with qRT–PCR, demonstrating the high degree of sensitivity of the RNA-seq method. Hierarchical clustering of differentially expressed genes uncovered several as yet uncharacterized genes that may contribute to retinal function. Data analysis with BWA and TopHat workflows revealed a significant overlap yet provided complementary insights in transcriptome profiling.
Conclusions: Our study represents the first detailed analysis of retinal transcriptomes, with biologic replicates, generated by RNA-seq technology. The optimized data analysis workflows reported here should provide a framework for comparative investigations of expression profiles. Our results show that NGS offers a comprehensive and more accurate quantitative and qualitative evaluation of mRNA content within a cell or tissue. We conclude that RNA-seq based transcriptome characterization would expedite genetic network analyses and permit the dissection of complex biologic functions.
Next-generation sequencing (NGS) technology has launched a new era of enormous potential and applications in genomic and transcriptomic analyses [1-3]. With continued cost reductions and improved analytical methods, NGS has begun to have a direct impact on biomedical discovery and clinical outcome [4-6]. NGS has enabled “meta-genomic” studies to survey the genomes of organisms in a particular ecosystem , and decode the entire genomes of species ranging from bacteria [8,9] and viruses  to humans . Whole-genome sequencing has made it possible to investigate genetic variations , global DNA methylation , and in vivo analysis of targets of DNA-binding proteins [14,15]. Deep sequencing of RNA with NGS (called “RNA-seq”) allows a comprehensive evaluation and quantification of all subtypes of RNA molecules expressed in a cell or tissue . RNA-seq technology can detect transcripts expressed at low levels  and permit the identification of unannotated transcripts and new spliced isoforms [16,18]. The issues related to cross-hybridization and detection levels that limit the accuracy of gene expression estimates by microarray technology are not relevant to the data obtained with RNA-seq . Visualization of mapped sequence reads spanning the splice junctions can also reveal novel splice forms of annotated genes in the mouse retina, which was not possible with earlier hybridization-based technologies. With a steady reduction in the costs of NGS, RNA-seq is now emerging as a method of choice for comprehensive transcriptome profiling.
The vertebrate retina exhibits a highly organized laminar structure that captures, integrates, and transmits visual information to the brain for further processing. Photoreceptors constitute more than 70% of the retinal cells and convert light into electrical signals . Rod photoreceptors mediate dim light vision and can detect a single photon of light, while cone photoreceptors are responsible for daylight vision, color perception, and visual acuity [21,22]. Impairment of photoreceptor function leads to retinal degeneration with a more common pattern of rod death preceding the death of cones [23-25]. The neural retina leucine zipper (Nrl) gene encodes a basic-motif leucine zipper transcription factor necessary for determining rod photoreceptor cell fate and functional maintenance . The Nrl−/− mouse, generated by creating a loss of function mutation in Nrl, has a cone-only retina that serves as a useful model for studies of cone biology [26-28].
Several previous investigations have elucidated the gene expression landscape specific to whole retina or retinal cell types and during development or aging. Serial analysis of gene expression [29-31] and cDNA eye gene arrays [32-36] were initially used to determine signatures of retinal gene expression. Oligonucleotide microarrays have since allowed a more comprehensive approach to expression profiling [37-41]. Microarray analyses of flow-sorted photoreceptors and single cells from dissociated retinas [42-44] have begun to reveal new insights into regulatory networks. Application of NGS technology greatly expands the power of expression profiling by identifying all transcripts and spliced isoforms in the tissue or cell type of interest.
Here, we have used the power of NGS-based RNA-seq analysis to investigate in depth the transcriptome of wild-type (WT) and Nrl−/− retinas and identified a set of differentially expressed genes and spliced isoforms. We have also taken advantage of the knowledge about Nrl−/− mice to optimize workflows for data analysis and compared our results with those obtained with microarray methods and quantitative reverse transcription polymerase chain reaction (qRT–PCR) analysis. Our studies illustrate that RNA-seq offers a more complete, accurate, and relatively faster approach for comparative and comprehensive analysis of retinal transcriptomes and for discovering novel transcribed sequences. Our validated data analysis workflow should also be beneficial for similar studies by other investigators. Raw data and workflow are available on the N-NRL/NEI website.
All investigations on mice were approved by the Animal Care and Use Committee of the National Eye Institute and followed the tenets of the Declaration of Helsinki. C57Bl/6J (referred to as wild type, WT) and Nrl−/− (on C57Bl/6J background ) mice were euthanized with CO2 inhalation. The retinas were excised rapidly, frozen on dry ice, and stored at −80 °C.
Fresh frozen mouse retinas were lysed with a mortar and pestle in TRIzol Reagent, and total RNA was isolated according to the manufacturer’s protocol (Invitrogen, Carlsbad, CA). RNA quality and quantity were assessed with the RNA 6000 Nano Kit (Agilent, Santa Clara, CA).
Whole retinal RNA samples were independently processed from three wild-type and three Nrl−/− mice at P21. Total RNA (1 μg) was used with the TruSeq mRNA-seq Sample Preparation Kit (Illumina, San Diego, CA) to construct cDNA libraries. The quality of the libraries was verified using the DNA-1000 Kit (Agilent) and quantitation performed with qRT–PCR using ABI 7900HT (Life Technologies, Carlsbad, CA), as suggested in the Sequencing Library qRT–PCR Quantification Guide (Illumina). Gene Expression Master Mix (Life Technologies) was used for the qRT–PCR reactions, and a titration of phiX control libraries was employed as the quantification standard.
Each cDNA library (10 pM) was independently loaded into one flow cell lane, and single-read cluster generation proceeded using the TruSeq SR Cluster Generation Kit v5 (Illumina). Sequencing-by-synthesis (SBS) of 70-nucleotide length was performed on a Genome Analyzer IIx running SCS2.8 software using SBS v4 reagents (Illumina). Base calling and chastity filtering were performed using RTA (real-time analysis with SCS2.8).
Burrows–Wheeler Transform Aligner (BWA)  was used to align RNA-seq reads against the mouse reference genome (build mm9), downloaded and indexed from the University of California Santa Cruz (UCSC) genome browser database . The resulting sequence alignment/map files were imported into Partek Genomics Suite (Partek Inc., St. Louis, MO) to compute raw and fragments per kilobase of exon model per million mapped (FPKM) reads normalized expression values of the transcript isoforms defined in the UCSC refFlat file. A stringent filtering criterion of FPKM value 1.0 (equivalent to one transcript per cell ) in at least one out of six samples was used to obtain expressed transcripts. The FPKM values of the filtered transcripts were log-transformed using log2 (FPKM+offset) with an offset=1.0. ANOVA (ANOVA) was then performed on the log-transformed data of the two groups (WT and Nrl−/−) to generate fold change and p values for each transcript. Y-chromosome transcripts were filtered out along with non-coding (nc) RNAs, mitochondrial DNA coded genes, pseudogenes, and predicted protein-coding genes. Differentially expressed mRNA isoforms were filtered for a fold change cutoff of 1.5 and p-value cutoff of 0.05. These criteria were implemented to enable a comparison with previous expression studies. Hierarchical clustering was performed using Cluster 3.0 software . We used uncentered correlation as the distance metric. Heatmaps and dendrograms were generated using JavaTreeView software . Aligned reads were visualized using the Integrated Genomics Viewer (IGV) .
Raw reads that passed the chastity filter threshold were mapped using TopHat  to identify known and novel splice junctions and to generate read alignments for each sample. Genomic annotations were obtained from Ensembl in gene transfer format (GTF). Splice junctions from the six samples were combined into a master junctions file that was used as an input file for the second iteration of TopHat mapping. The transcript isoform level and gene level counts were calculated and FPKM normalized using Cufflinks. An FPKM filtering cutoff of 1.0 in at least one of the six samples was used to determine expressed transcripts. Differential transcript expression was then computed using Cuffdiff. The resulting lists of differentially expressed isoforms were filtered and sorted into the following categories: protein coding mRNA transcripts and ncRNA transcripts.
Reverse transcription (RT) reactions were performed using oligo(dT)20 with SuperScript II reagents (Life Technologies) according to the manufacturer’s protocol. cDNA synthesized from 2 μg of total RNA (1 μg for minus RT controls) was diluted to 100 μl (fivefold dilution), and from this 1 μl was used for each qRT–PCR reaction. The qRT–PCR reactions were performed in triplicate for TaqMan assays or in duplicate for the SYBR assays, using three biologic replicates per genotype, on a 7900HT Genetic Analyzer (Life Technologies). TaqMan assays were performed using TaqMan Gene Expression Master Mix and TaqMan Gene Expression Assays (Life Technologies) for genes listed in Table 1. The SYBR Green assays (Table 2) were performed using Power SYBR Green Master Mix (Life Technologies) and oligonucleotides at a final concentration of 200 nM. Oligonucleotides were designed using the Primer3 PCR Primer Design Tool  and synthesized by Integrated DNA Technologies (Coralville, IA). To eliminate complications due to contaminating genomic DNA in the RNA samples, qRT–PCR reactions were also performed with minus-RT control, using hypoxanthine guanine phosphoribosyl transferase (Hprt) primer pairs that can differentiate between mRNA and genomic DNA (data not shown). Differential expression analysis was performed using the ddCt method , with the geometric average of actin, beta (ActB) and Hprt as the endogenous controls .
Six libraries of P21 retinal cDNA (three each from WT and Nrl−/−) were sequenced to obtain 35 to 49 million raw sequence reads per sample (Table 3). Of these, 75.8% to 82.7% reads passed the RTA chastity filter and were used for subsequent Burrows–Wheeler Aligner (BWA) and TopHat analysis workflows (Figure 1). Due to TopHat workflow’s power to map across splice junctions, the workflow consistently yielded 6 to 7 million more alignments per sample when compared to BWA.
Based on the BWA analysis workflow, 16,014 transcripts were detected with a normalized FPKM value greater than 1.0 in any of the six samples. Transcripts were filtered based on whether they were mRNAs or ncRNAs. Of the 15,142 mRNA transcripts, only 1,422 met our criteria of differential expression of having a fold change greater than 1.5 and a p-value less than 0.05 (Table 4). Of the 1,422 differentially expressed mRNA transcripts (DETs) representing 1,218 unique genes, 551 were downregulated in Nrl−/− (including rod-specific genes) retinas, and 871 were upregulated in Nrl−/− (including cone-enriched genes and those involved in retinal remodeling) retinas.
A total of 34,115 transcripts were detected with a normalized FPKM value of greater than 1.0 in any of the samples in either group. Transcripts were filtered based on whether they were protein-coding mRNAs or ncRNAs. Of the 32,001 mRNA transcripts, only 3,258 met our criteria of differential expression (Table 4). The DETs represented 1990 unique genes; 1,504 were downregulated in Nrl−/− (including rod-specific genes) retinas, and 1,754 were upregulated in the Nrl−/− (including cone-enriched genes and those involved in retinal remodeling) retinas.
The BWA/ANOVA and TopHat/Cufflinks analyses were compared to assess the consistency and quality of the results. Using the official Mouse Genome Informatics gene symbol as the linking term, Venn diagrams were constructed to summarize the overlap between the set of all (Figure 2A), the top 500 (Figure 2B), and the top 200 (Figure 2C) DETs from the BWA workflow and the DET list from the TopHat workflow. A comparison of the full list of BWA DETs to the TopHat list revealed only 51.7% overlap between the differentially expressed genes (DEGs) from BWA and TopHat. This overlap increased to 73.8% and 87.8% when only the top 500 and 200 DEGs from BWA, respectively, were considered. Subsequent analyses were performed using BWA data.
We first assessed the correlation between the FPKM values (obtained with RNA-seq) with their corresponding qRT–PCR crossing threshold (Ct) values from the TaqMan assays; the two values represent the quantitative levels of expression of a specific transcript in the RNA sample. For this purpose, we chose 24 differentially expressed genes (DEGs, reflecting a wide range of expression) and 12 housekeeping genes (HKGs). The Ct values from three biologic replicates (without normalization) were then compared to the corresponding log2 FPKM values (Figure 3). A least-squares regression analysis of DEGs provided an R2 value of 0.8798, with a corresponding slope of −1.056, suggesting a strong inverse relationship between Ct and log2 FPKM values over a dynamic range of 4–5 orders of magnitude. Only one out of 24 genes, cell adhesion molecule 3 (Cadm3), fell outside this correlation. Further investigation of the RNA-seq aligned reads showed that our qRT–PCR assay was specific for only one of the two retina-expressed spliced isoforms of Cadm3. The reanalysis using a SYBR assay designed to detect both Cadm3 transcripts confirmed the linear correlation between RNA-seq and qRT–PCR analysis. Interestingly, FPKM and Ct values for 6 of the 12 HKGs did not show the expected linear relationship; these included ubiquitin C (Ubc), ActB, ribosomal protein L13A (Rpl13a), ribosomal protein S26 (Rps26), phosphoglycerate kinase 1 (Pgkl), and most severely glyceraldehyde-3-phosphate dehydrogenase (Gapdh). With the exception of Ubc that was underestimated by qRT–PCR (in the same manner as Cadm3), the BWA workflow underestimated all others.
RNA-seq data were evaluated for the expression of 27 established HKGs (Table 5) included in the control qRT–PCR plates from the following vendors: Life Technologies (Mouse Endogenous Control Array), SA Biosciences, Frederick, MD (Mouse Housekeeping Genes RT2 Profiler PCR Array), and Qiagen, Valencia, CA (QuantiTect Housekeeping Genes). Comparison of qRT–PCR data for 12 genes (that were tested) showed almost complete concordance of expression with the RNA-seq results. Only one gene, Ubc, revealed a significant difference in expression between the WT and Nrl−/− retinas with qRT–PCR (−1.89 fold) compared to the RNA-seq (−1.19 fold) analyses. Gapdh showed a relatively high change in expression in qRT–PCR and RNA-seq (−1.49 and −1.37 fold, respectively). Hprt and Rpl13a revealed lower variation in qRT–PCR and RNA-seq, respectively. Actb, TATA box binding protein (Tbp), glucuronidase, beta (Gusb), and Pgk1 were among the best HKGs for qRT–PCR and RNA-seq normalization. For further qRT–PCR analyses, we employed ActB and Hprt in all normalization calculations.
Based on the RNA-seq data from the WT and Nrl−/− retinas, we selected 25 DEGs (12 downregulated and 13 upregulated) showing a wide range of differential expression for validation with qRT–PCR analysis. qRT–PCR data for all genes validated the RNA-seq results (Figure 4). The WNT1 inducible signaling pathway protein 1 (Wisp1) TaqMan assay did not produce an amplicon in any of the experiments performed; subsequent examination of the RNA-seq data revealed that this assay did not correspond to the splice variant expressed in the retina. Additional analysis using a SYBR assay with oligonucleotides specific to the retinal splice variant confirmed the differential expression of Wisp1 (−43.9 fold change) in the Nrl−/− retina compared to the WT.
The preceding analysis clearly demonstrates the high reliability and accuracy of the data obtained with RNA-seq methodology. We therefore used RNA-seq data to derive absolute expression levels of individual transcripts. The top 25 genes highly expressed in the WT or Nrl−/− retina are listed in Table 6 and Table 7. As predicted, most of these genes encode proteins involved in photoreceptor function/metabolism.
We then focused on DEGs between the Nrl−/− and WT retinas. A total of 1,422 transcripts, corresponding to 1,218 unique genes, showed a minimum fold change of 1.5 at p≤0.05. Hierarchical clustering of all differentially expressed transcripts resulted in two distinct clusters: one cluster of 477 genes downregulated in the Nrl−/− retina includes all known rod-specific genes such as rhodopsin (Rho; FC=-4,804), guanine nucleotide binding protein, alpha transducing 1 (Gnat1; FC=-2,034), and nuclear receptor subfamily 2, group E, member 3 (Nr2e3; FC=-227.5; Figure 5A and Table 8); and the other cluster of 741 upregulated genes had all cone-specific genes such as opsin 1, short-wave-sensitive (Opn1sw; FC=18.4), cyclic nucleotide gated channel beta 3 (Cngb3; FC=18.1), and Gnat2 (FC=12.2; Figure 5B and Table 9).
We then compared our DEG list with two published studies that examined WT and Nrl−/− retinas: a recent transcript-level RNA-seq analysis that included 6,123 DETs  and a gene-level microarray analysis showing 438 DEGs  (Figure 6). To obtain the list of DEGs from the Mustafi et al.  data set, we performed ANOVA on their FPKM data from GEO database. Interestingly, the DEGs lists from the three studies had only 203 common genes including many previously identified genes specifically expressed in cone (fatty acid binding protein 7, brain [Fabp7], Opn1sw, Cngb3, and Gnat2) or rod (Rho, Gnat1, cyclic nucleotide gated channel alpha 1 [Cnga1], and Nr2e3) photoreceptors. To assess the power of RNA-seq to more comprehensively identify DETs than microarray, we examined the list of 634 genes identified in common by the RNA-seq studies but not by the microarray study. This list included 18 retinal disease genes (ATP-binding cassette, sub-family A (ABC1), member 4 [Abca4], cadherin 23 (otocadherin) [Cdh23], ADP-ribosylation factor-like 6 [Arl6], Bardet-Biedl syndrome 9 (human) [Bbs9], calcium binding protein 4 [Cabp4], cyclic nucleotide gated channel alpha 3 [Cnga3], G protein-coupled receptor 98 [Gpr98], guanylate cyclase activator 1a (retina) [Guca1a], opsin 1 (cone pigments), medium-wave-sensitive (color blindness, deutan) [Opn1mw], orthodenticle homolog 2 (Drosophila) [Otx2], phosphodiesterase 6G, cGMP-specific, rod, gamma [Pde6g], peripherin 2 [Prph2], retinol binding protein 4, plasma [Rbp4], retinol dehydrogenase 1 (all trans) [Rdh1], regulator of G-protein signaling 9 binding protein [Rgs9bp], unc-119 homolog (C. elegans) [Unc119], Usher syndrome 2A (autosomal recessive, mild) homolog (human) [Ush2a], and whirlin [Whrn]) and several known genes involved in visual perception (guanylate cyclase 2e [Gucy2e], guanylate cyclase 2f [Gucy2f], recoverin [Rcvrn], RAR-related orphan receptor beta [Rorb], and sal-like 3 (Drosophila) [Sall3]). Several genes showing large differential expression values might participate in rod homeostasis (galactosidase, beta 1-like 2 [Glb1l2] FC=-14.02, GRAM domain containing 2 [Gramd2] FC=-14.0, carbohydrate (chondroitin 6/keratan) sulfotransferase 3 [Chst3] FC=-4.8, desert hedgehog [Dhh] FC=-4.1, and ADP-ribosylation factor-like 4D [Arl4d] FC=-3.6) and cone function (dual specificity phosphatase 23 [Dusp23] FC=6.3, cyclin-dependent kinase 11B [Cdkl1] FC=6.1, tryptophan hydroxylase 1 [Tph1] FC=4.7, muscle glycogen phosphorylase [Pygm] FC=4.6, cyclin-dependent kinase 6 [Cdk6] FC=4.0, Sall3 FC=3.9, and early growth response 1 [Egr1] FC=3.7).
Our RNA-seq data allowed us to identify 359 genes not identified in previous investigations. To further assess the quality of our analysis, we performed qRT–PCR validation of 15 genes identified by other studies (but not in our study) as differentially expressed and of 7 genes uniquely identified by our study (but not by other studies; Table 10). Of the 15 genes identified by other studies, only three (ATP-binding cassette, sub-family A (ABC1), member 13 [Abca13], CD8 antigen, alpha chain [Cd8a], and acyl-CoA oxidase-like [Acoxl]) were confirmed with qRT–PCR as being differentially expressed. We also detected these three as differentially expressed but had filtered them out because of FPKM values that were less than 1.0 in all samples. Interestingly, the Abca13 transcript detected in the retina had only sequence reads for exons 56 through 62. This finding was supported by qRT–PCR using two SYBR assays designed to exons 53/55 and exons 56/58. All seven genes uniquely identified by our study were validated as significantly differentially expressed.
The significantly lower number of DETs detected by our study compared to the Mustafi et al. study (2011; 1,422 versus 6,123, respectively) can be attributed to the following:
1. We used a stringent 1.0 FPKM cutoff that generated a list of genes with significant base level expression and fewer false positives than a lower expression level threshold. If we had decreased our threshold to 0.1 FPKM, we would have detected 975 more DETs; however, these genes are expressed at an extremely low level and their impact must be weighed against the increase in false positives. We chose a conservative criterion to identify significant and bona fide differentially expressed genes.
2. Mustafi et al.  pooled multiple RNA samples before generating the library and used the identical library on multiple lanes of the sequencer. Our experimental design consisted of libraries generated from individual biologic replicates that allowed us to eliminate the transcripts based on p-value.
Several DETs we identified might contribute to photoreceptor function but are not yet characterized; these include pleckstrin homology domain containing, family F (with FYVE domain) member 2 (Plekhf2; FC=-5.35), kelch-like 13 (Drosophila) [Klhl3] (FC=-3.3), NIPA-like domain containing 1 (Nipal1; FC=-2.8), and coiled-coil domain containing 24 (Ccdc24; FC=-2.6) in the WT retina, and kelch-like 33 (Drosophila) [Klhl33] (FC=14), WSC domain containing 2 (Wscd2; FC=4), hairless (Hr; FC=3.9) and regulator of G-protein signaling 22 (Rgs22; FC=3.8) in the Nrl−/− retina. We also identified Crx opposite strand transcript 1 (Crxos1; FC=4.1), which is exclusively expressed in the eye from the opposite strand of a key retinal transcription factor, cone-rod homeobox containing gene (Crx) . An interesting new finding is the retinal expression of multiple genes from the Kelch-like family (Klhl3, 4, 5, 18, 33, 36), solute carrier family (>30 members), and zinc-finger protein family (>10 members). Mutations in at least one gene from each family have previously been associated with retinal disease: Klhl7 with autosomal dominant RP , Slc24A1 with autosomal-recessive congenital stationary night blindness , and Znf513 with autosomal-recessive retinitis pigmentosa (RP) .
Specific patterns of gene expression define the morphology and function of distinct cell types and tissues. Changes in gene expression are associated with complex biologic processes, including development, aging, and disease pathogenesis. Until recently, such investigations focused on one or a few genes at a time. Advances in genomic technology have permitted simultaneous evaluation of most, if not all, genes that respond to an extrinsic microenvironment or intrinsic biologic program(s). Such studies are critical for delineating gene networks that can be targeted for treating specific diseases. RNA-seq allows comprehensive evaluation of transcriptomes, alternative transcripts, and coding polymorphisms. However, analyzing RNA-seq data has been challenging due to the complexity associated with quality control, sequence alignments, and handling of large data sets . Several algorithms [45,60] have been proposed for mapping sequence reads to the reference genome, and multiple workflows [16,50] suggested for RNA-seq data analysis. Here, we report a detailed RNA-seq methodology using WT and Nrl−/− retinas as a study paradigm and establish the high performance of NGS technology compared to microarray and qRT–PCR platforms for transcript identification and quantification studies. Consistent with recent studies , our RNA-seq data demonstrate high sensitivity, a wider dynamic range of coverage, and lower technical variability.
Quantitative RT–PCR has long been considered the “gold standard” for mRNA quantification [62,63], and hence routinely used to validate results from transcriptome analysis studies. We show that FPKM values from RNA-seq analysis have a strong linear correlation across at least four orders of magnitude with Ct values from qRT–PCR. Expression of several HKGs is underestimated by RNA-seq because of the algorithmic limitation associated with alignment of reads that map to multiple genomic locations (paralogous sequences or pseudogenes). All of the outlying HKGs inspected had a lower-than-projected FPKM value due to varying numbers of associated pseudogenes [64-67]. For example, Gapdh has 331 pseudogenes in the mouse genome . Our qRT–PCR data projected an FPKM value of approximately 4000 for Gapdh, yet the BWA workflow estimated an FPKM of only 6.6 in the WT retina (see Figure 3). This was also the case, but less severe, for Pgk1, Rps26, Rpl13a, and ActB. Current algorithms proportionally divide the number of reads aligning to multiple genes during FPKM calculation among those genes. In our study, unsuitable qRT–PCR assay design explains the remaining exceptions to the linear correlation between qRT–PCR and RNA-seq. After careful visual inspection of the aligned reads in IGV, we found that the assay designed for Wisp1 was not specific to the splice variant expressed in the retina. Similarly, the assays designed for Cadm3 and Ubc were specific to one of the two transcripts expressed in the retina. Hence, RNA-seq provides a better assessment of alternate isoforms, and transcript quantification is not limited by the design of qRT–PCR assays.
We took advantage of the RNA-seq data to inspect the expression of commonly used HKGs (see Table 5) for normalization in qRT–PCR assays. The choice generally depends on specific tissue and/or developmental time points being investigated. Our RNA-seq studies suggest that most of the HKGs can be used for normalization calculation in qRT–PCR assays; however, Gapdh, β-2 microglobulin (B2m), and Ubc do not appear to be good choices. Additional RNA-seq data would help in delineating relevant HKGs appropriate for qPCR validation in developing retina or cell types.
We compared two different strategies for analyzing WT and Nrl−/− RNA-seq data. The BWA workflow relies on fast and accurate gapped alignment of reads to the exonic regions of the genome. Since the gap between most adjacent exons is larger than a few bases, the cumulative gap extension penalty underestimates the quality of the alignment of reads spanning the splice junctions. Hence, the BWA workflow produced accurate quantitative estimation of gene and transcript isoform expression while losing valuable information about alternate splicing. The higher accuracy of quantitative gene expression estimates by the BWA workflow compared to those by TopHat is evident from the stronger correlation determined by linear regression analysis of the DETs. The regression line from BWA had a slope of −1.056 (compared to −0.905 for TopHat) and R2 of 0.8798 (compared to 0.7727 for TopHat).
The TopHat workflow maps the reads to exonic regions of the reference genome as well as across all known and putative splice junctions defined in the Ensembl GTF file. TopHat attempts to map reads across splice junctions defined in the Ensembl GTF file and across novel splice junctions detected during the first phase of alignment. Hence, the TopHat workflow maps significantly more reads starting with the same number of pass filter (PF) reads and detects additional transcript isoforms missed by the BWA workflow. The source of genomic annotations used by these methods is another important difference. UCSC refFlat annotation (used by the BWA workflow) for the mouse reference genome (build mm9) contained approximately 28,000 unique transcript isoforms, whereas the Ensembl GTF file (used by the TopHat workflow) for the same genome build listed three times more unique transcript isoforms. The problem is amplified because of the lack of one-to-one mapping for several transcripts defined in the UCSC refFlat file in Ensembl GTF. Hence, a non-trivial number of DETs detected by the BWA workflow could not be mapped to any DET from the TopHat workflow (see Figure 2, regions shaded in green).
The BWA workflow detects about 16,000 transcripts in the retina, with a minimum expression equivalent to one transcript per cell (i.e., 1 FPKM) . When the criteria were relaxed to cover transcripts expressed at low levels (0.1 FPKM), 20,707 transcripts were detected in the retina. This is not surprising as the whole retina includes more than 50 distinct neuronal cell types, and each cell would achieve protein diversity largely by alternative promoter usage and/or alternative splicing . The TopHat workflow yields thousands of known and putative transcript isoforms not previously described in the retina. However, validating these novel isoforms predicted from RNA-seq data remains a challenge.
Integrated analysis of RNA-seq data with miRNA-seq, transcription factor binding sites data (chromatin immunoprecipitation sequencing-Chip-Seq), genetic variations (expression Quantitative Trait Loci) , and methylation patterns would allow decoding of the complex regulatory networks associated with retinal development and function. Several technical improvements would however be necessary to overcome the bias introduced into the RNA-seq data due to GC content, mappability of reads, length of the gene, and regional differences due to local sequence structure . RNA-seq methods are more likely to identify longer differentially expressed transcripts than shorter transcripts with the same effect size . New statistical methods are being developed to correct for systematic biases inherent in NGS data [70-72]. In the coming years, we will witness an explosion in high throughput genomic methods that are expected to revolutionize biology and biomedical discovery.
This research was supported by Intramural Research Program of the National Eye Institute, National Institutes of Health, Bethesda, MD.