Molecular Vision 2014; 20:1281-1295
Received 15 March 2012 | Accepted 17 September 2014 | Published 19 September 2014
Marylyn D. Ritchie,1 Shefali S. Verma,1 Molly A. Hall,1 Robert J. Goodloe,2 Richard L. Berg,3 Dave S. Carrell,4 Christopher S. Carlson,5 Lin Chen,6 David R. Crosslin,7,8 Joshua C. Denny,9,10 Gail Jarvik,7,11 Rongling Li,12 James G. Linneman,13 Jyoti Pathak,14 Peggy Peissig,13 Luke V. Rasmussen,15 Andrea H. Ramirez,10 Xiaoming Wang,9 Russell A. Wilke,9,16 Wendy A. Wolf,17 Eric S. Torstenson,2 Stephen D. Turner,18 Catherine A. McCarty19
1Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA; 2Center for Human Genetics Research, Vanderbilt University, Nashville, TN; 3Biomedical Informatics Research Center, Biostatistics, Marshfield Clinic Research Foundation, Marshfield, WI; 4Group Health Research Institute, Seattle, WA; 5Fred Hutchinson Cancer Research Center, Seattle, WA; 6Ophthalmology, Marshfield Clinic Research Foundation, Marshfield, WI; 7Division of Medical Genetics, University of Washington, Seattle, WA; 8Department of Biostatistics, University of Washington, Seattle, WA; 9Departments of Biomedical Informatics, Vanderbilt University, Nashville, TN; 10Department of Medicine, Vanderbilt University, Nashville, TN; 11Departments of Medicine and Genome Sciences, University of Washington, Seattle, WA; 12Office of Population Genomics, National Human Genome Research Institute, Bethesda, MD; 13Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, Marshfield, WI; 14Department of Biomedical Informatics, Mayo Clinic College of Medicine, Rochester, MN; 15Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University, Chicago, IL; 16IMAGENETICS at Sanford Medical Center, Fargo, ND and Department of Internal Medicine, University of North Dakota, Fargo, ND; 17Division of Genetics and Genomics, Boston Children's Hospital and Department of Pediatrics, Harvard Medical School, Boston, MA; 18Public Health Sciences, University of Virginia, Charlottesville, VA; 19Essentia Institute of Rural Health, Duluth, MN
Correspondence to: Marylyn Ritchie, Pennsylvania State University, Center for Systems Genomics, The Huck Institutes for the Life Sciences, Department of Biochemistry and Molecular Biology, 512 Wartik Laboratory, University Park, PA 16802; Phone: (814) 863-5107; FAX: (814) 863-6699; email: firstname.lastname@example.org
Purpose: Cataract is the leading cause of blindness in the world, and in the United States accounts for approximately 60% of Medicare costs related to vision. The purpose of this study was to identify genetic markers for age-related cataract through a genome-wide association study (GWAS).
Methods: In the electronic medical records and genomics (eMERGE) network, we ran an electronic phenotyping algorithm on individuals in each of five sites with electronic medical records linked to DNA biobanks. We performed a GWAS using 530,101 SNPs from the Illumina 660W-Quad in a total of 7,397 individuals (5,503 cases and 1,894 controls). We also performed an age-at-diagnosis case-only analysis.
Results: We identified several statistically significant associations with age-related cataract (45 SNPs) as well as age at diagnosis (44 SNPs). The 45 SNPs associated with cataract at p<1×10−5 are in several interesting genes, including ALDOB, MAP3K1, and MEF2C. All have potential biologic relationships with cataracts.
Conclusions: This is the first genome-wide association study of age-related cataract, and several regions of interest have been identified. The eMERGE network has pioneered the exploration of genomic associations in biobanks linked to electronic health records, and this study is another example of the utility of such resources. Explorations of age-related cataract including validation and replication of the association results identified herein are needed in future studies.
Cataract is the leading cause of blindness in the world [1,2], is the leading cause of vision loss in the United States , and accounts for approximately 60% of Medicare costs related to vision . Summary prevalence estimates indicate that 17.2% of Americans aged 40 years and older have cataract in either eye and 5.1% have pseudophakia or aphakia (previous cataract surgery). In addition to the implications for healthcare delivery and healthcare costs, cataract has been shown to be associated with falls and increased mortality [5-12], possibly because of associated systemic conditions. Women have a slightly higher risk of having cataract than men . With increased life expectancy, the number of cataract cases and cataract surgeries is expected to increase dramatically unless primary prevention strategies can be developed and successfully implemented.
Several genetic loci have also been linked to cataract as an independent phenotypic trait. An extensive body of literature has addressed the role of genetics in childhood cataract , and it has been hypothesized that these same genes may be plausible candidates for age-related cataract . It has been suggested that as many as 40 genes may be involved in age-related cataract . Evidence for a major gene has been identified for cortical  and nuclear [18,19] cataract, with heritability estimates of 58%  and 48% , respectively. A whole genome STR scan conducted in families in Wisconsin revealed a major locus for age-related cortical cataract on chromosome 6p12-q12 , and specific candidate genes that have been studied include galactokinase (Gene_ID: 2584; OMIM: 604313) [23,24], apolipoprotein E (Gene_ID: 348; OMIM: 107741) , glutathione S-transferase (Gene_ID: 2944; OMIM: 138350), N-acetyltransferase 2 (Gene_ID: 10; OMIM: 612182) [27,28], and estrogen metabolism genes . Two recent studies found an association between the EPHA2 gene (Gene_ID: 1969; OMIM: 176946) and cataract [30,31].
Higher body mass index (BMI) has been shown in many studies to increase risk of cortical and posterior subcapsular (PSC) cataract (odds ratio [OR] = 1.5–2.5) [32-38]. A recent study found that nuclear cataract was not associated with obesity but was associated with the FTO obesity gene (Gene_ID: 79068; OMIM: 610966) in an Asian population . Although familial aggregation studies have shown a potential role for gene and environment interactions in nuclear cataract [40,41], research in this area is limited. The association of glutathione S-transferase with cataract has been shown to be modified by smoking  and sunlight exposure . No whole genome association SNP studies of age-related cataract in unrelated individuals have been reported in the medical literature. The purpose of this study was to conduct a genome-wide association study (GWAS) for age-related cataract and to prioritize top hits for further follow-up.
The National Human Genome Research Institute (NHGRI)-funded electronic medical records and genomics (eMERGE) network implemented an electronic phenotype algorithm to select cataract cases and controls . Cataracts as a condition were selected by Marshfield Clinic as its primary eMERGE phenotype, and the algorithm, which uses diagnostic and procedure codes, was developed by the Marshfield Clinic Personalized Medicine Research Project (PMRP) investigators . The five sites in eMERGE-I include Marshfield Clinic, Group Health Research Institute, Vanderbilt University, Mayo Clinic, and Northwestern University. This study included four of the sites: Marshfield Clinic, Group Health Research Institute, Vanderbilt University, and Mayo Clinic. Using an algorithm for a specific phenotype, each participating site extracted study samples for a specific disease or phenotype from the electronic health records (EHR). Once samples had been selected and genotyped, they were available for phenotyping with additional algorithms. Thus, the cataract algorithm was deployed across the network. The cases and the controls had to meet the following inclusion criteria: The cases were age 50 years and older at the time of diagnosis or surgery, and the controls were age 50 years or older at the time of the most recent eye exam and had had an eye exam within the previous 5 years. The controls had no diagnostic codes for cataract or evidence of cataract surgery. The cases were identified as “surgical” or “diagnosis only.” Surgical cases had undergone a cataract extraction in at least one eye. The diagnosis-only cases were required to have either cataract diagnoses on two or more dates or have one diagnosis date and natural language processing and optical character recognition (NLP/OCR) find one or more inclusion cataract terms. Cataract type was extracted from the notes using natural language processing and optical character recognition with validation through manual chart abstraction [45,46].
Genome-wide genotyping has been performed on approximately 17,000 samples across the network at the Broad Institute and at the Center for Inherited Disease Research (CIDR) using the Illumina 660W-Quad or 1M-Duo Beadchips (CIDR, Baltimore, MD). For this particular study, which includes predominantly individuals of European descent, we used only the Illumina 660W-Quad platform. This platform consists of 561,490 SNPs and 95,876 intensity-only probes. Genotyping calls were made at either CIDR or Broad using BeadStudio version 3.3.7. The eMERGE Cataract dataset pre-quality control (QC) included 7,535 DNA samples and 344 HapMap controls: 3,968 Marshfield Clinic, 2,379 Group Health, 986 Mayo, and 202 Vanderbilt BioVU. Data were cleaned using the eMERGE QC pipeline developed by the eMERGE Genomics Working Group . This process includes evaluation of the sample and marker call rate, gender mismatch, duplicate and HapMap concordance, batch effects, Hardy–Weinberg equilibrium, sample relatedness, and population stratification. After QC, 530,101 SNPs and 7,397 samples were used for analysis (see Table 1 for distribution by site). All genotype data and a detailed QC report for each individual site, as well as the merged eMERGE dataset, can be found on dbGaP, and the detailed eMERGE QC pipeline can be found in [47,48].
Single-locus tests of association were performed using PLINK  assuming an additive genetic model for all 530,101 SNPs in a total of 7,397 unrelated individuals (5,503 cases and 1,894 controls). We calculated principal components using the EIGENSTRAT program  and thus adjusted our analyses for the first three principal components (PCs) to avoid any spurious associations that can be caused due to population stratification. EIGENSTRAT is based on principal components analysis and is used to detect and correct for population stratification in genome-wide association studies. Thus, we present the results of the analysis adjusted by principal components 1–3 (PC1–3).
We also performed an age-at-diagnosis association analysis using cases only. Age at diagnosis is defined as the age when the first cataract diagnosis was made in the electronic health record. We performed unadjusted analysis and adjusted for PC1–3 using linear regression in PLINK. In Table 2 and Table 3, we report all p values <1×10−5. All associations identified by our analyses are suggestive and must be replicated in independent datasets because the signals did not reach a Bonferroni corrected genome-wide statistical significance level.
Figure 1 shows the Manhattan plots for the single locus tests of association for cataract case control adjusted (Figure 1A) and age-at-diagnosis adjusted (Figure 1B) and Figure 2 shows the corresponding QQ plots for each GWAS analysis. Our top hits in the adjusted case-control analysis include gigaxonin (GAN; Gene_ID: 8139, OMIM: 605379; p value = 2.42×10−6), which encodes a member of the cytoskeletal Broad-Complex, Tramtrack, and Bric a brac (BTB/kelch) repeat family. The encoded protein plays a role in neurofilament architecture and is involved in mediating the ubiquitination and degradation of some proteins. Defects in this gene are a cause of giant axonal neuropathy (GAN). Other potential interesting findings include DNER (Gene_ID: 92737; OMIM: 607299; p value = 1.87×10−5), which encodes for the Delta and Notch-like epidermal growth factor-related receptor, and EHHADH (Gene_ID: 1962; OMIM: 607037; p value = 2.80×10−5) encodes for enoyl-CoA, hydratase/3-hydroxyacyl CoA dehydrogenase. Myocyte-specific enhancer factor 2C also known as MADS box transcription enhancer factor 2, polypeptide C is a protein that in humans is encoded by the MEF2C gene (Gene_ID: 4208; OMIM: 600662; p value = 7.26×10−5). MEF2C upregulates the expression of the homeodomain transcription factors DLX5 and DLX6, two transcription factors that are necessary for craniofacial development . This could be another interesting link to cataracts.
Several SNPs in or near ALDOB (Gene_ID: 229; OMIM: 612724; p value = 2.46×10−6), which encodes for aldolase B, fructose-bisphosphate, were also associated with cataracts in our GWAS analysis. Mutations in this gene result in an autosomal recessive disorder of fructose intolerance, and cases of cataract have been reported in the first decade of life . Another interesting associated gene is MAP3K1 (Gene_ID: 4214; OMIM: 600982; p value = 1.33×10−5), a functional mitogen-activated protein kinase kinase kinase 1. Molecular signatures of MAP3K1 have been shown to be important in embryonic eyelid closure in the mouse . In total, 45 SNPs were statistically significant at p<10−5 or smaller.
In the age-at-diagnosis analysis, our top hits include ACSS3 (Gene_ID: 79611; OMIM: 614356; p value = 6.39×10−7), which is acyl-CoA synthetase short-chain family member 3; EPHA4 (p value = 7.03×10−5), ephrin type-A receptor 4, which is a protein that in humans is encoded by the EPHA4 gene (Gene_ID: 2043; OMIM: 602188). This gene belongs to the ephrin receptor subfamily of the protein-tyrosine kinase family, along with EPHA2. EPH and EPH-related receptors have been implicated in mediating developmental events, especially in the nervous system .
This study is the first genome-wide association study in age-related cataract reported in the literature. Cataract in type 2 diabetes has been investigated, and a region on chromosome 3p14.4–3p14.2 was identified in a Han Chinese population . The five SNPs identified in that study do not show evidence of association in our eMERGE cataract GWAS. It is difficult to interpret these results, however, because age-related cataracts and cataracts in type 2 diabetics may be two different phenotypes, which may have disparate etiologies. In addition, our dataset does not have an overwhelming number of individuals with type 2 diabetes (see Table 1); thus, we were underpowered to explore this specific type of association. Other previously published research on gene mapping in cataracts supports a linkage region on chromosome 1  and association with EPHA2 [30,31]. In our GWAS, we did not see evidence for association with EPHA2, although we did see association with EPHA4. One significant difference in this study is the phenotyping of cases and controls based on electronic health records (EHR) in population-based cohorts, rather than family-based samples. However, our study in addition to the literature supports the suggestion of cataract-susceptibility loci on chromosome 1. Replication studies and larger sample sizes are needed to validate and confirm these findings.
Although the eMERGE network has demonstrated the utility of electronic phenotyping in EHR for several traits [57-61], there are inherent challenges with this approach. For ophthalmic conditions specifically, the abundance of EHR coded information is extremely limited or, in some health systems, absent. Thus, sophisticated phenotyping strategies must be established [45,46] Still, the success of the EHR and biobank approach for association studies is unprecedented. The ability to perform multiple GWAS simultaneously with no additional genotyping is an enormous benefit . Once a set of patient samples has been genotyped on a genome-wide association platform, those data can be reused for multiple additional genotype-phenotype association studies. In particular, the eMERGE network has done quite a bit of this for quantitative traits and clinical laboratory variables such as cholesterol , red-blood cell indices , and white blood cell count . The additional effort is expended on creating electronic phenotyping algorithms, rather than collecting samples and genotyping. Thus, this is an enormous resource for subsequent genotype-phenotype association studies.
Future explorations of age-related cataract include validating and replicating the association results identified herein. Unfortunately, because of the sample size and limited power by stratifying cases and controls by the eMERGE site, we did not have the opportunity to replicate these findings within eMERGE. The goal is to identify a similar study population where these results can be explored. In addition, we are beginning to investigate the role of gene–gene and gene–environment interactions associated with cataracts . Due to the complexity of the trait, we hypothesize that the genetic architecture will be similar to that of other complex traits: multigenic with a combination of genetic and environmental interactions.
As demonstrated by this and other studies, the beauty of using an electronic health record is the ability to reuse genotyped samples for various phenotypes. The eMERGE network has clearly demonstrated the success of this study design, and continues to demonstrate the strengths and limitations of this approach.
The eMERGE Network was initiated and funded by NHGRI, with additional funding from NIGMS through the following grants: U01HG004610 (Group Health Cooperative); U01HG004608 (Marshfield Clinic); U01HG04599 (Mayo Clinic); U01HG004609 (Northwestern University); U01HG04603 (Vanderbilt University, also serving as the Coordinating Center); U01HG006389 (Essentia Institute of Rural Health). The Northwest Institute of Medical Genetics is also supported by a State of Washington Life Sciences Discovery Fund award.