Introduction

Single-nucleotide polymorphisms (SNPs) are the most common type of genetic variation in the human genome that are widely used in associations studies with complex diseases and quantitative traits. In fact, SNPs occur in every 100–300 bases along the human genome and represent 90% of all genetic variations. They are found in both coding (gene) and noncoding regions of the genome with different densities (Lee et al. 2005).

In the human genome, SNPs are abundant in noncoding regions, including regulatory and untranslated (UTR) regions as well as introns. Single-nucleotide variations may affect phenotypic functions and eventually contribute to disease development. SNPs in the regulatory regions may influence gene expression or transcription factor binding (Stenson et al. 2009; Kimura-Kataoka et al. 2012), while SNPs of the UTR regions may modify the transcriptional activity (Milanese et al. 2007), RNA stability and ribosomal translation of mRNA (Boffa et al. 2008). Polymorphisms in coding DNA, especially those that lead to an amino acid residue change in the protein product (nonsynonumous, nsSNP) are of particular importance as they are responsible for nearly half of the known genetic variations related to human inherited disease (Hampe et al. 2007).

SNP association studies are useful in the recognition of genomic variants that are responsible for specific phenotypes, such as complex diseases, quantitative traits and physiological responses to drugs (Andrawiss 2005; Takahashi et al. 2012). However, identifying these SNPs among the thousands of SNPs in candidate genes is a difficult problem to solve (Zhernakova et al. 2009). Thus, it is essential to make a preselection of the SNPs to be analysed according to their functional significance (Ilhan and Tezel 2013; Patnala et al. 2013) by using bioinformatics prediction tools, which allows to differentiate between neutral SNPs and those of likely functional importance.

Oestrogen receptor α (ER α) is a prototype of nuclear transcription factors that regulate the expression of target genes. This protein controls a diverse set of essential functions such as reproduction, development and proliferation (Knobil and Neill 1994). ER α is encoded by the ESR1 gene and different promoters have been identified in this gene (Kos et al. 2001). Their resulting transcripts differ in their 5 UTR but not in their coding regions. Among the described variants, four lead to a functional protein ER α1 (NM_000125.3, NM_001122740.1, NM_001122741.1 and NM_001122742.1).

In the recent years, a multitude of SNPs in ESR1 gene have been identified, and many associations studies have determined the importance of some SNPs in different hormone-dependent diseases (Ding et al. 2010; Kim et al. 2011; Paskulin et al. 2013). Yet, it has been reported that some polymorphisms of the ESR1 gene may affect the enhancer activity of gene (Maruyama et al. 2000) or gene regulation (Albagha et al. 2001). However, few studies have been able to determine the mechanism by which these polymorphisms may act in disease development. Also, there are a large number of SNPs in ESR1 gene which have not been studied and which may be implicated in the aetiology of several diseases. Therefore, there is a need to achieve SNPs prioritization of the ESR1 gene and to determine the functional significance of these polymorphisms.

In this study, we present an in silico characterization of SNPs that can affect the function of ESR1 gene. Nine different computational tools (Polyphen-2, SIFT, PROVEAN, SNAP, INPS, Net-O-Glyc, ESEfinder, UTRscan and TFsearch) with good prediction performance (sensitivity and specificity) are applied to identify candidate SNPs that are likely to affect the protein function and that can be selected in priority for future association studies. These programs are based on different methods and use different data and information including protein sequence, protein structure and protein stability. The use of multiple bioinformatics tools allow a more reliable annotation and classification of variants since those that are similarly predicted by many tools will have stronger evidence about their function.

Materials and methods

Datasets

The NCBI database SNP, dbSNP available at www.ncbi.nim.nih.gov (bluid 141, May 2014) was used to recover the list of candidate SNPs analysed in this study. The Medline database was used to review literature until May 2014 and to extract the SNP association studies regarding the ESR1 gene.

Prediction of phenotypic effects of nsSNP

The effect of the amino acid substitution on protein function was predicted using Polyphen-2 (http://www.bork.embl-heidelberg.de/PolyPhen/), SIFT (http://blocks.fhcrc.org/sift/SIFT.html), PROVEAN (http://provean.jcvi.org), SNAP (https://www.rostlab.org/services/snap/) and INPS (http://inps.biocomp.unibo.it/inps).

Also, the Net-O-Glyc server (http://www.cbs.dtu.dk/services/NetOGlyc) was used to determine the effect of nsSNPs on glycosylation sites of the ER α protein.

Polyphen-2 tool

Polyphen-2 (polymorphism phenotyping v2) uses a combination of structural and sequence homology analyses to predict whether an amino acid change is likely to be deleterious for the structure and function of the protein (Adzhubei et al. 2010). The program searches for 3D protein structures, performs multiple alignments of homologous sequences and analyses in several protein structure databases, the information of amino acid contact. Thus, position-specific independent count (PSIC) scores will be calculated for each of the two amino acid residues and then the PSIC score difference will be computed between them. The higher a PSIC score difference is, the higher is the functional impact of a particular amino acid substitution is likely to have.

SIFT tool

SIFT program is a sequence homology-based tools that predicts the possible impact of an amino acid substitution on protein function. This program uses multiple sequence alignment conservation approach to determine if the substitution is either tolerated by the protein or damaging to the protein. Thus, a normalized tolerance score is calculated. The higher the score, the less functional impact a particular amino acid substitution is likely to occur (Sim et al. 2012).

Protein variation effect analyser (PROVEAN) tool

PROVEAN measures the damaging effect of an amino acid change in protein sequences (Choi et al. 2012). The software uses a delta alignment score of the protein sequence without and with variation with respect to the alignment of homologous sequences. A score ≤ −2.5, indicates that the change is deleterious.

Screening for nonacceptable polymorphisms (SNAP) tool

SNAP tool evaluates the effects of nonsynonymous substitutions on protein function based on a neural-network method that uses in silico derived protein information (e.g. secondary structure, evolutionary information for residue conservation, annotation, solvent accessibility, etc.). The network calculates a score for each substitution and then the scores will be translated into binary predictions of neutral or nonneutral effect (Bromberg and Rost 2007).

Impact of nonsynonymous mutations on protein stability (INPS)

INPS allows to determine the possible impact of nonsynonymous polymorphisms on the protein stability from its sequence. INPS predicts the thermodynamic free energy change induced by the amino acid residue change in protein sequences by using two types of characteristics concerning the mutation type (mutability, molecular weight, hydrophobicity, etc.) and the evolutionary information (Fariselli et al. 2015).

Net-O-glyc

Net-O-glyc 4.0 server produces neural network predictions of mucin type GalNAc O-glycosylation sites in mammalian proteins. The program analyses the reference sequence as well as those carry the amino acid change. The substitution is predicted to have functional significance, if different functional patterns are found for each of the sequences analysed. The sites are predicted as glycosylated if they have scores higher than 0.5 (Steentoft et al. 2013).

Prediction of SNP effect on splicing events

ESEfinder (http://exon.cshl.edu/ESE/) analyse exonic sequences to predict putative exonic splicing enhancer (ESE) reactive with the four human SR proteins (SF2/ASF, SC35, SRp40 and SRp55) (Cartegni et al. 2003). ESEfinder uses weight matrices corresponding to the different human SR protein motifs to identify putative ESE in query sequences. The matrices are based on frequency values obtained by functional systematic evolution of ligands by exponential enrichment (SELEX) experiments and the initial SELEX library, which was made by chemical synthesis.

Prediction of SNP effect in regulatory region

The impact of SNPs in regulatory region was assessed using UTRscan (http://itbtools.ba.itb.cnr.it/utrscan), TFsearch (http://diyhpl.us/~bryan/irc/protocol-online/protocol-cache/TFSEARCH.html) and MicroSNiPER (http://epicenter.ie-freiburg.mpg.de/services/microsniper/).

UTRscan tool

UTRscan was used to predict the functional significance of noncoding SNPs in the UTR regions (Pesole and Liuni 1999). The program compares the query sequences (without and with nucleotide change) with the functional sequence patterns collection (located in 5 UTR and 3 UTR sequences) of the UTR site (Grillo et al. 2010). If the UTR SNP has different functional patterns, this polymorphism is predicted to have functional significance.

TFsearch tool

TFsearch was used to predict the effect of SNPs on a putative DNA transcription factor-binding site of the ‘TRANSFAC’ databases (Heinemeyer et al. 1998).

MicroSNiPER tools

MicroSNiPER is a web-based tool that predicts the possible impact of SNPs on putative microRNA (miRNA) binding site (Barenboim et al. 2010). This application uses a straightforward method to identify if a 3 UTR SNP will disrupt or create a microRNA binding site. Micro SNiPer offers a high flexibility and simple graphical representation of the results.

Results

The list of candidate SNPs of the ESR1 gene that was investigated in this work was retrieved from dbSNP database. A total of 362 SNPs were surveyed and prioritized. Among these SNPs, 87 were located in exons (45 were found to be nonsynonymous and 42 to be synonymous) and 275 were located in UTR and regulatory regions.

SNPs of the coding region of the ESR1 gene were analysed by different programs to select SNPs that have functional significance on the protein structure or function. First, we used the SIFT program to prioritize the 43 nsSNPs of the ESR1 gene. The protein sequence with mutational position and amino acid residue variants associated to these missense nsSNPs were submitted to the SIFT program. Among the SNPs analysed, 18 variations (41.9%) were found to be deleterious with a tolerance index of 0.05. The highly deleterious indexes were observed for the following SNPs: rs369520220, rs200924028, rs104893956, rs148034868, rs188957694, rs121913044, rs374786087, rs121913043, rs138891155 and rs397509428. It is interesting to note that the SNP rs104893956 is a nonsense nsSNP which leads to a premature stop codon (table 1).

Table 1 nsSNPs of the ESR1 gene predicted to have functional significance by Polyphen-2/SIFT/PROVEAN/SNAP.

Subsequently, we used the Polyphen-2 program and after computing the PISC score for each nsSNP, 18 nsSNPs (41.9%) were evaluated to significantly affect protein structure and exhibited a range PSIC score difference between 1.5 and 2.6 (table 1). When comparing the results obtained from the evolutionary-based approach realized by the SIFT program and the structural approach effectuated by the Polyphen program, we observed a significant correlation between the two programs for 14 SNPs (32.5%).

SNAP program was also used to assess the effect on protein sequence of the 43 nsSNPs. Among the SNPs analysed, nine (20.9%) were found to be nonneutral (table 1). It is interesting to note that when these three programs were used together, we observed 53% agreement between them.

Nonsynonymous SNPs of the ESR1 gene were investigated by the PROVEAN program and eight nsSNPs (18.6%) were found to be deleterious. Seven of the eight nsSNPs were predicted to be functional SNPs by both SIFT and Polyphen approaches. An agreement between the four programs used was observed in three nsSNPs (rs121913044, rs121913043 and rs182943916) (table 1).

Then, the stability change of the ER α protein caused by nsSNPs was determined by the IPNS server. Twenty-two nsSNPs (51.16%) showed an increase in energy (DDG < 0 kcal/mol, less favourable change) in comparison with the native structure. The most destabilizing nsSNPs were rs121913044 (V364E, DDG =−2.398 kcal/mol) and rs121913043 (C447A, DDG =−2.343 kcal/mol). This result correlates with the program results of PolyPhen-2, SIFT, SNAP and PROVEAN, which have predicted that these two polymorphic sites to be damaging.

The exonic SNPs of the ESR1 gene (87 SNPs) were analysed by the ESEfinder program to predict the effect of these variations on the splicing phenomenon. After comparing the functional element of each SNP, 17 SNPs (19.5%) were found to change the functional pattern of putative splicing site (table 2). Among these polymorphisms, eight SNPs (9.2%) induce a splicing site abolition, while the other nine SNPs (10.3%) yield to the creation of an exonic splicing site.

Table 2 ESEfinder results.

By combining the results of the SIFT, Polyphen-2, SNAP and ESEfinder programs, the following SNPs (4.6%) were found to have a functional significance by all programs at the same time: rs200924028, rs121913044, rs141662120 and rs138891155 (table 3).

Table 3 List of the most important SNPs found by prediction and/or association studies results.

Finally, protein sequence and the substitutions of the 43 nsSNPs were loaded to the Net-O-Glyc server. After comparing the functional element for each SNP, four SNPs (9.3%) were found to change a putative site of O-glycosylation, three of which created new glycosylation sites (table 4). Three different programs were applied to survey and prioritize the SNP in the UTR and regulatory regions of the ESR1 gene. Polymorphisms in the UTR affect the gene expression by affecting the ribosomal translation of mRNA or by influencing the RNA half-life. The UTRscan program was used to predict the effect of SNPs of UTR region on the mRNA UTR of functional significance. Among the 110 SNPs in the mRNA UTR of the four variants encoding the isoform ER α1 (NM_000125.3, NM_001122740.1 and NM_001122741.1, NM_001122742.1), one variation (rs9340771) in the 5 UTR of NM_001122740.1 was related to functional pattern change of GY-BOX (table 5).

Table 4 nsSNPs of the ESR1 gene predicted to have functional significance by Net O-glyc.
Table 5 SNPs in UTR region predicted to have significance by UTRscan or MicroSNiPER.

Polymorphisms mapping to miRNA binding sites have been shown to disrupt the ability of miRNA to target genes resulting in differential mRNA and protein expression. MicroSNiPER program was applied to predict the effect of 3 UTR SNP on miRNA binding site. Seventy nine SNPs were analysed and the results showed that nine (11.4%) were found to affect miRNA binding (table 5). Among these polymorphisms, there are four SNPs (5.1%) that potentially abolished miRNA target sites, while the other (6.3%) created a putative miRNA target sites.

Eventually, the TFsearch program was applied to predict the SNP effect of the regulatory region of the ESR1 gene on putative transcription factor-binding sites. For this, a total of 167 SNPs were analysed and the results revealed that 28 SNPs (16.8%) change the transcription factor-binding sites. Most of these variations were related to the variant NM_001122742.1 (22 SNPs, 20.8% of the SNPs of the variant NM_001122742.1), while 13 SNPs (27.1% of the SNPs of the variant NM_001122741.1) were related to the variant NM_001122741.1 and 16 SNPs (35.5 and 34% of the SNPs of the variant NM_001122740.1 and NM_000125.3, respectively) were related to the variants NM_001122740.1 and NM_000125.3 (table 6).

Table 6 SNPs in regulatory region predicted to have significance by TFsearch.

The Medline database was reviewed to determine a list of functional SNPs from the data of association studies performed on the ESR1 gene. Polymorphisms of the ESR1 gene were associated with numerous diseases (coronary artery disease, cancer, diabetes, etc.) and physiological parameters such as cholesterol level, body height, total fat mass, height at menarche, testosterone level, etc. A high number of associations were observed with cancers, bone diseases and coronary artery disease (table 1 in electronic supplementary material at http://www.ias.ac.in/jgenet/).

Reviewing the literature, 89 SNPs of the ESR1 gene were selected to be functional SNPs. Among these variations, only four SNPs (4.5%) are located in exons, while the others were in noncoding regions (95.5%). However, reviewing the number of positive studies associated to each of these functional SNPs, six SNPs (rs2077647, rs1801132, rs2228480, rs2234693, rs3020314 and rs9340799) were found to be the most important and where the SNPs rs2234693 and rs9340799 were in the lead of this list.

By combining the results of SNP prioritization and association studies, six SNPs (1.6%) among the 362 SNPs analysed were found to be shared (rs2077647, rs851993, rs291740, rs2881766, rs6903180 and rs9478245) (table 3). All these SNPs, except rs2077647, were located in regulatory region and were found to change putative-binding site for transcription factors. While, the rs2077647 is a nsSNP and it was found to change putative splicing site.

Discussion

Understanding the functions of SNPs will greatly help in understanding the genetics of the human phenotype variation, especially the genetic basis of human complex diseases. However, to identify functional SNPs from a pool containing both functional and neutral SNPs is challenging.

A number of studies have been reported on associations between ESR1 gene polymorphisms and many diseases, such as osteoporosis (Ioannidis et al. 2004; Harsløf et al. 2010; Luo et al. 2014), osteoarthritis (Tawonsawatruk et al. 2009; Riancho et al. 2010), breast cancer (Tapper et al. 2008; Ding et al. 2010), endometrial cancer (Sliwinski et al. 2010), some brain disorders as depressive disorders (Tsai et al. 2003), Alzheimer’s disease (Lin et al. 2003; Monastero et al. 2006; Ma et al. 2009) and coronary artery disease (Wu et al. 2010). However, most of the SNPs of the ESR1 gene have not been studied yet. In this study, we surveyed and prioritized 362 SNPs of the ESR1 gene. Using different computational algorithm tools, we found that 18 nsSNPs were evolutionary conserved by SIFT, 18 nsSNPs might alter protein structure by Polyphen-2, nine nsSNPs were nonneutral by SNAP and eight nsSNPs were deleterious by PROVEAN. An agreement between the four programs used was observed in three nsSNPs (rs121913044, rs121913043 and rs182943916). This result suggests that these variants may be considered to be most likely damaging or deleterious SNPs. The analysis of protein stability change allows confirmation of this finding. In fact, using the INPS program, 22 nsSNPs were found to induce an increase in energy in comparison with the native structure and the most destabilizing SNPs were rs121913044 (V364E) and rs121913043 (C447A).

Using ESEfinder, our results showed that 17 nsSNPs might change functional pattern of putative splicing site. Among the functional nsSNPs predicted, a coding nonsense SNP (rs104893956) was found. This variation caused by a nucleotide change from C to T and leads to a codon stop gain. This result suggests that this SNP may have a very high level of risk to be involved in some diseases as it can truncate and even inactivate the ER protein.

Equally, our results showed that, four nsSNPs (rs200924028, rs121913044, rs141662120 and rs138891155) might alter protein structure and function as well as splicing phenomen. This result suggests that these markers might have a high potential to be candidate SNPs in association studies. Further, four nsSNPs (rs201617046, rs149308960, rs146924427 and rs201118302) were found to change putative sites of glycosylation among which three nsSNPs created new site of glycosylation. Protein glycosylation is an important posttranslational modification that confers both structural and functional properties to the molecules. However, many studies have shown that up to 1.4% of known disease-causing missense mutations are predicted to give rise to gains of glycosylation and for some of these mutations, the novel glycans have been shown to be both necessary and sufficient to account for the deleterious impact of the mutation (Schulte am Esch et al. 2005; Vogt et al. 2007). Thereby, it may be suggested that these nsSNPs can be added to the list of candidate SNPs in association studies to determine their potential role in diseases.

Analysing SNPs in regulatory regions, we found that one SNP might change functional binding motif at 5 UTR, nine SNPs might change pattern of miRNA binding site and 28 SNPs might modulate gene regulation.

Our predictions are in good agreement with previous reports, especially those which have demonstrated that the variation rs121913044 (V364E) which has a single amino acid substitution in hormone-binding domain of the ER α, allows the receptor to act as a strong dominant negative inhibitor of oestrogen action (McInerney et al. 1996) and the rs121913043 (C447A) which causes a decoupling of hormone binding and transcriptional activation functions of the receptor (Reese and Katzenellenbogen 1991). We also identified rs397509428 (Q375H) as a key SNP, a prediction supported by the fact that it corresponds to a substitution in ligand-binding domain of ER α and causes a complete oestrogen insensitivity and puberty delay in women (Quaynor et al. 2013). Another functional SNP is rs104893956 which is characterized by cytosine to thymine transition at codon 157 and results in a premature stop codon and oestrogen resistance (Smith et al. 1994).

Reviewing the literature, 89 SNPs were selected as functional SNPs where most of them were located in noncoding regions. This result is consistent with the view of Frazer et al. (2009) who suggested that disease risk associated SNP map predominantly to noncoding regions of the human genome. In fact, among the most important polymorphic sites of the ESR1 gene are rs2234693 and rs9340799, located in the first intron and are separated by only 46 bp. The rs2234693 (T397C) is caused by a T/C transition in intron 1, whereas the rs9340799 (G351A) polymorphism is caused by a G/A transition located 50 bp downstream of the rs2234693 (Shearman et al. 2003; Pollak et al. 2004). These two polymorphisms of the ESR1 gene have been described and studied for possible association with several clinical outcomes including cardiovascular risk (Herrington et al. 2002; Alevizaki et al. 2007), multiple sclerosis (Niino et al. 2000; Kikuchi et al. 2002), osteoporosis (Harsløf et al. 2010; Kim et al. 2010), uterine leiomyomas (Al-Hendy and Salama 2006), cancer (Chattopadhyay et al. 2014) as well as type 2 diabetes, obesity (Speer et al. 2001), bone mineral density (Yamada et al. 2002), azoospermia or severe oligozoospermia (Kukuvitis et al. 2002; Suzuki et al. 2002; Lazaros et al. 2010) and systemic lupus erythematosus (Wang et al. 2010).

By combining the results of SNP prioritization and association studies, we come with six functional SNPs among which only SNP rs2077647 is located in the coding region and was already reported to be associated with numerous diseases such as coronary artery disease (Peter et al. 2005, 2009), cancers (Anghel et al. 2010; Sonoda et al. 2010), Alzheimer disease (Ma et al. 2009) as well as to response to drugs administration (Zhang et al. 2010). The correlation between our prediction results and data from association studies supports the results of this study and suggests that SNP-based pathogenicity detection tools can appropriately reflect the role of a disease associated SNP. Since association studies and SNP prioritization are two nonredundant source of knowledge, we think that a good correlation between them can support the use of computational tools for the selection of SNPs to be investigated by association studies.

Conclusion

Genetic screening of the ESR1 gene locus has revealed the existence of thousands of polymorphic sites, some of them alter the function of the receptor and were associated to phenotypic traits and diseases risk. However, giving the high number of SNPs in this gene, association studies should be carried on genetic variants that have functional significance. The correlation between our results and data from association studies suggests that application of computational tools might provide an alternative approach to select functional SNPs in association studies. Since association studies and SNP prioritization are two nonredundant source of knowledge, we think that a good correlation between them can support the use of computational tools for the selection of SNPs to be investigated by association studies.