Abstract
The analysis of a large number of human and mouse genes codifying for a populated cluster of transmembrane proteins revealed that some of the genes significantly vary in their primary nucleotide sequence inter-species and also intra-species. In spite of that divergence and of the fact that all these genes share a common parental function we asked the question of whether at DNA level they have some kind of common compositional structure, not evident from the analysis of their primary nucleotide sequence. To reveal the existence of gene clusters not based on primary sequence relationships we have analyzed 13574 human and 14047 mouse genes by the composon-clustering methodology. The data presented show that most of the genes from each one of the samples are distributed in 18 clusters sharing the common compositional features between the particular human and mouse clusters. It was observed, in addition, that between particular human and mouse clusters having similar composon-profiles large variations in gene population were detected as an indication that a significant amount of orthologs between both species differs in compositional features. A gene cluster containing exclusively genes codifying for transmembrane proteins, an important fraction of which belongs to the Rhodopsin G-protein coupled receptor superfamily, was also detected. This indicates that even though some of them display low sequence similarity, all of them, in both species, participate with similar compositional features in terms of composons. We conclude that in this family of transmembrane proteins in general and in the Rhodopsin G-protein coupled receptor in particular, the composon-clustering reveals the existence of a type of common compositional structure underlying the primary nucleotide sequence closely correlated to function.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Transmembrane proteins (TMEM) are involved in diverse biological functions such as receptors, transporters, and enzymes, among other roles (Almen et al. 2009), having impact in terms of disease mechanisms and drug discovery. The analysis of the genome from several organisms including human indicates that 20–25% of all genes code for TMEM proteins (Jones 1998; Lander et al. 2001; Wallin and von Heijne 1998) although less than 1% of 3D structures are deposited in the protein data bank (Berman et al. 2002). It appears therefore of particular importance the developing of effective approaches for clustering, annotation, and modeling of TMEM genes. The methods currently developed are based on amino acid sequence homology and functional similarity (Almen et al. 2009; Foord et al. 2005; Horn et al. 2003). However, despite the primary sequence diversity observed in those genes, all they share a common function as TMEM genes as observed by gene ontology (GO) annotation. Thus, a question still unresolved is whether these genes share some common compositional structures at DNA level not evident from the analysis of their primary nucleotide sequence.
To resolve this issue, some tools have been developed in order to discover similarities and differences not based on primary sequence relationships. In order to perform gene clustering (Brameier and Wiuf 2007; Cho et al. 2009; Qu and Xu 2004; Yang et al. 2008) or gene function prediction (Chen and Xu 2005; Jensen et al. 2003; Nariai et al. 2007), the analysis of common compositional features between genes has become an important issue in biology because the annotation of protein function is important to understand life at the molecular level having implications, moreover, from a medical and pharmaceutical point of view (Radivojac et al. 2013). Recently, and in order to evaluate similarities and dissimilarities among functionally distinct sectors of genes independent of the reading frame and without previous knowledge of the expression or the gene function, a clustering methodology has been proposed that analyzes compositional features not based on primary sequence relationships (Fuertes et al. 2011). The data presented in this paper revealed evolutionary features of kinship or lack thereof that heretofore had not been recorded (Fuertes et al. 2016a, b). The similarities and dissimilarities in the usage of 14 items called composons (sets of nucleotide triplets, formed by all nearest neighbors having the same gross composition) were recorded as the comparison parameter and evaluated by reading the sequence in a fully overlapping way. It was found that most genes of both human and mouse gene samples were grouped in a small number of clusters (11 in human and 9 in mouse) having common compositional features in both species (Fuertes et al. 2011). Due to the small gene sample size used in that study and on the basis of the available information, we wondered whether those clusters represented the entire number of clusters in which the genome could be distributed and whether the genes from a given cluster could have some correlation with function. To answer that question, we believed it was necessary to expand the number of genes being studied. It was observed that using a 20-fold higher randomly selected sample (about 50% of the genome), the genes from both human and mouse were classified into 18 clusters having highly similar compositional profiles in both species. It was observed, however, that the gene populations in similar human and mouse clusters notably differed. It was observed, in addition, that practically all genes contained in cluster 16 codify for TMEM proteins, most of which were contained into the G-protein coupled receptor (GPCR) superfamily (as olfactory (ORs), vomeronasal (Vmnr), taste (Tas2r) or trace amine-associated receptor (TAAR) genes between others) and also into genes codifying for enzymes, transporters, solute carriers, etc. Interestingly, an important fraction of these human and mouse TMEM genes are notably dissimilar in their nucleotide sequence. The analysis suggests that the TMEM genes contain a type of common compositional structure underlying the primary nucleotide sequence closely correlated to function.
Materials and Methods
DNA Sequence Acquisition and Pre-processing
The human and mouse coding sequences (CDS) analyzed in this study were obtained from The National Centre for Biotechnology Information DNA database (NCBI) (Benson et al. 2010). The criterion of random sample selection was applied to select the human and mouse gene samples (Fuertes et al. 2011). CDSs shorter than 100 base pairs were discarded from the analysis since, as it has been observed (laboratory data), the error in the estimation of the nucleotide composition of the DNA sequence from the analysis of composon-usage significantly increases when the DNA length decreases. The entire exonic sequence of each gene was considered for analysis without taking into consideration the isoforms due to alternative splicing. The final samples included 13574 human and 14047 mouse genes.
Triplet Composon-Usage Frequencies
Triplet composon-usage frequencies were determined by reading the DNA sequences in a fully overlapping way and computing all nearest neighbors of each nucleotide along the gene sequence and grouping them into composons. A composon is defined as a set of nucleotide triplets with the same “gross composition” (Fuertes et al. 2011). The term “gross composition” is referred here to triplets constituted by the same type of nucleotides. For example, the triplets constituted by the nucleotides C and T in whatever order and number would have the same “gross composition” and all of them would constitute the composon <CT> and so on. Online Resource 1 shows the 14 composons that can be formed when sorting out the 64 triplets that can be obtained with nucleotides A, T, G, and C and the nomenclature of the composons used in this paper. The triplet nucleotide-usage was calculated using compseq, an application located in http://emboss.bioinformatics.nl/, the European Molecular Biology Open Software Suite (EMBOSS) (Rice et al. 2000). The parameters for compseq were, in this case, a word size = 3 (a triplet) and a frame of word to look = 0. That means that the sequence is red in a fully overlapping way.
The relationship between the composon-usage and the nucleotide composition of the DNA sequence (Fuertes et al. 2016b) would be done by the Eq. (1). Thus if we denote by N i the number of nucleotides of the type-i and assuming that each one of indexes i, j, k takes the values of A, G, T, or C then,
and the estimated length in bp of the DNA sequence N would, then, be, \(N = \sum _{i} N_{i}\).
As baseline we have used the expected composon-usage frequency average of a randomly generated DNA sequences of theoretical infinite length having the same composition for A, T, C, and G. On the average, in such sequences each triplet must appear with a frequency of 16‰. In degenerate composons containing six triplets per composon (see Online Resource 1), the expected composon-usage would be 94‰.
Composon-Clustering
The k-means clustering algorithm was implemented to cluster the composon-usage frequencies of the genes. The Pearson correlation coefficient, r, was the distance function used for clustering. The distance function measures the linear relationship between the composon-usage frequency of each gene and the composon-usage frequency of the cluster centroid, defined as the composon-usage average of all genes fitting into the cluster analyzed. The centroid of a cluster was previously defined for any pair of variables (McQueen 1967). To maximize the correspondence between genes fitting into a given composon-cluster, a high threshold for the distance function was selected for both species, r ≥ 0.948. The k-means algorithm was found in the website http://www.gepas.org (Al-Shahrour et al. 2006; Montaner et al. 2006). The software package MatLab (© 1984–2010 The MathWorks, Inc) was used for statistical tests and calculations. We will refer hereafter to the k-means clustering of composon-usages as composon-clustering.
Results
Gene Clustering of Human and Mouse Samples
The results of the analysis by composon-clustering of the sample containing 13574 and 14047 human and mouse genes, respectively, are illustrated in Figs. 1 and 2. As it can be observed, all genes were classified into 18 different composon-clusters both in human and mouse. It was noticed that clusters 2 and 5 in mouse were absent when a small number of genes were analyzed (Fuertes et al. 2011), however, they are present when the sample increased 20-fold. The profiles of clusters 1–9 are shown in Fig. 1, and the profiles 10–18 in Fig. 2. It was also observed that each human cluster has a highly similar counterpart in mouse. Actually, the composon-usage profile of the human and mouse centroids of similar clusters overlap with high correlation, r > 0.989, as an indication that the human and mouse genes from a given cluster share a high degree of composon-usage similarity. It was also observed that in human and mouse 90% of the analyzed genes were classified in 11 clusters, as an indication that in both species most genes share a small number of basic compositional features or architectures in terms of composon-usage. The numeric values of the composon-usage frequencies of each one of these profiles can be obtained from Online Resources 2 to 6.
Gene Populations in Human and Mouse Composon-Clusters
Figure 3a shows that there are notable differences in the number of genes classified into each one of the human and mouse clusters particularly in some of the most populated ones. As shown, the human composon-clusters 2, 5, 6, 8, 9, 12, 13, 15, and 17 have a number of genes higher than their counterparts in mouse. In contrast, human clusters 1, 3, 4, 7, 10, 11, 14, 16, and 18 have a number of genes lower than their counterparts in mouse. The mouse cluster 16 contains more than 50% of the genes that fit into its cluster counterpart in human. In both species, low differences in the number of genes per cluster were observed in clusters 8, 11, 12, 14, 15, and 18. Figure 3b shows the number of genes that differ between both species in each cluster. The highest differences were observed in the most populated clusters with the exception of cluster 8. On the other hand, the less gene-populated clusters have a similar number of genes in both species with the exception of clusters 13, 16, and 17. Since, as indicated in “Materials and Methods”, the criterion of random sample selection was applied to select the human and mouse gene samples and it is assumed that most human and mouse genes are orthologs (Mouse Genome Sequencing C et al. 2002), we think that the differences in gene population among cluster counterparts point to the hypothesis that those genes diverged notably in compositional features during the evolution of the orthologs in both species.
Clustering of TMEM Genes
The GO analysis performed a posteriori in all genes of the composon-cluster 16 reveals that (Fig. 4a) in human 95% of them are TMEM genes, 61% of which are GPCRs. In mouse, however, although the percentage of TMEM genes is also 95%, the percentage of GPCRs increases up to 84% (see Fig. 3b). In human 64% of all GPCRs are ORs genes against 72% in mouse (Fig. 4c). It can also be observed that the mouse had a higher percentage of Vmnr1 genes (11%) than in human (1%).
The data show the robustness of the proposed method for clustering TMEM genes based on the compositional features of the coding sequences since their composon-usage profile is characteristic of the genes contained in cluster 16 (see Fig. 2, panel 16), and despite the increasing number of released crystal structures showing the low sequence identity of the binding pocket of GPCRs (Yoshikawa et al. 2013) and the notable structural diversity of more than 800 human GPCRs (Fredriksson et al. 2003). It was observed that in both species the correlation of the composon-usage of the GPCR genes belonging to the TMEM class is very high (r > 0.994), as an indication that some type of evolutionary link in terms of nucleotide composition reflected in the composon-profile (see Eq. 1) must exist between all GPCRs of cluster 16.
The finding showing that nearly all human/mouse genes from cluster 16 show common composon-usage profiles in their nucleotide sequences is relevant because the clustering was performed without previous knowledge of their function or of their similarity. For example, in mouse OR genes the identity among the intact coding regions varies between 34 and 99% emphasizing the extreme diversity of the OR gene family (Godfrey et al. 2004). By functional and composon-clustering analysis, however, the OR genes can be classified as a single family suggesting that their common compositional architecture could be linked to function. As a matter of fact certain TMEM contained in cluster 16 acting as transporters, chemokine receptors, or solute carriers have in and between them low sequence identity (Mishra et al. 2014) but very high correlation in their composon-usage profiles (see Fig. 2, panel 16). It is most likely, therefore, that the composon-clustering method does not classify genes based on the similarity of sequences but on the presence or absence at DNA level of common interspersed compositional structures. This is an important outcome because some orphan GPCRs including a significant amount of intronless genes, as the Rhodopsin-like GPCRs, that are not classified into any of the known subfamilies. There is a need, therefore, of new bioinformatic approaches to address this issue (Alem et al. 2007).
Discussion
Several methods have been described that address issues related to the classification of protein sequences, as for example: (i) methods based on searching for short/long conserved fragments of TMEM proteins (Godfrey et al. 2004; Malnic et al. 2004); (ii) methods based on the knowledge of substrate specificities or the inclusion of energy functions on homology modeling of TMEM in general (Li et al. 2015; Yarov-Yarovoy et al. 2006) and of GPCRs in particular (Michino et al. 2009); (iii) using phylogenetic analyses 60% of GPCRs are clustered according to ligand preferences (Yang et al. 2012), etc. The novelty of the clustering approach presented in this paper is that it allows clustering genes from the composon-usage frequency information existing in the nucleotide sequence red in a fully overlapping way (Fuertes et al. 2011). The robustness of the method has been tested on other gene families as occur with composon-clusters 2 and 5 (Fuertes et al. 2016b) and also with a large amount of human-mouse orthologs that coevolve from a compositional cluster in mouse to a different compositional cluster in human (Fuertes et al. 2016a).
Since there is more than one way to read a stretch of DNA and there may be various layers of biological information in the DNA interspersed between, or superimposed on, the passages written in the genetic code (Pearson 2006), it has been previously shown a method of composon-clustering based on the analysis of the usage frequency of sets of base triplets called composons (Fuertes et al. 2011). The data presented in this paper show that about 14000 human genes and a similar number of mouse genes analyzed by the composon-clustering method were classified in 18 clusters and that 90% of the clustered genes fit into the 11 composon-clusters previously described (Fuertes et al. 2011). It was also observed that each one of the 18 human and mouse counterpart clusters share highly similar conserved composon-usage profiles (r > 0.989).
The occurrence of only a small number of composon-clusters both in human and mouse gene samples let us to address the conceptual particulars of the type of genomic information obtained by the analysis of the composon-usage frequency of a DNA sequence relative to the codon-usage. The codon reading and the composon-reading of the DNA compute base triplets but there is a major difference between the information provided by both systems of reading. While the analysis of codon-usage requires reading the DNA sequence in the proper open reading frame and in a non-overlapping way, the analysis of the composon-usage requires reading the DNA sequence in a fully overlapping way independently, therefore, of the open reading frame. Thus, while the number of base triplets of a DNA sequence after a codon reading is 1/3 of the number of bases of the sequence, the number of triplets after a composon-reading is equal to the number of bases of the DNA sequence since each base forms a triplet with their nearest neighbors. Thus, the information obtained by the fully overlapping reading is higher than that obtained by reading in a non-overlapping way. Thus, while the information provided by the genetic code is linked to the translated protein, the information provided by the composon-usage reading is linked to the nucleotide compositional characteristics that are embedded and distributed along the DNA sequence in terms of the number, location, and frequency of use of each one of the triplets (see Eq. 1). This information is, therefore, connected to the structural organization of the DNA sequences. From this point of view the genetic code and the composon-code are codes in a strict sense since in both cases an input determines an output (Pearson 2006) having redundant properties (Watson et al. 2008). The 64 codons of the genetic code codify for 20 amino acids and the 64 base triplets of the composon-code codify for 14 composons.
The finding that a large number of genes of the human and mouse genomes could be classified into a small number of composon-clusters and that the composon-profiles of the clusters from one species have similar counterparts in the clusters of the other species suggests that variations in the composon-usage and, therefore, variations in compositional organization might be, to a certain extent, evolutionarily restricted. This restriction would open the question of whether could this merely be attributable to evolutionary conservatism of the composon-profiles after gene duplication followed by mutation or does it reveal a gene-type-associated characteristic maintained for functional reasons? We think, moreover, that the existence of high and poor gene-populated clusters reinforces the hypothesis that the gene size of complex genomes resulted from differential increases in ploidy and from smaller scale duplication of pre-existing DNA sequences followed by sequence divergence and often, eventually, by translocation of sequence duplicates (Levine et al. 2006; Long et al. 2003; Ohno 1970) while maintaining their basic composon-profile. In fact, duplicated genes may accumulate large number of base substitutions and have, however, similar composon-usage profiles. If so, it could be expected, as shown in this paper, that (i) several genes of an organism originated from gene duplication could, to a significant extent, be grouped into similar composon-clusters in spite of their primary sequence divergence and that (ii) a large proportion of genes of a single species could fit into a small number of clusters. To our knowledge, this is the first time in which it has been shown that a large fraction of mouse and human genes can be classified into a small and a certain number of highly correlated compositional architectures.
In this context, we believe that one of the most relevant outcomes of the data presented in this paper about the genes contained in cluster 16 is that, in spite of their low degree in nucleotide sequence homology, that a posteriori was annotated by GO to a common molecular TMEM function, the composon-clustering was able to classify the genes in a single compositional cluster. Actually, most genes from cluster 16 codify for receptors activated by the binding of a repertory of different ligands including taste-sensitive compounds, odors, pheromones, hormones, and neurotransmitters that activate cellular signal transduction mechanisms via GPCRs.
The TMEM cluster of genes contains a large amount of orphan GPCRs encoded by intronless genes (as is the case of the Rhodopsin-like GPCRs) not classified in any of the known subfamilies (Alem et al. 2007). Thus, we believe that as far as the composon-cluster 16 concerns the analysis of the DNA by the composon-usage is able to identify a distinctive characteristic gene architecture interspersed along the primary gene sequence linked to their molecular function. Therefore, most likely, this type of clustering may be useful for gene annotation prediction. While the analysis of the primary sequence of the DNA would consider these genes not to be related, the composon-clustering revealed that they are highly correlated (r > 0.989). Due to the low degree of nucleotide sequence homology between some of the GPCRs and to the fact that the analysis of GPCRs is error prone and it needs to be followed up by a careful manual curation (Haitina et al. 2009), bioinformatics methods have been used for predicting the classification of GPCRs according to their amino acid sequence and pseudo amino acid composition approach (Gu et al. 2010; Qiu et al. 2009; Xiao et al. 2009). Therefore, just by the analysis of the DNA without the need for further curation it would be possible to classify GPCR genes. The computation of similarities and distances between the GPCR genes fitted into cluster 16 according to the composon-usage profiles could assist to classify them being an attractive model to study gene evolution and the identification of subfamilies. In fact, odorants and pheromones present in the environment are detected via ORs, TAARs, Vmnrs, and formyl peptide receptors, expressed by the sensory neurons in the epithelia of these organs (Ibarra-Soria et al. 2014). TAARs, which were initially identified as receptors from a specific group of biogenic amines in the brain, are actually expressed in the olfactory epithelia in mice and are able to be regarded as a second class of ORs (Niimura 2009). Most of these genes are grouped in composon-cluster 16. As described (Adipietro et al. 2012), a special class of GPCRs, as the ORs, has been subjected to rapid evolution between species as a result of gene duplication and gene loss. The difference in gene population between the mouse and human genes classified in cluster 16 may reflect this loss and the adaptation of the olfactory system to the environmental context of each species. Interestingly, even though the human ORs cover a similar receptor space as the mouse ORs, suggesting that the human olfactory system has retained the ability to recognize a broad spectrum of chemicals, it has lost nearly two-thirds of the mouse ORs (Zhang and Firestein 2002). Since as it was shown both, in mouse and human, the composon-cluster 16 includes hundreds of gene coding for ORs, it is most likely that the majority of the ORs correlates with the GPCR-Rhodopsin family.
The occurrence of highly gene-populated composon-clusters having similar compositional features suggests that the genes forming each cluster may share some sort of evolutionary linkage and that this occurrence may confer some type of selective advantage promoting, thus, their formation and maintenance as it occurs with other types of clustering (Yi et al. 2007). In addition, the existence of differences in gene population among clusters might also be connected with a differential selective advantage and history of the duplicated clustered genes being responsible for the formation of levels of clustering organization ranging from small clusters to large ones spanning hundreds of genes. (See Online Resource 1). In the less populated composon-clusters (<300 genes), as for example cluster 16, the linkage between the composon-clustering and the molecular function could at first sight be observed since most of the genes have a common parental function. In the more populated composon-clusters (300–3000 genes per cluster) even though they may have some sort of functional linkage, such a link is not at first sight viewed due to the wide number of specific functions of the genes that fit into those clusters.
If, as shown in the composon-cluster 16, the architecture of DNA sequences, as determined by the analysis of the similarities and dissimilarities in the composon-usage and independent of their sequence homology, may have some sort of linkage with functional properties. It is most likely that this type of analysis could add some light on the computational annotation of gene function prediction. In conclusion, we suggest that there is a need to improve the existing available tools that may contribute to decipher the challenging problem of protein annotation prediction (Radivojac et al. 2013). We think, in addition, that the reduced dimensionality of the composon-clustering classification may help to annotate in a fast way large number of genes having parental functions.
References
Adipietro KA, Mainland JD, Matsunami H (2012) Functional evolution of mammalian odorant receptors. Plos Genet 8:e1002821
Alem K, Louhichi A, Ladjama A, Rebai A (2007) In silico investigation of intronless rhodopsin-like G-protein coupled receptors (GPCR) in the human genome: features and classification. Arch Inst Pasteur Tunis 84:57
Almen MS, Nordstrom KJ, Fredriksson R, Schioth HB (2009) Mapping the human membrane proteome: a majority of the human membrane proteins can be classified according to function and evolutionary origin. BMC Biol 7:50
Al-Shahrour F, Minguez P, Tarraga J, Montaner D, Alloza E, Vaquerizas JM, Conde L, Blaschke C, Vera J, Dopazo J (2006) BABELOMICS: a systems biology perspective in the functional annotation of genome-scale experiments. Nucleic Acids Res 34:W472
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW (2010) GenBank. Nucleic Acids Res 38:D46
Berman HM, Battistuz T, Bhat TN, Bluhm WF, Bourne PE, Burkhardt K, Feng Z, Gilliland GL, Iype L, Jain S, Fagan P, Marvin J, Padilla D, Ravichandran V, Schneider B, Thanki N, Weissig H, Westbrook JD, Zardecki C (2002) The protein data bank. Acta Crystallogr D Biol Crystallogr 58:899
Brameier M, Wiuf C (2007) Co-clustering and visualization of gene expression data and gene ontology terms for Saccharomyces cerevisiae using self-organizing maps. J Biomed Inform 40:160
Chen Y, Xu D (2005) Genome-scale protein function prediction in yeast Saccharomyces cerevisiae through integrating multiple sources of high-throughput data. Pac Symp Biocomput 10:471–482
Cho YR, Zhang A, Xu X (2009) Semantic similarity based feature extraction from microarray expression data. Int J Data Min Bioinform 3:333
Foord SM, Bonner TI, Neubig RR, Rosser EM, Pin JP, Davenport AP, Spedding M, Harmar AJ (2005) International Union of Pharmacology. XLVI. G protein-coupled receptor list. Pharmacol Rev 57:279
Fredriksson R, Lagerstrom MC, Lundin LG, Schioth HB (2003) The G-protein-coupled receptors in the human genome form five main families. Phylogenetic analysis, paralogon groups, and fingerprints. Mol Pharmacol 63:1256
Fuertes MA, Perez JM, Zuckerkandl E, Alonso C (2011) Introns form compositional clusters in parallel with the compositional clusters of the coding sequences to which they pertain. J Mol Evol 72:1
Fuertes MA, Rodrigo JR, Alonso C (2016a) Do intron and coding sequences of some human-mouse orthologs evolve as a single unit? J Mol Evol 82:247
Fuertes MA, Rodrigo JR, Zuckerkandl E, Alonso C (2016b) The chromosomal and functional clustering of markedly divergent human-mouse orthologs run parallel to their compositional features. J DNA RNA Res 1:31
Godfrey PA, Malnic B, Buck LB (2004) The mouse olfactory receptor gene family. Proc Natl Acad Sci USA 101:2156
Gu Q, Ding YS, Zhang TL (2010) Prediction of G-protein-coupled receptor classes in low homology using Chou’s pseudo amino acid composition with approximate entropy and hydrophobicity patterns. Protein Pept Lett 17:559
Haitina T, Fredriksson R, Foord SM, Schioth HB, Gloriam DE (2009) The G protein-coupled receptor subset of the dog genome is more similar to that in humans than rodents. BMC Genom 10:24
Horn F, Bettler E, Oliveira L, Campagne F, Cohen FE, Vriend G (2003) GPCRDB information system for G protein-coupled receptors. Nucleic Acids Res 31:294
Ibarra-Soria X, Levitin MO, Saraiva LR, Logan DW (2014) The olfactory transcriptomes of mice. Plos Genet 10:e1004593
Jensen LJ, Gupta R, Staerfeldt HH, Brunak S (2003) Prediction of human protein function according to gene ontology categories. Bioinformatics 19:635
Jones DT (1998) Do transmembrane protein superfolds exist? FEBS Lett 423:281
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, Levine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann Y, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, Gibbs RA, Muzny DM, Scherer SE, Bouck JB, Sodergren EJ, Worley KC, Rives CM, Gorrell JH, Metzker ML, Naylor SL, Kucherlapati RS, Nelson DL, Weinstock GM, Sakaki Y, Fujiyama A, Hattori M, Yada T, Toyoda A, Itoh T, Kawagoe C, Watanabe H, Totoki Y, Taylor T, Weissenbach J, Heilig R, Saurin W, Artiguenave F, Brottier P, Bruls T, Pelletier E, Robert C, Wincker P, Smith DR, Doucette-Stamm L, Rubenfield M, Weinstock K, Lee HM, Dubois J, Rosenthal A, Platzer M, Nyakatura G, Taudien S, Rump A, Yang H, Yu J, Wang J, Huang G, Gu J, Hood L, Rowen L, Madan A, Qin S, Davis RW, Federspiel NA, Abola AP, Proctor MJ, Myers RM, Schmutz J, Dickson M, Grimwood J, Cox DR, Olson MV, Kaul R, Raymond C, Shimizu N, Kawasaki K, Minoshima S, Evans GA, Athanasiou M, Schultz R, Roe BA, Chen F, Pan H, Ramser J, Lehrach H, Reinhardt R, McCombie WR, de la Bastide M, Dedhia N, Blocker H, Hornischer K, Nordsiek G, Agarwala R, Aravind L, Bailey JA, Bateman A, Batzoglou S, Birney E, Bork P, Brown DG, Burge CB, Cerutti L, Chen HC, Church D, Clamp M, Copley RR, Doerks T, Eddy SR, Eichler EE, Furey TS, Galagan J, Gilbert JG, Harmon C, Hayashizaki Y, Haussler D, Hermjakob H, Hokamp K, Jang W, Johnson LS, Jones TA, Kasif S, Kaspryzk A, Kennedy S, Kent WJ, Kitts P, Koonin EV, Korf I, Kulp D, Lancet D, Lowe TM, McLysaght A, Mikkelsen T, Moran JV, Mulder N, Pollara VJ, Ponting CP, Schuler G, Schultz J, Slater G, Smit AF, Stupka E, Szustakowki J, Thierry-Mieg D, Thierry-Mieg J, Wagner L, Wallis J, Wheeler R, Williams A, Wolf YI, Wolfe KH, Yang SP, Yeh RF, Collins F, Guyer MS, Peterson J, Felsenfeld A, Wetterstrand KA, Patrinos A, Morgan MJ, de Jong P, Catanese JJ, Osoegawa K, Shizuya H, Choi S, Chen YJ, Szustakowki J, International Human Genome Sequencing C (2001) Initial sequencing and analysis of the human genome. Nature 409:860
Levine MT, Jones CD, Kern AD, Lindfors HA, Begun DJ (2006) Novel genes derived from noncoding DNA in Drosophila melanogaster are frequently X-linked and exhibit testis-biased expression. Proc Natl Acad Sci USA 103:9935
Li L, Li J, Xiao W, Li Y, Qin Y, Zhou S, Yang H (2015) Prediction the substrate specificities of membrane transport proteins based on support vector machine and hybrid features. IEEE/ACM Trans Comput Biol Bioinform 13:947–953
Long M, Betran E, Thornton K, Wang W (2003) The origin of new genes: glimpses from the young and old. Nat Rev Genet 4:865
Malnic B, Godfrey PA, Buck LB (2004) The human olfactory receptor gene family. Proc Natl Acad Sci USA 101:2584
McQueen B (1967) Some methods for classification and analysis of multivariate observations. In: UoC Press (ed) Proceedings of 5th Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley, pp 281–297
Michino M, Abola E, Participants GD, Brooks CL 3rd, Dixon JS, Moult J, Stevens RC (2009) Community-wide assessment of GPCR structure modelling and ligand docking: GPCR Dock 2008. Nat Rev Drug Discov 8:455
Mishra NK, Chang J, Zhao PX (2014) Prediction of membrane transport proteins and their substrate specificities using primary sequence information. Plos ONE 9:e100278
Montaner D, Tarraga J, Huerta-Cepas J, Burguet J, Vaquerizas JM, Conde L, Minguez P, Vera J, Mukherjee S, Valls J, Pujana MA, Alloza E, Herrero J, Al-Shahrour F, Dopazo J (2006) Next station in microarray data analysis: GEPAS. Nucleic Acids Res 34:W486
Mouse Genome Sequencing C, Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J, Barlow K, Beck S, Berry E, Birren B, Bloom T, Bork P, Botcherby M, Bray N, Brent MR, Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, Carninci P, Cawley S, Chiaromonte F, Chinwalla AT, Church DM, Clamp M, Clee C, Collins FS, Cook LL, Copley RR, Coulson A, Couronne O, Cuff J, Curwen V, Cutts T, Daly M, David R, Davies J, Delehaunty KD, Deri J, Dermitzakis ET, Dewey C, Dickens NJ, Diekhans M, Dodge S, Dubchak I, Dunn DM, Eddy SR, Elnitski L, Emes RD, Eswara P, Eyras E, Felsenfeld A, Fewell GA, Flicek P, Foley K, Frankel WN, Fulton LA, Fulton RS, Furey TS, Gage D, Gibbs RA, Glusman G, Gnerre S, Goldman N, Goodstadt L, Grafham D, Graves TA, Green ED, Gregory S, Guigo R, Guyer M, Hardison RC, Haussler D, Hayashizaki Y, Hillier LW, Hinrichs A, Hlavina W, Holzer T, Hsu F, Hua A, Hubbard T, Hunt A, Jackson I, Jaffe DB, Johnson LS, Jones M, Jones TA, Joy A, Kamal M, Karlsson EK, Karolchik D, Kasprzyk A, Kawai J, Keibler E, Kells C, Kent WJ, Kirby A, Kolbe DL, Korf I, Kucherlapati RS, Kulbokas EJ, Kulp D, Landers T, Leger JP, Leonard S, Letunic I, Levine R, Li J, Li M, Lloyd C, Lucas S, Ma B, Maglott DR, Mardis ER, Matthews L, Mauceli E, Mayer JH, McCarthy M, McCombie WR, McLaren S, McLay K, McPherson JD, Meldrim J, Meredith B, Mesirov JP, Miller W, Miner TL, Mongin E, Montgomery KT, Morgan M, Mott R, Mullikin JC, Muzny DM, Nash WE, Nelson JO, Nhan MN, Nicol R, Ning Z, Nusbaum C, O’Connor MJ, Okazaki Y, Oliver K, Overton-Larty E, Pachter L, Parra G, Pepin KH, Peterson J, Pevzner P, Plumb R, Pohl CS, Poliakov A, Ponce TC, Ponting CP, Potter S, Quail M, Reymond A, Roe BA, Roskin KM, Rubin EM, Rust AG, Santos R, Sapojnikov V, Schultz B, Schultz J, Schwartz MS, Schwartz S, Scott C, Seaman S, Searle S, Sharpe T, Sheridan A, Shownkeen R, Sims S, Singer JB, Slater G, Smit A, Smith DR, Spencer B, Stabenau A, Stange-Thomann N, Sugnet C, Suyama M, Tesler G, Thompson J, Torrents D, Trevaskis E, Tromp J, Ucla C, Ureta-Vidal A, Vinson JP, Von Niederhausern AC, Wade CM, Wall M, Weber RJ, Weiss RB, Wendl MC, West AP, Wetterstrand K, Wheeler R, Whelan S, Wierzbowski J, Willey D, Williams S, Wilson RK, Winter E, Worley KC, Wyman D, Yang S, Yang SP, Zdobnov EM, Zody MC, Lander ES (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420:520
Nariai N, Kolaczyk ED, Kasif S (2007) Probabilistic protein function prediction from heterogeneous genome-wide data. Plos ONE 2:e337
Niimura Y (2009) On the origin and evolution of vertebrate olfactory receptor genes: comparative genome analysis among 23 chordate species. Genome Biol Evol 1:34
Ohno S (1970) Evolution by gene duplication. Springer Science+Business Media, New York
Pearson H (2006) Genetic information: codes and enigmas. Nature 444:259
Qiu JD, Huang JH, Liang RP, Lu XQ (2009) Prediction of G-protein-coupled receptor classes based on the concept of Chou’s pseudo amino acid composition: an approach from discrete wavelet transform. Anal Biochem 390:68
Qu Y, Xu S (2004) Supervised cluster analysis for microarray data based on multivariate Gaussian mixture. Bioinformatics 20:1905
Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, Ben-Hur A, Pandey G, Yunes JM, Talwalkar AS, Repo S, Souza ML, Piovesan D, Casadio R, Wang Z, Cheng J, Fang H, Gough J, Koskinen P, Toronen P, Nokso-Koivisto J, Holm L, Cozzetto D, Buchan DW, Bryson K, Jones DT, Limaye B, Inamdar H, Datta A, Manjari SK, Joshi R, Chitale M, Kihara D, Lisewski AM, Erdin S, Venner E, Lichtarge O, Rentzsch R, Yang H, Romero AE, Bhat P, Paccanaro A, Hamp T, Kassner R, Seemayer S, Vicedo E, Schaefer C, Achten D, Auer F, Boehm A, Braun T, Hecht M, Heron M, Honigschmid P, Hopf TA, Kaufmann S, Kiening M, Krompass D, Landerer C, Mahlich Y, Roos M, Bjorne J, Salakoski T, Wong A, Shatkay H, Gatzmann F, Sommer I, Wass MN, Sternberg MJ, Skunca N, Supek F, Bosnjak M, Panov P, Dzeroski S, Smuc T, Kourmpetis YA, van Dijk AD, ter Braak CJ, Zhou Y, Gong Q, Dong X, Tian W, Falda M, Fontana P, Lavezzo E, Di Camillo B, Toppo S, Lan L, Djuric N, Guo Y, Vucetic S, Bairoch A, Linial M, Babbitt PC, Brenner SE, Orengo C, Rost B, Mooney SD, Friedberg I (2013) A large-scale evaluation of computational protein function prediction. Nat Methods 10:221
Rice P, Longden I, Bleasby ATG (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 16:276
Wallin E, von Heijne G (1998) Genome-wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci 7:1029
Watson JDBT, Bell SP, Gann A, Levine M, Oosick R (2008) Molecular biology of the gene. Pearson/Benjamin Cummings, San Francisco
Xiao X, Wang P, Chou KC (2009) GPCR-CA: a cellular automaton image approach for predicting G-protein-coupled receptor functional classes. J Comput Chem 30:1414
Yang D, Li Y, Xiao H, Liu Q, Zhang M, Zhu J, Ma W, Yao C, Wang J, Wang D, Guo Z, Yang B (2008) Gaining confidence in biological interpretation of the microarray data: the functional consistence of the significant GO categories. Bioinformatics 24:265
Yang J, Li S, Zhu T, Wang X, Zhang Z (2012) Discovery and validation of potential drug targets based on the phylogenetic evolution of GPCRs. Nat Sci 4:1109
Yarov-Yarovoy V, Schonbrun J, Baker D (2006) Multipass membrane protein structure prediction using Rosetta. Proteins 62:1010
Yi G, Sze SH, Thon MR (2007) Identifying clusters of functionally related genes in genomes. Bioinformatics 23:1053
Yoshikawa Y, Oishi S, Kubo T, Tanahara N, Fujii N, Furuya T (2013) Optimized method of G-protein-coupled receptor homology modeling: its application to the discovery of novel CXCR7 ligands. J Med Chem 56:4236
Zhang X, Firestein S (2002) The olfactory receptor gene superfamily of the mouse. Nat Neurosci 5:124
Funding
This work was supported by a program of the Instituto de Salud Carlos III-Redes Temáticas de Investigación Cooperativa en Salud (ISCIII-RETIC RD06/0021/0008 program) and Laboratorios LETI. An institutional grant from Fundación Ramón Areces is also acknowledged. The funders had no role in study, design, data collection and analysis, decision to publish, or preparation of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Fuertes, M.A., Rodrigo, J.R. & Alonso, C. A Method for the Annotation of Functional Similarities of Coding DNA Sequences: the Case of a Populated Cluster of Transmembrane Proteins. J Mol Evol 84, 29–38 (2017). https://doi.org/10.1007/s00239-016-9763-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00239-016-9763-7