Introduction

Transmembrane proteins (TMEM) are involved in diverse biological functions such as receptors, transporters, and enzymes, among other roles (Almen et al. 2009), having impact in terms of disease mechanisms and drug discovery. The analysis of the genome from several organisms including human indicates that 20–25% of all genes code for TMEM proteins (Jones 1998; Lander et al. 2001; Wallin and von Heijne 1998) although less than 1% of 3D structures are deposited in the protein data bank (Berman et al. 2002). It appears therefore of particular importance the developing of effective approaches for clustering, annotation, and modeling of TMEM genes. The methods currently developed are based on amino acid sequence homology and functional similarity (Almen et al. 2009; Foord et al. 2005; Horn et al. 2003). However, despite the primary sequence diversity observed in those genes, all they share a common function as TMEM genes as observed by gene ontology (GO) annotation. Thus, a question still unresolved is whether these genes share some common compositional structures at DNA level not evident from the analysis of their primary nucleotide sequence.

To resolve this issue, some tools have been developed in order to discover similarities and differences not based on primary sequence relationships. In order to perform gene clustering (Brameier and Wiuf 2007; Cho et al. 2009; Qu and Xu 2004; Yang et al. 2008) or gene function prediction (Chen and Xu 2005; Jensen et al. 2003; Nariai et al. 2007), the analysis of common compositional features between genes has become an important issue in biology because the annotation of protein function is important to understand life at the molecular level having implications, moreover, from a medical and pharmaceutical point of view (Radivojac et al. 2013). Recently, and in order to evaluate similarities and dissimilarities among functionally distinct sectors of genes independent of the reading frame and without previous knowledge of the expression or the gene function, a clustering methodology has been proposed that analyzes compositional features not based on primary sequence relationships (Fuertes et al. 2011). The data presented in this paper revealed evolutionary features of kinship or lack thereof that heretofore had not been recorded (Fuertes et al. 2016a, b). The similarities and dissimilarities in the usage of 14 items called composons (sets of nucleotide triplets, formed by all nearest neighbors having the same gross composition) were recorded as the comparison parameter and evaluated by reading the sequence in a fully overlapping way. It was found that most genes of both human and mouse gene samples were grouped in a small number of clusters (11 in human and 9 in mouse) having common compositional features in both species (Fuertes et al. 2011). Due to the small gene sample size used in that study and on the basis of the available information, we wondered whether those clusters represented the entire number of clusters in which the genome could be distributed and whether the genes from a given cluster could have some correlation with function. To answer that question, we believed it was necessary to expand the number of genes being studied. It was observed that using a 20-fold higher randomly selected sample (about 50% of the genome), the genes from both human and mouse were classified into 18 clusters having highly similar compositional profiles in both species. It was observed, however, that the gene populations in similar human and mouse clusters notably differed. It was observed, in addition, that practically all genes contained in cluster 16 codify for TMEM proteins, most of which were contained into the G-protein coupled receptor (GPCR) superfamily (as olfactory (ORs), vomeronasal (Vmnr), taste (Tas2r) or trace amine-associated receptor (TAAR) genes between others) and also into genes codifying for enzymes, transporters, solute carriers, etc. Interestingly, an important fraction of these human and mouse TMEM genes are notably dissimilar in their nucleotide sequence. The analysis suggests that the TMEM genes contain a type of common compositional structure underlying the primary nucleotide sequence closely correlated to function.

Materials and Methods

DNA Sequence Acquisition and Pre-processing

The human and mouse coding sequences (CDS) analyzed in this study were obtained from The National Centre for Biotechnology Information DNA database (NCBI) (Benson et al. 2010). The criterion of random sample selection was applied to select the human and mouse gene samples (Fuertes et al. 2011). CDSs shorter than 100 base pairs were discarded from the analysis since, as it has been observed (laboratory data), the error in the estimation of the nucleotide composition of the DNA sequence from the analysis of composon-usage significantly increases when the DNA length decreases. The entire exonic sequence of each gene was considered for analysis without taking into consideration the isoforms due to alternative splicing. The final samples included 13574 human and 14047 mouse genes.

Triplet Composon-Usage Frequencies

Triplet composon-usage frequencies were determined by reading the DNA sequences in a fully overlapping way and computing all nearest neighbors of each nucleotide along the gene sequence and grouping them into composons. A composon is defined as a set of nucleotide triplets with the same “gross composition” (Fuertes et al. 2011). The term “gross composition” is referred here to triplets constituted by the same type of nucleotides. For example, the triplets constituted by the nucleotides C and T in whatever order and number would have the same “gross composition” and all of them would constitute the composon <CT> and so on. Online Resource 1 shows the 14 composons that can be formed when sorting out the 64 triplets that can be obtained with nucleotides A, T, G, and C and the nomenclature of the composons used in this paper. The triplet nucleotide-usage was calculated using compseq, an application located in http://emboss.bioinformatics.nl/, the European Molecular Biology Open Software Suite (EMBOSS) (Rice et al. 2000). The parameters for compseq were, in this case, a word size = 3 (a triplet) and a frame of word to look = 0. That means that the sequence is red in a fully overlapping way.

The relationship between the composon-usage and the nucleotide composition of the DNA sequence (Fuertes et al. 2016b) would be done by the Eq. (1). Thus if we denote by N i the number of nucleotides of the type-i and assuming that each one of indexes i, j, k takes the values of A, G, T, or C then,

$$N_{i} = x_{ < i > } + \frac{1}{2}\mathop \sum \limits_{j \ne i} x_{ < ij > } + \frac{1}{3}\mathop \sum \limits_{{\begin{array}{*{20}c} {j \ne i} \\ {k \ne j} \\ \end{array} }} x_{ < ijk > }$$
(1)

and the estimated length in bp of the DNA sequence N would, then, be, \(N = \sum _{i} N_{i}\).

As baseline we have used the expected composon-usage frequency average of a randomly generated DNA sequences of theoretical infinite length having the same composition for A, T, C, and G. On the average, in such sequences each triplet must appear with a frequency of 16‰. In degenerate composons containing six triplets per composon (see Online Resource 1), the expected composon-usage would be 94‰.

Composon-Clustering

The k-means clustering algorithm was implemented to cluster the composon-usage frequencies of the genes. The Pearson correlation coefficient, r, was the distance function used for clustering. The distance function measures the linear relationship between the composon-usage frequency of each gene and the composon-usage frequency of the cluster centroid, defined as the composon-usage average of all genes fitting into the cluster analyzed. The centroid of a cluster was previously defined for any pair of variables (McQueen 1967). To maximize the correspondence between genes fitting into a given composon-cluster, a high threshold for the distance function was selected for both species, r ≥ 0.948. The k-means algorithm was found in the website http://www.gepas.org (Al-Shahrour et al. 2006; Montaner et al. 2006). The software package MatLab (© 1984–2010 The MathWorks, Inc) was used for statistical tests and calculations. We will refer hereafter to the k-means clustering of composon-usages as composon-clustering.

Results

Gene Clustering of Human and Mouse Samples

The results of the analysis by composon-clustering of the sample containing 13574 and 14047 human and mouse genes, respectively, are illustrated in Figs. 1 and 2. As it can be observed, all genes were classified into 18 different composon-clusters both in human and mouse. It was noticed that clusters 2 and 5 in mouse were absent when a small number of genes were analyzed (Fuertes et al. 2011), however, they are present when the sample increased 20-fold. The profiles of clusters 1–9 are shown in Fig. 1, and the profiles 10–18 in Fig. 2. It was also observed that each human cluster has a highly similar counterpart in mouse. Actually, the composon-usage profile of the human and mouse centroids of similar clusters overlap with high correlation, r > 0.989, as an indication that the human and mouse genes from a given cluster share a high degree of composon-usage similarity. It was also observed that in human and mouse 90% of the analyzed genes were classified in 11 clusters, as an indication that in both species most genes share a small number of basic compositional features or architectures in terms of composon-usage. The numeric values of the composon-usage frequencies of each one of these profiles can be obtained from Online Resources 2 to 6.

Fig. 1
figure 1

Composon-usage profiles in human (black circle) and mouse (white square) from clusters 1 to 9. To avoid overlapping with the baseline the characteristic usage frequencies per cluster in human and mouse clusters ± 2 SD are represented in the upper insets: (black rectangle) frequency-usages higher, (double horizontal) lower, and (line) overlapping with the baseline

Fig. 2
figure 2

Composon-usage profiles in human (black circle) and mouse (white square) from clusters 10 to 18. The characteristic usage frequencies per cluster are as described in Fig. 1

Gene Populations in Human and Mouse Composon-Clusters

Figure 3a shows that there are notable differences in the number of genes classified into each one of the human and mouse clusters particularly in some of the most populated ones. As shown, the human composon-clusters 2, 5, 6, 8, 9, 12, 13, 15, and 17 have a number of genes higher than their counterparts in mouse. In contrast, human clusters 1, 3, 4, 7, 10, 11, 14, 16, and 18 have a number of genes lower than their counterparts in mouse. The mouse cluster 16 contains more than 50% of the genes that fit into its cluster counterpart in human. In both species, low differences in the number of genes per cluster were observed in clusters 8, 11, 12, 14, 15, and 18. Figure 3b shows the number of genes that differ between both species in each cluster. The highest differences were observed in the most populated clusters with the exception of cluster 8. On the other hand, the less gene-populated clusters have a similar number of genes in both species with the exception of clusters 13, 16, and 17. Since, as indicated in “Materials and Methods”, the criterion of random sample selection was applied to select the human and mouse gene samples and it is assumed that most human and mouse genes are orthologs (Mouse Genome Sequencing C et al. 2002), we think that the differences in gene population among cluster counterparts point to the hypothesis that those genes diverged notably in compositional features during the evolution of the orthologs in both species.

Fig. 3
figure 3

a Bar graph representing the number of genes per composon-cluster in human and mouse (gray and black bars, respectively). b Histogram representing the difference in the number of genes between similar composon-clusters when the number of human genes is higher than in mouse (gray bars) and when (black bars) the number of human genes is lower than in mouse (in absolute values)

Clustering of TMEM Genes

The GO analysis performed a posteriori in all genes of the composon-cluster 16 reveals that (Fig. 4a) in human 95% of them are TMEM genes, 61% of which are GPCRs. In mouse, however, although the percentage of TMEM genes is also 95%, the percentage of GPCRs increases up to 84% (see Fig. 3b). In human 64% of all GPCRs are ORs genes against 72% in mouse (Fig. 4c). It can also be observed that the mouse had a higher percentage of Vmnr1 genes (11%) than in human (1%).

Fig. 4
figure 4

Circular graphics showing the GO annotation of all genes in human composon-cluster 16 (column A) and in mouse composon-cluster 16 (column B). The upper row shows the percentages of the type of proteins found. The middle row shows the percentages of the type of transmembrane proteins found. The down row panel shows the percentages of the type of GPCRs found

The data show the robustness of the proposed method for clustering TMEM genes based on the compositional features of the coding sequences since their composon-usage profile is characteristic of the genes contained in cluster 16 (see Fig. 2, panel 16), and despite the increasing number of released crystal structures showing the low sequence identity of the binding pocket of GPCRs (Yoshikawa et al. 2013) and the notable structural diversity of more than 800 human GPCRs (Fredriksson et al. 2003). It was observed that in both species the correlation of the composon-usage of the GPCR genes belonging to the TMEM class is very high (r > 0.994), as an indication that some type of evolutionary link in terms of nucleotide composition reflected in the composon-profile (see Eq. 1) must exist between all GPCRs of cluster 16.

The finding showing that nearly all human/mouse genes from cluster 16 show common composon-usage profiles in their nucleotide sequences is relevant because the clustering was performed without previous knowledge of their function or of their similarity. For example, in mouse OR genes the identity among the intact coding regions varies between 34 and 99% emphasizing the extreme diversity of the OR gene family (Godfrey et al. 2004). By functional and composon-clustering analysis, however, the OR genes can be classified as a single family suggesting that their common compositional architecture could be linked to function. As a matter of fact certain TMEM contained in cluster 16 acting as transporters, chemokine receptors, or solute carriers have in and between them low sequence identity (Mishra et al. 2014) but very high correlation in their composon-usage profiles (see Fig. 2, panel 16). It is most likely, therefore, that the composon-clustering method does not classify genes based on the similarity of sequences but on the presence or absence at DNA level of common interspersed compositional structures. This is an important outcome because some orphan GPCRs including a significant amount of intronless genes, as the Rhodopsin-like GPCRs, that are not classified into any of the known subfamilies. There is a need, therefore, of new bioinformatic approaches to address this issue (Alem et al. 2007).

Discussion

Several methods have been described that address issues related to the classification of protein sequences, as for example: (i) methods based on searching for short/long conserved fragments of TMEM proteins (Godfrey et al. 2004; Malnic et al. 2004); (ii) methods based on the knowledge of substrate specificities or the inclusion of energy functions on homology modeling of TMEM in general (Li et al. 2015; Yarov-Yarovoy et al. 2006) and of GPCRs in particular (Michino et al. 2009); (iii) using phylogenetic analyses 60% of GPCRs are clustered according to ligand preferences (Yang et al. 2012), etc. The novelty of the clustering approach presented in this paper is that it allows clustering genes from the composon-usage frequency information existing in the nucleotide sequence red in a fully overlapping way (Fuertes et al. 2011). The robustness of the method has been tested on other gene families as occur with composon-clusters 2 and 5 (Fuertes et al. 2016b) and also with a large amount of human-mouse orthologs that coevolve from a compositional cluster in mouse to a different compositional cluster in human (Fuertes et al. 2016a).

Since there is more than one way to read a stretch of DNA and there may be various layers of biological information in the DNA interspersed between, or superimposed on, the passages written in the genetic code (Pearson 2006), it has been previously shown a method of composon-clustering based on the analysis of the usage frequency of sets of base triplets called composons (Fuertes et al. 2011). The data presented in this paper show that about 14000 human genes and a similar number of mouse genes analyzed by the composon-clustering method were classified in 18 clusters and that 90% of the clustered genes fit into the 11 composon-clusters previously described (Fuertes et al. 2011). It was also observed that each one of the 18 human and mouse counterpart clusters share highly similar conserved composon-usage profiles (r > 0.989).

The occurrence of only a small number of composon-clusters both in human and mouse gene samples let us to address the conceptual particulars of the type of genomic information obtained by the analysis of the composon-usage frequency of a DNA sequence relative to the codon-usage. The codon reading and the composon-reading of the DNA compute base triplets but there is a major difference between the information provided by both systems of reading. While the analysis of codon-usage requires reading the DNA sequence in the proper open reading frame and in a non-overlapping way, the analysis of the composon-usage requires reading the DNA sequence in a fully overlapping way independently, therefore, of the open reading frame. Thus, while the number of base triplets of a DNA sequence after a codon reading is 1/3 of the number of bases of the sequence, the number of triplets after a composon-reading is equal to the number of bases of the DNA sequence since each base forms a triplet with their nearest neighbors. Thus, the information obtained by the fully overlapping reading is higher than that obtained by reading in a non-overlapping way. Thus, while the information provided by the genetic code is linked to the translated protein, the information provided by the composon-usage reading is linked to the nucleotide compositional characteristics that are embedded and distributed along the DNA sequence in terms of the number, location, and frequency of use of each one of the triplets (see Eq. 1). This information is, therefore, connected to the structural organization of the DNA sequences. From this point of view the genetic code and the composon-code are codes in a strict sense since in both cases an input determines an output (Pearson 2006) having redundant properties (Watson et al. 2008). The 64 codons of the genetic code codify for 20 amino acids and the 64 base triplets of the composon-code codify for 14 composons.

The finding that a large number of genes of the human and mouse genomes could be classified into a small number of composon-clusters and that the composon-profiles of the clusters from one species have similar counterparts in the clusters of the other species suggests that variations in the composon-usage and, therefore, variations in compositional organization might be, to a certain extent, evolutionarily restricted. This restriction would open the question of whether could this merely be attributable to evolutionary conservatism of the composon-profiles after gene duplication followed by mutation or does it reveal a gene-type-associated characteristic maintained for functional reasons? We think, moreover, that the existence of high and poor gene-populated clusters reinforces the hypothesis that the gene size of complex genomes resulted from differential increases in ploidy and from smaller scale duplication of pre-existing DNA sequences followed by sequence divergence and often, eventually, by translocation of sequence duplicates (Levine et al. 2006; Long et al. 2003; Ohno 1970) while maintaining their basic composon-profile. In fact, duplicated genes may accumulate large number of base substitutions and have, however, similar composon-usage profiles. If so, it could be expected, as shown in this paper, that (i) several genes of an organism originated from gene duplication could, to a significant extent, be grouped into similar composon-clusters in spite of their primary sequence divergence and that (ii) a large proportion of genes of a single species could fit into a small number of clusters. To our knowledge, this is the first time in which it has been shown that a large fraction of mouse and human genes can be classified into a small and a certain number of highly correlated compositional architectures.

In this context, we believe that one of the most relevant outcomes of the data presented in this paper about the genes contained in cluster 16 is that, in spite of their low degree in nucleotide sequence homology, that a posteriori was annotated by GO to a common molecular TMEM function, the composon-clustering was able to classify the genes in a single compositional cluster. Actually, most genes from cluster 16 codify for receptors activated by the binding of a repertory of different ligands including taste-sensitive compounds, odors, pheromones, hormones, and neurotransmitters that activate cellular signal transduction mechanisms via GPCRs.

The TMEM cluster of genes contains a large amount of orphan GPCRs encoded by intronless genes (as is the case of the Rhodopsin-like GPCRs) not classified in any of the known subfamilies (Alem et al. 2007). Thus, we believe that as far as the composon-cluster 16 concerns the analysis of the DNA by the composon-usage is able to identify a distinctive characteristic gene architecture interspersed along the primary gene sequence linked to their molecular function. Therefore, most likely, this type of clustering may be useful for gene annotation prediction. While the analysis of the primary sequence of the DNA would consider these genes not to be related, the composon-clustering revealed that they are highly correlated (r > 0.989). Due to the low degree of nucleotide sequence homology between some of the GPCRs and to the fact that the analysis of GPCRs is error prone and it needs to be followed up by a careful manual curation (Haitina et al. 2009), bioinformatics methods have been used for predicting the classification of GPCRs according to their amino acid sequence and pseudo amino acid composition approach (Gu et al. 2010; Qiu et al. 2009; Xiao et al. 2009). Therefore, just by the analysis of the DNA without the need for further curation it would be possible to classify GPCR genes. The computation of similarities and distances between the GPCR genes fitted into cluster 16 according to the composon-usage profiles could assist to classify them being an attractive model to study gene evolution and the identification of subfamilies. In fact, odorants and pheromones present in the environment are detected via ORs, TAARs, Vmnrs, and formyl peptide receptors, expressed by the sensory neurons in the epithelia of these organs (Ibarra-Soria et al. 2014). TAARs, which were initially identified as receptors from a specific group of biogenic amines in the brain, are actually expressed in the olfactory epithelia in mice and are able to be regarded as a second class of ORs (Niimura 2009). Most of these genes are grouped in composon-cluster 16. As described (Adipietro et al. 2012), a special class of GPCRs, as the ORs, has been subjected to rapid evolution between species as a result of gene duplication and gene loss. The difference in gene population between the mouse and human genes classified in cluster 16 may reflect this loss and the adaptation of the olfactory system to the environmental context of each species. Interestingly, even though the human ORs cover a similar receptor space as the mouse ORs, suggesting that the human olfactory system has retained the ability to recognize a broad spectrum of chemicals, it has lost nearly two-thirds of the mouse ORs (Zhang and Firestein 2002). Since as it was shown both, in mouse and human, the composon-cluster 16 includes hundreds of gene coding for ORs, it is most likely that the majority of the ORs correlates with the GPCR-Rhodopsin family.

The occurrence of highly gene-populated composon-clusters having similar compositional features suggests that the genes forming each cluster may share some sort of evolutionary linkage and that this occurrence may confer some type of selective advantage promoting, thus, their formation and maintenance as it occurs with other types of clustering (Yi et al. 2007). In addition, the existence of differences in gene population among clusters might also be connected with a differential selective advantage and history of the duplicated clustered genes being responsible for the formation of levels of clustering organization ranging from small clusters to large ones spanning hundreds of genes. (See Online Resource 1). In the less populated composon-clusters (<300 genes), as for example cluster 16, the linkage between the composon-clustering and the molecular function could at first sight be observed since most of the genes have a common parental function. In the more populated composon-clusters (300–3000 genes per cluster) even though they may have some sort of functional linkage, such a link is not at first sight viewed due to the wide number of specific functions of the genes that fit into those clusters.

If, as shown in the composon-cluster 16, the architecture of DNA sequences, as determined by the analysis of the similarities and dissimilarities in the composon-usage and independent of their sequence homology, may have some sort of linkage with functional properties. It is most likely that this type of analysis could add some light on the computational annotation of gene function prediction. In conclusion, we suggest that there is a need to improve the existing available tools that may contribute to decipher the challenging problem of protein annotation prediction (Radivojac et al. 2013). We think, in addition, that the reduced dimensionality of the composon-clustering classification may help to annotate in a fast way large number of genes having parental functions.