Keywords

1 Introduction

An interplay between structure and function of genetic entities, and the taxonomy of their bearers still challenges researchers. A lot has been done here (see e.g., [4,5,6, 14, 20] and much more others). Obviously, the answer depends on the genetic matter taken into consideration: some entities show the strong prevalence of the taxonomy over function [19], while another matter shows the prevalence of the function over taxonomy [3]. This paper aims to further the studies of the interplay mentioned above.

Here we study the relation of the structure encoded in 16 SRNA (that is a triplet frequency dictionary) to taxonomy of the bearers of those moleculae. In general, there are three entities: structure of a genetic sequence, the function encoded in it, and taxonomy of the bearer of that former. Ultimately, we aim to study an interplay between all of them. To do it, one must define rigorously what is a structure. Hereafter we shall refer a structure as a triplet frequency dictionary \(W^{(j)}\) of the \(j^{\text {th}}\) 16 SRNA sequence.

The index j enlists the genetic entities to be considered, with respect to their taxonomy. Frequency dictionary (also known as k-mer ensemble) is well-known object in the studies of nucleotide sequences; it opposes to the widely spread sequences comparison methods based on alignment. The main disadvantage of alignment consists in the necessity to set up some informally determined parameters. On the contrary, k-mer based methods are free from that disadvantage thus providing a researcher with more reliable and formally defined results. The k-mer approach to sequence comparison has a long story and still goes through a progress. We use here classical approach based on the count of the triplet frequency determined over a sequence; however, some expansions of the method could be found in [13], see also papers [1, 2, 21].

One may not expect a dependence on a function of the anticipated patterns in interplay; here the function is the same. Indeed, we study the ribosomal RNA (16S RNA, specifically) genes, hence all of them encode the same function. Thus, we aim to reveal the dependence between triplet composition of the genes and the taxonomy of their bearers.

To reveal the interplay between structure and taxonomy, we do the following steps:

  • choose the genetic entities with clearly determined and controlled function;

  • convert them into a triplet frequency dictionary \(W^{(j)}\) each;

  • use up-to-date and powerful methods to cluster the points (frequency dictionaries) in the relevant metric space and identify the clusters;

  • check whether a taxonomy of DNA donoring organisms or determine the composition of the clusters (if any).

Suppose, the clusters are observed (otherwise no interplay takes place). Here three possible outputs may be:

  1. 1.

    the clusters are apparent, and each cluster comprises the sequences belonging to a specific taxon (or taxa);

  2. 2.

    the clusters are apparent, and each cluster comprises the sequences belonging to organisms of various taxa (maybe, rather distant);

  3. 3.

    an hierarchy in the clusters composition takes place: e.g. there are super-clusters gathering the higher taxa with fine pattern of each super-cluster determined by lower taxonomy position of the organisms.

Here we present some preliminary results on the study of the relation between triplet composition of 16S RNA genes and taxonomy of some bacteria. Ultimately, this work aims to reveal the medically sounding effects in such pattern appearance.

The medical value of a tool to retrieve knowledge from 16S pyrosequencing and the determination of patterns characterizing healthy people vs. patients with various neurological diseases or their predisposition is very high. The reliable changes in qualitative and quantitative diversity of the microbiota for inflammatory bowel diseases (Crohn’s disease and ulcerative colitis, Parkinson’s disease, Alzheimer’s disease, multiple sclerosis, and other neurodegenerative and neuroinflammatory diseases) are reported.

However, the lack of correct and convenient interpretation follows in a severe expansion of time spent on analysis; one must rigorously follow the same protocol that is not always possible elsewhere. However, a diagnosis of a number of gastroenterological, neurological, and possibly other diseases may be improved. In the future, it will significantly contribute the personalized medical care based on microbiota records. The most ambitious goal here is to create a preventive strategy to correct the human microbiota due to targeted drugs prescription: either eliminating harmful microflora or activating the necessary one. It is necessary to assess the adequacy of the correction being carried out during this treatment procedure.

2 Materials and Methods

2.1 Genetic Material

To reveal the interplay between structure and taxonomy over a set of 16S RNA bacterial genes we use SILVA databaseFootnote 1. It is freely accessible database gathering SRNAs of a great variety of organisms, including bacteria. For the purposes of our study we downloaded 52474 sequences of large subunits of bacterial 16S RNA. The distribution of the genes over taxons is extremely inhomogeneous: some of higher taxa comprise a few species (or strains), while others comprise hundreds or more. Such bias results in a “signal loss”: numerous entries representing higher taxa with few species fail to produce a signal, but make a noise just deteriorating a cluster pattern. To avoid this effect, we hashed the database: we eliminated both over-represented and under-represented taxa. Finally, we tried to balance the representativeness of various taxa in the dataset, so that the entries representing various lower taxa ranged in number from a hundred to tens. The final size of the database was 2143 entries. Taxonomic composition of the database is shown in Table 1. Of course, the composition of the dataset is far from an ideally balanced; however, it represents to some extent the natural distribution of taxa. it should be borne in mind that any database is filled not according to nature, but following the preferences in the choice of species to be sequenced.

Table 1. Abundances of various taxa and genetic entries in the dataset; N stands for the number of genetic entities in the family.

2.2 Triplet Frequency Dictionary

Triplet frequency dictionary \(W^{(j)}\) is the list of all 64 triplets \(\omega _k\), \(k = \mathsf {AAA}, \ldots \), \(\mathsf {TTT}\) accompanied with their frequency \(f_{\omega _k}\); index j here enlists the sequences in the dataset. To make a dictionary, place the reading frame of the length 3 at the very beginning of a sequence and count all the triplets identified by the frame as it moves along a sequence from left to right (for determinacy), with the given step t. Within this paper, \(t=1\). Obvious constraint

$$\begin{aligned} \sum _{k=\mathsf {AAA}}^{\mathsf {TTT}} f_{\omega } = 1 \end{aligned}$$
(1)

holds true.

The transformation of a sequence into the triplet frequency dictionary converts that latter into a point in 63-dimensional metric space; the constraint (1) allows to eliminate a triplet, since there are 63 ones linearly independent only. In theory, any triplet might be excluded from the analysis; practically, we have excluded the triplet \(\mathsf {CAC}\), since it has the least standard deviation figure determined over the dataset. An idea standing behind such choice is that the triplet with the minimal standard deviation contributes less of all into the distinguishability of the genetic entities.

The transformation maps symbol sequence into more convenient mathematical object that is the points in metric space, thus allowing to implement the effective methods of analysis. To do it, one must introduce a metrics; further, we use Euclidean metrics

$$\begin{aligned} \rho \left( W_j, W_l\right) = \sqrt{\sum _{k=\mathsf {AAA}}^{\mathsf {TTT}} \left( f_k^{(j)} - f_k^{(l)}\right) ^2}{.} \end{aligned}$$
(2)

Thus, we investigate the distribution of the points corresponding to genetic sequences in this metric space revealing patterns and clusters, if any.

2.3 Clustering and Visualization

A variety of methods to cluster the multidimensional data is huge. We used k-means and elastic map technique to cluster the data. k-means is well known linear classification method [7, 9], so let’s focus on elastic map technique. It is the non-linear statistics method based on the approximation of the multidimensional data by a manifold of the lower dimension; further we shall use two-dimensional manifolds [8].

The idea of this method consists in jamming the originally plain manifold (a square in our case) in the manner to minimise the total deformation energy of the elastic manifold, and mathematical springs connected to the manifold in the projection points. It is highly powerful and efficient method to cluster multidimensional data and visualise them.

Non-linear clustering of genes was provided by local density technique. In simple words, local density is a specific number of point in a small site on map. To calculate the local density, we supply each point on the map with bell-shaped function

$$\begin{aligned} h\left( r, r^{(j)}\right) = \exp \left\{ -\dfrac{(r- r^{(j)})^2}{\sigma ^2}\right\} {,} \end{aligned}$$
(3)

where r is a point position on the map, \(r^{(j)}\) is the coordinate of a gene converted into a point through triplet frequency transformation, and \(\sigma \) is the contrast parameter. The function (3) looks like a normal distribution function, however it is not.

As soon as all the points on the map are supplied with the function (3), one should calculate the sum

$$\begin{aligned} H(r) = \sum _{j\in \varOmega }h\left( r, r^{(j)}\right) {.} \end{aligned}$$
(4)

Here \(\varOmega \) is the set of all the points from the dataset. One should plot the function (4) over the map to see the density of the points distribution (see Fig. 1(b)); \(\varOmega \) is the set all the points representing the considered genes.

3 Results and Discussion

16 SRNAs are typically used in the studies of the relations (phylogeny as well) of bacteria [10, 18, 22]. Usually, the comparison of the sequences is provided by alignment; here we present some preliminary results of the structure identification provided through the implementation of alignment-free approach, namely the unsupervised clustering based on elastic map technique.

Speaking in advance, we tried the unsupervised clustering to reveal a pattern in taxa distribution of some bacteria; we aimed mainly to prove that such pattern exists. Figure 1 shows the raw distribution of 2143 genes of 16S RNA over the soft elastic map. Figure 1(a) shows the distribution itself, and Fig. 1(b) shows the same distribution over the local density mapped at the same map.

Fig. 1.
figure 1

A distribution of 2143 points over the elastic map with no local density indication (Fig. 1(a)) and with that former (Fig. 1(b)).

Fig. 2.
figure 2

Individual distributions of various orders.

Figure 1 shows the overall distribution of the genes over the soft elastic map \(16\times 16\) (Fig. 1(a)). To compare with, Fig. 1(b) shows this distribution together with the local density. Of course, the cluster pattern depends on the contrast radius \(\sigma \) from (3); the choice of that latter is quite informal. We used by default figure of 0.25 for this parameter. Doubtlessly, there is one highly dense cluster located at the right of the map. There are three to four clusters more, as well.

Figure 2 shows the individual distributions of specific orders over the elastic map. To do it, we made all markers of genes except those belonging to a specific order invisible; however, the elastic map as well as the local density chart is developed for the entire set of genes (these are 2143 entries). For technical reasons, we had to merge two orders (these are Mycoplasmatales and Solibacterales) into a single map (see Fig. 2(i)).

We explored the distinguishability of rather high taxa through the clustering of 16S RNA bacterial genes converted into triplet frequency dictionaries. Thus, a question arises what happens with lower taxa? In other words, if one implements the same procedure for the set of genes belonging, say to the same order, then what kind of clustering could be observed? Again, here two options may take place: the former is that lower taxa yield the distinct clustering (regardless of the peculiarities of the clusters composition, at the first step), and the latter is a decomposition of a cluster pattern resulting in more or less uniform distribution of the genes over the elastic map.

The first option means a scalability of the cluster pattern observed through the triplet composition analysis of the genes; the second one means the absence of fine structure in the lower taxa distributions developed due to triplet composition approach. Figure 3 illustrates the lower level distinguishability of the genes, for chlamydiales order. There are three families comprised into the dataset, for this order. Obviously, the distribution of the families is highly specific and the species show significant speciality in the mutual location over the elastic map. The orders Acidobacteriales and Acidomicrobiales comprise a single lower suborder with 34 and 24 entries each, respectively, so we just omitted these orders from consideration here.

Fig. 3.
figure 3

Family distributions of Clhlamydiales order.

Space limitation makes it impossible to show the in-order distribution for all five orders shown above; however, two other orders (these are Bacillales, 1118 entries and Bacteroidia, 695 entries) have eight and seven families each, so we studied the distribution of the families for them. Surprisingly, these two orders show opposite patterns in the behaviour. Bacillales order shows three apparent clusters: the first one is the most dense, are two others are less dense. So, the distribution of the families over the clusters is pretty close to a uniform one: the genders belonging to various families are distributed quite homogeneously over these three clusters. It means that no dependence between lower taxonomy and triplet composition of the genes for this order is observed.

The cluster comprising Chlamydiales order makes a clear and apparent group located separately from other considered bacteria orders, in elastic map (see Fig. 3(f) and 3). Such isolation of pathogenic bacteria makes a promising result concerning the reliable diagnostics, in future. On the contrary, the order Bacteroidia exhibits very good and clear speciality in the cluster composition. It comprises seven families and they are distributed over the elastic map separately. The genes of this order yield four clusters; however, the genes are separated, for each family.

The ultimate goal is to identify and verify the early predictors of some neurological diseases, in particular the multiple sclerosis through the analysis of microbiota [11, 12, 15,16,17]. This ambitious goal requires an implementation of the tool for fast and efficient analysis of some genetic markers of the microbiota, and 16S RNA seems to be the best one here. A diagnostics of the mentioned diseases requires a study of a normal pattern of the gut microbiota occurrence; hence, we prove an efficiency in the microbial population investigation and a reference value, for further medically sounding studies. The method of clustering and/or classification could complement the currently practising techniques of for diagnostics and curation strategy implementation.

Here we present some preliminary results aimed to demonstrate the feasibility and efficiency of the diagnostics based on 16S RNA analysis of the microbiota of healthy and sick people. To implement such diagnostic tool, one should make sure that a genetic marker used to distinguish sick people from healthy ones really supports this distinguishability. The results provided here unambiguously prove the efficiency of such approach, in principle. Doubtlessly, our current results to not comprise a diagnostic tool; they just approve the feasibility of the tool if it is implemented.

4 Conclusion

Here we explored the interplay between triplet composition of 16S RNA bacterial genes of five orders and taxonomy of those bacteria. Some preliminary results are present aimed to approve the feasibility of triplet composition based clustering of 16S RNA bacterial genes to identify the distinguishability of various taxa in the 63-dimensional Euclidean space of triplets frequency. The results unambiguously show that various taxa differ in terms of the triplets frequency so that more detailed and exhaustive investigation of the interplay for sure makes sense and may bring a lot. Moreover, the interplay is scalable: a transition from higher taxon to lower ones reveals the new and more fine structuredness in the clustering.

The study of interplay between taxonomy and k-tipple composition of genes is of great interest and value itself. However, these studies may contribute a lot in various applied areas including e.g. medicine.

Thus, a design and implementation of a tool for early diagnostics such hard to detect diseases based on comparative analysis of formally identified structures in bacterial 16S RNA is feasible.