Keywords

1 Introduction

Complex systems often show forms of tangled organisation, characterised by the intertwined relation of their parts. When modelling a complex system it is common to associate variables to its atomic components and the actual dynamics of the system can be represented by the interplay among some of its groups of variables, which we may call relevant subsets. The identification of relevant subsets in a complex system is a key issue in contexts as diverse as biology, neuroscience, environmental studies, big data analysis, economics and robotics, just to mention some. In previous works, we have proposed a method based on information-theoretical measures that aims at identifying those relevant subsets (heareafter referred to as RSs). The technique is named the Dynamical Cluster Index method (DCI) and has been shown to effectively capture the dynamical organisation of complex systems of several kinds, such as genetic networks and chemical reactions [1, 14]. This method has been shown to be superior to classical correlation-based techniques and can be applied to dynamical systems non necessarily in stationary states. Recently, we have proposed an operational definition of RS in terms of a filtering algorithm and suggested the use of transfer entropy for the assessment of the information flow among RSs [2]. The overall method developed makes it possible to identify the components of a complex system that are relevant for its dynamics as well as their relation in terms of information flow. The detection of such dynamical structure is a prerequisite for inferring a likely hierarchical organisation of the system. The identification of the RSs relies on a ranking based on a statistical index, which depends on a reference system. This system should provide a baseline for the assessment of the results and it is usually generated by means of a stochastic model, so as to both capture the main statistical properties of the complex system to be analysed and provide a reference homogeneous system, i.e. a system without RSs. A subtle yet crucial point concerns the robustness and significance of the results attained by the application of the DCI method against the fluctuations induced by the sampling of the homogeneous system.

In this paper we address this issue and show that the results attained by the DCI method are valid in general and do not depend upon the homogeneous system instances generated. We first briefly recall the basic notions and provide an overview of the DCI method in Sect. 2. In Sect. 3 we succinctly illustrate the sieving procedure used for identifying the RSs and their exchange of information. The robustness of the results is discussed in Sect. 4, where we present the results of a thorough statistical analysis. Finally, we discuss further improvements of the method and we conclude with Sect. 6.

2 The Dynamical Cluster Index Method

The DCI method has been introduced in previous work and the interested reader can find details in [1, 12, 13]. For the sake of completeness, we provide a brief summary of the main notions and the method itself. Let us consider a system modelled with a set U of n variables ranging in finite and discrete value domains. The cluster index of a subset S of variables in U, \(S \subset U\), as defined by Tononi et al. [10], estimates the ratio between the amount of information integration among the variables in S and the amount of integration between S and U. These quantities are based on the Shannon entropy of both the single elements and sets of elements in U. The entropy of an element \(x_i\) is defined as:

$$\begin{aligned} H(x_i) = -\sum \limits _{v \in V_i} p(v) \;\; log\; p(v) \end{aligned}$$
(1)

where \(V_i\) is the set of the possible values of \(x_i\) and p(v) the probability of occurrence of symbol v.

In this work, we deal with observational data, therefore probabilities are estimated by means of relative frequencies. The cluster index C(S) of a set S of k elements is defined as the ratio between the integration I(S) of S and the mutual information between S and the rest of the system \(U-S\). The integration of S is defined as:

$$\begin{aligned} I(S) = \sum \limits _{x \in S} H(x) - H(S) \end{aligned}$$
(2)

I(S) represents the deviation from statistical independence of the k elements in S. The mutual information \(M(S;U-S)\) is defined as:

$$\begin{aligned} M(S;U-S) \equiv H(S) - H(S | U-S) = H(S) + H(U-S) - H(S,U-S) \end{aligned}$$
(3)

where H(A|B) is the conditional entropy and H(AB) the joint entropy. Note that H(A) denotes the entropy of the set A. Finally, the cluster index C(S) is defined as:

$$\begin{aligned} C(S) = \frac{I(S)}{M(S;U-S)} \end{aligned}$$
(4)

Since C is defined as a ratio, it is undefined in all those cases where \(M(S;U-S)\) vanishes. In this case, the subset S is statistically independent from the rest of the system and it has to be analyzed separately. As C(S) scales with the size of S, cluster index values of systems of different size need to be normalized. To this aim, a reference system is defined, i.e., the homogeneous system \(U_h\), that keep some statistical properties of the original system but does not contain clusters. There are several ways to generate \(U_h\), which will be discussed in Sect. 4. In general, for each subsystem size of \(U_h\) the average integration \(I_h\) and the average mutual information \(M_h\) are computed. The cluster index value of S is normalized by means of the appropriate normalization constant:

$$\begin{aligned} C'(S) = \frac{I(S)}{\langle I_h \rangle } / \frac{M(S;U-S)}{\langle M_h \rangle } \end{aligned}$$
(5)

Furthermore, to assess the significance of the differences observed in the cluster index values, a statistical index \(T_c\) is computed:

$$\begin{aligned} T_c(S) = \frac{C'(S) - \langle C'_h \rangle }{\sigma (C'_h)} = \frac{C(S) - \langle C_h \rangle }{\sigma (C_h)} \end{aligned}$$
(6)

where \(\langle C_h \rangle \), \(\sigma (C_h)\) and \(\langle C'_h \rangle \) and \(\sigma (C'_h)\) are the average and the standard deviation of the population of cluster indices and normalized cluster indices with the same size of S from the homogeneous system.

The search for RSs of a dynamical system by means of the DCI requires first the collection of observations of the values of the variables at different instants. As the DCI is computed on the basis of symbol frequencies, in principle it is not required to have a time series but just a collection of snapshots of the system variables. However, the results discussed in this paper have been obtained by analysing time series of discrete-time discrete-state systems. In order to find the RSs, in principle all the possible subsets of system variables should be considered and their DCI computed. In practice, this procedure is feasible only for small-size subsystems in a reasonable amount of time. Therefore, heuristics are required to address the study of large-size systems.

3 Relevant Subsets

The list of candidate RSs (CRSs) can be ranked according to the significance of their DCI. In many cases this analysis might return a huge list of entangled sets, so that a direct inspection is required for explaining their relevance. To this aim, we have introduced a DCI analysis post-processing sieving algorithm to reduce the overall number of CRSs to manually tackle [2]. A main limitation might be owing to the fact that if a CRS A is a proper subset of CRS B, then only the subset with the higher DCI is maintained between the two. Thus, only disjoint or partially overlapping CRSs are retained: the used assumption implies that the remaining CRSs are not further decomposable, forming in such a way the “building blocks” of the dynamical organisation of the system. The sieving algorithm enables us to provide a precise definition of RSs, avoiding the fuzzyness typical of the definitions of clusters in general. Of course, this operational definition is based on some assumptions, which might limit the outcome of a DCI-based analysis. The main assumption is that the ranking of the CRSs depends upon the homogeneous system: this issue will be thoroughly discussed in Sect. 4.

3.1 Information Flow Among RSs

The transfer entropy (TE) has been introduced by Schreiber [6] as a measure to quantify information transfer between systems evolving in time. Let X and Y be two random variables representing the state transition of two stochastic or deterministic systems. Let \(x_t\) and \(y_t\) be the values respectively of X and Y at time t. Let also suppose that the systems are Markovian processes of order 1, i.e. \(p(x_{t+1} | x_t, x_{t-1}, x_{t-2}, \ldots ) = p(x_{t+1} | x_t)\).Footnote 1 The transfer entropy \(T_{X \rightarrow Y}\) quantifies the amount of information available from knowing \(y_t\) on \(x_{t+1}\) and is defined as follows:

$$\begin{aligned} T_{Y\rightarrow X} =\sum _{X,Y} p(x_{t+1},x_t,y_t)~log~\frac{p(x_{t+1}|x_t,y_t)}{p(x_{t+1}|x_t)}, \end{aligned}$$
(7)

We note that the temporal dependency is not necessarily of unitary lag, i.e. \(t-1 \rightarrow t\). For a complete assessment of the statistical dependency of X on Y one should sum over \(t-1, t-2, \ldots , t-k\), where k is the order of the Markovian process. Nevertheless, in this paper (i) we are analysing Markovian systems of order 1, whose behaviour depends only on the immediately previous step and (ii) although TE is not a direct measure of causal effect, the use of short history length and the generation of time series by means of perturbations makes it possible to consider this measure as a way to infer causal effect [5].

4 Robustness of the Method

In this section we illustrate the results concerning the robustness of the results returned by the DCI method with respect to the variance introduced by the homogeneous system generation. We assessed the ranking of the RSs and the transfer entropy values as a function of a sampled distribution of homogeneous system instances and we also compared different ways for generating it. We first describe the test cases used in the analysis; subsequently, we discuss the results in terms of RSs ranking and transfer entropy.

Table 1. The update rules of the boolean networks discussed in the text. Random(0.5) denotes a Bernoulli distribution with probability 0.5.

4.1 Test Cases

We chose five paradigmatic systems composed of 12 nodes updated either by means of a boolean function or randomly. The rationale behind the definition of such systems is that, despite their apparent simplicity, they exhibit a nontrivial dynamics, as they are boolean networks, a modelling framework that has obtained remarkable results in simulating real gene regulatory networks [79, 11]. Nodes update their state in parallel and synchronously. The functional dependences and the update rules of these systems are shown in Table 1. The size of these systems enables us to perform an exhaustive enumeration of all the possible groups, allowing their complete assessment. The systems analysed are initially set to a random initial state and are evolved in time for 500 steps. In order to avoid short attractors and to better observe the relationships among nodes a perturbation is made every 10 steps—nodes are perturbed sequentially, but not in the same order. In Case 5 we also studied trajectories in which each node in every step has the same probability \(p=1.5\,\%\) of being perturbed. It is worth remarking that each perturbation is introduced after the system has recovered a stable dynamical situation.

The five instances share a common structure but differ in specific dynamical organizations of some nodes. In Case 1, we consider two independent groups of three nodes (namely, group A and group B), by assigning at each node the XOR function of the other two nodes in the group. Case 2 derives from Case 1 by introducing in the first node of group B a further dependence from the last node of group A, hence introducing information transfer from group A to group B. Case 3 is a variant of Case 2 in which a functional dependence of the first node of group A from the last node of group B is introduced. In Case 4, five heterogeneously linked nodes are influenced by groupA. The combination of the dynamical rules of the nodes and their initial condition makes the dynamical behaviour of the sixth node always in phase with the triplet. Finally, Case 5 derives from Case 1 by adding two nodes whose dynamical behaviour directly depends on nodes of both group A and group B: these 8 nodes form a group clearly different from the remaining 4 random nodes, as they are interdependent and ruled by deterministic functions.

4.2 Relevant Sets Ranking

The usual way of computing the \(T_c\) value consists in generating an instance of an homogeneous system and compute the average of integration and mutual information of its subsets of any size. These values are then used to assess the statistical significance of the DCI of a given subset of the system under observation. The homogeneous system can be generated according to different models and the time series can be of course of different length. We checked the robustness of the results against both criteria.

As for the model for generating the homogeneous system, we considered two possibilities that differ in the distribution probability used. Let \(s^1,s^2,\ldots \) be the time series of the system to be analysed, where \(s^i = (x^i_1,x^i_2,\ldots ,x^i_n)\). Let \(\hat{s}^i = (\hat{x}^i_1,\hat{x}^i_2,\ldots ,\hat{x}^i_n)\) be the generic state of the homogeneous system time series. In the following, without loosing generality, we suppose that variables are boolean. The two distributions used for generating the homogeneous systems are the following:

  1. i.

    Compute the frequency \(f_i\) of 1 s for each variable \(x_i\) occurring in the series.Footnote 2 Generate a series of states in which the values occur according to the individual frequencies of the variables, i.e. the value of \(\hat{x}^i_j\) is sampled from a Bernoulli distribution with parameter \(f_i\).

  2. ii.

    Compute the global frequency f of 1 s occurring in the series. Generate a series of states in which the values occur according to the global frequencies, i.e. the value of \(\hat{x}^i_j\) is sampled from a Bernoulli distribution with parameter f.

Both models capture the idea of preserving some statistical properties of the data to be analysed, while providing a baseline for the estimation of the main quantities of interest. In particular, randomness is introduced with the aim of avoiding structure and make it possible to compute integration and mutual information for a system that does not have relevant sets in its dynamics. The difference between the two models is that the first maintains the individual frequencies for each variable, while the second just assumes an overall frequency of occurrence of the values and it is therefore less accurate than the former.

We compared the rankings produced by using the two models, collecting statistics for 50 homogeneous system independent replicas. We will first present the results from each of the models, assessing the robustness against its inherent variance, and we subsequently compare the results between the two models.

Table 2. Results attained by using model (i). For each of the first five positions in the ranking, the group occurring most frequently is shown, along with its frequency.

Results for Model (i). Results of model (i) are shown in Table 2, where for each of the first fives positions in the ranking, the group occurring most frequently is shown, along with its frequency in that position. The results shown in Table 2 are also confirmed by a statistic on the groups ranked in any of the top five positions: the most frequently occurring groups in each of the top five positions are also those with the highest probability of being ranked among the first five. Results are shown in the appendix (Table 8). The results for Case 1 are sharp, as the two independent groups of variable are always ranked in the uppermost two positions. The following positions in the rank are occupied by their subsets. Results for Case 2 are also quite clear: the two dependent groups are ranked in the first positions and their interaction is captured by the detection of groups containing variables from both blocks (e.g. group {N5,N8,N9,N10} ranked in the fourth position for the 20 % of the times). The dynamics of Case 3 is more complex than the previous cases and this is reflected by the rankings returned. In fact, the two blocks are no longer emerging as candidate RSs, but rather their parts are ranked high. This phenomenon can be explained by observing that pairs of variables are usually way more integrated than triplets. However, the rankings obtained by model (i) are still able to capture the essence of system structure. The dependence graph among variables of Case 4 is rather intricate and so is its dynamics. Nevertheless, the analysis of Case 4 enlightens some notable groups of interacting variables, such as {N4,N6}, and {N3,N4,N5}. Finally, results of Case 5 are surprisingly sharp: the two groups of variables (i.e. group A and group B) are identified, along with the two controlled nodes N6 and N7.

Table 3. Results attained by using model (i) after the application of the sieving algorithm. For each of the first fives positions in the ranking, the group occurring most frequently is shown, along with its frequency.
Table 4. Statistics of rankings over 50 independent draws of model (i) homogeneous system. For each test case, the mean and standard deviation of the Spearman rank correlation coefficient is shown.

We observe that parts of the same candidate RS are often ranked in the first positions, thus obfuscating the organisation emerging from the analysis. To this aim, the sieving algorithm is indeed applied and a clearer picture of the organisation in terms of RSs is provided. An excerpt of the results of the application of the sieving algorithm is reported in Table 3 (see the appendix for complete data in Table 9). The use of this algorithm makes it possible to clean the picture of the dynamical organisation of the systems and identify its RSs. As an example, let us consider Case 5: the three main RSs are robustly ranked in the first positions.

The advantage of using the sieving algorithm might be harmed by the variance introduced by the homogeneous system. To estimate this variance, we compared the rankings by means of the Spearman rank correlation coefficient, which is a special case of the general correlation coefficient introduced by Kendall [4]. The coefficient is applied pairwise, considering two rankings r and s:

$$\begin{aligned} \rho _s = 1-\dfrac{6 \sum _{i=1}^n(r_i-s_i)^2}{n^2(n^2-1)} \end{aligned}$$
(8)

where \(r_i\) and \(s_i\) are the rank of element i in the two rankings and n is the number of elements. Note that the coefficient is well defined only in the case of two rankings containing the same elements. By computing \(\rho _s\) for every possible pair of the 50 rankings and taking the average, we obtain the results of Table 4. Indeed, the rankings are rather stable. This observation also suggests the possibility of taking the average rank as the main information for the sieving algorithm, so as to dampen fluctuations due to sampling.

Results for Model (ii). The results obtained by applying method (ii) for generating the homogeneous system do not significantly differ from those of method (i), both in terms of relative positions and rank correlation coefficient. For this reason we omit the results.

Table 5. Comparison of the rankings between the two models. For each test case, the mean and standard deviation of the Spearman rank correlation coefficient is shown.
Table 6. Comparison of the rankings obtained with time series of different lengths. For each test case, the mean and standard deviation of the Spearman rank correlation coefficient is shown.

We conclude by observing that the rankings produced by the two models have negligible differences, as we can observe by the average rank correlation coefficient reported in Table 5. The average is computed over all the possible pairs of rankings. The robustness inter and intra models for generating the homogeneous system guarantees that the application of the DCI method is reliable and stable.

Data Series Length. We also assessed the robustness of the results as a function of both models (i) and (ii) and the length of the data series. We applied the DCI methodFootnote 3 for data series of length 1, 5, 10, 20, 25 and 30 times the length of the original series. For the sake of brevity, we just report the average rank correlation coefficient computed across all the possible pairs of rankings for all the possible data series lengths (see Table 6). We observe that the rankings are independent of the data series length. This result enables us to assert that a good practice for the application of the DCI method is to generate a homogeneous system data series of the same length as the original one.

5 Transfer Entropy

Once the RSs have been identified, the information flow among them—or at least just correlation—can be quantified by means of TE. The data we have considered in these experiments are time series of the perturbed time evolution of a discrete system, therefore the analysis made by means of TE can indeed provide meaningful pieces of knowledge concerning information flow among RSs.

Table 7. Transfer entropy T between RSs in the five test cases. The values in the table represent \(T_{Y \rightarrow X}\), where Y is the element in the column and X in the row.

We computed the TE between every pair of RSs. To assess the significance of these values, we compared them with TE computed over the same time series in which the observations are randomly permuted. Such a time series has the same statistical properties of the original one except for the causal relations produced by the boolean update functions. For each time series, we generated 50 random shuffling and computed the TE between the RSs identified. These values were then used to compute a p-value for assessing the significance of the TE values computed on the original time series. Results are shown in Table 7. The TE values corresponding to a p-value \(\le 0.05\) are in bold. Each entry of the table contains the TE value computed from the group in the column to the one in the row. We observe that the TE analysis captures the structure of the boolean systems, as in each of the five cases, the significant values of TE correspond to the pairs of groups that actually exchange information. Notably, the actual value of TE might not be sufficiently informative; indeed, we can observe that there are low TE values that turn out to be significant (see, e.g. Case 4 in Table 7) and, conversely, non negligible values that are instead not significant (see, e.g. Case 2 in Table 7).

A quantitative comparison among groups of different size can be done by computing a normalised TE. According to [3], the normalised TE (NTE) is defined as:

$$\begin{aligned} NTE(Y \rightarrow X) = \frac{TE(Y \rightarrow X) - TE(Y_S \rightarrow X))}{h_x} \end{aligned}$$
(9)

where \(h_x =-\sum _{X} p(x^{t+1},x)log~p(x^{t+1}|x)\) and \(TE(Y_S \rightarrow )\) is the TE computed on a homogeneous system obtained by randomly shuffling the observations in the data series, as previously described. The values of NTE are computed 50 times, each using a random shuffling of the original data and tables (see Table 10 in the appendix) report mean and standard deviation. It is important to note that these results match quite precisely the functional relations introduced by the boolean functions which impact the dynamics of the system.

6 Conclusion and Future Work

In this work we have assessed the robustness of the DCI method against the homogeneous system. Results show that the method is both robust and reliable. Indeed, the robustness of the method is a requirement for its application in the identification of a plausible and sound hypothesis on the organisation of a dynamical system. It is important to remark that we are interested in the organisation emerging in a system from its dynamics, rather then its static relational structure. As ongoing work, we are devising an improvement over the DCI method that makes it possible to extract information on the hierarchical organisation of a complex systems, thus not just identifying its RSs and the information flow among them, but also their possibly tangled hierarchical organisation.