Keywords

1 Introduction

Network analysis is a popular and powerful tool of modern analysis of complex systems [14, 15]. This analysis is known to be very useful for technological, social, biological, and other complex system. Nodes (vertices) of the network correspond to the elements of the complex system and links (edges) of the network correspond to the interaction between elements. Measure of interaction between nodes gives the weights of the links. Resulting weighted graph represents the network model of the complex system. The structure of the network is defined by the data sets that we use to measure the links. In the present paper we consider network models generated by statistical data sets. Important examples are market networks and brain connectivity networks. The statistical origin of the data generates error in the decision about network structures. This error can leads to erroneous interpretation of network analysis. The majority of existing publications in the field in our knowledge does not pay attention to this problem. The main goal of the present paper is to develop a general approach to network analysis of statistical data sets in order to handle the related statistical errors.

Financial market is known to be a complex system. The complexity of the system is reflected in the associated complete weighted graph. The minimum spanning tree (MST) of the graph was studied in [13] to extract the most valuable information from this complex network. This information can be extended with the use of planar maximally filtered graph (PMFG) as suggested in [23]. Both procedures (MST and PMFG) can be considered as a filtering of a complex graph into a simpler relevant subgraph. Research in this direction is very active our days (see for example [24] where the state of art is given). Another filtering procedure was proposed in [3]. As a result of this procedure a market graph (MG) is constructed. Maximum cliques (MC) and maximum independent sets (MIS) of the market graph give an interesting information about financial market structures [4, 5] (for calculation of MC and MIS see [17, 18]).

The financial market has a large element of randomness. The scientific approach to handle the randomness of the financial market consists among others of the following connected stages:

  • Design of the model of the market network, choice of the filtered structural characteristic (FSC).

  • Identification of FSC from the observations, construction of appropriate statistical procedures.

  • Control of uncertainty of statistical procedures.

It is common knowledge that the prices and returns of stocks of financial market are modeled by stochastic process [21]. A complete information about this process is given by the associated probabilistic space \((\varOmega , \mathfrak {I}, P)\). It follows from the Kolmogorov consistency theorem that the process is defined by the collections of finite-dimensional joint distributions. To model the associated network one has to introduce a measure of interaction between stocks. Any measure of interaction (dependence) between stocks therefore has to be extracted from the joint distributions. This give rises to the concept of true market network and true FSC. Once the measure of interaction is defined one can go to the next stage: identification of the market network and FSC from observations. This gives rise to the concept of sample market network and sample FSC. Control of uncertainty can be based now on the analysis of the difference between true market network and sample market network and true FSC and sample FSC.

In the present paper we develop a general approach which generalizes some ideas from [1, 2, 69]. First we propose a general approach to design a different models for market network on the base of idea of measure of association introduced in [10] and developed in [11]. We show that existing network models [13, 16, 19] can be obtained from this approach. Next we show that statistical multiple decision theory is an appropriate theoretical basis for identification of filtered structural characteristic (FSC). Finally we introduce the conditional risk as a natural measure of quality in market network analysis.

The paper is organized as follows. In Sect. 2 we describe some class of measures of dependence that we call measures of association. In Sect. 3 we discuss identification problem for filtered structural characteristics (FSC). In Sect. 4 we put the market network analysis in the framework of multiple decision theory. In Sect. 5 we discuss the conditional risk as a measure of quality in market network analysis and give some illustrative examples.

2 Measures of Association

There are many measures of dependence between two random variables proposed in the literature: Pearson correlation, Kruskal correlation, Kendall correlation, Spearman correlation, Fehner correlation and others [22]. Many of them can be put in the framework of the general concept proposed in [11]. According to Lehmann, random variables \(X,Y\) are positively dependent if

$$\begin{aligned} P(X \le x, Y \le y)\ge P(X\le x)P(Y \le y), \text{ for } \text{ all } \ \ (x,y) \in R^2 \end{aligned}$$
(1)

In terms of the joint distribution function this reads

$$\begin{aligned} F_{X, Y}(x,y)- F_{X}(x)F_{Y}(y)\ge 0, \text{ for } \text{ all } \ \ (x,y) \in R^2 \end{aligned}$$
(2)

Similarly, \(X,Y\) are negatively dependent if (1), (2) holds with inequality sign reversed. The definition of positive dependence compares the probability of the product of events with the product of probabilities of events in the sense that small value of \(Y\) tends to be associated with small value of \(X\) and (see below) large value of \(Y\) with large value of \(X\). Dependence measures based on this comparison will be called in this paper measures of association. In particular covariance between two random variables is a measure of association as it follows from the Hoeffding formula [11]:

$$\begin{aligned} \mathrm {Cov}(X, Y)= \int _{-\infty }^{\infty }\int _{-\infty }^{\infty }[F_{X, Y}(x,y)- F_{X}(x)F_{Y}(y)]dxdy \end{aligned}$$
(3)

It implies that if two random variables are positively dependent then their covariance and therefore Pearson correlation between them is non negative. Converse is known to be true for the normal vector \((X,Y)\) [11]. It means that for the normal case positiveness of the correlation implies the positive dependence of the random variables. It gives a strong additional justification for the use of Pearson correlation as a measure of dependence in the normal case.

The condition (1) is equivalent to any of the following conditions

$$ P(X \le x, Y \ge y)\le P(X\le x)P(Y \ge y), \text{ for } \text{ all } \ \ (x,y) \in R^2 $$
$$ P(X \ge x, Y \le y)\le P(X\ge x)P(Y \le y), \text{ for } \text{ all } \ \ (x,y) \in R^2 $$
$$ P(X \ge x, Y \ge y)\ge P(X\ge x)P(Y \ge y), \text{ for } \text{ all } \ \ (x,y) \in R^2 $$

Therefore if two variables \(X,Y\) are positively dependent then for any \(x,y \in R\) one has

$$ P( (X-x)(Y-y) >0) - P((X-x)(Y-y)<0) \ge 0 $$

This observation produces a family of different measures of association \(q(x,y)\):

$$\begin{aligned} q_{X,Y}(x,y)=P( (X-x)(Y-y) >0) - P((X-x)(Y-y)<0) \end{aligned}$$
(4)

For example if \(x=\text{ med }(X)\), \(y=\text{ med }(Y)\) than one obtain the \(q\)-measure of association of Kruskal (simplest measure of association in terminology by Kruskal). If \(x=E(X)\), \(y=E(Y)\) then one gets the sign correlation of Fehner [22]. In addition as it was proven by Lehmann if two random variables are positively dependent than its Kendall and Spearman correlations are positive. Therefore measures of association constitute a large family of measures of dependence between two random variables. In what follows we will use the notation \(\gamma _{X,Y}\) for any measure of association for two random variables \(X\) and \(Y\).

3 Identification Problem in Market Network Analysis

We model the financial market as a family of random variables \(X_{i}(t)\), where \(i = 1,2, \ldots ,N\), \(t =1, 2, \ldots ,n\). In this setting \(N\) is the number of stocks and \(n\) is a number of observations. Random variable \(X_{i}(t)\) for a fixed \(i\), \(t\) describes the behavior of some numerical characteristic (price, return, volume and so on) of the stock \(i\) at the moment \(t\). For a fixed \(i\) the sequence of random variables \((X_i(1), X_i(2), \ldots , X_i(n))\) describes the behavior of the stock \(i\) over the time. We assume that for a fixed \(i\) the random variables \(X_i(t)\) are independent and identically distributed as \(X_i\). This assumption is valid for stocks returns and many other stocks characteristics. The random vector \(X=(X_1, X_2,\ldots , X_N)\) gives a complete description of the market for the given numerical characteristic.

In this paper we consider only market network models based on the pair wise dependence of stocks. The nodes of the network are the stocks of the market and the weighted link between stocks \(i\) and \(j\), \(i \ne j\) is given by a measure of association \(\gamma _{i,j}\) for random variables \(X_i\) and \(X_j\): \(\gamma _{i,j}=\gamma (X_i,X_j)\). We call the obtained network true market network with measure of association \(\gamma \). For a given structural characteristic \(S\) (MST, PMFG, MG, MC, MIS and others) true characteristic is obtained by filtration on the true market network. In general measure of association \(\gamma \) has to reflect a dependence between random variables associated with stocks. The choice of the measure of association is therefore connected with the joint distribution of the vector \((X_1, X_2,\ldots , X_N)\). The most popular measure of association used in the literature is Pearson correlation. Pearson correlation is known to be the most appropriate measure of association in the case of multivariate normal distribution of the vector \((X_1, X_2,\ldots , X_N)\). When the distribution of this vector is not known one needs a more universal measure of association not related with the form of distribution. One such measure of association is q-measure of Kruskal.

In practice however market networks are constructed from statistical data sets of observations. Let \(x_{i}(t)\) be an observation of the random variable \(X_i(t)\), \(i = 1,2, \ldots ,N\), \(t =1, 2, \ldots ,n\). For a given structural characteristic \(S\) (MST, PMFG, MG, MC, MIS and others) the main problem is to identify true characteristic (associated with the true market network) from the observations. Traditional way for this identification used in the literature can be described as follows: first one has to make estimations \(\hat{\gamma }_{i,j}\) of the measures of association \(\gamma _{i,j}\), next one constructs the sample network as the weighted complete graph where the nodes are the stocks of the market and the weighted link between stocks is given by \(\hat{\gamma }_{i,j}\). Finally, the structural characteristic \(S\) is identified on the sample market network by the same filtration process as on the true market network. Described identification process can be considered as statistical procedure for the identification of \(S\). But this statistical procedure is not only one that can be considered for identification of \(S\). Moreover it is not clear whether this procedure is the best possible or even if this procedure is good from statistical point of view. This question is crucial in our investigation.

4 Multiple Decision Theory

To answer the question above and define optimal statistical procedures for identification of structural characteristics one needs to formulate this problem in the framework of mathematical statistics theory. Identification of a given structural characteristic (MST, PMFG, MG, MC, MIS and others) is equivalent to the selection of one particular structural characteristic from the finite family of possible ones. Any statistical procedure of identification is therefore a multiple decision statistical procedure. Multiple decision theory is nowadays one of the active branch of mathematical statistics [12, 20]. In the framework of this theory the problem of identification of FSC can be presented as follows. One has \(L\) hypothesis \(H_1, H_2, \ldots , H_L\) corresponding to the family of possible subgraphs associated with FSC. Multiple decision statistical procedure \(\delta (x)\) is a map from the sample space of observations \(R^{N \times n}=\{x_i(t): \ i=1,2,\ldots ,N; \ t=1,2,\ldots ,n\}\) to the decision space \(D=\{(d_1,d_2,\ldots ,d_L)\}\), where \(d_j\) is the decision of acceptance of the hypothesis \(H_j\), \(j=1,2,\ldots ,L\). Quality of the multiple decision statistical procedure \(\delta (x)\) according to Wald [25] is measured by it’s conditional risk. In our case conditional risk \(R(H_k,\delta )\) can be written as

$$ R(H_k,\delta )=\sum _{j=1}^L w_{k,j} P_k(\delta (x)=d_j) $$

where \(w_{k,j}\) is the loss from the decision \(d_j\) when the true decision is \(d_k\), \(w_{k,k}=0\), \(P_k(\delta (x)=d_j)\) is the probability to take the decision \(d_j\) when the true decision is \(d_k\). Conditional risk can be used for the comparison of different multiple decision statistical procedures for structural characteristic identification [7] and it is appropriate to measure the statistical uncertainty of structural characteristics [6].

Example 1. Market graph. For a given value of threshold \(\gamma _0\) market graph [3] is obtained from the complete weighted graph (market network) by eliminating all edges with property \(\gamma _{i,j} \le \gamma _0\), where \(\gamma _{i,j}\) is the measure of association between stocks \(i\) and \(j\). In this case the set of hypotheses is

$$\begin{aligned} \begin{array}{l} H_{1}:\gamma _{i, j} \le \gamma _{0},\forall (i, j), \ i < j,\\ H_{2}:\gamma _{12}>\gamma _{0},\gamma _{i, j}\le \gamma _{0},\forall (i, j)\ne (1,2), \ i < j, \\ H_{3}:\gamma _{12}>\gamma _{0},\gamma _{13}>\gamma _{0},\gamma _{i, j} \le {\gamma _{0}},\forall (i, j)\ne (1,2),(i, j)\ne (1,3),\\ \ldots \\ H_{L}:\gamma _{i, j}> \gamma _{0},\forall (i, j), \ i < j, \end{array} \end{aligned}$$
(5)

where \( L=2^M \) with \(M=N(N-1)/2 \). These hypotheses describe all possible market graphs. To identify the true market graph one needs to construct a multiple decision statistical procedure \(\delta (x)\) which will select one hypothesis from the set \(H_1, H_2, \ldots , H_L\).

Example 2. Minimum spanning tree (MST). Minimum spanning tree [13] is the spanning tree of the complete weighted graph (market network) with the maximal total associations between included edges. In this case one has by Caylay formula \(L=N^{N-2}\) and each hypothesis \(H_s\) can be associated with multi-index \(s=(s_1,s_2,\ldots ,s_N, s_{N+1}, \ldots , s_{2N})\), \(s_j \in \{0,1\}\) (tree code).

5 Conditional Risk

There are many ways to define the losses \(w_{k,j}\) and associated conditional risk. For example for a given structural characteristic \(S\) one can define a conditional risk by

$$\begin{aligned} R(S,\delta ) = \sum _{1 \le i < j \le N} [a_{ij}P_{i,j}^a(S,\delta ) + b_{ij}P_{i,j}^b(S,\delta )], \end{aligned}$$
(6)
Fig. 1.
figure 1

Conditional risk as a function of number of observations for Pearson correlation. Solid line corresponds to the normal distribution. Dashed line corresponds to the Student distribution.

Fig. 2.
figure 2

Conditional risk as a function of number of observations for Kruskal correlation. Solid line corresponds to the normal distribution. Dashed line corresponds to the Student distribution.

where \(a_{i,j}\) is the loss from erroneous inclusion of the edge \((i,j)\) in the structure \(S\), \(P_{i,j}^a(S,\delta )\) is the probability that decision procedure \(\delta \) takes this decision, \(b_{i,j}\) is the loss from erroneous non inclusion of the edge \((i,j)\) in the structure \(S\), \(P_{i,j}^b(S,\delta )\) is the probability that decision procedure \(\delta \) takes this decision. Two terms in (6) can be considered as type I and type II statistical errors [12]. The value of conditional risk \(R(S,\delta )\) essentially depends on the choice of measure of association \(\gamma \), distribution of random vector \(X=(X_1, X_2, \ldots , X_N)\), structural characteristic \(S\), multiple decision statistical procedure \(\delta (x)\) for structural characteristic identification and number of observations \(n\). To illustrate this dependence we present below some results of numerical experiments for MST on US stock market with \(N=100\), \(a_{i,j}=b_{i,j}=1/2\). The experiments show some intriguing properties of associated conditional risk. The Fig. 1 shows the behavior of conditional risk for Pearson correlation, two type of distributions (multivariate Normal and Student distributions) and different number of observations. The Fig. 2 shows the behavior of conditional risk for Kruskal correlation, the same type of distributions (multivariate Normal and Student distributions) and different number of observations. In both cases the multiple decision statistical procedure is the Kruskal algorithm applied to the sample network (we use classical estimations for Pearson and Kruskal correlations).

Table 1. Conditional risk for MST: Pearson correlation
Table 2. Conditional risk for MST: Kruskal correlation

The Fig. 1 shows that conditional risk for Pearson correlation has a big dependence on the type of distribution. Pearson correlation is a good measure of association for normal distribution and it is not good for Student distribution. The Fig. 2 shows that conditional risk for Kruskal correlation is stable with respect to the type of distribution. At the same time Kruskal correlation is more appropriate measure of association for Student distribution than Pearson correlation. It suggests to use the Kruskal measure of association in the case of distributions with fat tails.

The values of conditional risk for different distributions and number of observations are presented in the Table 1 (Pearson correlation) and Table 2 (Kruskal correlation). All multivariate distributions in the tables have the same covariance matrix \(\varSigma \) (covariance matrix for the 100 stocks of US stock market) and are obtained by transformation \(X=\sigma ^{1/2}Z\), \(Z=(Z_1,Z_2,\ldots ,Z_N)\) being the vector of normalized independent random variables with the same uni-variate distribution. This uni-variate distribution are normal, truncated normal, uniform distribution (platykurtic), distribution with two modes (bimodal), discrete distribution with 2 values (stable trend rare risk) and Student distribution with 3 degrees of freedom. Detailed description of these distributions is given in [1]. The Tables 1 and 2 confirm the stability of conditional risk for Kruskal correlation. A comparative analysis of conditional risk for Pearson and sign correlations for the market graph construction is given in [1] where some interesting observations are described. The problem of optimality of multiple decision statistical procedures for the market graph identification is discussed in [7]. It was proven in [7] that it is possible to construct a statistical procedures with lower conditional risk than the widely used in the literature statistical procedure based on the sample graph. The dependence of conditional risk on the filtered structural characteristic is investigated in [6].

6 Concluding Remark

The general approach to market network analysis for statistical data set gives an appropriate theoretical basis for investigation of different market network models. It allows to design a statistical procedures of a good quality for identification of structural characteristics of network.