Formation of Sets of Independent Components of a Multidimensional Random Variable Based on a Nonparametric Pattern Recognition Algorithm

Lapko, A. V.; Lapko, V. A.; Bakhtina, A. V.

doi:10.1007/s11018-022-01990-0

Formation of Sets of Independent Components of a Multidimensional Random Variable Based on a Nonparametric Pattern Recognition Algorithm

GENERAL PROBLEMS OF METROLOGY AND MEASUREMENT TECHNIQUE
Published: 11 February 2022

Volume 64, pages 689–696, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Measurement Techniques Aims and scope

Formation of Sets of Independent Components of a Multidimensional Random Variable Based on a Nonparametric Pattern Recognition Algorithm

Download PDF

A. V. Lapko^1,2,
V. A. Lapko^1,2 &
A. V. Bakhtina²

54 Accesses
1 Citation
Explore all metrics

We consider the possibility of circumventing the decomposition problem for the range of values of random variables when testing various hypotheses. A brief review of the literature on this issue is given. A method is proposed for forming sets of independent components of a multidimensional random variable, based on testing hypotheses about the independence of paired combinations of components of a multidimensional random variable. The method uses a two-dimensional nonparametric algorithm for the recognition of kernel-type patterns, corresponding to the criterion of maximum likelihood. In contrast to the traditional technique using Pearson’s criterion, the proposed technique avoids the problem of decomposing the range of values of random variables into multidimensional intervals. We present results of computational experiments performed using the method of forming sets of independent random variables. From the obtained data, an information graph is constructed, whose vertices correspond to the components of a multidimensional random variable, and the edges determine their independence, while the vertices of the complete subgraphs correspond to groups of independent components of the random variable. The results obtained form the basis for the synthesis of a multilevel nonparametric system for processing large volumes of data, each level of which corresponds to a specific set of independent random variables.

Testing the Hypothesis of the Independence of Two-Dimensional Random Variables Using a Nonparametric Algorithm for Pattern Recognition

Article 01 March 2021

Modification of a Nonparametric Procedure for Testing the Hypothesis About the Distributions of Random Variables

Article 01 July 2023

Application of a Nonparametric Pattern Recognition Algorithm to the Problem of Testing the Hypothesis of the Independence of Variables of Multi-Valued Functions

Article 01 April 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction. The formation of sets of independent random variables is necessary when reducing the dimension of information processing problems and the synthesis of effective decision-making algorithms. A priori information about the independence of random variables makes it possible to improve the approximation properties of the nonparametric estimate of the probability density in comparison with the kernel statistics for dependent variables [1,2,3]. The results obtained in these works are confirmed by studying the asymptotic properties of a nonparametric estimate for the equation of a separating surface of kernel type in the problem of pattern recognition [4].

The traditional method for testing the hypothesis of the independence of random variables is based on the Pearson χ² test. However, its application includes the difficultly formalized stage of dividing the range of values of random variables into multidimensional intervals and requires large volumes of initial statistical data, which is associated with the transition from continuous to discrete random variables [5]. Methods for discretizing the interval of values of a one-dimensional random variable are considered in [6,7,8]. In [7, 9], formulas for the discretization of the range of values of a multidimensional random variable with a normal distribution law are given, obtained on the basis of an analysis of the asymptotic properties of the histogram. The work [10] substantiates the method of optimal discretization of the range of values of a multidimensional random variable. This technique is consistent with the formula for choosing the number of sampling intervals for a one-dimensional random variable with a uniform distribution law (Heyhold and Gaede’s formula) [11], and its implementation is coupled with the estimation of the integral of the square of a multidimensional probability density. When evaluating this functional on the probability density, the first positive results were obtained [12, 13]. Therefore, it was required to develop a new method for testing the hypothesis under consideration, which avoids the decomposition problem for the domain of values of random variables. A similar problem is solved when testing the hypothesis about the identity of the distribution laws of random variables based on a nonparametric pattern recognition algorithm [14,15,16]. These works show the possibility of replacing the hypothesis testing problem about the distributions of random variables with the hypothesis testing problem about the equality of the pattern recognition error to a certain threshold value.

The purpose of this article is to develop a methodology for the formation of sets of independent components of a multidimensional random variable using a nonparametric algorithm for pattern recognition of kernel type that meets the criterion of maximum likelihood.

Testing the hypothesis of the independence of the components of a two-dimensional random variable. Let there be a sample $ V=\left({x}_1^i,{x}_2^i,i=1,\dots, n\right) $ of volume n formed from independent observations of a two-dimensional random variable x = (x₁, x₂). Observations x are extracted from general populations characterized by unknown probability densities p(x₁)p(x₂) or p(x₁, x₂). It is necessary to check the hypothesis of the independence of the random variables x₁, x₂:

$$ {H}_0:p\left({x}_1,{x}_2\right)\equiv p\left({x}_1\right)p\left({x}_2\right). $$

(1)

To test the hypothesis H₀ (1), we will solve a two-alternative pattern recognition problem. The classes Ω₁, Ω₂ correspond to the domains of definition of the probability densities p(x₁)p(x₂), p(x₁, x₂). Under these conditions, the Bayesian decision rule corresponding to the criterion of maximum likelihood has the form

$$ m(x):\left\{\begin{array}{l}x\in {\Omega}_1\kern0.5em \mathrm{if}\kern0.5em p\left({x}_1,{x}_2\right)<p\left({x}_1\right)p\left({x}_2\right);\\ {}x\in {\Omega}_2\kern0.5em \mathrm{if}\kern0.5em p\left({x}_1,{x}_2\right)<p\left({x}_1\right)p\left({x}_2\right).\end{array}\right. $$

(2)

In contrast to the traditional formulation of the pattern recognition problem, in the synthesis of the decision rule (2) under conditions of initial uncertainty, there is clearly no training sample. The estimation of the probability densities p(x₁)p(x₂), p(x₁, x₂) is carried out using the sample V. For this, nonparametric estimates of the probability densities of the Rosenblatt–Parzen type are used [1,2,3, 17, 18]:

$$ \overline{p}\left({x}_1\right)\overline{p}\left({x}_2\right)=\frac{1}{n^2{c}_1{c}_2}\sum \limits_{i=1}^n\sum \limits_{j=1}^n\Phi \left(\frac{x_1-{x}_1^i}{c_1}\right)\Phi \left(\frac{x_2-{x}_2^i}{c_2}\right); $$

(3)

$$ \overline{p}\left({x}_1,{x}_2\right)=\frac{1}{nc_1{c}_2}\sum \limits_{i=1}^n\Phi \left(\frac{x_1-{x}_1^i}{c_1}\right)\Phi \left(\frac{x_2-{x}_2^i}{c_2}\right), $$

(4)

where c₁, c₂ are the blur coefficients of the kernel functions Φ(u_v), v = 1, 2.

In statistics (3) and (4), the kernel functions Φ(u_v) satisfy the conditions

$$ 0\le \Phi \left({u}_v\right)<\infty; \kern1em \Phi \left({u}_v\right)=\Phi \left({u}_v\right);\kern1em \int_{-\infty}^{+\infty}\Phi \left({u}_v\right) du=1;\kern1em \int_{-\infty}^{+\infty }{u}^m\Phi \left({u}_v\right){du}_v<\infty; \kern1em 0\le m<\infty; \kern1em v=1,2. $$

The values of the kernel function blur coefficients c₁, c₂ decrease with an increase in the sample volume n of statistical data V.

The nonparametric statistics (3), (4) are asymptoticly unbiased and consistent [1, 2, 18]. It was found that the minimum value of the standard deviation (RMSD) $ \overline{p}\left({x}_1\right)\overline{p}\left({x}_2\right) $ from p(x₁)p(x₂) with an increase in the volume n of initial statistical data tends to zero in proportion to the value n^–4/5. The order of such convergence of the nonparametric estimate $ \overline{p}\left({x}_1,{x}_2\right) $ to the probability density p(x₁, x₂) is less and amounts to n^–4/6. The a priori information on the independence of random variables enables an increase in the order of convergence of a nonparametric kernel-type probability density estimate.

Taking into account expressions (2)–(4), we write the nonparametric decision rule for the classification of random variables x = (x₁, x₂) as

$$ \overline{m}(x):\left\{\begin{array}{l}x\in {\Omega}_1\kern0.5em \mathrm{if}\kern0.5em \overline{p}\left({x}_1,{x}_2\right)<\overline{p}\left({x}_1\right)\overline{p}\left({x}_2\right);\\ {}x\in {\Omega}_2\kern0.5em \mathrm{if}\kern0.5em \overline{p}\left({x}_1,{x}_2\right)<\overline{p}\left({x}_1\right)\overline{p}\left({x}_2\right).\end{array}\right. $$

(5)

In the modification of the nonparametric algorithm for pattern recognition (5), the optimal blur coefficients c₁, c₂ of the kernel estimates of the probability densities $ \overline{p}\left({x}_1\right)\overline{p}\left({x}_2\right) $ and $ \overline{p}\left({x}_1,{x}_2\right) $ are selected based on the results of the analysis of their approximation properties.

The optimal value of the blur coefficient c_v of the kernel functions of the nonparametric estimate of the one-dimensional probability density under the condition of the minimum standard deviation $ \overline{p}\left({x}_v\right) $ of p(x_v) is determined by the equation [18]:

$$ {c}_v^{\ast }={\left[{\left\Vert \Phi (u)\right\Vert}^2/\left(n{\left\Vert {p}^{(2)}\left({x}_v\right)\right\Vert}^2\right)\right]}^{1/5}. $$

(6)

Here p⁽²⁾(x_v) is the second derivative of the probability density p(x_v) with respect to x_v;

$$ {\left\Vert \Phi (u)\right\Vert}^2=\int_{-\infty}^{\infty }{\Phi}^2(u) du;\kern1em {\left\Vert {p}^{(2)}\left({x}_v\right)\right\Vert}^2=\int_{-\infty}^{\infty }{\left({p}^{(2)}\left({x}_v\right)\right)}^2{dx}_v. $$

After simple transformations, we represent Eq. (6) in the form

$$ {c}_v^{\ast }={\upbeta}_v{\sigma}_v{n}^{-1/5}, $$

where σ_v is the standard deviation of the random value x_v;

$$ {\upbeta}_v={\left[\left\Vert \Phi (u)\right\Vert /\left({\upsigma}_v^5{\left\Vert {p}^{(2)}\left({x}_v\right)\right\Vert}^2\right)\right]}^{1/5}. $$

The component of the functional β_v, defined by the expression

$$ {\uplambda}_v={\upsigma}_v^5{\left\Vert {p}^{(2)}\left({x}_v\right)\right\Vert}^2, $$

is a constant for a number of unimodal probability densities. The value λ_v is determined by the form of the probability density and does not depend on its parameters [19,20,21,22,23,24,25,26,27].

According to the data of computational experiments for a family of one-dimensional lognormal distribution laws, the authors of this article have determined the parameter estimates

$$ \overline{\upbeta_v}=1.49\exp \left(-0.589{\overline{\upalpha_v}}^{0.857}\right)+0.021, $$

where $ {\overline{\upeta}}_v $ and are the estimates of the coefficients of counter-excess and asymmetry of the random variable x_v, v = 1, 2, respectively.

The optimal kernel function blur coefficient in the statistic $ \overline{p}\left({x}_v\right) $ is estimated by the formula

$$ {\overline{c}}_v^{\ast }={\overline{\upbeta}}_v{\overline{\upalpha}}_v{n}^{-1/5},v=1,2. $$

(7)

According to the proposed methodology and taking into account the results of research [27], for a two-dimensional random variable x = (x₁, x₂), the estimates of the optimal blur coefficients of the kernel statistic $ \overline{p}\left({x}_1,{x}_2\right) $ are determined by the expression

$$ {\overline{\overline{c}}}_v^{\ast }=\overline{\upbeta}{\overline{\upalpha}}_v{n}^{-1/6},v=1,2. $$

(8)

In Eq. (8), the parameter $ \overline{\upbeta} $ takes on the value

$$ \overline{\upbeta}=1.498\exp \left(-0.524{\overline{\upalpha}}^{0.809}\right)+0.0356, $$

$$ \overline{\upalpha}={\left({\overline{\updelta}}_1^2+{\overline{\updelta}}_2^2+{\overline{\upeta}}_1^2+{\overline{\upeta}}_2^2\right)}^{1/2}. $$

(where)

The obtained estimates for the functionals β_v, v = 1, 2, and β refine the research results [27,28,29].

Let us estimate the error probabilities of recognizing ρ₁, ρ₂ using the nonparametric decision rule (5) for pattern recognition from the statistical data V. In this case, we will take into account the estimates of the blur coefficients (7), (8) of the kernel statistics $ \overline{p}\left({x}_1\right)\overline{p}\left({x}_2\right),\overline{p}\left({x}_1,{x}_2\right). $

The estimates $ {\overline{\uprho}}_t $ are calculated in the “sliding exam” mode on the sample V under the assumption that its elements belong to the class Ω_t:

$$ {\overline{\uprho}}_t=\frac{1}{n}\sum \limits_{j=1}^n1\left(\delta (j),\overline{\delta}(j)\right), $$

where $ 1\left(\delta (j),\overline{\delta}(j)\right) $ is the indicator function; δ(j) = t is an indicator of type $ {x}^j=\left({x}_1^j,{x}_2^j\right)\in {\Omega}_t;\overline{\updelta}(j) $ is the “decision” of algorithm (5) for membership of the situation x j in one of the classes Ω_t, t = 1, 2.

When calculating $ {\overline{\uprho}}_t $ using the “sliding exam” methodology, the situation $ {x}^j=\left({x}_1^j,{x}_2^j\right) $ from the sample V, which is controlled by algorithm (5), is excluded from the process of generating statistics (3) and (4).

When forming the values $ {\overline{\uprho}}_t $, the indicator function is defined as

$$ 1\left(\updelta (j),\overline{\updelta}(j)\right)=\left\{\begin{array}{l}0\kern1em \mathrm{if}\kern1em \updelta (j)=\overline{\updelta}(j);\\ {}1\kern1em \mathrm{if}\kern1em \updelta (j)\ne \overline{\updelta}(j).\end{array}\right. $$

Let us compare the values $ {\overline{\uprho}}_t,{\overline{\uprho}}_2 $ under the assumption that the elements of the sample V belong to the classes Ω₁, Ω₂, respectively. The hypothesis H₀ is satisfied if $ {\overline{\uprho}}_t<{\overline{\uprho}}_2. $ This assertion is true, since with the independence of the random variables in the domains of definition Ω₁, Ω₂ of the estimates of the probability densities $ \overline{p}\left({x}_1\right)\overline{p}\left({x}_2\right) $ and $ \overline{p}\left({x}_1,{x}_2\right), $ the relation $ \overline{p}\left({x}_1\right)\overline{p}\left({x}_2\right)>>\overline{p}\left({x}_1,{x}_2\right) $ is satisfied with the estimated error probability $ \overline{\uprho}. $ Otherwise, $ {\overline{\uprho}}_2<{\overline{\uprho}}_1, $ the random variables x₁, x₂ are dependent.

The probabilities of Bayesian errors ρ₁, ρ₂ of class recognition are linear functionals of the probability densities p(x₁)p(x₂) and p(x₁, x₂), respectively. Since the nonparametric estimates of the indicated probability densities have the properties of asymptotic convergence [4, 18], the asymptotic convergence of the statistical estimates $ {\overline{\uprho}}_1,{\overline{\uprho}}_2\kern0.5em \mathrm{to}\kern0.5em {\uprho}_1,{\uprho}_2 $ is assumed.

With bounded volumes n of the initial sample V, the problem of confidence-based estimation of the probabilities of pattern recognition errors arises. To solve it, one can use the traditional method of confidence assessment of probabilities [5] or the Kolmogorov–Smirnov test [30], in which the deviation $ {\overline{D}}_{12}=\left|{\overline{\uprho}}_1-{\overline{\uprho}}_2\right| $ is compared with the threshold value D_β = [− ln(β₀/2)/n]^1/2. Here β₀ is the probability (risk) of rejecting the hypothesis $ \overline{H}:{\uprho}_1<{\uprho}_2. $ If the relation $ {\overline{D}}_{12}<{D}_{\upbeta} $ holds, then the hypothesis $ {\overline{H}}_0 $ is valid and the risk of rejecting it does not exceed the value β₀. When $ {\overline{D}}_{12}<{D}_{\upbeta} $, the hypothesis $ {\overline{H}}_0 $ is rejected.

Methods for the formation of sets of components of a multidimensional random variable. Suppose there is a sample of observations $ V=\left({x}_v^i,v=1,\dots, k,i=1,\dots, n\right) $ of volume n, composed of statistically independent observations of the components of a multidimensional random variable x = (x_v, v = 1, ..., k). The form of the probability density p(x) is a priori unknown. From the statistical data of the sample V, using the above proposed hypothesis testing criterion

$$ {H}_{vj}:p\left({x}_v,{x}_j\right)\equiv p\left({x}_v\right)p\left({x}_j\right) $$

(9)

for the components x_v, v = 1, ..., k, x_j, j = 1, ..., k, v > j, it is required to generate sets of independent random variables x(t) = = (x_v, v ∈ I_t), t = 1, ..., m. Here I_t is the set of indices of components that make up the set x(t), and the number m of sets of components of a random variable x is unknown.

The proposed technique consists of three stages.

Stage 1 In accordance with the above recommendations, the hypothesis H_vj of type (9) is tested for each pair of components (x_v, x_j) of a multidimensional random variable x = (x_v, v = 1, ..., k). The number of such pairs is 0.5k(k – 1).

Stage 2 Based on the results of stage 1, an information graph G(X, A) is constructed, where X is the set of vertices corresponding to the components of the random variable x; A is the set of edges. If the hypothesis H_vj holds, that is, the components x_v, x_j are independent, then there is an edge between two vertices x_v, x_j.

Stage 3 Analyze the information graph G(X, A) and determine the complete subgraphs G(X_t, A_t), t = 1, ..., m. If the components of the random variable x are independent, then each pair of vertices of the subgraph G(X_t, A_t) has an edge. Detect the complete subgraphs with algorithms to decompose the original graph using its adjacency matrix [31]. The components x_v, v ∈ I_t, corresponding to the vertices of the complete subgraph G(X_t, A_t) form a set of independent random variables. In this case, one can find a number of options for decomposing the information graph.

Analysis of the results of computational experiments. Let us check the efficiency of the proposed method when analyzing the data of a computational experiment. Let us investigate the efficiency of the procedure for forming sets of independent components on the volume n of the initial statistical data and the degree of dependence of random variables.

When designing computational experiments, the statistical data $ V=\left({x}_1^i,\dots, {x}_4^i,i=1,\dots, n\right) $ of the components x₁, x₂, x₃ of a multidimensional random variable x = (x₁, x₂, x₃, x₄) are assumed to be independent. Their values are formed using sensors with uniform p(x1) = R(3; 1) and normal distribution laws p(x₂) = N(3; 1), p(x₃) = N(3; 1), where R(3; 1) and N(3; 1) are the distribution laws of random variables with mathematical expectation and standard deviation equal to three and one, respectively. The values of the component x₄ are found from one or another dependence determined by various conditions of the study:

$$ {x}_4=\upvarphi \left({x}_1\right)={\upvarphi}_1^2-6{x}_1+10+\upvarepsilon; $$

(10)

$$ {x}_4={\upvarphi}_4\left({x}_1,{x}_2\right)={x}_1^2-{x}_2^2+6\left({x}_2-{x}_1\right)+20+\upvarepsilon . $$

(11)

Here ε are the values of a random variable with a normal distribution law N(0; 1).

Sensors of random variables x₂, x₃ with normal distribution laws are formed on the basis of expressions

$$ {x}_v=3+\left(\sum \limits_{l=1}^r{\upvarepsilon}_0^l-0.5r\right)6/\sqrt{3r},\kern1em v=2,3, $$

where $ {\varepsilon}_0^l $ are values of random variables with a uniform probability density on the interval [0; 1]; r = 12.

The technique of computational experiments is implemented in the Delphi-7 programming environment. To generate a random variable ε₀ ∈ (0; 1) with a uniform distribution law, we use a standard function random and the procedure Randomize, which takes into account the time of day as the basis for generating pseudo-random numbers with a uniform distribution law.

The volume n of initial statistical data in computational experiments was 100, 500, 1000. For a specific volume n of initial data, the parameters $ {\overline{\uprho}}_1,{\overline{\uprho}}_2,\kern0.5em \overline{D} $ correspond to dependencies (10), (11). The parameter values $ {\overline{\uprho}}_1,{\overline{\uprho}}_2 $ are calculated 10 times and then averaged (cf. Table 1).

Table 1 Results of Testing Hypotheses about the Independence of Paired Combinations of a Four-Dimensional Random Variable

Full size table

With relatively small volumes n = 100 of initial statistical data and using dependence (10), the values $ {\overline{\uprho}}_1 $ and $ {\overline{\uprho}}_2 $ for pairs of random variables (x₁, x₂), (x₁, x₃), (x₂, x₃), (x₂, x₄) and (x₃, x₄), are not reliably different. Under these conditions, the values $ {\overline{D}}_{12},{\overline{D}}_{13},{\overline{D}}_{23},{\overline{D}}_{24},{\overline{D}}_{34}, $ are less than the threshold D_β = 0.192 with the risk β₀ = 0.05 of rejecting the hypothesis $ {\overline{H}}_0. $ According to the conditions of the computational experiment, these paired sets of components of the random variable are a priori independent. With small volumes of initial data, there is an uncertainty in decision making, which follows from the analysis of the results of testing the hypotheses under study. This conclusion is explained by the difference in the convergence conditions, for example, in the nonparametric estimates of the probability densities $ \overline{p}\left({x}_1\right)\overline{p}\left({x}_2\right) $ and $ \overline{p}\left({x}_1,{x}_2\right), $ which is confirmed by the results of studies in [1,2,3].

If the paired combinations of random components are dependent, then the proposed method, under the conditions of the considered computational experiment, unambiguously rejects the hypothesis $ {\overline{H}}_0. $ This conclusion is valid for the components (x₁, x₄) when using dependence (10) in the computational experiment. In this case, the inequality $ {\overline{\uprho}}_1>{\overline{\uprho}}_2 $ holds, i.e., the condition $ \overline{p}\left({x}_1,{x}_2\right)>\overline{p}\left({x}_1\right)\overline{p}\left({x}_2\right) $ is satisfied, and the decision that is made according to rule (5) is valid. The specified condition is reliable, since $ {\overline{D}}_{14}<{D}_{\upbeta} $ and $ {\overline{D}}_{14}=0.5 $ and D_β = 0.192. With the complication of dependencies between random variables under the conditions of using the transformation (11) with n = 100, the above-mentioned regularities are preserved.

With an increase in the volume n of the initial statistical data V, the efficiency of the proposed method for testing hypotheses about the independence of random variables increases. With the volume of initial data n = 500, the hypothesis about the independence of paired combinations of random variables (x₁, x₂), (x₁, x₃), (x₂, x₃), (x₃, x₄) is fulfilled, since the corresponding errors of pattern recognition by the decision rule (5) are connected by the relation $ {\overline{\uprho}}_1<{\overline{\uprho}}_2. $ This relation holds for the paired combinations (x₁, x₃), (x₃, x₄) when using dependence (10) and the risk β₀ = 0.05 in the computational experiment to reject the hypothesis $ {\overline{H}}_0. $ The dependence between the components (x₁, x₄) is more significant when $ {\overline{D}}_{14}=0.616 $ and D_β = 0.086. As the relationship between random variables (11) becomes more complex, a reliable dependence is observed between (x₁, x₄), (x₂, x₄), as well as a reliable independence of the components (x₁, x₂), (x₃, x₄).

With the volume of initial data n =1000, the above-mentioned patterns for n = 500 are basically preserved. Application of the developed methodology enables detection of dependency characteristics contradicting the hypothesis $ {\overline{H}}_0 $ with a high level of reliability when using transformations (10), (11) in a computational experiment. If the components of the random variable x are a priori independent, then the inequality $ {\overline{\uprho}}_1<{\overline{\uprho}}_2 $ holds; however, under the conditions of the considered computational experiment, it can be reliable or unreliable.

Based on the data in the Table 1, we will form an information graph G(X, A), where X is the set of vertices that correspond to the components x_v, v = 1, ..., 4, of the random variable x; A is the set of edges between the vertices of the graph. There is an edge between two vertices x_v, x_j if the corresponding components are independent (cf. Fig. 1).

When using transformation (10), there are two versions of complete subgraphs, which correspond to sets of independent components (x₁, x₂, x₃) and (x₂, x₃, x₄) (cf. Fig. 1a). The choice of a specific option is determined by the method of subsequent processing of the initial data. If transformation (11) is applied, then one set of independent random variables (x₁, x₂, x₃) is found (cf. Fig. 1b). The results obtained correspond to the conditions for designing a computational experiment based on transformations (10), (11).

Conclusion. The proposed method for the formation of sets of independent components of a multidimensional random variable can be replaced by testing hypotheses about the independence of their paired combinations. To solve this problem, a two-dimensional nonparametric algorithm of kernel-type pattern recognition is used, which corresponds to the criterion of maximum likelihood. This approach allows us to bypass the problem of decomposition of the values of random variables into multidimensional intervals. When optimizing a nonparametric pattern recognition algorithm, it is advisable to use fast algorithms for selecting the blur coefficients, which is especially important when processing large volumes of data. Based on the results of testing hypotheses about the independence of two-dimensional random variables it is possible to build an information graph, whose vertices correspond to random variables, and edges define their independence. Decomposition of the information graph into complete subgraphs enables detection of different variants of sets of independent components of a multidimensional random variable. The choice of a particular set is determined by the adopted procedure for the subsequent processing of the initial data.

According to the results of computational experiment, it was established that the proposed technique is especially sensitive to the detection of dependent random variables, which is characteristic of small and large volumes of initial statistical data. In conditions of small volumes of values of a four-dimensional random variable (for n <500), it is impossible to make an unambiguous decision about the independence of random variables. At n ≥ 500 under the conditions of a computational experiment, it is possible to reliably detect the dependent components of a four-dimensional random variable. Independent random variables in these conditions are determined with varying degrees of reliability with limited amounts of initial statistical data. A promising direction for further research is the comparison of the proposed method with the traditional one based on the use of Pearson’s criterion with different formulas for discretization of the range of values of random variables.

The results obtained form the basis for the synthesis of a multi-level nonparametric system, where each level corresponds to a specific set of independent random variables. Such systems are efficient when processing large data volumes.

The study was carried out with the financial support of the Russian Foundation for Basic Research, the Government of the Krasnoyarsk Territory, and the Krasnoyarsk Regional Science Foundation within the framework of the scientific project No. 20-41-240001.

References

A. V. Lapko and V. A. Lapko, “Properties of a nonparametric estimate of the multidimensional probability density of independent random variables,” Informat. Sist. Upravl., 31, No. 1, 166–174 (2012).
Google Scholar
A. V. Lapko and V. A. Lapko, “Nonparametric estimate of the probability density of independent random variables,” Informat. Sist. Upravl., 29, No. 3, 118–124 (2011).
Google Scholar
A. V. Lapko and V. A. Lapko, “Influence of a priori information about the independence of multidimensional random variables on the properties of their nonparametric estimate of the probability density,” Sist. Upravl. Inform. Tekhnol., 48, No. 2.1, 164–167 (2012).
Google Scholar
A. V. Lapko and V. A. Lapko, “Properties of a nonparametric decision function in the presence of a priori information about the non-dependence of the attributes of classified objects,” Avtometriya, 48, No. 4, 112–119 (2012).
Google Scholar
V. S. Pugachev, Probability Theory and Mathematical Statistics: Textbook, Fizmatlit, Moscow (2002).
MATH Google Scholar
H. A. Sturges, J. Am. Stat. Ass., 21, pp. 65–66 (1926), https://doi.org/https://doi.org/10.1080/01621459.1926.10502161.
D. W. Scott, Multivariate Density Estimation: Theory, Practice, and Visualization, Wiley, New York (1992).
Book Google Scholar
A. Hacine-Gharbi, P. Ravier, R. Harba, and T. Mohamadi, Patt. Recogn. Lett., 33, No. 10, 1302–1308 (2012), https://doi.org/https://doi.org/10.1016/j.patrec.2012.02.022.
L. Devroye and G. Lugosi, Test, 13, No. 1, 129–145 (2004), https://doi.org/https://doi.org/10.1007/BF02603004.
A. V. Lapko and V. A. Lapko, “Method of discretization of the region of values of a multidimensional random variable,” Izmerit. Tekhn., No. 1, 16–20 (2019), https://doi.org/10.32446/0368-1025it.2019-1-16-20.
Article Google Scholar
I. Heinhold and K. Gaede, Ingeniur statistic, Springler Verlag, München-Wien (1964).
Google Scholar
A. V. Lapko and V. A. Lapko, “Nonparametric estimation of the quadratic functional of multimodal probability density,” Metrologiya, No. 3, 17–29 (2019), https://doi.org/https://doi.org/10.32446/0132-4713.2019-3-17-29.
A. V. Lapko and V. A. Lapko, “Estimation of the parameters of the formula for optimal discretization of the range of values of a two-dimensional random variable,” Izmer. Tekhn., No. 5, 9–13, (2018), https://doi.org/https://doi.org/10.32446/0368-1025it.2018-5-9-13.
A. V. Lapko and V. A. Lapko, “Nonparametric algorithms for pattern recognition in the problem of testing the statistical hypothesis of the identity of two distribution laws of random variables,” Avtometriya, 46, No. 6, 47–53 (2010).
MathSciNet Google Scholar
A. V. Lapko and V. A. Lapko, “Comparison of empirical and theoretical distribution functions of a random variable on the basis of a nonparametric classifier,” Avtometriya, 48, No. 1, 45–49 (2012).
Google Scholar
A. V. Lapko and V. A. Lapko, “Technique for testing hypotheses about the distributions of multidimensional spectral data using a nonparametric pattern recognition algorithm,” Komp. Opt., 43, No. 2, 238–244 (2019), https://doi.org/https://doi.org/10.18287/2412-6179-2019-43-2-238-244.
E. Parzen, Ann. Math. Stat., 33, No. 3, 1065–1076 (1962).
Article Google Scholar
V. A. Epanechnikov, “Nonparametric estimate of multidimensional probability density,” Teor. Veroyatn. Primen., 14, No. 1, 156–161 (1969).
MathSciNet MATH Google Scholar
B. W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman & Hall, London (1986).
MATH Google Scholar
S. Sheather and M. Jones, J. Roy. Stat. Soc. B, 53, No. 3, 683–690 (1991), https://doi.org/https://doi.org/10.1111/j.2517-6161.1991.tb01857.x.
S. J. Sheather, Stat. Sci., 19, No. 4, 588–597 (2004), https://doi.org/https://doi.org/10.1214/088342304000000297.
G. R. Terrell and D. W. Scott, J. Am. Stat. Ass., 80, 209–214 (1985).
Article Google Scholar
M. C. Jones, J. S. Marron, and S. J. Sheather, J. Am. Stat. Ass., 91, 401–407 (1996).
Article Google Scholar
D. W. Scott, Multivariate Density Estimation: Theory, Practice, and Visualization, John Wiley & Sons, New Jersey (2015).
Book Google Scholar
A. V. Lapko and V. A. Lapko, “Modified algorithm for fast selection of the blur coefficients of nuclear estimates of multidimensional probability densities,” Izmer. Tekhn., No.11, 9–13, (2020), https://doi.org/https://doi.org/10.32446/0368-1025it.2020-11-9-13.
A. V. Lapko and V. A. Lapko, “Estimation of the integral of the square of derivatives of symmetric probability densities of one-dimensional random variables,” Metrologiya, No. 1, 15–27 (2020), https://doi.org/https://doi.org/10.32446/0132-4713.2020-1-15-27.
A. V. Lapko and V. A. Lapko, “Estimation of a nonlinear functional of probability density for the optimization of nonparametric decision functions,” Izmer. Tekhn., No. 1, 14–20 (2021), https://doi.org/https://doi.org/10.32446/0368-1025it.2021-1-14-20.
A. V. Lapko and V. A. Lapko, “A fast algorithm for choosing kernel function blur coefficients in a nonparametric estimate of the probability density,” Izmer. Tekhn., No. 6, 16–20 (2018). 2018, https://doi.org/https://doi.org/10.32446/0368-1025it-2018-6-16-20.
A. V. Lapko and V. A. Lapko, “A fast algorithm for the selection of blur coefficients in multidimensional kernel estimates of the probability density,” Izmer. Tekhn., No. 10, 19–23 (2018), https://doi.org/https://doi.org/10.32446/0368-1025it.2018-10-19-23.
A. S. Sharakshane, I. G. Zheleznov, and V. A. Ivnitskiy, Complex Systems, Vysshaya Shkola, Moscow (1977).
Google Scholar
N. Christofides, Graph Theory: an Algorithmic Approach [Russian translation], Mir, Moscow (1978).
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computational Modeling, Siberian Branch of the Russian Academy of Sciences, Krasnoyarsk, Russia
A. V. Lapko & V. A. Lapko
Reshetnev Siberian State University of Science and Technology, Krasnoyarsk, Russia
A. V. Lapko, V. A. Lapko & A. V. Bakhtina

Authors

A. V. Lapko
View author publications
You can also search for this author in PubMed Google Scholar
V. A. Lapko
View author publications
You can also search for this author in PubMed Google Scholar
A. V. Bakhtina
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. V. Lapko.

Additional information

Translated from Izmeritel’naya Tekhnika, No. 9, pp. 3–9, September, 2021.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lapko, A.V., Lapko, V.A. & Bakhtina, A.V. Formation of Sets of Independent Components of a Multidimensional Random Variable Based on a Nonparametric Pattern Recognition Algorithm. Meas Tech 64, 689–696 (2021). https://doi.org/10.1007/s11018-022-01990-0

Download citation

Received: 17 February 2021
Accepted: 18 May 2021
Published: 11 February 2022
Issue Date: December 2021
DOI: https://doi.org/10.1007/s11018-022-01990-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Formation of Sets of Independent Components of a Multidimensional Random Variable Based on a Nonparametric Pattern Recognition Algorithm

Similar content being viewed by others

Testing the Hypothesis of the Independence of Two-Dimensional Random Variables Using a Nonparametric Algorithm for Pattern Recognition

Modification of a Nonparametric Procedure for Testing the Hypothesis About the Distributions of Random Variables

Application of a Nonparametric Pattern Recognition Algorithm to the Problem of Testing the Hypothesis of the Independence of Variables of Multi-Valued Functions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Formation of Sets of Independent Components of a Multidimensional Random Variable Based on a Nonparametric Pattern Recognition Algorithm

Similar content being viewed by others

Testing the Hypothesis of the Independence of Two-Dimensional Random Variables Using a Nonparametric Algorithm for Pattern Recognition

Modification of a Nonparametric Procedure for Testing the Hypothesis About the Distributions of Random Variables

Application of a Nonparametric Pattern Recognition Algorithm to the Problem of Testing the Hypothesis of the Independence of Variables of Multi-Valued Functions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation