Keywords

1 Introduction

Currently, with the growth of HIV-infected people in Kazakhstan in public health, issues are increasingly emerging that require instant solutions and analysis of volumetric data. This problem entails, in addition to the danger of a pandemic nature, a wide range of social, medical and economic consequences that require prompt measures. The nature of the danger of this disease lies in the defeat, first of all, of the young and able-bodied population of the republic.

According to UNAIDS experts (2021), there are more than 37.6 million people living with HIV in the world, of which 770 thousand have already died from AIDS-related diseases. Although the data of the joint UN program show the relative stability of the HIV pandemic, the incidence rate is still very high.

The coronavirus pandemic of 2019–2021 is a clear example of the danger of infections becoming uncontrolled, and showed all of humanity the tasks that can be overcome by the unprecedented measures of all states.

The problem of HIV infection, although it has a different distribution pattern, continues to be very dangerous, since the success of treatment has not yet led to the final recovery of patients in remission. According to the Republican Center for the Prevention and Control of AIDS, as of September 30, 2020, 27,100 cases of HIV infection were registered on an accrual basis, of which 16,344 were men, 10,756 were women, and 146 were children. In addition, 4,464 children born from women with a positive HIV status [1].

The new strategy of the Joint United Nations Program on HIV/AIDS (UNAIDS) has committed to ending the world's AIDS epidemic by 2030. This was reflected in the state program for the development of the healthcare system of the Republic of Kazakhstan “Health” for 2016–2030. The relevance of this issue is also caused by the need to study the nature of the HIV epidemic in Kazakhstan, especially in population groups with a high risk of infection.

In order to prevent the development of epidemiological outbreaks, methods of in-depth analysis are used that allow early detection of morbidity in the population. The medicine of Kazakhstan has come to understand the need to introduce statistical processing into all areas of its activity. However, with the widespread introduction of statistical processing tools, there came an understanding of the need not only for qualitative analysis, but also for a more detailed and in-depth study of data visualization processes. It is necessary not only to know the software package for statistical analysis, but also to specify them for each specific case. In this regard, conducting research related to medical data, with the study of the nature of morbidity in a whole group of people, should be determined by the integration of methods and a universal approach. The most important task of a researcher in conducting medical research is the choice of a specific method of statistical data analysis.

The scale and complexity of the health information system has increased dramatically, and its development and management is difficult to control. In the field of traditional methods and simple methods of mathematical statistics, it is difficult to solve the problems caused by the explosive growth of data and information, which will adversely affect the management of the medical information service system. Therefore, to guide the development and maintenance of software engineering, the collection of software data is especially important [1].

With the rapid development of computer and information technologies, as well as storage technologies, it becomes possible to store a large amount of data [2]. Data mining technology can search for and extract potentially valuable knowledge from large amounts of data. Database technology is the science of software that manages databases. The data from the database is analyzed by studying the methods of structuring, designing and applying the data [3]. With the rapid development of information technology, the scale, scope, and depth of database applications continue to expand, leading to the phenomenon of “rich data and bad information” [4]. Data mining is defined as the process of searching for a data pattern, that is, working with data from a large number of incomplete, fuzzy, random data. [5]. Data mining is a very active area of research in the field of databases and artificial intelligence [5,6,7,8]. strategies, contributing to sustainable development [9].

In data mining technology, the recognition parameters and the choice of coefficients are analyzed in detail, after which a data mining model is derived [10].

To analyze large anonymous data about patients, the authors propose to use a method based on the technology of processing and structuring case data. Using this method, it is possible to accurately and efficiently extract key information in each specific case using a special model [11]. An example of a mathematical model of epidemiology (co-infection with HIV and tuberculosis) shows studies on the identifiability of mathematical models [12]. The problem of identifying model parameters is reduced to minimizing the quadratic objective functional. Since nonlinear systems are considered, the solution of inverse problems of epidemiology can be ambiguous, therefore approaches to the analysis of the identifiability of inverse problems are described. These approaches make it possible to establish which of the unknown parameters (or their combinations) can be unambiguously and stably restored from the available additional information [13]. The coefficients of the epidemiological model describe the characteristics of the population and the development of the disease. The inverse problem of identifying parameters in a mathematical model is reduced to the problem of minimizing the objective function that characterizes the squared deviation of statistical data from experimental data. The set of statistical and optimization algorithms demonstrates the identification of parameters with the corresponding relative accuracy of 30%. The results can be used by healthcare organizations to predict the epidemic of infectious diseases in a given region by comparing simulation data with historical data [14].

The use of statistical methods for the analysis of medical information is currently not widespread in Kazakhstan, so the purpose of our research was to analyze, predict and predetermine the epidemiological situation using Data Mining technology.

2 Materials and Methods

As an object of study, data of a 10-year period (2010–2020) of the incidence of HIV infection in the Republic of Kazakhstan were selected. The classification of data on the incidence of the population was carried out using the analysis of Big Data Data Mining. As a tool for data analysis, we used the Statistica software package: StatisticaBase, StatisticaAdvanced, Data Mining data mining tools, and SANN automated neural networks. The latest clustering methods have made it possible to perform analysis using graphical forms, based on the single link method. Clustering of data by using graphical forms made it possible to reduce the time of analysis, as well as to develop an algorithm for predicting the incidence.

The practical significance and relevance of applying cluster analysis to data is beyond doubt, since in the modern information society, data and the results of their analysis play an increasingly important role, and clustering allows you to better understand these data.

3 Results and Discussion

The processing of experimental data was carried out on a computer in statistical packages.

Fig. 1.
figure 1

Line chart of the Republic of Kazakhstan for 2010–2020.

A linear graph of the incidence (Fig. 1) of HIV infection (the number of patients and carriers) was built taking into account the aggregate data of a 10-year period for the population of the Republic of Kazakhstan (2010–2020). The abscissa axis shows the years of the study of HIV-infected people, the coordinate axis shows the absolute numbers of HIV-infected people (100,000 people). These diagrams show a steady trend in incidence over the period 2010–2013. Since 2014, the result has been deteriorating with a surge in incidence almost twice. In 2019, compared with the initial years of the study, the incidence of the population increases several times and reaches a kind of peak. However, by 2020 we are seeing a slight decrease in the incidence. This indicator is explained, on the one hand, by the deterioration of the information collection system during the pandemic of a new coronovirus infection, and, on the other hand, by its consequences in the form of deaths. Thus, when assessing the long-term dynamics of the incidence of HIV infection in the Republic of Kazakhstan, a rapid rise is revealed in the period from 2013 to 2019 and a decline in the time period 2019–2020.

The observed decrease in the vertical transmission path to 1.3% does not mean that this trend is absolute, since there is a fluctuation in it - the improvement in results is followed by a gradual deterioration. Based on the analysis of the linear graph according to the incidence rate of HIV-infected people, three groups of years can be distinguished:

  • Years of moderate recovery (2010–2013);

  • Years of high growth (2013–2019);

  • Recession years (2019–2020) and 4) interim years (2014, 2016, 2018).

The sample mean value of the observed variable is determined by formula (1):

$$ \overline{x} = \frac{{i\sum\limits_{i = 1}^n {x_i } }}{n} $$
(1)

where n is the sample size (true number of observations of variable x).

The median consists of two equal, ordered values divided evenly above and below. The mode is the most frequently occurring value in the dataset.

Sample variance characterizes the variability of a variable and is calculated by formula (2):

$$ \overline{S}_x^2 = \frac{{\sum\limits_{i = 1}^n {(x_i - \overline{x})^2 } }}{n - 1} $$
(2)

where \(\overline{x}\) is the sample mean.

The variance varies from 0 to infinity. The last value of 0 means no variability - the variables are constant.

The original data file contains information about HIV-infected people in 16 regions and 2 cities of the Republic of Kazakhstan. The purpose of this cluster analysis is to break into clusters and identify the corresponding cluster to identify risk groups. The use of cluster analysis to solve this problem is considered one of the main effective and widely used methods.

We will classify 16 regions using a hierarchical cluster analysis procedure, using the Euclidean distance (Euclideandistances) as a proximity measure, and the SingleLinkage method or the (near neighbors) method to unite clusters. With these methods, you can link two clusters together. When any two clusters are together, they get closer to each other and differ from the link distance. Accordingly, clusters linked together become separate elements, accidentally found together from the rest. This phenomenon strings objects together and forms clusters. The resulting clusters are represented by long chains. The determination of the natural number of clusters was carried out by combining regions into clusters. The order of combining regions into clusters is shown in a hierarchical tree (Fig. 2).

$$ \pi = \frac{a_i + a_j }{{2b_{iJ} }} $$
(3)

where ai, aj are the average intracluster class distances; i: j; bij - average intercluster distances between the same classes. The estimate of the natural partition is made according to the following formula:

$$ S = \frac{1}{k}\sum_{i = 1}^{\overline{e}} {\max \pi ij} $$
(4)
Fig. 2.
figure 2

Classification of the regions of the Republic of Kazakhstan by the incidence of the population from 2010 to 2020.

Identical values in objects are taken equal to one. The breakdowns obtained using the above algorithm will be equal to one or no more than 1. Accordingly, we can conclude that all objects, united into one cluster, ultimately equal one.

The SingleLinkade method is the most conceptual method, with the more common name of the Nearest-Neighbor method. The work of the algorithm is represented by the search for the two closest objects, the combination of which goes with the formation of the primary cluster. Each subsequent object joins the cluster to which this object is closer.

To determine the natural number of clusters into which collections of objects are divided, at each level of hierarchical clustering, the set was divided into a given number of classes. With each pair of clusters, the degree of their internal connection with each other was assessed. From here comes the calculation of the average intracluster distance for each cluster.

The ratio of the average intra-cluster distance to the inter-cluster distance is taken as an estimate of connectivity.

On the dendrogram, the distances (in arbitrary units) are marked along the horizontal lines, at which the objects are combined into clusters. The horizontal axis represents the observations, the vertical one - the distance of the association.

At the first steps, clusters of regions of Kazakhstan are formed: (Atyrauregion, Mangystauregion, Aktoberegion). Further clusters are formed (WestKazakhstanregion, Kyzylordaregion) - there are more clusters between these regions than between those that were merged in the previous steps. The following clusters - (Pavlodarregion, Kostanayregion) are combined into clusters (NorthKazakhstanregion, Astana). Further, clusters (Karagandaregion, Almaty) and (Akmolaregion, EastKazakhstanregion), etc. are combined into one cluster. The process ends with the union of all objects into one cluster. So, judging by the dendrogram, in this case, three clusters can be distinguished (Table 1).

Table 1. Matrix of Euclidean distances between clusters.
Table 2. Composition and content of clusters by incidence of the population of the Republic of Kazakhstan for a 10-year period (2010–2020).

Figure 3 illustrates that when the proximity measure is cut off at the level of 250, 3 clusters stand out. The composition of the resulting clusters is determined in Table 2.

After analyzing the features of the obtained clusters and comparing the average values of HIV-infected persons by class in the regions, we obtained the following results:

The first cluster as a whole is characterized by an average level of HIV infection among the adult population and vulnerable groups of the population - drug users, convicts, and occupies the share of sexual transmission of HIV infection;

The second cluster is characterized by a low rate of HIV infection compared to cluster 1. The second cluster includes a group of people who are at high risk, however, suffering from alcoholism, drug addiction and other social diseases;

The third cluster shows the incidence and route of transmission of HIV infection among injecting drug users, as well as those infected through sexual transmission and intrauterine transmission from a sick mother to the fetus. Figure 4 also shows significant differences in relation to the three groups of regions.

Fig. 3.
figure 3

Graph of average values for each cluster.

Fig. 4.
figure 4

Graph of the association scheme by steps.

Results of tree clustering. Steps are plotted along the horizontal axis on the diagram, distances are plotted along the vertical axis. In total, the algorithm took 16 steps to combine all objects into one cluster.

The resulting classification revealed clusters with a high growth of HIV-infected people in the regions united in cluster 1. The results of statistical forecasting obtained by combining regions into homogeneous groups and solving inverse problems showed that injecting drug users are predictors of incidence. Processing with the help of Data mining showed that this population group continues to stimulate the growth of the HIV epidemic. The increase in the proportion of coinfections, in the structure of which sexually transmitted infections play an important role, causes serious concern .

4 Conclusion

Thus, the following conclusions can be drawn from the results of the study:

  • The analysis of the incidence of HIV in the Republic of Kazakhstan over a 10-year period (2010-2020) revealed a sharp increase in HIV-infected people and a steady trend in incidence.

  • The cluster classification algorithm revealed the internal connectivity between objects and showed the correctness of the mathematical model of HIV epidemiology.

  • Processing with the help of Datamining showed a continuing increase in the incidence of HIV infection in Kazakhstan.

  • The results of statistical forecasting revealed the predictors of morbidity, causing a high risk group and stimulating the growth of the HIV epidemic.

  • Clustering and the nature of the resulting clusters will make it possible to form from their aggregates a special base for modeling, optimizing and selecting specific antiretroviral drugs and therapeutic regimens in the fight against HIV infection in Kazakhstan.