Using Data Mining Technology in Monitoring and Modeling the Epidemiological Situation of the Human Immunodeficiency Virus in Kazakhstan

Kubegenova, A. D.; Kubegenov, E. S.; Gumarova, Zh. M.; Kamalova, Gaukhar A.; Zhazykbaeva, G. M.

doi:10.1007/978-3-031-21340-3_6

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1703))

Included in the following conference series:

International Scientific and Practical Conference on Information Technologies and Intelligent Decision Making Systems

113 Accesses
1 Citations

Abstract

In this article, based on data mining technology, machine learning methods and cluster analysis, regularization of task identification is carried out, algorithms for numerically solving the inverse problem for a mathematical model for the spread of the socially significant disease human immunodeficiency virus in Kazakhstan are described. Data mining technology in modeling the situation with the human immunodeficiency virus is especially relevant, since it is on its basis that maps of the short-term incidence forecast in Kazakhstan and the regions of the country are compiled. The article discusses statistical data on the spread of the human immunodeficiency virus in Kazakhstan over the past 10 years (2010–2020). Information technologies, including Data mining technologies, allowed the authors to characterize the morbidity graph, identify risks, and test statistical predictors of morbidity. The main part of the article describes such indicators as an algorithm for numerically solving the inverse problem and building a mathematical model for the epidemiology of the human immunodeficiency virus by classifying regions into homogeneous groups. Data Mining classification methods were used to process the human immunodeficiency virus and analyze their status in the region. The forecast of the incidence of the population of Kazakhstan is carried out using the Statistica software package. An efficient algorithm for the numerical solution of the inverse problem for mathematical modeling will allow testing the developments on real data.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Search for Hidden Patterns in the Study of Coronavirus Patients Using Data Mining Methods

Modeling and Predicting Human Infectious Diseases

The Novel Approach to Modeling the Spread of Viral Infections

Keywords

1 Introduction

Currently, with the growth of HIV-infected people in Kazakhstan in public health, issues are increasingly emerging that require instant solutions and analysis of volumetric data. This problem entails, in addition to the danger of a pandemic nature, a wide range of social, medical and economic consequences that require prompt measures. The nature of the danger of this disease lies in the defeat, first of all, of the young and able-bodied population of the republic.

According to UNAIDS experts (2021), there are more than 37.6 million people living with HIV in the world, of which 770 thousand have already died from AIDS-related diseases. Although the data of the joint UN program show the relative stability of the HIV pandemic, the incidence rate is still very high.

The coronavirus pandemic of 2019–2021 is a clear example of the danger of infections becoming uncontrolled, and showed all of humanity the tasks that can be overcome by the unprecedented measures of all states.

The problem of HIV infection, although it has a different distribution pattern, continues to be very dangerous, since the success of treatment has not yet led to the final recovery of patients in remission. According to the Republican Center for the Prevention and Control of AIDS, as of September 30, 2020, 27,100 cases of HIV infection were registered on an accrual basis, of which 16,344 were men, 10,756 were women, and 146 were children. In addition, 4,464 children born from women with a positive HIV status [1].

The new strategy of the Joint United Nations Program on HIV/AIDS (UNAIDS) has committed to ending the world's AIDS epidemic by 2030. This was reflected in the state program for the development of the healthcare system of the Republic of Kazakhstan “Health” for 2016–2030. The relevance of this issue is also caused by the need to study the nature of the HIV epidemic in Kazakhstan, especially in population groups with a high risk of infection.

In order to prevent the development of epidemiological outbreaks, methods of in-depth analysis are used that allow early detection of morbidity in the population. The medicine of Kazakhstan has come to understand the need to introduce statistical processing into all areas of its activity. However, with the widespread introduction of statistical processing tools, there came an understanding of the need not only for qualitative analysis, but also for a more detailed and in-depth study of data visualization processes. It is necessary not only to know the software package for statistical analysis, but also to specify them for each specific case. In this regard, conducting research related to medical data, with the study of the nature of morbidity in a whole group of people, should be determined by the integration of methods and a universal approach. The most important task of a researcher in conducting medical research is the choice of a specific method of statistical data analysis.

The scale and complexity of the health information system has increased dramatically, and its development and management is difficult to control. In the field of traditional methods and simple methods of mathematical statistics, it is difficult to solve the problems caused by the explosive growth of data and information, which will adversely affect the management of the medical information service system. Therefore, to guide the development and maintenance of software engineering, the collection of software data is especially important [1].

With the rapid development of computer and information technologies, as well as storage technologies, it becomes possible to store a large amount of data [2]. Data mining technology can search for and extract potentially valuable knowledge from large amounts of data. Database technology is the science of software that manages databases. The data from the database is analyzed by studying the methods of structuring, designing and applying the data [3]. With the rapid development of information technology, the scale, scope, and depth of database applications continue to expand, leading to the phenomenon of “rich data and bad information” [4]. Data mining is defined as the process of searching for a data pattern, that is, working with data from a large number of incomplete, fuzzy, random data. [5]. Data mining is a very active area of research in the field of databases and artificial intelligence [5,6,7,8]. strategies, contributing to sustainable development [9].

In data mining technology, the recognition parameters and the choice of coefficients are analyzed in detail, after which a data mining model is derived [10].

To analyze large anonymous data about patients, the authors propose to use a method based on the technology of processing and structuring case data. Using this method, it is possible to accurately and efficiently extract key information in each specific case using a special model [11]. An example of a mathematical model of epidemiology (co-infection with HIV and tuberculosis) shows studies on the identifiability of mathematical models [12]. The problem of identifying model parameters is reduced to minimizing the quadratic objective functional. Since nonlinear systems are considered, the solution of inverse problems of epidemiology can be ambiguous, therefore approaches to the analysis of the identifiability of inverse problems are described. These approaches make it possible to establish which of the unknown parameters (or their combinations) can be unambiguously and stably restored from the available additional information [13]. The coefficients of the epidemiological model describe the characteristics of the population and the development of the disease. The inverse problem of identifying parameters in a mathematical model is reduced to the problem of minimizing the objective function that characterizes the squared deviation of statistical data from experimental data. The set of statistical and optimization algorithms demonstrates the identification of parameters with the corresponding relative accuracy of 30%. The results can be used by healthcare organizations to predict the epidemic of infectious diseases in a given region by comparing simulation data with historical data [14].

The use of statistical methods for the analysis of medical information is currently not widespread in Kazakhstan, so the purpose of our research was to analyze, predict and predetermine the epidemiological situation using Data Mining technology.

2 Materials and Methods

As an object of study, data of a 10-year period (2010–2020) of the incidence of HIV infection in the Republic of Kazakhstan were selected. The classification of data on the incidence of the population was carried out using the analysis of Big Data Data Mining. As a tool for data analysis, we used the Statistica software package: StatisticaBase, StatisticaAdvanced, Data Mining data mining tools, and SANN automated neural networks. The latest clustering methods have made it possible to perform analysis using graphical forms, based on the single link method. Clustering of data by using graphical forms made it possible to reduce the time of analysis, as well as to develop an algorithm for predicting the incidence.

The practical significance and relevance of applying cluster analysis to data is beyond doubt, since in the modern information society, data and the results of their analysis play an increasingly important role, and clustering allows you to better understand these data.

3 Results and Discussion

The processing of experimental data was carried out on a computer in statistical packages.

A linear graph of the incidence (Fig. 1) of HIV infection (the number of patients and carriers) was built taking into account the aggregate data of a 10-year period for the population of the Republic of Kazakhstan (2010–2020). The abscissa axis shows the years of the study of HIV-infected people, the coordinate axis shows the absolute numbers of HIV-infected people (100,000 people). These diagrams show a steady trend in incidence over the period 2010–2013. Since 2014, the result has been deteriorating with a surge in incidence almost twice. In 2019, compared with the initial years of the study, the incidence of the population increases several times and reaches a kind of peak. However, by 2020 we are seeing a slight decrease in the incidence. This indicator is explained, on the one hand, by the deterioration of the information collection system during the pandemic of a new coronovirus infection, and, on the other hand, by its consequences in the form of deaths. Thus, when assessing the long-term dynamics of the incidence of HIV infection in the Republic of Kazakhstan, a rapid rise is revealed in the period from 2013 to 2019 and a decline in the time period 2019–2020.

The observed decrease in the vertical transmission path to 1.3% does not mean that this trend is absolute, since there is a fluctuation in it - the improvement in results is followed by a gradual deterioration. Based on the analysis of the linear graph according to the incidence rate of HIV-infected people, three groups of years can be distinguished:

Years of moderate recovery (2010–2013);
Years of high growth (2013–2019);
Recession years (2019–2020) and 4) interim years (2014, 2016, 2018).

The sample mean value of the observed variable is determined by formula (1):

$$ \overline{x} = \frac{{i\sum\limits_{i = 1}^n {x_i } }}{n} $$

(1)

where n is the sample size (true number of observations of variable x).

The median consists of two equal, ordered values divided evenly above and below. The mode is the most frequently occurring value in the dataset.

Sample variance characterizes the variability of a variable and is calculated by formula (2):

$$ \overline{S}_x^2 = \frac{{\sum\limits_{i = 1}^n {(x_i - \overline{x})^2 } }}{n - 1} $$

(2)

where $\overline{x}$ is the sample mean.

The variance varies from 0 to infinity. The last value of 0 means no variability - the variables are constant.

The original data file contains information about HIV-infected people in 16 regions and 2 cities of the Republic of Kazakhstan. The purpose of this cluster analysis is to break into clusters and identify the corresponding cluster to identify risk groups. The use of cluster analysis to solve this problem is considered one of the main effective and widely used methods.

We will classify 16 regions using a hierarchical cluster analysis procedure, using the Euclidean distance (Euclideandistances) as a proximity measure, and the SingleLinkage method or the (near neighbors) method to unite clusters. With these methods, you can link two clusters together. When any two clusters are together, they get closer to each other and differ from the link distance. Accordingly, clusters linked together become separate elements, accidentally found together from the rest. This phenomenon strings objects together and forms clusters. The resulting clusters are represented by long chains. The determination of the natural number of clusters was carried out by combining regions into clusters. The order of combining regions into clusters is shown in a hierarchical tree (Fig. 2).

$$ \pi = \frac{a_i + a_j }{{2b_{iJ} }} $$

(3)

where a_i, a_j are the average intracluster class distances; i: j; b_ij - average intercluster distances between the same classes. The estimate of the natural partition is made according to the following formula:

$$ S = \frac{1}{k}\sum_{i = 1}^{\overline{e}} {\max \pi ij} $$

(4)

Identical values in objects are taken equal to one. The breakdowns obtained using the above algorithm will be equal to one or no more than 1. Accordingly, we can conclude that all objects, united into one cluster, ultimately equal one.

The SingleLinkade method is the most conceptual method, with the more common name of the Nearest-Neighbor method. The work of the algorithm is represented by the search for the two closest objects, the combination of which goes with the formation of the primary cluster. Each subsequent object joins the cluster to which this object is closer.

To determine the natural number of clusters into which collections of objects are divided, at each level of hierarchical clustering, the set was divided into a given number of classes. With each pair of clusters, the degree of their internal connection with each other was assessed. From here comes the calculation of the average intracluster distance for each cluster.

The ratio of the average intra-cluster distance to the inter-cluster distance is taken as an estimate of connectivity.

On the dendrogram, the distances (in arbitrary units) are marked along the horizontal lines, at which the objects are combined into clusters. The horizontal axis represents the observations, the vertical one - the distance of the association.

At the first steps, clusters of regions of Kazakhstan are formed: (Atyrauregion, Mangystauregion, Aktoberegion). Further clusters are formed (WestKazakhstanregion, Kyzylordaregion) - there are more clusters between these regions than between those that were merged in the previous steps. The following clusters - (Pavlodarregion, Kostanayregion) are combined into clusters (NorthKazakhstanregion, Astana). Further, clusters (Karagandaregion, Almaty) and (Akmolaregion, EastKazakhstanregion), etc. are combined into one cluster. The process ends with the union of all objects into one cluster. So, judging by the dendrogram, in this case, three clusters can be distinguished (Table 1).

Table 1. Matrix of Euclidean distances between clusters.

Full size table

Table 2. Composition and content of clusters by incidence of the population of the Republic of Kazakhstan for a 10-year period (2010–2020).

Full size table

Figure 3 illustrates that when the proximity measure is cut off at the level of 250, 3 clusters stand out. The composition of the resulting clusters is determined in Table 2.

After analyzing the features of the obtained clusters and comparing the average values of HIV-infected persons by class in the regions, we obtained the following results:

The first cluster as a whole is characterized by an average level of HIV infection among the adult population and vulnerable groups of the population - drug users, convicts, and occupies the share of sexual transmission of HIV infection;

The second cluster is characterized by a low rate of HIV infection compared to cluster 1. The second cluster includes a group of people who are at high risk, however, suffering from alcoholism, drug addiction and other social diseases;

The third cluster shows the incidence and route of transmission of HIV infection among injecting drug users, as well as those infected through sexual transmission and intrauterine transmission from a sick mother to the fetus. Figure 4 also shows significant differences in relation to the three groups of regions.

Results of tree clustering. Steps are plotted along the horizontal axis on the diagram, distances are plotted along the vertical axis. In total, the algorithm took 16 steps to combine all objects into one cluster.

The resulting classification revealed clusters with a high growth of HIV-infected people in the regions united in cluster 1. The results of statistical forecasting obtained by combining regions into homogeneous groups and solving inverse problems showed that injecting drug users are predictors of incidence. Processing with the help of Data mining showed that this population group continues to stimulate the growth of the HIV epidemic. The increase in the proportion of coinfections, in the structure of which sexually transmitted infections play an important role, causes serious concern .

4 Conclusion

Thus, the following conclusions can be drawn from the results of the study:

The analysis of the incidence of HIV in the Republic of Kazakhstan over a 10-year period (2010-2020) revealed a sharp increase in HIV-infected people and a steady trend in incidence.
The cluster classification algorithm revealed the internal connectivity between objects and showed the correctness of the mathematical model of HIV epidemiology.
Processing with the help of Datamining showed a continuing increase in the incidence of HIV infection in Kazakhstan.
The results of statistical forecasting revealed the predictors of morbidity, causing a high risk group and stimulating the growth of the HIV epidemic.
Clustering and the nature of the resulting clusters will make it possible to form from their aggregates a special base for modeling, optimizing and selecting specific antiretroviral drugs and therapeutic regimens in the fight against HIV infection in Kazakhstan.

References

Statistical collection: Health of the population of the Republic of Kazakhstan and the activities of healthcare organizations in 2020. https://amanbol.kz/news/vich-v-kazakhstane-dannye/https://masa.media/ru/site/
Google Scholar
Cui, Z., Yan, C.: Deep Integration of health information service system and data mining analysis technology. Applied Mathematics and Nonlinear Sciences 5(2), 443–452 (2020)
Article Google Scholar
Xinyi, W.: The role of data mining technology in advertising marketing. J. Phys.: Conf. Ser. 1744, 042202 (2021)
Google Scholar
Yang, J., Li, Y., Liu, Q.: Brief introduction of medical database and data mining technology in big data era. J Evid Based Med. 13, 57–69 (2021)
Article Google Scholar
Jianguo, L., Sheng, Z.: Application research of data mining technology in personal privacy protection and material data analysis. Integrated Ferroelectrics 216(1), 29–42 (2021)
Article Google Scholar
Bijalwan, V., Kumar, V., Kumari, P.: KNN based machine learning approach for text and document mining. Int. J. Database Theory and Application 7(1), 61–70 (2014)
Article Google Scholar
Yukselturk, E., Ozekes, S., Turel, Y.K.: Predicting dropout student: an application of data mining methods in an online education program. European Journal of Open, Distance and e-learning 17(1), 118–133 (2014)
Article Google Scholar
He, W., Yan, G., Xu, L.D.: Developing vehicular data cloud services in the IoT environment. IEEE transactionsonindustrialinformatics 10(2), pp. 1587–1595 (2014)
Google Scholar
Peña-Ayala, A.: Educational data mining: A survey and a data mining-based analysis of recent works. Expertsystemswith applications 41(4), 1432–1462 (2014)
Google Scholar
Liu, L.: Development and Application of Computer Data Mining Technology. In: International Conference on Applications and Techniques in Cyber Intelligence ATCI 2019. ATCI 2019. Advances in Intelligent Systems and Computing, 1017. Springer, Cham (2020)
Google Scholar
Liu, M., Qu, M., Zhao, B.: Research and citation analysis of data mining technology based on bayes algorithm. Mobile Netw Appl 22, 418–426 (2017). https://doi.org/10.1007/s11036-016-0797-2
Zhenhua, H., et al.: Analysis of COVID-19 spread characteristics and infection numbers based on large-scale structured case data. Scientia Sinica Informationis 50(12), 1882 (2020)
Google Scholar
Kabanikhin, S.I.: Determination of the coefficients of nonlinear ordinary differential equations systems using additional statistical information. Int. J. Mathe. Physics 10(1), 36–42 (2019)
Article Google Scholar
Kabanikhin, S.I., Krivorotko, O.I.: Mathematical Modeling of the Wuhan COVID-2019 Epidemic and Inverse Problems. Comput. Math. and Math. Phys. 60, 1889–1899 (2020). https://doi.org/10.1134/S0965542520110068
Kabanikhin, S., Olga, K., Victoriya, K.: A combined numerical algorithm for reconstructing the mathematical model for tuberculosis transmission with control programs. Journal of Inverse and Ill-posed Problems 26(1), 121-131 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

West Kazakhstan Agrarian Technical University named after Zhangir Khan, Uralsk, 090009, Kazakhstan
A. D. Kubegenova, E. S. Kubegenov, Zh. M. Gumarova, Gaukhar A. Kamalova & G. M. Zhazykbaeva

Authors

A. D. Kubegenova
View author publications
You can also search for this author in PubMed Google Scholar
E. S. Kubegenov
View author publications
You can also search for this author in PubMed Google Scholar
Zh. M. Gumarova
View author publications
You can also search for this author in PubMed Google Scholar
Gaukhar A. Kamalova
View author publications
You can also search for this author in PubMed Google Scholar
G. M. Zhazykbaeva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zh. M. Gumarova .

Editor information

Editors and Affiliations

National Research University Moscow Power Engineering Institute, Moscow, Russia
Arthur Gibadullin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kubegenova, A.D., Kubegenov, E.S., Gumarova, Z.M., Kamalova, G.A., Zhazykbaeva, G.M. (2022). Using Data Mining Technology in Monitoring and Modeling the Epidemiological Situation of the Human Immunodeficiency Virus in Kazakhstan. In: Gibadullin, A. (eds) Information Technologies and Intelligent Decision Making Systems. ITIDMS 2021. Communications in Computer and Information Science, vol 1703. Springer, Cham. https://doi.org/10.1007/978-3-031-21340-3_6

Download citation

DOI: https://doi.org/10.1007/978-3-031-21340-3_6
Published: 10 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21339-7
Online ISBN: 978-3-031-21340-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Using Data Mining Technology in Monitoring and Modeling the Epidemiological Situation of the Human Immunodeficiency Virus in Kazakhstan

Abstract

Similar content being viewed by others

Search for Hidden Patterns in the Study of Coronavirus Patients Using Data Mining Methods

Modeling and Predicting Human Infectious Diseases

The Novel Approach to Modeling the Spread of Viral Infections

Keywords

1 Introduction

2 Materials and Methods

3 Results and Discussion

4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Using Data Mining Technology in Monitoring and Modeling the Epidemiological Situation of the Human Immunodeficiency Virus in Kazakhstan

Abstract

Similar content being viewed by others

Search for Hidden Patterns in the Study of Coronavirus Patients Using Data Mining Methods

Modeling and Predicting Human Infectious Diseases

The Novel Approach to Modeling the Spread of Viral Infections

Keywords

1 Introduction

2 Materials and Methods

3 Results and Discussion

4 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation