Keywords

1 Introduction

Global warming has been shown to contribute to the rising abundance of human insect vector diseases, especially in tropical countries [1, 2]. The world has been fighting with mosquito-borne diseases, which have significantly be a threat to health worldwide [3, 4]. To ultimately eliminate and control the widespread of insect-borne diseases like malaria, there is an ongoing effort [5,6,7]. Apart from Christopher’s work of 1911 [8], there have been different models recently for prediction malaria spread and abundance. Malaria prediction and treatment are very difficult and complicated due to the transmission of mosquitoes and complex ecology [9,10,11,12].

Malaria is one of the precarious illnesses in Africa especially Nigeria [13, 14]. A reliable and early parasite-based diagnosis, identification of symptoms, prescription, and further malaria regulation plans, For example, the use of insecticide-medicated bed-net and indoor residual spray is crucial to reducing the incidence of malaria in Nigeria. Many donors have been helpful in this regard by distributing treated bed nets in Nigeria. The Plasmodium falciparum opposition to generally known anti-malarial medications imposes the consumption of potentially and extra cost-effective combination therapy in addition to accurate laboratory diagnosis [15]. To prevent the undiscerned use of anti-malaria medicines, the World Health Organization (WHO) stresses the occurrence of malaria in a human body before the use of or treatment of malaria drugs. But in Africa especially in Nigeria the presumptive treatment of malaria and other related fevers is still popular due to a lack of medical personnel, technical expertise, and effective laboratory structure. The aforementioned problems have cause delays in the diagnosis of other severe feverish diseases and the assumption of treatment sometimes results in misuse or abuse of anti-malarial drugs [16, 17].

Malaria is as old as a human being because it is an ancient disease, which causes social, economic, and health burden amongst people in the world [9]. The disease is common in the warm and humid nations and the ailment happened to be in existence for more than hundreds of centuries now, malaria has remained a prominent community well-being challenges among many countries. WHO declared malaria endemic in 109 countries in the year 2008 with 243 million malaria conditions recorded and over millions of demises of the endemic which are mainly children under 5 years of age (WHO).

Nigeria in recent years experiences a high incidence of malaria. The healthcare system is characterized by an adequate large quality of data, with little or no effort to apply the robust information of data contained for solving life-threatening problems in the medical diagnosis of diverse diseases [10]. The data mining approach remains the most significant method among various techniques for the prediction or diagnosis of several diseases [18]. There have been many proposed prediction models by researchers on environmental factors [19,20,21,22,23]. The use of big data for predictions based on environmental factors and clinical conditions has not been explored [23].

Therefore, this study aims to predict the incidence of malaria using clinical and environmental factors with the help of the LSTM prediction model in the R programming language environment. Finally, the paper conducts a comparative performance evaluation on the prediction algorithms. The LSTM model was used on big data to predict malaria prevalence in the study areas. The study required a dataset and the data collected from the study area were used for the model. The deep learning method was chosen because of its strength, prediction, and forecasting ability. the predictions are made based on past values and a relationship has to be established between the fluctuations occurred in values.

2 Big Data in the Healthcare System

The advancement in information technology increases the amount of data used by the modern establishment thereby making data science an essential tool for maintaining data in any organization [24,25,26]. For instance, an organization with a huge amount of user data needs data science to effectively improve methods to collect, accumulate, and process the data. Various systematic techniques may be used by the company to process and obtain resourceful outcomes about the user data. Data mining and big data are used interrelated with Data Science. To comprehend and process actual data concepts, data science is of great importance as it incorporates the field of statistics and machine learning with their various procedures [27, 28]. The study of Information Science, Computer Science, Mathematics, and Statistics with their concepts and methods are all embedded in the field of data science [28, 29].

The 21st century is an age of big data affecting every aspect of human life, including biology and medicine [30, 31]. The move from paper medical records to Electronic Health Record (EHR) systems has resulted in an unprecedented increase in data [32, 33]. Big data thus offers a great opportunity for doctors, epidemiologists, and specialists in health policy to make evidence-driven decisions that will eventually enhance patient care [34, 35]. “Big data is not only a modern reality for the biomedical scientist but an imperative that needs to be fully grasped and used in the search for new knowledge” [36,37,38].

Data in the healthcare sector are usually huge and not easy to handle [38,39,40]. This results from the enormous way by which data grows in the health care sector, the rate at which data are been produced, and the variety of various data in the healthcare system [41, 42]. The rate at which data are been captured, stored, analyzed, and retrieved in the health care sector has swift rapidly from the aged paper-based storage technique to the use of digital technique and method [42, 43].

On the other hand, the complication of data makes processing and analysis of data by the aged long traditional method very difficult and uneasy to handle. However, the large volume of data, as well as the sophistication of the data, make it difficult for conventional methods and techniques to process and interpret the data. Therefore, the application of advanced technologies which includes virtualization and cloud computing allows for huge and effective data processing in the healthcare system. Thereby rapidly turning the healthcare system into a big data industry. Nevertheless, in these modern days, the improvement in the information and communication technologies (ICT) brings the advancement of varied data from new sources in the healthcare system.

This source includes the Global positioning system (GPS), data from gene sequence, file logs, devices that identify Radio Frequency (RFID), smart meters, and posts from social media. The increasing rate with which data is been produced from various sources brings about an increase in the amount of data in the healthcare system [42, 44]. Thus, their results give tedious means of storing, processing, and analyzing data with the aged long traditional method of data processing applications [45]. Nevertheless, modern methods and techniques, as well as advanced computational technology, have been used to store, control, and analyze values from broad and varied data in the health care system in real-time [42, 44, 45]. As a result, the healthcare sector has now become a big data industry. Big data now provides massive opportunities for the healthcare system [46].

Improvement in information technology and data computing has greatly changed researches on population-based by encouraging easy access to a huge amount of data. Sometimes, such database links are referred to as “big data” [47, 48]. In other to make efficient use of these data for researches in clinical health or public health, the researchers need to widen researches further than the traditional surveillance model, as operating with big data differs from focusing on performing narrow analysis, treatment-oriented clinical data. Therefore, leveraging on Big data to reflect accurately on the heterogeneous population it represents becomes expedient [42, 45]. This endeavor needs a swift research environment that can adopt a quick advancement in computing technology to at all-time combine data while making use of new methods to reduce their complexity [42, 48].

Big data in healthcare have been established for its timely advancement in disease detection, diagnosis, and enable the best control of any disease outbreaks. Predictive analysis of health care can be achieved easily with the aid of big data [49, 57]. In the United States of America, for example, predictive analytics was used to enhance disease response management and a deeper understanding of diseases [50]. The use of a health-based framework for broad data collection to enhance patient care and prevent insurance fraud has been widely used in Australia [51]. The use of data from social media, sensor data, and air quality data with machine learning models are currently studies to predict asthma [23]. Generally, the use of broad health data in the predictive analysis has created considerable interest in research in recent years [52]. For the management of complex computational data, the modern scalable database has been used examples of such are Hadoop MapReduce and Apache Spark. The latter does not come with a file system other than Hadoop, so you need to combine it with a different file system based on the cloud [53]. Hadoop spends time running computationally complex machine learning algorithms [54, 55], making Apache Spark 100 times faster than ever before. Also, trends and patterns which make it easy to diagnose and treat patients are been revealed by big data.

Big data would be useful in today’s digital world as the key to controlling malaria outbreaks, but the criteria for successful data collection and global analysis need to be clear. The study states that the data and algorithms available in digital form are used for prediction and monitoring. For example, it is vital to consider both the clinical and environmental factors where the infection has spread and been detected in the battle against endemic malaria. Nonetheless, it is equally important that these data and algorithms are used in a protected manner, following data security laws and with proper regard for privacy and privacy. Inability to do so would undermine public confidence, making people less able to follow advice or recommendations on public health, and more likely to have poor health outcomes [56].

Hence the exploitation of big data in malaria-endemic will bring about improved care with minimum cost and good satisfaction to patients. On Big Data, the LSTM model was used to estimate the prevalence of malaria in the study areas. For the model, the analysis needed a dataset, and the data obtained from the study area was used. Because of its power, prediction, and forecasting ability, the deep learning method was chosen. The estimates are made based on previous values and a correlation must be formed between the variations in values that occur.

2.1 Clinical Data

The clinical data used for this study were collected from three General Hospital in Irepodun Local Government (LGA) of Kwara State (Omu-Aran, Oro, and Agbamu), the health centers (HC) were set up by the Kwara State Government of Nigeria. This HC was used to collect samples of malaria from endemic sites within Irepodun LGA. The three hospitals were chosen because they are government-owned hospitals and were also allowed to control endemic malaria in Kwara State and are in highly endemic areas. Such hospitals were well equipped with malaria test kits and their personnel were also well trained in the treatment and testing of malaria. Hospital data attributes are the number of successful cases, symptom-free, regular age, sex, per month. The data used for this analysis covered three years, from the beginning of 2017 to the end of 2019.

2.2 Environmental Data

It is very essential to know the environmental risk even if it is at a very local scale for malaria-endemic. The most considered environmental factors are temperature, relative humidity, rainfall, and the vegetative index to estimate malaria abundance. Environmental data were obtained from Minet’s daily forecast and the satellites. From the MiNet forecast, the MiNet01AB product predicted a Composite picture of 7 days at 1.23 km * 1.23 km of spatial resolution for day and night temperature. Nigerian Meteorological Agency’s data acquisition center reported normal relative modesty at 2 m and 8 m above ground level. The vegetative index was obtained by resolution of 1.5 * 1.5. Rainfall was measured based on the amount of rainfall in each region. The following information was also systematically collected: the vegetation index (NDVI) measured at 1 km around the site with SPOT 5, the number of residents, the existence of abandoned projects, and the existence of abandoned construction with tools or holes that could serve as possible breeding sites for Anopheles mosquitoes.

3 Methodology

Hochreiter [29] introduced Long-Short -Term Memory (LSTM) to tackle the Recurrent Neural Network (ANN) reliability and speed problems. Both are very similar but in LSTM a new concept is introduced with interaction per module or one cell. RNN uses a definition of time and is a neural network that is aided by edges spanning adjacent steps in time. It’s mutual parameters over a series of time steps and allows connections to self-loop over time from a node to itself. The model receives the input values at step t and the node’s previously hidden value and determines the hidden state. The final output value in the prior state will be determined by the input value. LSTM was used to tackle the issue of vanishing gradients appropriately. Unlike RNN, which uses hidden layer structure, LSTM is a linear self-loop memory block that flows through the gradients with long sequences.

The memory block containing self-connecting memory cells is used to memorize the time state, and the three gates from the memory block help control the flow of information within and outside the memory block. It helps the model to manage long-term and short-term correlation in time series. The LSTM registered at time t, instead of a simple RNN. As depicted in Fig. 1, LSTM has a four-layer long-term training process with retrieving information called chain-like structure, thus consisting of memory blocks such as cells. The vital condition called the state of the cell remains unchanged and enables data to flow forward. Nonetheless, useable sigmoid activation gates may be used to add or remove data inside a layer. Applying gates when memorizing the LSTM will prevent long-term dependence. Such gates have a range of matrix operations including individual weights.

Fig. 1.
figure 1

Development of the Long-Term Memory (LSTM)

3.1 The Staging Procedure of LSTM Is as Follows

The sigmoid function of LSTM identifies data that is not needed for any phase and takes the current input as \( \left( {{\text{X}}_{t} } \right) \) at time \( t - 1 \) and produces output \( \left( {V_{t - 1} } \right) \) at time \( t - 1 \). The sigmoidal function specifies which part of the output will be separated from the old output. This stage is called the Forgetting Gate \( \left( {f_{t} } \right) \). Forget the \( f_{t} \) gate, the value is between 0 and 1 and is a vector that matches each number in the cell \( \left( {C_{t - 1} } \right) \).

$$ f_{t} = \sigma \left( {w_{f} \left[ {V_{t - 1} ,{\text{X}}_{t} } \right] + b_{f} } \right) $$
(1)

Where

  • \( \sigma \) is the sigmoidal function, \( w_{f} \) = weights and \( b_{f} \) = forget gate.

Both equations have two states: the ignoring condition and the storing condition of the current input and the \( {\text{X}}_{t} \) in the cell state. The two layers are the sigmoidal layer and the tanh layer. To the sigmoidal layer provided to decide if the new information needs to be modified or not using 0 or 1. The \( tanh \) update weights in the second layer and passes the values between (−1 to 1). Using their level of importance, the values are selected accordingly. Both the values are updated as presented in Eq. 4 and form the new cell state.

$$ m_{t} = \sigma \left( {w_{f} \left[ {V_{t - 1} ,{\text{X}}_{t} } \right] + b_{m} } \right) $$
(2)
$$ N_{t} = tanh\left( {w_{f} \left[ {V_{t - 1} ,{\text{X}}_{t} } \right] + b_{n} } \right) $$
(3)
$$ C_{t} = C_{t - 1 } f_{t} + N_{t} m_{t} $$
(4)

In the final stage, the output \( V_{t} \) is based on the output of sigmoidal gates \( Q_{t} \) and is multiplied by the new \( tanh\left( {C_{t} } \right) \) layer generated.

$$ Q_{t} = \sigma \left( {w_{f} \left[ {V_{t - 1} ,{\text{X}}_{t} } \right] + b_{q} } \right) $$
(5)
$$ V_{t} = Q_{t} tanh\left( {C_{t} } \right) $$
(6)

\( w_{q} \) and \( b_{q} \) are the weights and biases for the output gates, respectively

3.2 Model

As input to predict malaria fever was used the clinical factors with asymptomatic cases and environmental data such as temperature, relative humidity, rainfall, vegetative index. The average rainfall recorded, the average temperature of day and night time, and the average enhanced vegetable index at a resolution of 1.5 km, thus combined to be monthly. The problem of missing variables of about 7% was tackled by clustered equations in pre-processing data with multivariate imputation. From 2017–2019 the data were taken from both the clinical and environmental data at each location (Omu-Aran, Oro, and Agbamu) in Irepodun LGA as presented in Fig. 1.

The model was trained from January 2017 to December 2019 using the preprocessed data. Following some iterations, the model was subsequently updated for 2 years to provide a monthly malaria abundance forecast for 24 months, based on historical data on environmental and clinical variables. The proposed model used an open-source online cloud big data to implement the system using the Apache Spark framework. The Apache Kafka was used for the real-time streaming and batch processing of the data on Spark. The output measurements for the model were determined and model results were produced for each geographic location.

The experiment was split into two phases: i) offline batch processing ii) streaming online. During the first stage, the data is pre-processed and used for processing. In the second stage, the real-time analysis was carried out by gathering the batches of message brokers. Initially, preprocessing was optimized by applying multivariate variable imputation using a clustered equation to manage missing data points for each region’s ambient and clinical time-series data. Then the LSTM classifier was used to make the preparation. In the second stage, the results were expected in real-time analysis, and the results over a given duration were estimated. The findings were ultimately processed for on-line analysis on the distributed Hadoop file system.

4 Results and Discussion

The experiment has been divided into two phases: i) offline batch processing ii) streaming on-line. The data are pre-processed in the first stage and used for processing. In the second stage, the real-time analysis was performed by collecting the message brokers in batches. Preprocessing was initially optimized by applying multivariate imputation using a clustered equation to manage missing data points on the environmental and clinical time series of each venue. Later, the LSTM classifier was used for the preparation. The results were projected for a certain amount of time during the second phase of the real-time analysis, and the results were estimated. The findings were eventually stored on the online reporting file system provided by Hadoop.

Throughout the analysis, the variables depend seasonally on environmental and clinical factors. For example, the number of asymptomatic cases and the vegetative index are time series, following [32]. Therefore, the research included the number of asymptomatic patients and their diagnoses. Between January 2017 until December 2019, the study expected cases of malaria is for 24 months. The Omu-Aran site reported the highest number of incidents, which may be attributed to the large number of people residing in the region and the high population density compared with other sites. The maximum rainfall was 2.3 mm, and the maximum day and night temperatures differed significantly from the other two locations by 2.5 °C. The results for the vegetative index were also significantly different from those locations. General Hospital, Agbamu has reported the smallest number of malaria cases. It was because i) the area’s population was very small relative to the other two sites and ii) we also wanted to assume that the areas were using local herbs to manage their malaria index, which could minimize the index reported in the hospital as in two other areas.

Precise health prediction models are very important for improved and more effective care and are useful for planning and decision-making. The results of the study have shown that clinical and environmental factors are resourceful and also have played a crucial role in the forecasting of malaria. Individuals with malaria without symptoms were present in the studied region, and this was verified with laboratory tests with high predictive capacity. Environmental factors play a significant role in the prediction of malaria since rainfall normally started around July to October of each year, the incidence and incidence of malaria was higher than in other months of the year. It was attributed to the fact Anopheles gambiae mosquito breeding aid of the season during this period. Factors like the presence of a watercourse near any residential area and a higher index of vegetation correlated with higher vector density support the eminences of malaria. Table 1 displays the statistical findings at the site selected for the malaria cases.

Table 1. Results of malaria incidences in the selected areas

4.1 Performance Evaluation Metrics

The basic evaluation criteria and assessment are based on the classification accuracy level using True Negative (TN), True Positive (TP), False Negative (FN), and False Positive (FP) (Table 2).

Table 2. Definition of performance evaluation metrics

Table 3 shows the performance evaluation of the three locations in the study, Agbamu has the highest accuracy of 98.34%, 99.37% sensitivity 93.78% specificity, and 98.14% precision. Oro has an accuracy of 97.94%, 98.45% sensitivity 94.05% specificity, and 97.56% precision and the least of them which is Omu-Aran has an accuracy of 97.47%, 98.05% sensitivity 94.67% specificity, and 97.56% precision.

Table 3. Performance evaluation for the metrics

Table 4 shows a comparison of the proposed method with other machine learning. The methods outperformed other machines learning this maybe because of the introduction of big data into the LSTM deep learning and result bring about better accuracy compared with other machine learning. Random Forest follows with an accuracy of 93.94% and ANN is the least of all the methods with an accuracy of 25.83%.

Table 4. Comparison of the proposed method with other machine learning

5 Conclusions

Malaria fever is a social problem in that it can widespread harm and cause personal damage. It is also a worldwide problem face by many developing nations. Hence, many types of research have been conducted to minimize the social, economic, and physical losses by predicting the spread of malaria-endemic. The purpose of this study was to design a prediction model using deep learning techniques and big data that is more appropriate than existing models by using environmental and clinical variables. Environmental and clinical factors play a significant role in malaria-endemic prediction. Though malaria prediction using environmental and clinical variables with big data analysis is a new approach but has proved important and very useful in predicting malaria-endemic. The application of correct predictive models are noteworthy for the allocation of medical resources for prevention and to estimate the impact of the disease. The use of a more accurate exploration of malaria prediction and diagnosis of malaria is necessary to improve treatment and its value in practice. The proposed method outperformed other machine learnings compare with an overall accuracy of 98.34%, 94.67%, and 98.14% precision. Future work can still work on other areas within the study area to be able to conclude accurately about the features that cause malaria within Kwara State. Also, features selection algorithms like genetic algorithms can be used for feature selection.