Keywords

1 Introduction

In the last month of 2019, in the city of Wuhan, China, a local outbreak of an illness was found that was cursing with symptoms such as cough, fever, sore throat, shortness of breath, fatigue, pneumonia evolving into severe acute respiratory syndrome, possibly being fatal. It was soon discovered that the disease was caused by a new coronavirus, SARS-CoV2 (severe acute respiratory syndrome coronavirus), and that the contagiousness and course of the disease would make it a threat to the world’s different health systems. The disease, named COVID-19, quickly spread across four continents, until on March 11, 2020 it was declared a pandemic by the World Health Organization [20]. In little more than a year after the pandemic was recognized, covid-19 has now reached more than 130,000,000 people, adding up to more than 2,800,000 fatalities [98]. These numbers continue to grow. The biggest public health crisis in decades had begun. If on one hand COVID-19 exposed the epidemiological fragility of an extremely connected world, using tourist and commercial routes to reach the most diverse populations; on the other hand, the global connection proved its value once again, with scientists from various fields of knowledge and from the most varied countries establishing collaborations in the urgent effort to know, detail, prevent, detect, and contain the virus. Thus, a large amount of research has begun on epidemiological and pathophysiological aspects, drug development, virus detection tests, vaccines, and case prediction and control. Science has advanced by leaps and bounds and made the distance between the emergence of a new disease, the identification of the causative agent, the sequencing of its genetic material, and the appearance of the first viable vaccines seem shorter in just 1 year. Despite these achievements, one year after the WHO recognized the pandemic, the disease continues to spread, presenting an exuberance of possible clinical manifestations. The virus has new and even more transmissible variants (REF). The most viable form of control since the beginning of the pandemic continues to be: case identification, tracking and contact isolation. The ability to identify the presence of the pathogen plays an important role both in preventing the spread of the disease and in adequately combating it. Delay in diagnosis can delay proper patient care, hindering recovery, and especially allowing undiagnosed infected people to circulate in society, spreading the virus. The most well accepted test for diagnosing COVID-19 is RT-PCR (Reverse Transcription Polymerase Chain Reaction); however, the procedures for this test take several hours [24] and the result can take days to be available. In addition, there is the possibility of virus presence and transmission even if the RT-PCR test is negative, depending on the time of contamination at which the test was performed. Understanding more about the behavior of the virus in populations (identifying risk groups, more vulnerable social groups) or about its spatial and temporal spread in a region was, since the beginning of the pandemic, a factor that reduced the impact of the virus. And this has allowed greater assertiveness in the measures of isolation, protection, and vaccination, besides being determinant for economic, social, and administrative decisions of governments that have the intention to contain the pandemic by COVID-19. In this context, a relatively new area of Public Health, digital epidemiology, has gained space and recognition, providing effective monitoring of confirmed cases, accumulations, and excess deaths. Moreover, the possibility of using machine learning to make temporal and spatial predictions about the occurrence of COVID-19 has definitely brought artificial intelligence into the healthcare field. This chapter is dedicated to exploring some of the major studies that have been done on the use of forecasting by compartmental, statistical, machine learning, and hybrid approaches.

This chapter is organized as follows: in Sect. 2, we present the theoretical basis and a review of compartment forecasting models; in Sect. 3, we detail the forecasting approaches based on statistical learning and present the basis of the main machine learning methods applied to Covid-19 forecast, as well as state-of-the-art works selected taking into account academic relevance, i.e. the number of citations and the impact factor of journals and books; finally, in Sect. 4 we present our final considerations and general conclusions.

2 Forecasting by Statistical Learning and Compartment Models

With the outbreak of the 2019 coronavirus disease many researchers have become interested in mathematically modeling this new disease. Many have done these studies using compartmental models based on differential equations. These models can be described by two types of equations, ordinary differential equations (ODEs) and partial differential equations (PDEs). The techniques for solving each of these models and methods for doing numerical simulations are different.

The following are some studies that have used mathematical modeling to understand how disease dynamics work and even make predictions using computational techniques associated with these models.

Among the ODE-based compartmental models the researchers Sarkar et al. [85] developed a 6-compartment model that extends the classical SEIR to predict Covid-19 dynamics, where a sensitivity analysis was conducted to recognize the most influential parameters with respect to the infected population. For this purpose, the partial rank correlation coefficient (PRCC) technique was used for all input parameters with respect to variable I(infected or symptomatic individuals). And the numerical implementation was done in the FORTRAN program with the method of least squares (MMQ) to adjust the diary cases of the disease.

The researchers Suba et al. [89] developed a model based on ODEs and also used implementation by means of the method of least squares. In this work, seven models were developed, and to find the parameters of the model, excel spreadsheet and MMQ and plotted graphs in MATLAB were used. This study did the sensitivity analysis using real data from Tamil Nadu. Good results with simple methods, but the system is sensitive to the change of the basic reproduction number R 0, which changes the whole system automatically.

Some other studies follow the line of numerical implementation with MATLAB. This same software was used in the study developed by Zhong et al. [105], to perform the numerical calculation of the created differential equation system. And real data was used to predict the number of infected. This study brought predictions of the epidemic in different scenarios and with different levels of anti-epidemic measure and medical care represented by beta rate and gamma rate, with unreliable data through objective analysis. But this study has a prediction limited by the data and their reliability, because data before January 18, 2020 should be used with caution. Mandal et al. [59] also used MATLAB software to solve the system of differential equations that describes the proposed SEIQR model. The method used was the fourth-order Runge–Kutta (RK4). In this study, a theoretical analysis and numerical simulation are performed, as well as a stability analysis and estimation of R 0. The prediction made is sensitive to some parametric conditions, and since human behavior is uncertain there are changes in the parametric space corroborating to the change in the graphs of the COVID-19 cases. Therefore, the prediction made is short term. MATLAB was also used by Jiang [37]. Initially this work used the simulation repository built into the Netlogo software to create a SIR model to simulate virus transmission. The simulation took place in a closed environment (Small World) and assumed that there were no vital dynamics, i.e., no one died or was born naturally. To optimize the parameters of the proposed model, the MATLAB function fmincon was used. To find the numerical solution of the ODE system and adjust the curves, MATLAB’s ode45 function was used. By using this function the values obtained were quite consistent with the real data as well as the simulation curves. This model was done for USA and for Hubei, China. For USA a model without vital dynamics was used, due to lack of data. The parameters definitely change with time in the real situation. The data from asymptomatic individuals is late, which makes it difficult to establish a SEIR-based model for fitting and prediction. As for Hubei, the prediction does not match the real situation. And finally, none of the models divides infected people into isolated and non-isolated infected individuals, or whether they received effective treatment. Massonis et al. [60] did a multi-state review using SIR and SEIR models described by systems of ODEs in which it evaluates structural identifiability, i.e., ability to provide insights into their unknown parameters, and observability (unmeasured states). A total of 255 articles were evaluated, 98 with SIR models and 157 with SEIR models. And a list of 36 model structures was made. The ability to provide reliable information was evaluated, and theoretical concepts of structural identifiability and reliability control were used for this. STRIKE-GOLDD, an open source toolbox and GenSSI2 MATLAB were used as analysis tools, and for some models the Observability Test code in Maple, Identifiability Analysis in Mathematica, SIAN in Maple, and others were used. Most models found in the literature have identifiable parameters. Often allowing for variability in an unknown parameter improves the observability and/or the identifiability of the model. This work has contributed to providing a detailed analysis of the structural identifiability and observability of a large set of compartmentalized COVID-19 models presented in the recent literature. To model and make prediction of COVID-19 evolution in Brazil, Bastos and Cajueiro [11] proposed two models SIRD and SIRASD described by ODEs. And to find the numerical solution of the ODE system and fit the curves, ode45 function, also from MATLAB, was used. And although this method controls the error by assuming fourth-order precision, it uses a precise fifth order formula to perform the steps. As a starting condition we used data from the Brazilian Institute of Geography and Statistics (IBGE). And the data used were from the Brazilian Ministry of Health (February 25 to March 30, 2020). For the estimation procedure, we minimized the loss functions using the method “optimize.least s quares” also from the scipy Python 34 library using the Cauchy loss with scaling parameter. It is notable that although the SIRASD model predicts that the number of infected is higher than the SIRD model estimates, it also manages to predict a lower peak for those infected with symptoms, which are those who require medical attention. This model is advantageous for short-term prediction for Brazil. The methodology of this study was able to estimate the asymptomatic individuals, who may not be entirely present in the data. But because the study was done at the beginning of the pandemic, there was little data and there were cases of underreporting of the actual number of infected people. In addition, this study did not take demographic effects into account, and it was assumed that there was no reinfection. The SIRASD model proved sensitive to the initial condition of asymptomatic individuals. Because the number of tests is small to map the entire population, it is necessary to work with assumptions.

A modeling of the spread of coronavirus taking into account the cases of undetected infections in China was done by Ivorra et al. [34]. In this paper, a deterministic SEIHRD model was made, which has low computational complexity and possibility of using ODE theory to analyze and interpret properly. This model is solved numerically via fourth-order Runge–Kutta (RK4) with 4 h time interval to approximate the solution of the system. Both Runge–Kutta and the WASF-GA algorithm have been implemented in Java. It is advantageous to use a deterministic model when you have little data, but the methodology used aimed precisely at solving this limitation. And a robust approach for overfitting the model parameters with respect to the reported data was created. However, the results are unsatisfactory because the estimation was done at the early stage of the epidemic.

Ambikapathy and Krishnamurthy [6] developed and validated a mathematical model to assess the impact of various scenarios on COVID-19 transmission in India. A compartmentalized ODE model incorporating the actual cases from 14 countries, China, Italy, Germany, France, USA, UK, Sweden, Netherlands, Austria, Canada, Australia, Malaysia, Singapore, and India, was proposed. The model was applied to predict transmission in India and the highest exposure situations, such as transit stations and shopping malls, were evaluated. It was validated using the infections reported in the adopted period and was used to predict future infected cases in the above countries, considering a 65-day period (IndiaSim implementation). Different intervention strategies were used with blocking periods of 4, 14, 21, 42, and 60 days. The model developed can capture the infection dynamics in each country to a considerable extent and predict future cases. The use of an ODE system to describe the models is advantageous because it is possible to apply controls to the model and find results. Nevertheless, the model suffers from numerical errors because at the beginning of the disease the S-compartment has a high value and the I and R-compartments have very low values. In addition, the model proposed in India assumes no spread of the disease in the community until the first week of March 2020, and the dynamic prediction interval is limited (110 days). The model will need to be updated. Also in order to do a predictive analysis of COVID-19 in China, Italy, and France, developed a SIRD model to predict the position of the epidemic peak of the disease. For stochastic evolution, the Python-Scipy package was used. For Italy, the prediction with nonlinear fit strategy for the endemic peak is robust. With simulations it was shown that the recovery rate is the same for China as for Italy, but the infection and mortality rates seem to be different. This model showed that cultural factors influence the infection rate, varying from one country to another. The model has the limitation of data sensitivity, so it changes from one country to another. And when making numerical solution adjustments, it was found that the data reported for the outbreak in France is still too preliminary to justify a significant adjustment of this kind. The researchers [48] introduced a SIRD model described by a system of ODEs to analyze the behavior of COVID-19 disease in the USA, Germany, UK, and Russia and solved using numerical methods, and the data agree well with the model. The model predicts the peak of the epidemic in each country and compares the results obtained. Germany’s prediction was the optimistic one. The authors Khajanchi et al. [42] proposed another paper in which two mathematical models were developed to describe the dynamics of the virus in China described by systems of ODEs and curves constructed for the number of infected, recovered, and dead. The optimal values of the model parameters, which accurately describe the statistical data, were found. World Health Organization (WHO) data was used to obtain the model parameters, obtaining good agreement between the statistical data and the model curves. Thus, it is shown that there was a broad fit of the proposed mathematical model. This indicates a high adequacy of the mathematical model for coronavirus infection. Hamzah et al. [31] have developed a framework to manage and track COVID-19 data called CoronaTracker. This framework is based on a SEIR predictive compartmental model to predict the outbreak of COVID-19 inside and outside China based on daily observations, analyzing the influence of news of people’s behavior both positively and economically. John Hopkins University (UJH), World Health Organization (WHO), and Ding Xiang Yuan databases were used as data sources. The data collected in CoronaTracker is available on the data lakes platform. For numerical simulation, the Scipy implementation was used, and for numerical integration, odeint was used. The study showed that the spread of the outbreak is influenced by the social policy of each country. The developed platform has an easy interface in which citizens can register their feelings and express their opinions about news articles. CoronaTracker can assist the government and authorities to disseminate articles, provide updates on the situation, and advocate good personal hygiene. This study has the limitation that when using data from John Hopkins University (UJH), an initial number of exposed individuals was missing. A decision-making system for COVID-19 (CDMS) was created by Varotsos and Krapivin [93] for USA, Brazil, Russia, and Greece. For the model with deterministic components, a compartmental SPRD model was created, similar to the classic SIRD in the literature, described by ODEs and with parameters determined by the reported data. And for the model with stochastic compartments, the classes were represented with a stability indicator that characterizes the COVID-19 propagation trend. Numerical evaluations were done by the SARD block using stochastic reports on the state of disease effects. This study showed that temperature and humidity slowly affect the effects of the pandemic. The analysis of the spread of the disease and the loss of income due to the pandemic has different impacts for each country. The analysis of official data from Russia and Greece showed the results of the pandemic. The risk of infection and mortality increases with increasing population density. What limits this study is that there is not enough data to make the study reliable. In practice, it was impossible to coordinate measures to contain the COVID-19 pandemic under conditions of high uncertainty. In the study of Sadun [81] a compartmentalized SEIR model was developed. In this study strategies are developed to try to estimate the reproduction number R 0, and come to the conclusion that there is no direct way to measure it. The estimated value of R 0 depends on the length of the latency period for three versions of the classical SEIR model. The estimates of the reproduction number that have been published should be viewed with skepticism, one needs to understand the latency of COVID-19. However, there is no direct way to measure R 0, so what one can do is measure the time scale of the exponential growth of the pandemic and try to estimate R from it. The SEIPAHRF model was created by Ndaïrou et al. [67] to understand the transmission dynamics of COVID-19 in Wuhan. This model introduced a modification of the classical SEIR model by introducing asymptomatic (A), hospitalized (H), and fatality (F) infectious class. To study the basic reproduction number, a generation matrix was used in a sensitivity analysis. The local stability of the model was also studied. In this study, the theoretical findings and numerical results fit well with the actual results and reflect reality in Wuhan, China. This model can be used to study the reality in other countries whose outbreaks are increasing. However, the limited data at the beginning of the study, since it was early in the disease, was limited. Also to model and predict the dynamics of the COVID-19 pandemic in India, Sarkar et al. [85] created a 6-compartment model that extends the standard SEIR. And it divides the coronavirus-infected population from the susceptible individuals before the progression of clinical symptoms. It was also proven that quarantine decreases contact between uninfected and infected, and thus there is a reduction in the contact rate and can effectively reduce R 0. A sensitivity analysis was performed to recognize the most influential parameters with respect to the clinically infected population. And this sensitivity analysis was done by evaluating the technique of partial rank correlation coefficients (PRCC) for all input parameters in relation to variable I. The indices were evaluated at six time points: 30, 45, 60, 75, 90, and 100 days before steady state. The model variable was selected for sensitivity analysis I (infected or symptomatic individuals), generating six more influential parameters out of nine. And the actual daily COVID-19 data are fitted using least squares method (MMQ), which locally minimizes the sum of squares of errors. The numerical implementation was done in FORTRAN program. This model provides an important tool for assessing the consequences of possible policies, incorporating social distancing and blocking. Unfortunately, because of the short time scale, demographic effects are not considered. The Abou-Ismail [1] researchers focused on explaining and mathematically simplifying three models: SIR, SEIR, and SUQC (susceptible, unquarantined, quarantined, and confirmed). The goal was to understand the nature of the pandemic and to measure the impacts of social distancing through mathematical models. Making use of a system analysis of ODEs that describe the disease.

To analyze mathematically and do a numerical study the authors Viguerie et al. [95] created a new framework for understanding compartmental models by means of equilibrium equations similar to those found in Continuum Mechanics (Lotka–Volterra type). The model is SEIRD and made use of differential equation models to derive and analyze R 0. For models based on ODEs it has the concept of the basic reproduction number R 0 well defined, but the extension to a model based on PDEs is not clear due to the influence of diffusion. Therefore, in this work the EDO version of the EDP model was derived and its efficiency was evaluated with numerical tests. For the numerical tests either implicit second-order Backward Euler (BDF2) or implicit first-order Backward Euler was used. Picard linearization was performed at each time step. And the iterative Generalized Minimum Residual method (GMRES) with Jacobi preconditioning was used to solve all linear systems. PDE models are advantageous in that they allow a continuous space description of the relevant dynamics, allowing the dynamics to be described in time and space at all scales. Since models described by EDOs are limited for describing spatial information, implicit models are effective in describing the temporal dynamics of the system. In this model deaths other than by COVID-19 and births are not considered. The study developed by the researchers Khoshnaw et al. [47] used MATLAB’s System Biology Tool (SBedit) package to compute the class dynamics of the model, and thus obtained a better understanding and identification of the key critical model parameters. And thus it was possible to understand the impacts of transmission rate and contact for New York. However, having several different models implies that one needs to create or identify the critical elements of each of the models. Furthermore, the model cannot simply be extrapolated to conditions in another country. Its parameters must be estimated from the new conditions. Making use of the same MATLAB package to obtain numerical solutions and calculate local sensitivity, Khoshnaw et al. [46] developed a model. Sensitivity analysis was done with the dynamics of the biological system modeled with law of mass action. This study also concluded that the most effective factors for the spread of coronavirus are: (1) the rate of person-to-person transmission, (2) the rate of quarantined exposure, and (3) the rate of transition from exposed individuals to individuals infected. MATLAB was used to numerically solve the compartmental model described by nonlinear differential equations proposed by Ahmed et al. [3]. And for the logistic model, the fitVirus function was used. The union of mathematical models and computer simulations is an effective tool that provides us with more understanding and good numerical predictions of the model states. However, in this study it is noticed that the number of people exposed to quarantine becomes stable after 40 days but the number of recovered people increases rapidly and becomes stable slowly. Having many approaches to identifying the estimates and understanding the disease makes the issue murky. However, this study has brought the identification of critical parameters of the model, helping to understand the overall issue more effectively and broadly. Using a code in MATLAB, Shao et al. [87] performed the numerical simulations. In this study, two time-delayed dynamic models were used to track COVID-19. The time-delayed dynamic coronavirus pneumonia model (TDD-NCP) introduced the delay process into the differential equations to describe the latent period of the epidemic and can be used to predict the trend of coronavirus outbreak. Whereas the Fudan-Chinese Center for Disease Control and Prevention (CCDC) model was established to determine the kernel functions in the TDD-NCP model by the public data of CDCC, this model is suggested to use the time delay model to adjust the real data. The advantage of the Fudan-Chinese model is that it can track the initial date of the epidemic, when provided the I(t 0). Moreover, this model can reconstruct parameters such as the growth rate and the “isolation rage,” and predict the cumulative number of confirmed cases in some cities in China. However, because this work was done in early March, there was still little knowledge about the disease and little data on confirmed cases. Rajagopal et al. [75] have developed a SEIRD model with integer and fractional differential equations to describe coronavirus in Italy. The fractional model is of the Caputo type, the most popular and most widely used for real problems. To find the optimal parameters, the model parameters are estimated. The number of infected, the number of deaths, and the associated mean square error (RMSE) are also considered. The fractional model gives more realistic predictions and has fewer modeling errors. And with that, the proposed model agrees with the actual data from Italy better than the classical model. A SCEAQHR model for predicting cases in Cameroon has been proposed by Nabi et al. [65]. This model integrates a new class for individuals who have made imperfect quarantine and disregarded blocking policies. The model parameters were estimated with real-time data, followed by a projection of the disease evolution. The model is described by Caputo fractional differential equations, and the existence and uniqueness of the solutions are presented. The optimization algorithm is based on the reliable-region-reflective (TRR) algorithm, which is the evolution of the Levenberg–Marquardt algorithm. The numerical implementation is done using the lsqcurve fit function of MATLAB. The Partial Rank Correlation Coefficient (PRCC) method was used to quantify the dominant mechanisms. The optimization is robust to solving nonlinear least squares problems.

The researchers Roda et al. [78] used the Akaike Information Criterion (AIC) to select the model. And performed an analysis of the predictions of the SIR and SEIR models. The SIR model outperformed the SEIR model in representing the information contained in the confirmed case data. The calibration of the model was done using the Monte Carlo Markov Chain algorithm, and the calibration was done with data from January 21 to February 04 from Wuhan city in China. The authors state that data before January 23 is unreliable and there is a lack of data. There is no identifiability because a group of model parameters cannot be determined solely from the data provided during model calibration. This impacts the reliability of the model.

Din et al. [22] brought out a new three-compartment model (PIQ) described by EDPs for COVID-19 transmission. To study the stability, the Atangana, Baleanu, and Caputo (ABC) model with arbitrary order was used. Banach’s fixed point theorem and Guo–Krasnoselskii were used to prove the existence of the model. And the numerical simulations were done using the Adams–Bashforth (AB) method with fractional differentiation. Using this method is a sophisticated and powerful tool for investigating nonlinear problems. The model proves mathematically that it is well defined.

Through a system of ordinary differential equations, the disease is contextualized through social parameters to understand how the spread works and how it is possible to control the epidemics that affect society and thereby create preventive measures. Examples of this type of model are the modified SEIR models proposed by Yang et al. [101] as well as the SEIR (Susceptible, Exposed, Infectious, Recovered) model with age-structured quarantine class with the two types of control measures used to analyze the effects of policy control for the coronavirus epidemic in Brazil [15], and the SEIRQ (Susceptible, Exposed, Infectious, Recovered, Quarantine) model with age structure, proposed by Gondim and Machado [29]. This model aims to analyze optimal quarantine strategies in order to help in decision-making through health managers.

Regarding statistical epidemiological models, Sarkar et al. [85] propose a mathematical model to monitor the dynamics of six compartments: Susceptible (S), Asymptomatic (A), Recovered (R), Infected (I), Isolated Infected (Iq), and Quarantined Susceptible (Sq), collectively expressed SARIIqSq. The authors applied their proposal to real data on the COVID-19 pandemic in India. Starting from the date of first COVID-19 case reported in India, the authors have simulated the SARIIqSq model for 260 days for each states and for whole India to study the dynamics of the SARS-CoV-2 disease. They statistically confirmed that a reduction in the contact rate between uninfected and infected individuals by quarantined the susceptible individuals can effectively reduce the basic reproduction number. They also demonstrate that the elimination of ongoing SARS-CoV-2 pandemic is possible by combining the restrictive social distancing and contact tracing. However, the authors also emphasize the uncertainty of accessible authentic data, specially concerning to the accurate baseline number of infected individuals due to subnotifications, which may guide to equivocal outcomes and inappropriate predictions by orders of size.

Ndaïrou et al. [67] propose a novel epidemiological compartment model that takes into account the super-spreading phenomenon of some individuals. They consider a fatality compartment, related to death due to the virus infection. The constant total population size N is subdivided into eight epidemiological classes: Susceptible class (S), Exposed class (E), Symptomatic and Infectious class (I), Super-Spreaders class (P), Infectious but Asymptomatic class (A), Hospitalized (H), Recovery class (R), and Fatality class (F). This model reached a reasonably good approximation of the reality of the Wuhan outbreak, predicting a diminishing on the daily number of confirmed cases of the disease. The model also fits well the real data of daily confirmed deaths. The model can be considered useful for other realities than Wuhan, China, since the amount of hospitalized individuals is relevant as an estimate of the Intensive Care Units (ICU) needed.

Khajanchi and Sarkar [43] developed a new compartmental model to explain the transmission dynamics of Covid-19. They calibrated their model with daily Covid-19 data for four Indian states: Jharkhand, Gujarat, Andhra Pradesh, and Chandigarh. They studied the feasible equilibria of the proposed model and their stability with respect to the basic reproduction number R 0. The disease-free equilibrium becomes stable and the endemic equilibrium becomes unstable when the recovery rate of infected individuals increases, but if the disease transmission rate remains higher, then the endemic equilibrium always remains stable. The proposed model obtained R 0 > 1 for all studied Indian states, suggesting a significant outbreak. The model is able to provide short-time Covid-19 forecasting as well.

Samui et al. [84] proposed a deterministic ordinary differential equation model able to represent the overall dynamics of SARS-CoV-2. They stratified the total human population into four compartments: susceptible individuals (uninfected), asymptomatic individuals (pauci-symptomatic or clinically undetected), reported symptomatic infected individuals (symptomatic infectious individuals are reported by the public health service), and unreported symptomatic infected individuals (clinically ill but not reported) to formulate the SAIU (susceptible or uninfected (S), asymptomatic (A), reported symptomatic infectious (I), unreported symptomatic infectious (U)) model. This model assumes that infected individuals informed will no longer be associated with infections, as they are isolated or transferred to Intensive Care Units (ICU). Thus, only infectious individuals belonging to I(t) or U(t) spread or transmit the diseases. The authors designed the SAIU model to study the transmission dynamics of COVID-19 based on the accessible data for India during the time period January 30, 2020 to April 30, 2020. Based on the estimated data, the SAIU model predicts the outbreak of COVID-19 and computes the basic reproduction number R 0. The authors assessed the sensitivity indices of the basic reproductive number R 0, given that R 0 expresses the initial disease transmission and the sensitivity indices describe the relative importance of various parameters in coronavirus transmission. The SAIU model showed the persistence of diseases for R 0 > 1. The endemic equilibrium point E ∗, for this study, was locally asymptotically stable for R 0 > 1.

Khajanchi et al. [44] extended the classical deterministic Susceptible–exposed–infectious–removed (SEIR) compartmental model refined by introducing contact tracing-hospitalization strategies to study the epidemiological properties of Covid-19. They calibrated their mathematical model using data of confirmed cases in India and estimated the basic reproduction number for the disease transmission. The authors have their calibrated epidemic model for the short term prediction in the four provinces and the Republic of India. The simulation of the calibrated model was able to capture the increasing growth patterns for three different provinces, namely Delhi, Maharashtra, West Bengal and the Republic of India, whereas in case of the province Kerala, the model fitting is not good compared to other states and overall India. Model simulation and prediction suggest that Covid-19 has a potential to exhibit oscillatory but controllable dynamics in the near future by maintaining social distancing and effectiveness of home isolation and hospitalization. The proposed model forecasts that isolation or hospitalization of the symptomatic population, under stringent hygiene safeguards and social distancing, is considerably effective. Finally, Khajanchi et al. [44] give evidences that the size and duration of an epidemic can be considerably affected by timely implementation of the hospitalization or isolation program.

The classic mathematical models of epidemiological prediction are quite useful, but deterministic, demonstrating only the average behavior of the epidemic, which makes it difficult to quantify uncertainty. Wang et al. [97] proposed an analysis of the spatial structure and dynamics of the spread of Covid-19, providing a spatio-temporal prediction of the Covid-19 outbreak in the USA. Kapoor et al. [39] investigated large-scale spatio-temporal prediction using neural network graphs and human mobility data in US counties. Through this method and space-time information, the model learns the epidemiological dynamics. Tomar and Gupta [92] proposed a space-time approach to control and monitor Covid-19 using LSTM (Long Short-Term Memory) neural networks and adjusting curves to predict chaos. Ren et al. [76] used Ecological Niche Models (ENM) to gather epidemiological and socioeconomic data, aiming to accurately predict the risk areas for Covid-19 infection. Yesilkanat [102] made a study with space-time approach for 190 countries in the world and compared it with the number of real cases of the disease using the Random Forest method. Also using a space-time approach, Pourghasemi et al. [70] did a risk mapping, change detection and trend analysis of the Covid-19 spread in Iran using regression and machine learning. Roy et al. [79] developed a short-term prediction model for the new Coronavirus using canonical ARIMA (Autoregressive Integrated Moving Average) and disease risk analysis done using weighted overlap analysis in geographic information systems.

3 Forecasting by Machine Learning and Hybrid Approaches

Several efforts to aid Covid-19 screening and monitoring can be perused in the works of Dong et al. [25]. In this work, Dong et al. [25] created an online interactive panel to visualize Covid-19 infected cases and deaths in real time, providing researchers, health authorities, and the general public a tool to track cases as the disease progresses. Due to the rapid development of the coronavirus, the need to classify infected patients and analyze which individuals were more vulnerable to the disease also grew. Therefore, Xie et al. [100] proposed a model of clinical prediction for patient mortality based on multivariable logistic regression, to improve the use of limited healthcare resources and calculate the patient’s survival rate. Furthermore, in order to aid the diagnosis, Feng et al. [26] developed the online calculator S-COVID-19-P based on Lasso regression, for early identification of suspected Covid-19 pneumonia in the admission of adult patients with fever. Jin et al. [38] proposed a system based on deep learning for the rapid diagnosis of Covid-19 with precision comparable to experienced radiologists, and can accurately classify pneumonia, CAP (Community-Acquired Pneumonia), influenza A and B, and Covid-19. They used LASSO to find the 12 most discriminating characteristics in the distinction between Covid-19 and other pneumonias. Gomes et al. [28] proposed a system to support the diagnosis of Covid-19 by analyzing chest X-ray images, capable of differentiating Covid-19 from bacterial and viral pneumonias using texture-based image representation and classification by Random Forests. Different from other more complex Covid-19 X-ray feature extraction approaches [7, 8, 12, 19, 33, 35, 45, 53, 54, 63, 66, 96], Gomes et al. [28] avoided deep learning based solutions and adopted texture and shape features to provide the users a low-cost computational web-based computational environment able to deal with several simultaneous users without overcharging network resources.

In order to find a new way to perform early, efficient, and accurate control and screening of suspected individuals, Meng et al. [62] created the Covid-19 Diagnostic Aid APP to calculate the probability of infection through simple and easy laboratory test results. Screening a large number of suspicious people could optimize the diagnostic process and save medical resources. Barbosa et al. [10] considered the fact that, in many regions of the world, RNA testing is not always available due to the scarcity of inputs, created HegIA, an intelligent system based on Bayes Networks and Random Forests to aid at the diagnosis of Covid-19 based on blood tests from 24 blood tests. The performance is close to RT-PCR (Reverse Transcription Polymerase Chain Reaction) for symptomatic individuals, though coronavirus RNA is not searched [10]. HegIA is a fully functional system, available for free use, to provide low-cost rapid testing.

Several works have used Evolutionary Computing and Swarm Intelligence Methods to automatically adjust compartmental models [61, 71, 73, 83]. Putra and Khozin Mu’tamar [71] automatically estimated parameters in the Susceptible, Infected, Recovered (SIR) model using the Particle Swarm Optimization (PSO) algorithm. Their results suggest that the proposed method is able to tune SIR models precisely compared to other analytical approaches. Similarly, Mbuvha and Marwala [61] calibrated a SIR model to South Africa’s Covid-19 reported cases taking into account several scenarios of the reproduction number R 0 for reporting infections and healthcare resource estimations. They assumed that the reported confirmed cases represent between 0.2% and 1% of the total infected population. The authors also assumed that SIR model parameters are fixed albeit at multiple ranges. However, they detected the uncertainty around SIR parameters and propose a Bayesian treatment using Markov Chain Monte Carlo techniques in the near future.

Qi et al. [73] investigated the influence of daily temperature (AT) and relative humidity (ARH) on the occurrence of Covid-19 in 31 Chinese provinces, mainly in Hubei. The authors collected daily counts of laboratory-confirmed cases in all provinces in China from the official reports of the National Health Commission of People’s Republic of China from December 1, 2019 to February 11, 2020 for Hubei province and from January 20, 2020 to February 11, 2020 for other provinces. Tibet was not included in the following model since only one case was reported during the 23-day cited period. The meteorological data, including daily average temperature (AT) and daily average relative humidity (ARH) of each provincial capital, were retrieved from Weather Underground. Although this study suggests that both daily temperature and relative humidity influenced the occurrence of COVID-19 in Hubei province and in some other provinces, the association between COVID-19 and AT and ARH across the provinces was not considered consistent. The authors found spatial heterogeneity of COVID-19 incidence, as well as its relationship with daily AT and ARH, among provinces in Mainland China.

Salgotra et al. [83] propose prediction models based on genetic programming (GP) for confirmed cases and death cases across the three most affected states in India: Maharashtra, Gujarat, and Delhi. The authors also applied the model to forecast Covid-19 cases in whole India. The proposed prediction models are presented using explicit formula. The authors studied the impotence of prediction variables as well. Statistical parameters and metrics have been used to evaluate and validate the evolved models. Genetic evolutionary programming models have proven to be highly reliable for Covid-19 cases in India.

Rahimi et al. [74] present a systematic review on Computational Intelligence algorithms for Covid-19 forecasting. They searched on Web of Science (WoS) and Scopus for publications in accordance with the following keywords: forecasting, prediction, Covid-19, and coronavirus. The authors selected 920 technical research articles presenting just algorithmic descriptions, review articles, conference papers, case studies, and able to provide managerial insights, published until October 10, 2020. The authors focused on papers indexed by the Web of Science. Rahimi et al. [74] categorized the main forecasting works according to the following classification regarding the algorithms:

  • Simple Moving Average [16] as defined by Maleki and Arellano-Valle [55], Maleki and Nematollahi [58], Zarrin et al. [103], Maleki et al. [56], and Hajrajabi and Maleki [30];

  • Auto-Regressive Integrated Moving Average (ARIMA) [5, 50, 64, 80, 88];

  • Two-piece distributions based on the scale [57];

  • Logistic functions: S-shaped functions to model epidemiological curves [17, 52, 72];

  • Regression Methods [4, 36, 77, 90, 94];

  • Canonical neural networks [27, 64, 91];

  • Deep learning methods based on Convolutional Neural Networks (CNNs) [13, 51, 86];

  • Deep learning methods based on Long-Short Term Memory (LSTM) neural networks [9, 18];

  • Genetic programming [82, 83]);

  • Classical and modified compartment models: SIR, SEIR, and SIRD [2, 14, 41, 69].

Tamang et al. [91] used artificial neural network-based curve fitting techniques to predict and forecast Covid-19 infected and death cases in India, USA, France, and United Kingdom, considering the progressive trends of China and South Korea. The authors considered three cases to analyze the Covid-19 outbreak: (1) forecasting as per the present trend of rising cases of different countries; (2) one-week forecasting following up with the improvement trends as per China and South Korea; and (3) forecasting if followed up the progressive trends as per China and South Korea before a week. According to the authors, to reduce infection rates and achieve leveling of trends in epidemiological curves, these countries will require fewer days according to the forecast with the trend in China and more days with steady progress are seen with the South Korea’s trend. In addition, it can also be concluded that, with the trend of China, countries with a greater number of cases could be better in fewer days with possibly stricter measures of social isolation, detachment, and confinement. Considering that South Korea’s trend is toward slower and more constant control, which could be more effective in the initial stage with lower reported cases. All conclusions were made in accordance with the predictions obtained with the application of the multilayer perceptron artificial neural network technique. Although the case data used in the study are based on reliable sources, the predictions are in accordance with the conditions and techniques applied. Consequently, their experimental results suggest that artificial neural networks are able to forecast the future cases of COVID 19 outbreak of practically any country at low error rates.

Huang et al. [32] propose a new model of CNN deep neural network with multiple inputs to predict the cumulative number of confirmed cases of Covid-19. The cumulative number of confirmed cases on the following day is predicted according to the total number of confirmed cases from the previous 5 days, total new confirmed cases, total cured cases, total new cured cases, total deaths, and total new deaths. Datasets from seven Chinese cities in the provinces of Hubei, Guangdong, and Zhejiang were used with confirmed serious cases for the training and forecasting of the models. Data on confirmed cases of COVID-19 from January 23, 2020 to March 2, 2020, and from January 23, 2020 to March 2, 2020, were obtained from the media outlet Surging News Network and from the World Health Organization, respectively. The two evaluation indexes of the mean absolute error (MAE) and root mean square error (RMSE) were used. According to the authors, the proposed algorithm can quickly use small datasets to establish models with high predictive precision. This is a considerable advantage of this model over other models with similar characteristics. Through the proposed algorithm, a prediction model was established for the number of confirmed cases of COVID-19. Verification and comparison were conducted between different deep learning algorithms. The accuracy and reliability of the deep learning algorithm have been verified by predicting the future trend of Covid-19. In addition, experiments for several cities with more serious confirmed cases in China indicated that the prediction model in this study had the lowest error rate among its tested equivalents. As future work, the authors envisage using deep learning networks with a mixed structure, seeking to build more accurate models, which can be applied to more countries.

Distante et al. [23] modeled spreading of Covid-19 using Chinese data and used the model to predict epidemic curve in each Italian region, allowing to gain better information on the new daily cases peaks with the predicted epidemiological curve. According to the authors, the forecast portion of the curve allows to have a better prediction of active cases with the SEIR model, by computing the position of the peaks of active cases for each Italian region. Interestingly, the process of training on Chinese data and using the knowledge to forecast Italian spreading of Covid-19 has resulted in good forecasting results, considering the mean average precision between official Italian data and the forecast. SEIR models may fit better than other compartment models since they are based on the complete curve dynamic. Therefore, the proposed approach is valid since the predictive model learns from the dynamics of Covid-19 in China and exploits its knowledge to predict future daily cases in Italy.

Wieczorek et al. [99] proposes a predictive model based on a deep 7-layer neural network trained by the NAdam method to predict the number of infected cases. The authors used a dataset provided by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University on their github page. This dataset is composed of the following sources: (a) World Health Organization (WHO); (b) European Center for Disease Prevention and Control (ECDC); (c) DXY.cn. Pneumonia. 2020; (d) COVID Tracking Project; (e) National Health Commission of the People’s Republic of China (NHC); (f) China CDC (CCDC); (g) Washington State Department of Health; (h) other smaller, regional US health departments. The predictive model was able to predict new cases with very high efficiency, above 99% in some geographic regions. However, the authors noticed that analysts should take into account several factors able to influence the epidemiological curves: behavior of the population in a given region, behavior of governments of given countries as well as access to knowledge and medical equipment. The neural network-based predictor employs a unified architecture. According to their experimental results, the authors do not need to change the architecture in dependence with each region or country. Accuracy for most of regions is around 87.70%. However, the authors believe that dedicated architectures should be used to contemplate differences among countries, like population and government behaviors.

Kırbaş et al. [49] modeled confirmed COVID-19 cases of Denmark, Belgium, Germany, France, United Kingdom, Finland, Switzerland, and Turkey using Auto-Regressive Integrated Moving Average (ARIMA), Nonlinear Autoregression Neural Network (NARNN) and Long-Short Term Memory (LSTM) approaches. They tested six model performance metrics: MSE, PSNR, RMSE, NRMSE, MAPE, and SMAPE. Cumulative confirmed case data of eight different European countries were used for modeling: Denmark, Belgium, Germany, France, United Kingdom, Finland, Switzerland, and Turkey. The datasets were acquired from the European Center for Disease Prevention and Control. Data were taken from the day the first case was seen, and the number of data for each country varies. The data covers 67, 90, 97, 100, 94, 90, 68, and 55 days, respectively, and ends on 3 May 2020. The data from cumulative confirmed cases in some European countries are modeled using three different approaches. According to the results, it was determined that LSTM approach has much higher success compared to ARIMA and NARNN. The lowest number of cases was observed in Finland during the epidemic, while the highest rate of increase was observed in the United Kingdom. According to the 2-week prospective estimation study, in many countries, the total case increase rate is expected to decrease slightly. Since the work was carried out entirely by considering statistical data and methodologies, the effects of social distancing and other similar measures, compliance with hygiene rules or lockdown were ignored. However, according to the results on real data, the authors considered the predictions satisfactory.

Pal et al. [68] have proposed to use the local data trend with a shallow Long Short-Term Memory (LSTM) based neural network combined with a fuzzy rule based system to predict long term risk of a country. The country-specific neural networks are optimized using Bayesian optimization. The authors used the dataset (https://github.com/datasets/covid-19) that included date, country, the number of confirmed cases, the number of recovered cases, and the total number of deaths. This data was combined with weather data (https://darksky.net/): humidity, dew, ozone, perception, maximum temperature, minimum temperature, and UV for analyzing the effect of weather. The authors considered mean and standard deviation over different cities of a country. The data spanned the duration 22-01-2020 to 02-08-2020. The authors propose to use country-specific optimized networks for accurate prediction, since this approach seems suitable for small and uncertain dataset. Combining the overall optimized LSTMs, they noticed that a shallow networks perform better compared to deep neural networks. The authors also noticed that the weather data does not affect the forecasting accuracy.

Zeroual et al. [104] performed a comparative study of five deep learning methods to forecast the number of new cases and recovered cases: simple Recurrent Neural Network (RNN), Long short-term memory (LSTM), Bidirectional LSTM (BiLSTM), Gated recurrent units (GRUs), and Variational AutoEncoder (VAE). These methods were applied for global forecasting of Covid-19 cases based on a small volume of data. This study is based on daily confirmed and recovered cases collected from six countries namely Italy, Spain, France, China, USA, and Australia. The values of parameters of deep learning models are selected such that the loss function is minimized during the training. The authors adopted the Adam optimizer. In the testing stage, the previously constructed models with the selected parameters are used to forecast the number of COVID cases. The accuracy of the model was verified by comparing the measured data with real data via different statistical indicators including RMSE, MAE, MAPE, and RMSLE (Root Mean Squared Log Error). The research was based on daily figures of confirmed and recovered cases collected from six highly impacted countries namely Italy, Spain, Italy, China, the USA, and Australia. The considered datasets are gathered from the starting of COVID-19 for the respective countries, i.e. 22 January 2020, till June 17th, 2020. These datasets are made publically by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (https://github.com/CSSEGISandData). Results demonstrate that the Variational AutoEncoder achieved the best forecasting performance in comparison to the other models.

Kapoor et al. [40] propose a novel spatio-temporal forecasting approach for Covid-19 case prediction based on Graph Neural Networks and mobility data. Differently from time series forecasting models, the proposed model learns from a single large-scale spatio-temporal graph, where nodes represent the region-level human mobility, spatial edges represent the human mobility based inter-region connectivity, and temporal edges represent node features through time. The authors applied their method to the US county level COVID-19 dataset. They perceived that the spatial and temporal information leveraged by the graph neural network allows the model to learn considerably complex dynamics. It is noticed a 6% reduction of RMSLE and an absolute Pearson Correlation improvement from 0.9978 to 0.998 in comparison with the state-of-the-art models. According to the authors, the combination of graph-based deep learning approaches can be very useful to aid to understand the spread and evolution of Covid-19.

de Lima et al. [21] proposed a real-time surveillance, forecast, and spatial visualization of Covid-19, named COVID-SGIS. As a case study, the forecasting system was applied to monitor Brazil. The system captures routinely reported Covid-19 information from 27 federative units from the Brazil.io database. It uses Covid-19 confirmed case data notified through Brazil’s National Notification System, SINAN, from March to May 2020. Time series ARIMA models were integrated to forecast the cumulative number of Covid-19 cases and deaths. These include 6-days forecasts as graphical outputs for each federal state in Brazil, separately, with its corresponding 95% confidence interval. The worst and the best scenarios are both presented. The overall percentage error between the forecasted values and the actual values varied between 2.56% and 6.50%. For the days when the forecasts fell outside the forecast interval, the percentage errors in relation to the worst case scenario were below 5%. Considering the good results obtained with the proposed tool, the authors claimed that the proposed method for dynamic forecasting may be used to guide social policies and plan direct interventions in a cost-effective, concise, and robust manner.

4 Conclusions

COVID-19 is a disease that was discovered and soon assumed pandemic status as it spread to several countries around the world. It drew attention for its ease of transmission and for exposing the vulnerabilities of health systems around the world. The individuals who were infected and their families were left with the pain and suffering and the after-effects of the disease. Although there are vaccines, there is still no proven effective drug against the disease, so following safety protocols and social isolation are indispensable. In addition to hygiene practices such as the use of masks and hand-washing, the use of models to understand the behavior of the disease and even to predict it helps to shed light on the next steps to be taken in this pandemic.

The representation of disease through mathematical models facilitates monitoring and can help analyze and understand disease dynamics through key characteristics. Through a system of equations, it is possible to model a disease and contribute to a quantitative understanding. These characteristics become useful information on how the spread of the disease works. They also make it possible to understand how to build time prediction and thus help to create measures to control and prevent COVID-19. However, for this type of modeling, some assumptions are made, such as assuming that disease transmission occurs homogeneously, or selecting only one among several climatic factors. Therefore, this limits the model’s ability to predict. And if more features were added, the model would lose robustness.

With the large amount of data available and thanks to speed and storage technologies, Artificial Intelligence is increasingly strong and present in several areas. Then, the use of machine learning techniques grew in order to obtain insights from this data. These models are applied in several areas, such as economics, for the performance of a stock in the stock market, in banking, in e-commerce, determining whether a customer will like the product or not, in health, as is done in the present work with digital epidemiology. But models with this kind of approach are black boxes, that is, they are not intelligible to experts, because their goal is to correctly map inputs to outputs. Another type of approach using Artificial Intelligence is the one that uses hybrid models, i.e. it combines machine learning models with statistical models. By doing this they combine the advantages of each of these types of models in order to obtain a more robust prediction model. Another type of hybrid model is one that combines compartmental models and machine learning. With this approach the model does not have as good a prediction quality as machine learning based models; however, it can aid in human understanding of epidemiological aspects as phenomena, while machine learning based models can return accurate predictions, thus combining intelligent systems for accurate human learning emergent predictions. The use of all these approaches is very important to support us in temporal and spatio-temporal prediction of cases and deaths. For these solutions can shed light on strategies to assist decision-making by health managers.

Finally, COVID-19 brings with it all the challenges of a new disease with only 1 year of existence, in facing this unknown, science makes use of all its arsenal. At this time when there is no extensive background to teach how the disease behaves, daily experience determines adjustments and creation of clinical protocols. Predicting the temporal and spatial behavior of COVID-19 through machine learning becomes a valuable tool to guide strategies, policies, and hope.