Keywords

Mathematics Subject Classification

Introduction

The newly discovered coronavirus disease (COVID-19) is infectious disease caused with common symptoms such as fever, tiredness, body aches, nasal congestion, runny nose, sore throat or diarrhoea and dry cough. The outbreak of coronavirus disease (COVID- 19) has happened in China in December 2019, the first case being found in Wuhan. In March, it was declared as pandemic by the World Health Organization (WHO) exposing the world to public health emergencies.

Several preventive measures have been taken by governments’ of various countries across the globe to prevent the spread of COVID infection. These preventive measures include social distancing, wearing the masks, face shields, frequent sanitization, quarantine in case of travel or home isolation in case of suspected symptoms. In India, first confirmed case happened to be in Kerala on 30-Jan-2020. The man, who was studying in WUHAN University and had travelled to India, tested positive for the virus. India’s first death was confirmed in Karnataka (Kalaburgi) on 12-Mar-2020. The man who returned from Saudi Arabia and had a history of Hypertension, Diabetes and Asthma succumbed to the disease.

Some of the worst affected cities in India are Delhi, Mumbai, Chennai and Bangalore. Bangalore’s first case got confirmed on 08-Mar-2020, when a software engineer who returned from Austin, US along with his wife and daughter tested positive. The first death case was of a 70 year old woman from Chikkabalapur in Bangalore, on 24-Mar-2020. She had travelled to Mecca and arrived in India on 14 March.

Chennai’s first COVID case was confirmed on 07-Mar-2020, when a man travelled to Chennai from Oman. He was admitted to Hospital on 05-Mar-2020 with complaints of fever and cough, and finally, the reports on 7 March showed that he was positive. The first death cases of Chennai confirmed on 5-Apr-2020, where a 71 year old man from Ramanathpuram and 60 year old man from Washermanpet died at Government hospital, Chennai.

Mumbai’s first confirmed case was on 11-Mar-2020. A couple from Andheri had tested positive post their return from a Dubai and Abu Dhabi trip. The first death case was confirmed on 17-Mar-2020, where a 71 year old man with a history of high blood pressure returned from Dubai. He developed pneumonia and inflammation of heart muscles and increased heart rate leading to death.

A 45 years old person from East Delhi, with a history of travel from Italy, was the first confirmed case in Delhi on 02-Mar-2020. Delhi’s first death was of a 68 years old woman who got the virus from her son who returns from Switzerland. It was confirmed on 14-Mar-2020.

Researchers are trying to contribute their bit in every possible way that can lead to some solution. While conventional methods are precise and deterministic, artificial intelligence (AI) techniques could give high-quality predictive models. In this study, authors use a publically available dataset from which contains information on positive cases, recovered cases, active cases and death cases in four metropolitan cities of India over 97 days (from 26th April to 31st July 2020). In the present study, a detailed statistical analysis has been performed on the data procured. It is observed that the data has a strong relationship with past days, an autoregressive model for active cases with 10 lag days data as input and an artificial neural network (ANN) model to extract the non-linearity between the data if it exists has been developed and verified both the models in the training as well as testing periods. Needless to say that the methods leading to reliable prediction of spreading of COVID-19 would be of big help in taking preventive measures to minimize its spread, deaths, active cases and maximizing recovery cases.

Literature

Ahmed [1] did an exhaustive review to understand the epidemiological evidences, clinical manifestations, investigations and treatment given to COVID cases who are admitted in various hospitals of Wuhan city and other parts of China.

Luo et al. [2] made an effort to expand screening capacity, reviewed advances and challenges in the rapid detection of COVID-19 by targeting nucleic acids, antigens or antibodies. They summarized some of the effective treatments and vaccines against COVID-19. They also discussed about possible reduction of viral progression post ongoing clinical trials of interventions.

Poletto et al. [3] published a comment, highlighting some of the important discoveries as a result of predictive modelling to diverse data sources. These results had an impact on clinical and policy decisions.

Through link [4], one will find expert, curated information on SARS-CoV-2 (the novel coronavirus) and COVID-19 (the disease), that will help the research and health community to work together. All these resources are free to access and include clear guidelines for clinicians and patients.

Shah et al. [5] proposed a generalized SEIR model of COVID-19 to study the behaviour of its transmission under different control strategies. This model considered all possible cases, where transmission happens from one human to another and formulated its reproduction number to analyze the accuracy of transmission dynamics of the coronavirus outbreak. Further, they applied optimal control theory to demonstrate the impact of various intervention strategies, people in quarantine and isolation of infected individuals, immunity boosters and hospitalization.

An epidemic model describing its spread in a population was formulated by Arino and Portet [6]. This model considered an Erlang distribution of times of sojourn in incubating, symptomatically and asymptomatically infectious compartments.

Fong et al. [7] proposed a methodology that embraces three virtues, (1) augmentation of existing data, (2) selecting a panel to pick the best forecasting model from many existing models and (3) tweaking the parameters of an individual forecasting model so that the accuracy of data mining is highest possible.

Shah et al. [8] further made an attempt to assess the impact of inter-state, foreign travel and public health interventions imposed by the US Government in response to the COVID-19 pandemic. They developed a disjoint mutually exclusive compartmental to study the transmission dynamics of the coronavirus. Formulation of system of non-linear differential equations, computation of basic reproduction number R0 and stability of the model at the equilibrium points was discussed in detail.

A visionary perspective on data usage and management for infectious diseases is provided by Wong et al. [9]. They highlighted that there is ample opportunity for researchers to make use of artificial intelligence methods to enable reliable and data-oriented disease monitoring in this information age. It is concluded that together with reliable data management platforms AI methods will enable effective analysis of infectious disease. It will also provide surveillance data to support risk and resource analysis for government agencies, healthcare service providers and medical professionals in the future.

Dey [10] developed a time series model for number of total infected cases in India, considering data from 3rd to 7th March 2020. They had developed two models during the initial days of COVID which were discarded because they lost their statistical validity. But later on they developed another model as a third degree polynomial that has remained stable since 8 Apr , with R2 > 0.998 consistently. This model is used for forecasting total number of confirmed COVID cases after cautionary discussion of triggers that would invalidate the model.

Hu et al. [11] also proposed the artificial intelligence (AI)-inspired methods for real-time forecasting of COVID-19 for estimating the size, lengths and ending time of COVID-19 across China. They developed a modified stacked auto-encoder for modelling the transmission dynamics of the epidemics and applied this model for forecasting the real-time confirmed cases of COVID-19 across China. The data collected for this study varied from 11 January to 27 February, 2020 from WHO.

Car et al. [12] transformed a time series dataset into a regression dataset and used it in training a multilayer perceptron (MLP) artificial neural network (ANN). By training this dataset, they tried to achieve a worldwide model of the maximal number of patients across all locations in each time unit. Hyper parameters were varied using a grid search algorithm, and a total of 48,384 ANNs were trained. Their study models showed high robustness of the deceased patient model, good robustness for confirmed and low robustness for recovered patient model.

For our present study, we collected data from [13,14,15].

Data

There are various sources that are tracking the coronavirus data. They are updated at different times and are gathered in different ways, so the data might differ from source to source. As on 21st August 2020, WHO website quotes 21,294,845 confirmed cases, 761,779 confirmed deaths, total of 216 countries/territories affected with this respiratory disease. According to revised guidelines on public health surveillance for COVID-19 by WHO (on 13-08-2020), emphasis should be on information on the importance of the collection of metadata for analysis and interpretation of surveillance data. The data has been collected from [13,14,15], for the purpose of studying the trend in four urban cities Bangalore, Delhi, Chennai and Mumbai along with India. Four major parameters, namely confirmed, recovered, active and death cases have been considered.

The basic statistics of the data for the period of 113 days (26 April to 17 August) for four metropolitan cities of India and India as a whole which is given in Table 10.1. The data has been plotted to see the pattern if visible in figure and also for the visual appreciation of the distribution in Fig. 10.1.

Table 10.1 Basic statistics of COVID-19 data for all different types such as positive, active, recovered and death cases of all India and its four important metropolitan cities for the period of 113 days from 26 April to 17 August
Fig. 10.1
figure 1

(Source Own)

Observed data plot of active cases in India and four metropolitan cities, namely Bengaluru, Chennai, Mumbai and Delhi for the period of 113 days

It is clearly visible from the basic statistics mentioned in Table 10.1 that data considered for the present study is non-Gaussian in nature as in all the cases, the skewness is not nearly or equal to 0 except for Indian case, but even in this case, the Kurtosis is not near 3. The most important property of COVID-19 data is its non-Gaussian nature. Hence, even though the mean and the standard deviation are valuable descriptors, when questioned about the severity of the spread, the assumption of normality (Gaussianity) will not be applicable. This can further be seen in the form of the probability distributions. The data of the four cases are normalized using the relation \(D_{i} = \left( {d_{i} - m_{i} } \right)/s_{i}\) where \(D_{i}\) is the normalized data, \(d_{i}\) is the actual data, \(m_{i}\) is the average of the given sample length, and \(s_{i}\) is the standard deviation of the given sample length for each i for the period of 113 days from 26 April to 17 August. Even after normalizing the data, the skewness and kurtosis remain the same. The data distribution of the normalized data has been plotted as a histogram to see patterns in the distribution so as to choose the model appropriately and is shown in Fig. 10.2.

Fig. 10.2
figure 2

(Source Own)

Histogram plot of COVID-19 active cases of all India and four metropolitan cities mentioned in the diagram

A common assumption that is popular with any time series data is to consider it to be a stationary random process. This helps in defining the long-term average and long-term deviation which remain as reference values in modelling and forecasting exercises. For a stationary process, basic statistical parameters such as mean and standard deviation of the long period remain time independent. However, if they vary widely over a period of time, then the stationary assumption will not be valid. In Fig. 10.3, the non-stationarity of long-term average and long-term deviation of COVID-19 active cases data for all India, and its metropolitan cities is shown by changing the sample length. In all cases, the number of samples does not lead to a constant value of the average. The data considered for modelling purpose in the present study focuses on active cases of all India and the four metropolitan cities mentioned in Table 10.1.

Fig. 10.3
figure 3

(Source Own)

Running mean and the standard deviation of COVID-19 data of the active cases for the data from 26 April to 17 August of 2020 for all India and four metropolitan cities of India

Methodology

To model any data, one has to understand the hidden structure or pattern in the data. In order to understand the data well, specifically for the active cases, data has been normalized using its own data and standard deviation. Also, detailed analysis of the same will help in understanding the pattern. The autocorrelation function indicates that a strong connection is there in data lags at least until 10 days lag as shown in Fig. 10.4. The data has been divided into training period (26 April to 31 July) with the sample size 97 and testing period (1 August to 31 August) with the sample size 31. In the training period, data will be trained for a particular model with the appropriate parameters with the number of parameters being less than 50% of the sample size; otherwise, it will be a polynomial fit for the entire length. Model will be validated using the measures such as the root mean square error and the coefficient of determination. If all the measures stay well within the confidence bands, then it will be implemented and checked again in the testing period mentioned.

Fig. 10.4
figure 4

(Source Own)

Sample Autocorrelation function of COVID-19 active cases of all India and its four important metropolitan cities

Model 1: (Auto-Regressive Model)

Based on the autocorrelation function plotted in Fig. 10.4, an auto-regressive (AR) model considering the past 10 lag days in the regression equation as the variable is constructed for the present active cases. This is due to a strong correlation which exists in 10 lag days, beyond which it starts reducing. For the sample size considered for the modelling purpose, a significant correlation is 0.6. Hence, until 10 lag information has been incorporated in modelling the active cases of all India and the four metropolitan cities considered for the present study. As mentioned in the previous section, the data has been trained using this regression model for the sample size of 97 days (from 26 April to 31 July) using the following equation:

$$\begin{aligned} A_{t} & = B0 + B1A_{t - 1} + B2A_{t - 2} + B3A_{t - 3} + B4A_{t - 4} + B5A_{t - 5} \\ & + B6A_{t - 6} + B7A_{t - 7} + B8A_{t - 8} + B9A_{t - 9} + B10A_{t - 10} \\ \end{aligned}$$
(10.1)

Here, A represents active cases and t in days. The parameters in equation for all India and four metropolitan cities during training the model have been given in Table 10.2.

Table 10.2 Parameters of the regression Eq. (10.1) for all India and four metropolitan cities of India

Comparison between the actual data, i.e. number of active cases and the AR model fit is shown in Fig. 10.5. Also the basic statistics such as average, standard deviation have been matched with the model, and the measures such as correlation coefficient (CC) have been found and listed in Table 10.3. In Fig. 10.5, it can be clearly seen that the model exactly matches with the observed data, and hence, the same be tested in the testing period data for using it in forecasting.

Fig. 10.5
figure 5

(Source Own)

Comparison between the observed data and the mode fit of active cases of all India and four metropolitan cities

Table 10.3 Comparison between the observed data and the model parameters for the training period

Model 2: Artificial Neural Network (ANN)

The non-Gaussianness of the data cannot be ignored at this moment as there can be some non-linearity hidden in the data; it is seen in the data distribution plotted as histograms. Hence, the artificial neural network (ANN) model has been used here with 6 days lag as the inputs in the input layer with one hidden layer and one output in the output layer. Network has been trained in the training period using back propagation algorithm using sigmoid function. There are totally 49 weights used while training the network. Network used for modelling is shown in Fig. 10.6. In the diagram, I represents the input layer, H represents the hidden layer, and O represents the output layer.

Fig. 10.6
figure 6

(Source Own)

Network used with six inputs in the input layer (I), one hidden layer with six neurons (H) and one output in output layer (O)

The network has been experimented only for two regions active cases data, i.e. for all India and Bengaluru active cases to compare with the regression model used in the previous section. The comparison between the actual data and the network model for these two cases is shown in Fig. 10.7.

Fig. 10.7
figure 7

Comparison between the actual data and the model fit using ANN model

The moment comparison is shown in Table 10.4 for ANN model. It can be clearly observed that the comparison between the actual data and the network fit is really appreciable even though only 6 lag information was used in the network for training the entire length data. Both model 1, i.e. Auto-regressive model and ANN model are performing good in the training period; check has to be in the testing period. The model which performs better in the testing period can be considered for future forecasting.

Table 10.4 Comparison between the observed data and the model parameters for the training period

Results and Discussion

It is observed that both AR model and ANN model have performed nearly same in the modelling or training period. In this section, both the models will be compared in the testing period; both the models have its own advantages and their disadvantages; one would be interested in the model which performs better in the testing period of 31 days (1 Aug to 31 Aug). AR model has been tested for all the five regions considered in the present study, whereas ANN model is performed only for two regions (All India and Bengaluru). In Table 10.5, the comparison between the AR model and ANN model for the two regions is shown:

Table 10.5 Day-wise comparison of the forecast for the period of 31 days (1 Aug to 31 Aug) using AR and ANN model with observed data

Day-wise forecast comparison has been listed in Table 10.5 for number of active cases comparison with the observed data so as to check each day whether it is nearly matching with the actual data or not. This may not be the correct measure to show the comparison. Hence, comparison in terms of root mean square error (RMSE) between observed data and the forecast by AR and ANN models, correlation coefficient (CC) between actual data and forecast and the performance parameter (PP) between the actual data and forecast has been given in Table 10.6 which is considered to be the best parameters for the comparison.

Table 10.6 Comparison in terms of RMSE between observed and forecast values for a period of 31 days, from 1 August to 31 August

It is clearly visible that training period and testing periods will not have the same relations as in terms of the measures which are used to compare the model in the forecast period as well as the training periods. Hence, one has to definitely check the measures in both the periods. But in the present cases, specifically in two cases, i.e. in India and Bengaluru, both methods perform similar as the result shows with minor variations in the measures, so any model can be used to forecast the present situation of the active cases. Similarly, in the remaining cases, only AR model has been tried and tested for the parameters which are well within the significance bands of their nature. Hence, the models can be accepted for modelling as well as forecasting purposes of COVID-19 data. Also, due to limitation of time and also due to restriction of the work, model has been tried only on the active cases.

Conclusion

In the present study, AR and ANN models have been tried on the present situation of COVID-19 pandemic, specifically on the active cases. Before applying these models, an exhaustive statistical analysis has been performed on the data due to understand the nature and pattern of the data. Stationarity test has been performed by plotting running mean and standard deviation in the same plot to see if they converge to a particular value and concluded that it is non-stationary. Hence, AR model with 10 days lag having strong correlation with the data is modelled, and the same model has been tested in the testing period or forecast period also. In order to understand the non-linear structure if it exists in the data, ANN model is constructed, and both the models are compared so as to identify the best model for further forecasting of the other cases such as death cases and recovered cases. Both models have outperformed in terms of the parameters used for measuring the same. Hence, any model can be used for the remaining cases. For future study of the same link between active cases, recovered cases and death cases has to be found, and if possible, then a combined ANN model with three outputs have to be developed which will forecast all three cases at a time. The future work will be concentrated on this combined study.