Keywords

1 Introduction

In machine learning, patterns are observed using statistics and the learning of model is performed by continuous iterations until an estimation of data prediction is observed. This study most commonly synthesizing useful concepts from the historical data. Machine learning is a method of training machines, i.e., computers to make a prediction based on some training data and experience. Application of machine learning is limitless, i.e., from health care industry to statistical-based conditions of a country. Machine learning is not just limited to a particular field, but it can be used to improve the existing knowledge of a field by learning from the previous data and predictions. In briefly, machine learning is an application of artificial intelligence that automates analytical model building by using an algorithm that iteratively learns from data without being explicitly programmed (Sharma et al., 2018). We can predict the further outcome or predicted outcome by analyzing through time-series order. Patterns like seasonality, trends, irregularity and cyclicity are used as features to predict the upcoming variable of interest. There are various application of time-series forecasting like earthquake prediction, stock market prediction, etc. The performance of the time-series forecasting models can be compared by evaluation error rate. The most in use error rate is root mean squared error.

2 Literature Review

Stephanie et al. (2020) in her research work analyzed the impact of COVID-19 with respect to geographical differences over features like population density, distribution of age, diagnostic capacity, etc. Leeb et al. (2020) in his research work analyzed the impact of COVID-19 targeting the specific age group, i.e., school-going children.

Lim et al. (2020) in his research work analysis the impact of COVID-19 targeting the specific section, i.e., interns working in University Hospital. Hayashi et al. (2020) in her research work analyzed the changed seasonal effect of influenza virus and SARS-Cov-2 due COVID-19 rules and regulations.

Wilson et al. (2020) in their research work performed clustering approach over COVID-19 impact on University Campus. Hawas (2020) in his research work performed time-series prediction for daily infection (COVID-19) rates in Brazil using RNN.

Alonso et al. (2020) in their research work discussed the various challenges for post-COVID-19 era and proposed some strategies for them. Lee and Lin (2020) in their research work analyzed the COVID-19 precaution relationships with other common infections and studied the impact of their outburst due to COVID-19. Cardil and de-Miguel (2020) proposed a scenario in his research work where COVID-19 rules and regulations could directly intervene and cause more damage by natural disasters. Filimonau et al. (2020) observed the section in their research work and manifested their commitment toward the job.

3 Methodology

Metric is a measurement of errors between the grouped observations that express the same phenomenon (Sharma, 2015). The most common metric used is root mean squared error (RMSE) which is also used in this paper.

Root mean squared error (RMSE) is calculated as the square root of average of squared differences between actual and predicted observations as shown in Eq. 1. RMSE can be used to penalize large errors as the error is squared before taking average.

$${\text{RMSE}} = \sqrt {\mathop {\mathop \sum \limits_{i = 1} }\limits^{n} \frac{{\left( {\hat{y}_{i} - y_{i} } \right)^{2} }}{n}}$$
(1)

where \(\hat{y}_{1} ,\hat{y}_{2} , \ldots ,\hat{y}_{n}\) are predicted values, \(y_{1} ,y_{2} , \ldots ,y_{n}\) are observed values and n is the number of observations.

Three time-series csv file datasets were fetched from Kaggle dating from January 22 to March 15. The three datasets are—confirmed, deaths, recovered. There are a of total 451 rows and 58 columns for each of these three datasets with features as province/state, country, latitude, longitude, dates from January 22, 2020 to March 15, 2020. The analysis has been done on all the features excluding the latitude and the longitude of the given locations.

The three csv file datasets were loaded to the Jupiter notebook using Pandas library and were converted into data frames that helped in representing the data through its in-memory 2d table. All the columns were extracted from the datasets using the.key () function. From this, further all the date columns were extracted using the loc () function. The analysis performed on the datasets and then appended into lists was total confirmed cases, total deaths, total recovered. Using these values from the lists, we calculated the mortality rate (total deaths/total confirmed), recovery rate (total recovered/total confirmed). All the dates and cases were converted into Nd array using NumpyPy. The dates stored into datasets were type casted into date-time format from integers and for better visualization. Using the loc () function on the Nd arrays, latest confirmed cases, latest death cases, latest recovered cases were displayed, i.e., from March 5, 2020 to March 15, 2020. The total number of confirmed cases per country was calculated. The unique values of provinces/states were stored, and it was observed that there were a lot of not a number (NAN) values assigned to these provinces/states which were removed using the pop () function. Top ten countries which had the greatest number of cases were calculated. As China was the first country to get affected by this deadly disease, it had the highest number of cases. To analyze this situation, a comparison was made between China and rest of the world on the basis of total number of confirmed cases. The dataset is pre-processed. As the model deals with dependent and independent variables, the dataset is split into training set and test set using the train test split 70% of the dataset is trained first and 30% of the dataset is kept for testing.

Polynomial Regression A polynomial function is used with the concept of curve fitting to forecast the variable of interest as shown in Eq. 2

$$f\left( x \right) = c_{0} + c_{1} x + c_{2} x^{2} \cdots c_{n} x^{n}$$
(2)

where n is the degree of the polynomial and c is a set of coefficients.

Support Vector Machine Regression Poly, sigmoid and Rbf (Gaussian) functions have been set inside the kernel which would further perform parallel processing of the data and produce the optimal function for the most appropriate prediction (Sharma & Shrivastav, 2020).The equation of hyperplane is shown in Eqs. 3 and 4, The Lagrangian form is minimized for w and b, where w is width of the margin and b is the constant.

$$g\left( x \right) = w^{T} x + b$$
(3)
$$J\left( {w,b,\alpha } \right) = \frac{1}{2}w^{T} w - \mathop {\mathop \sum \limits_{i = 1} }\limits^{N} \alpha_{i} d_{i} \left( {w^{T} x_{i} + b} \right) + \mathop {\mathop \sum \limits_{i = 1} }\limits^{N} \alpha_{i}$$
(4)

Holt’s Linear Model Method proposed by Holt involves two smoothing relations, i.e., trend (\(b_{t}\)) and level (\(\ell_{t}\)) with a forecast equation (\(\hat{y}_{{t + ht}}\)) as shown in Eqs. 57.

$$\hat{y}_{{t + ht}} = \ell _{t} + hb_{t}$$
(5)
$$\ell_{t} = \alpha y_{t} + \left( {1 - \alpha } \right)\left( {\ell_{t - 1} + b_{t - 1} } \right)$$
(6)
$$b_{t} = \beta^{*} \left( {\ell_{t} - \ell_{t - 1} } \right) + \left( {1 - \beta^{*} } \right)b_{t - 1}$$
(7)

where \(0 \le \alpha \le 1\) (level smoothing parameter) and \(0 \le \beta^{*} \le 1\) (trend smoothing parameter).

Holt’s Winter Model Method proposed by Holt involves three smoothing relations, i.e., trend (\(b_{t}\)), level (\(\ell_{t}\)) and season (\(s_{t}\)) with a forecast equation (\(\hat{y}_{{t + ht}}\)) as shown in Eqs. 811.

$$\hat{y}_{{t + ht}} = \ell _{t} + hb_{t} + s_{{t + h - m\left( {k + 1} \right)}}$$
(8)
$$\ell_{t} = \alpha \left( {y_{t} - s_{t - m} } \right) + \left( {1 - \alpha } \right)\left( {\ell_{t - 1} + b_{t - 1} } \right)$$
(9)
$$b_{t} = \beta^{*} \left( {\ell_{t} - \ell_{t - 1} } \right) + \left( {1 - \beta^{*} } \right)b_{t - 1}$$
(10)
$$s_{t} = \gamma \left( {y_{t} - e_{t - 1} - v_{t - 1} } \right) + \left( {1 - \gamma } \right)s_{t - m}$$
(11)

where \(k\) is the integer part of \(\left( {h - 1} \right)/m\), \(0 \le \alpha \le 1\) (level smoothing parameter) \(0 \le \beta^{*} \le 1\) (trend smoothing parameter) and \(0 \le \gamma^{*} \le 1\) (seasonal smoothing parameter).

AutoRegressive Model(AR Model) In (autoregressive model (AR model) with the successor of past values of variables, variable of interest can be forecasted using linear combinations. An order of p AR model is shown in Eq. 12.

$$y_{t} = c + \phi_{1} y_{t - 1} + \phi_{2} y_{t - 2} + \cdots + \phi_{p} y_{t - p} + \varepsilon_{t}$$
(12)

where \(\varepsilon_{t}\) is white noise.

Moving Average Model (MA Model) In (moving average model (MA model) with the help of past forecast errors, variable of interest can be fore- casting in a regression alike model by using Eq. 13.

$$y_{t} = c + \varepsilon_{t} + \theta_{1} \varepsilon_{t - 1} + \theta_{2} \varepsilon_{t - 2} + \cdots + \theta_{q} \varepsilon_{t - q}$$
(13)

where \(\varepsilon_{t}\) is white noise.

Autoregressive Integrated Moving Average Model (ARIMA Model) (Autoregressive integrated moving average model (ARIMA model) is the combination of moving average and autoregression model as shown in Eq. 14. It follows the same stationary and invertibility environment as autoregressive and moving average models.

$$y_{t}^{^{\prime}} = c + \phi_{1} y_{t - 1}^{^{\prime}} + \cdots + \phi_{p} y_{t - p}^{^{\prime}} + \theta_{1} \varepsilon_{t - 1} + \cdots + \theta_{q} \varepsilon_{t - q} + \varepsilon_{t}$$
(14)

where \(y_{t}^{^{\prime}}\) is the differenced series.

Facebook’s Prophet Model Facebook’s Prophet Model predicts over nonlinear variables like trend, seasonality, holidays, idiosyncratic changes which is shown in the Eq. 15.

$$y\left( t \right) = g\left( t \right) + s\left( t \right) + h\left( t \right) + e\left( t \right)$$
(15)

where \(g\left( t \right)\) is trend models non-periodic changes, \(s\left( t \right)\) is seasonality which presents periodic changes, \(h\left( t \right)\) is ties in effects of holidays and \(e\left( t \right)\) covers idiosyncratic changes not harbored by the model.

The methodology section includes the stepwise algorithmic description where the research article presents the systematic research conducted and analyzed the trend on the COVID-19 dataset. The various algorithms applied and studied, and the conclusion has been presented at last. The mentioned methodology depicts the technical analysis related to the disease trends and the directions of the pandemic with respect to the duration/timing. The stipulated time is increasing, and the resultant spread also increases with the much-affected persons seems to be deadly among individuals Fig. 1 shown methodology.

Fig. 1
figure 1

Methodology

4 Result

4.1 Polynomial Regression

As shown in Fig. 2, we have predicted the trend of confirmed cases using polynomial regression for the next five days, i.e., September 24, 2020 to September 29, 2020.

Fig. 2
figure 2

Polynomial regression

4.2 Support Vector Machine Regressor

As shown in Fig. 3, we have predicted the trend of confirmed cases using support vector machine regressor for the next five days, i.e., September 24, 2020 to September 29, 2020. The root mean squared error (RMSE) was observed to be 1,005,399.938. Predicted confirmed cases are 7,387,905, 7,606,222, 7,830,090, 8,059,624, 8,294,945. This model performed worst.

Fig. 3
figure 3

Support vector machine regressor

4.3 Holt’s Linear Model

As shown in Fig. 4, we have predicted the trend of confirmed cases using Holt’s linear model for the next five days, i.e., September 24, 2020 to September 29, 2020. The root mean squared error (RMSE) was observed to be 113,382.878. Predicted confirmed cases are 5,909,289, 6,006,013, 6,102,737, 6,199,461, 6,296,186.

Fig. 4
figure 4

Holt’s linear model

4.4 Holt’s Winter Model

As shown in Fig. 5, we have predicted the trend of confirmed cases using Holt’s winter model for the next five days, i.e., September 24, 2020 to September 29, 2020. The root mean squared error (RMSE) was observed to be 224,526.107. Predicted confirmed cases are 6,142,376, 6,284,535, 6,412,858, 6,583,385, 6,740,125.

Fig. 5
figure 5

Holt’s winter model

4.5 AR Model

As shown in Fig. 6, we have predicted the trend of confirmed cases using autoregressive model (AR model) for the next five days, i.e., September 24, 2020 to September 29, 2020. The root mean squared error (RMSE) was observed to be 134,715.533. Predicted confirmed cases are 5,954,376, 6,057,034, 6,160,121, 6,263,644, 6,367,599.

Fig. 6
figure 6

AR model

4.6 MA Model

As shown in Fig. 7, we have predicted the trend of confirmed cases us polynomial regression for the next five days, i.e., September 24, 2020 to September 29, 2020. The root mean squared error (RMSE) was observed to be 37,850.063. Predicted confirmed cases are 5,682,521, 5,760,792, 5,839,125, 5,914,426, 5,989,599. This model performed as second best for MA model.

Fig. 7
figure 7

MA model

4.7 ARIMA Model

As shown in Fig. 8, we have predicted the trend of confirmed cases using (autoregressive integrated moving average model (ARIMA model) for the next five days, i.e., September 24, 2020 to September 29, 2020. The root mean squared error (RMSE) was observed to be 112,111.823. Predicted confirmed cases are 5,912,914, 6,012,312, 6,112,132, 6,212,373, 6,313,037.

Fig. 8
figure 8

ARIMA model

4.8 Facebook’s Prophet Model

As shown in Fig. 9, we have predicted the trend of confirmed cases using Facebook’s Prophet model for the next five days, i.e., September 24, 2020 to September 29, 2020. The root mean squared error (RMSE) was observed to be 36,248.027. Predicted confirmed cases are 5,598,438, 5,674,748, 5,751,535, 5,825,041, 5,899,729, whereas the upper bounds for the respective days are 5,666,743, 5,745,622, 5,824,853, 5,897,610, 5,968,758. This model performed bet.

Fig. 9
figure 9

Facebook’s Prophet model

4.9 Average of All Models

The average of all the prediction model’s prediction for confirmed cases in the period between September 24–28 are observed as 6,017,199, 6,127,316, 6,236,641, 6,350,228, 6,462,896 (Figs. 10 and 11).

Fig. 10
figure 10

Root mean squared error of all models

Fig. 11
figure 11

Prediction of all models

5 Conclusion and Future Work

From the analysis and the prediction, it is evident that corona virus is growing at a very rapid pace. It raises a serious concern in the world as its affects are catastrophic. It is a deadly virus which has taken up many lives and continues to do the sam. It is need of the hour to know about its effects on the whole world. This project helps the government in analyzing the situation and predicting its future outcomes so that preventive measures can be taken to contain and prevent this widely spreading disease.

The following work can be considered as future work:

  1. 1.

    Applying geospatial analysis, i.e., manipulating the data with the help of longitude and latitude of a place which would help in containing a place in which the outbreak has taken place at large scale so that it is localized to that certain area and not allowing it to spread any further.

  2. 2.

    Using a graphical user interface which would help in analyzing the situation in a more user-friendly way.