1 Introduction

The novel coronavirus infectious disease (COVID-19), reported to have started in China in early December 2019 [30], has spread to almost all countries and territories worldwide. This epidemic was identified by World Health Organization (WHO) as a global pandemic in March 2020. As of December 31, 2021, this global pandemic results 281,808,270 confirmed cases, and 5,411,759 confirmed deaths worldwide [27] including 53,795,407 confirmed cases, and 820,355 deaths in the USA [8].

Although the reported headline counts presently show an overall decreasing trend, it may not be so in every country. The stated trends appear to be different in different regions of the world [27]. An in-depth separate country and/or state/territory-specific analysis of the spread of this virus over time may reveal useful information on the spread of this virus for policymakers. Furthermore, accurate forecast models of these counts may be useful for policymakers of healthcare organizations and government agencies in addressing and controlling the spread of the virus.

Since the beginning of this pandemic in late 2019 or early 2020, forecast models were developed for forecasting various COVID-19 counts including cases and deaths [1,2,3,4,5, 11,12,13,14,15,16, 18, 19, 21, 24, 25, 28]. Some of these authors have used Box-Jenkins methodologies to develop autoregressive integrated moving average (ARIMA) models for COVID-19 counts of cases and deaths [2, 4, 5, 12, 13, 15, 18, 21, 24, 25]. Although ARIMA models for both COVID-19 cases and deaths are modeled separately, a detailed exploratory analysis reveals that a short-term forecast model for deaths may take advantage of case counts as inputs. At the patient level, case positivity is identified first, and deaths follow. Thus, known number of cases may be used to predict future deaths. In this paper, we have endeavored to develop such a model (referred to as the Transfer function model) using Box-Jenkins methodology to predict COVID-19 deaths based on past values of this series (referred to as output series) as well as the past values of COVID-19 cases (referred to as input series) based on USA data obtained from Center for Disease Control (CDC) web site [9, 10]. Transfer function ARIMA models have been used in the context of other applications [17, 20, 29] where a forecast model for building construction costs was developed using influences of house price and work Volume [29], and another paper deals with flood prediction using rainfall and water discharge [17]. These transfer function modeling techniques can easily be generalized to model COVID-19 deaths using cases as inputs, and this is what we have attempted to accomplish in this paper. A brief description of our data, the proposed transfer function model, and details of the steps used in fitting this transfer function model are in Sect. 2. The final forecast model is developed in Sect. 3, and the performance of the model is evaluated in Sect. 4 followed by concluding remarks in Sect. 5. For computational purposes, we have used SAS/STAT software [22], and in particular, SAS PROC ARIMA [23].

2 Methodology

2.1 Data

We have downloaded data on COVID-19 daily counts of infected cases and deaths in the USA from the CDC website [9, 10]. Since the characteristics of observed data have changed over time due to various interventions including government mandates and vaccinations, one may argue in favor of using the most recent data values for forecasting future values instead of using all available data. With this in mind, we have chosen a subset to include data values since July 24, 2021, the date when the USA achieved at least 50% vaccination rate [8]. The subset data from July 24, 2021, to December 31, 2021, results in a time series data of \(n = 161\) days for the development of our model. The sample size of 161 appears to be at least a moderate size to develop an ARIMA model. It is noted in PROC ARIMA’s [23] data requirement section that a minimum of 30 observations are needed for the development of an ARIMA model.

2.2 Transfer Function Model

Transfer function models were shown to perform better with respect some model comparison metrics than purely ARIMA models when explanatory input series is available [29]. For details, the reader may review the paper [29] in the context of a different application than the COVID-19 data considered here. To simplify the notation of the transfer function model considered in this paper, let \(y_{t}\) and \(x_{t}\) be the tth value of the output series (daily death counts) and input series (daily cases), respectively. The observed series \(y_{t}\) and \(x_{t}\) are often nonstationary and need transformation and/or differences to convert to stationary series. For modeling purposes later, we shall use the natural logarithmic transformation of both output and input series, and let \(\lambda_{t} = \log_{e} \left( {y_{t} } \right)\) and \(\beta_{t} = \log_{e} \left( {x_{t} } \right)\) denote these logarithmic series. Furthermore, we shall let \(z_{t}\) and \(w_{t}\) denote the stationary series obtained from \(\lambda_{t}\) and \(\beta_{t}\) after appropriate differencing which will be explained in our model development section. Then, a transfer function model may be developed using the following three steps.

Step 1: Identification of an ARIMA Model for Input Series \(\beta_{t}\).

For forecasting future values of the logarithmic output series \(\lambda_{t}\), future values of the logarithmic input series \(\beta_{t}\) are needed and hence the need of an adequate ARIMA model for \(\beta_{t}\)(or for a differenced \(\beta_{t}\) if \(\beta_{t}\) is not a stationary series). This should be done independently of the output series at this step. Here, all necessary calculations and model development steps are performed using SAS/STAT software [22]. We will take advantage of various options of IDENTIFY and ESTIMATE statements available in PROC ARIMA [23] of SAS/STAT Software [22] to develop a forecast model for the logarithmic input series \(\beta_{t}\). We shall let \(z_{t}\) and \(w_{t}\) denote the stationary series obtained from \(\lambda_{t}\) and \(\beta_{t}\) after appropriate differencing, if needed, of \(\lambda_{t}\) and \(\beta_{t}\) series, respectively.

Step 2: Pre-whitening of \(\lambda_{t}\) and \(\beta_{t}\) Preliminary Transfer Function Model

Using the ARIMA model of the input series described in step 1, both \(z_{t}\) and \(w_{t}\) are filtered (a process known as pre-whitening). The cross-correlations between these two pre-whitened series are then used to determine a preliminary transfer function model of the following form:

$$ z_{t} = \mu + \frac{C\omega (B)}{{\delta (B)}}B^{b} w_{t} + \eta_{t} $$
(1)

In the above model (1), µ is a constant, C is an unknown scale parameter, \(\omega (B)\) represents the polynomial terms in backshift operator B for the moving average of \(w_{t}\) (with \(B^{u} \left( {w_{t} } \right) = w_{t - u} )\)) where the averaging operators are determined based on the past \(w_{t}\) values that influence \(z_{t}\), \(\delta (B)\) represents the autoregressive operators of \(z_{t}\) where the autoregressive backshift operators are determined by the number of past \(z_{t}\) values that are related to \(z_{t}\), and b (known as delay parameter) is the number of periods it takes for the pre-whitened input series \(w_{t}\) to affect the pre-whitened output series \(z_{t}\). The choice of both \(\omega (B)\) and \(\delta (B)\) is dependent on cross-correlations between the pre-whitened series mentioned above just prior to Eq. (1). The residual term \(\eta_{{\text{t}}}\) of the model (1) must be uncorrelated with the pre-whitened \(w_{t}\) series, a necessary condition for the validity of transfer function modeling. If \(\eta_{{\text{t}}}\) and pre-whitened \(w_{t}\) are uncorrelated and \(\eta_{t}\) autocorrelations are insignificant (i.e., large p-value of white noise test), then model (1) is taken as the final transfer function model. However, often this does not happen in practice, and the white noise test often results in significant autocorrelations. When \(\eta_{t}\) autocorrelations under model (1) are significant, one needs to identify a model for \(\eta_{t}\) of the form

$$ \eta_{t} = \frac{\theta (B)}{{\phi (B)}}\varepsilon_{t} $$

and revise tentative model (1) to the final transfer function model following the description offered below in Step 3.

Step 3: Final Transfer Function Model

If \(\eta_{t}\) residuals under (1) are uncorrelated with pre-whitened \(w_{t}\) series, but \(\eta_{t}\) autocorrelations are significant, then it is necessary to identify a model that describes \(\eta_{t}\) and modify model (1) accordingly to a final transfer function model of the following form:

$$ z_{t} = \mu + \frac{C\omega (B)}{{\delta (B)}}B^{b} w_{t} + \eta_{t} \quad {\text{with}}\quad \eta_{t} = \frac{\theta (B)}{{\phi (B)}}\varepsilon_{t} $$
(2)

where \(\varepsilon_{t}\) is assumed to satisfy the white noise assumption (i.e., \(\varepsilon_{t}\) autocorrelations are insignificant), and

$$ \phi (B) = \left( {1 - \phi_{1} B - \phi_{2} B^{2} - \phi_{p} B^{p} } \right) $$

is the autoregressive operator of order p of \(\eta_{t}\), and

$$ \theta (B) = \left( {1 - \theta_{1} B - \theta_{2} B^{2} - \theta_{q} B^{q} } \right) $$

is the moving-average operator of order q of \(\varepsilon_{t}\).

In our next subsection, we have taken advantage of SAS’s PROC ARIMA [23] to fit the above transfer function model to our data described in Sect. 2.1.

3 Parameter Estimation of Transfer Function Model

As noted above, we take advantage of PROC ARIMA’s [23] identify and estimate options to develop the final transfer function model for our data described in Sect. 2.1. We first note that both the logarithmic output series \(\lambda_{t}\) and input series \(\beta_{t}\) are nonstationary. Using various plots and statistical stationarity tests offered as options in PROC ARIMA, we have concluded that the following differences achieve stationarity for both series:

$$ z_{t} = \left( {\lambda_{t} - \lambda_{t- 1} } \right) - \left( {\lambda_{t - 7} - \lambda_{t - 8} } \right) $$
$$ w_{t} = \left( {\beta_{t} - \beta_{t - 1} } \right) - \left( {\beta_{t - 7} - \beta_{t - 8} } \right) $$

As noted above in Sect. 2.2, our first task here is to find an appropriate model for the differenced input series \(w_{t}\). A careful review of outputs generated by PROC ARIMA’s identify statement on \(w_{t}\) series, we have concluded that the following model fits well to \(w_{t}\) where we have used \(e_{t}\) to denote estimates of residual \(\varepsilon_{t}\).

$$ w_{t} = e_{t} - 0.56842e_{t - 1} - 0.23588e_{t - 2} + 0.15397e_{t - 3} - 0.26602e_{t - 7} + 0.18503e_{t - 9} $$
(3)

The above model allows forecasting future values of logarithmic input series \(\beta_{t}\). In our selection of the above model, we have utilized ESCAF, MINIC, and SCAN options of PROC ARIMA’s IDENTIFY statement to get several suggested models and then have used various model comparison criteria (AIC, SBC, etc.) to choose the model (3). For a visual check of the goodness-of-fit of the model (3), a plot of observed and predicted values of \(\beta_{t}\) series is displayed in Fig. 1. Both the observed and fitted values are kept on a logarithmic scale at this stage. Figure 1 shows that the fitted moving average model (3) mimics the observed case counts very closely.

Fig. 1
figure 1

Observed and predicted values of ln (Cases) under model (3)

Using the estimated parameter values reported for model (3), we have looked at the cross-correlations between pre-whitened values of \(z_{t}\) and \(w_{t}\). These cross-correlations and guidelines available in the literature [6, 7, 26] are utilized to determine the appropriate backshift polynomial terms in \(\omega (B)\) and \(\delta (B)\), and the delay parameter b involved in the preliminary transfer function model (1). Keeping only statistically significant terms at a 5% level of significance, we obtained the following estimated preliminary transfer function model:

$$ z_{t} = \frac{{0.57157\left( {1 - 0.58395 B^{3} } \right)}}{{\left( {1 - 0.59833 B^{1} } \right)}}B^{0} w_{t} + a_{t} $$
(4)

The residual values \(a_{t}\) in (4) are estimates of \(\eta_{{\text{t}}}\) mentioned in model (3). These residuals were found to be uncorrelated with pre-whitened \(w_{t}\), see Table 1 which shows all cross-correlations between \(a_{t}\) and pre-whitened \(w_{t}\) series are insignificant. The p-values of chi-square Q statistics are much higher than 0.05 for lags up to k corresponding to k = 5, 11, 17, 23, 29 as displayed in Table 1 and thus satisfies the necessary condition for the validity of the transfer function modeling. However, \(a_{t}\) autocorrelations (i.e., estimated \(\eta_{t}\) autocorrelations) under model (4) are significant and thus need further modeling. We then proceeded to Step 3 to identify an ARIMA model for estimated \(\eta_{{\text{t}}}\) series and concluded the following final transfer function model:

$$\begin{aligned} z_{t} &= \frac{{0.62463\left( {1 - 0.32292 B^{3} } \right)}}{{\left( {1 - 0.49046 B^{1} } \right)}}B^{0} w_{t} \\ &\quad+ \frac{{\left( {1 - 0.45811 B^{7} } \right)}}{{\left( {1 + 0.33825 B^{1} + 0.36882 B^{2} + 0.26713 B^{4} } \right)}}e_{t} \end{aligned}$$
(5)
Table 1 SAS output on cross-correlation test of residuals with logarithmic input series log(xt) under preliminary transfer function model (4)

The fitted transfer function model (5) appears to be an adequate model based on the following residual diagnostics measures:

  1. 1.

    residual series \(e_{t}\) in (5) is uncorrelated with pre-whitened \(w_{t}\)(p-values of all chi-square tests in Table 2 are greater than 0.05 indicating insignificant cross-correlations) and,

  2. 2.

    \(e_{t}\) residuals of the fitted model (5) are uncorrelated (p-values of all reported tests in Table 3 are greater than 0.05 implying that the residual series satisfies the white noise assumption).

Table 2 SAS output on cross-correlation test of residuals with logarithmic input series log(xt) under final transfer function model (5)
Table 3 SAS output on autocorrelation check of residuals of model (5)

Note that the failure of either of these two conditions in (1) and (2) would indicate the inadequacy of the fitted model.

A plot of the observed and predicted values of \(\log_{e} \left( {y_{t} } \right)\) series obtained under model (5) is displayed in Fig. 2. The plot is a visual display of the goodness-of-fit of the model (5) to the observed \(\log_{e} \left( {y_{t} } \right)\) series, and model (5) appears to be a good fitting transfer function model.

Fig. 2
figure 2

Observed and predicted values ln (Deaths) under fitted transfer function model (5)

In our next section, the fitted transfer function model (5) is used for forecasting future values of the output series \(\log_{e} \left( {y_{t} } \right)\) using \(\log_{e} \left( {x_{t} } \right)\) as the input series.

4 Performance of Fitted Model and Forecasts

This section addresses the performance of the fitted transfer function model (5) using plots and summary values based on some statistical criteria. To evaluate the forecasting performance of the model (5), we have estimated the model parameters using all but the most recent 30 data values (i.e., the training set consists of data from 7/24/21 to 12/1/21). The most recent 30 data values from 12/2/21 to 12/31/21 then constitute our test data. The fitted model for train data is used to predict the most recent 30 values (test data) that were not used for model fitting. This allows us to study the model’s performance by comparing the observed and predicted values for these 30 observations. Since the test data mentioned here are not used for parameter estimation of the forecast model, some statistical summary measures on forecast errors from the test data may indicate how well the model forecast future data. Note that, in our prediction, we have used the observed, not estimated, values of the input series since these values are available. Also, SAS reports predicted logarithmic deaths, not the predicted deaths due to logarithmic transformation before differencing. Table 4 displays the MAE (= mean of absolute values of (predicted—observed)), RMSE (= square root of the mean of squared values of (predicted—observed)) and MAPE (mean of the absolute values of (predicted- observed)/observed)) obtained for training and test data under the transfer function model (5). Note that all three measures are on a logarithmic scale. We expect our final fitted model to forecast future values with similar summarized forecast errors as those reported in Table 4.

Table 4 Some Goodness-of-fit Measures of model (5) for train and test data

The transfer function model is then fitted to all data prior to forecasting future values. To forecast future values of \(\log_{e} \left( {y_{t} } \right)\) using model (5), we need to use past and future values of \(\log_{e} \left( {x_{t} } \right)\). Since future values of \(\log_{e} \left( {x_{t} } \right)\) are not available, SAS’s PROC ARIMA, by default, uses fitted model (1) to predict future values of \(\log_{e} \left( {x_{t} } \right)\) and used them in the fitted transfer function model (5) for the prediction of future values of \(\log_{e} \left( {y_{t} } \right)\). We forecast 30 future values of \(\log_{e} \left( {y_{t} } \right)\) and displayed these values in Fig. 3 along with 95% prediction intervals.

Fig. 3
figure 3

Thirty days observed and predicted values and thirty days forecasts values of ln(Deaths) with 95% confidence intervals under model (5)

We like to conclude this section by comparing the fitted transfer function model to the pure ARIMA model for death counts ignoring case counts. One may argue if such an ARIMA model for death counts alone (ignoring case counts) can perform better than the proposed transfer function model that utilizes case counts as input series for the prediction of death counts. To address this, one needs to find an appropriate ARIMA model and compare this ARIMA model to the proposed transfer function model with respect to some model comparison criteria.

By looking at all suggested models of IDENTIFY statement of PROC ARIMA and using model comparison criteria, we have concluded that the following model fits best to the differenced series \(z_{t}\):

$$ z_{t} = - 0.41814z_{t - 7} + e_{t} - 0.30448e_{t - 1} - 0.4746e_{t - 2} $$
(6)

A plot of predicted and observed values of ln(Deaths) under model (6) is displayed in Fig. 4 to demonstrate the goodness-of-fit of the above ARIMA model for the ln(Deaths) series ignoring case counts.

Fig. 4
figure 4

Observed and predicted values of ln(Deaths) under model (6)

The MAE, RMSE, and MAPE calculated for the fitted ARIMA model (6) and transfer function model (5) are reported in Table 5. For all three measures, the transfer function model (5) has lower values compared to the ARIMA model (6), see the percentage reduction values of approximately 14% in the third row of Table 5. This implies that the utilization of case counts as input series results in a better forecast model here for our data.

Table 5 Comparison of transfer function model and ARIMA model with respect to some criteria

5 Concluding Remarks

In this paper, we have developed a transfer function model for COVID-19 reported death counts in the USA using reported case counts as input where we have used data beginning July 24, 2021, which is the date when the vaccination rate exceeded 50% in the USA. The results indicated that the transfer function model generates a better forecast model as opposed to the pure ARIMA model as demonstrated by some goodness-of-fit criteria in Table 5. The third row of Table 5 shows that the percentage reduction in MAE, RMSE, and MAPE due to transfer function modeling compared to the best ARIMA model for the same data is about 14% each. Overall, the findings in this paper contribute to the body of knowledge and demonstrate that the appropriate application of transfer function models can increase forecasting accuracy.

We like to note some limitations of our model. We have developed the proposed transfer function model for predicting COVID-19 deaths in the USA using reported case counts as input series. We have used all data reported on CDC’s website from July 24, 2021, to December 31, 2021. The data download date was January 10, 2022. We have assumed that all data during this period are reported by our download time. There may be minor data discrepancies if data were downloaded after January 10, 2022, due to late reporting to CDC. As a result, the estimated parameters may be different than what we have reported here. However, the overall model structure is expected to be the same.

We also like to note that there may be some data reporting errors. A closer look at the data reveals that reported cases and deaths on Saturdays and Sundays were often lower than those on Monday thru Friday of the weeks for which data were used for our analysis (among 23 weeks of our data, 22 weeks reported lower deaths on Sundays compared to other days. One may reasonably consider this as a serious data reporting/recording error. We do not see any reason for the number of deaths and cases to be lowest often on Saturdays and Sundays than on other days of the week. Nonetheless, we have used the data as it is without making any adjustments. Results and findings here may be different if data were to correct, if possible, the noted discrepancy.

The modeling technique described in this paper can be extended to include other input series (e.g., COVID-19 hospitalization counts) which may result in further improvements.