Introduction

In the deterministic analysis, the reservoir operation problem is solved for a particular (known) inflow sequence, among numerous other possible sequences. Hence, the effect of variation in the inflow pattern is not reflected in the solution obtained from the deterministic analysis and it represents only a small part of a more wide range of possible reservoir behaviors. In the implicit analysis, the randomness of the inflows to the reservoir, which makes the reservoir operation a stochastic process, is considered indirectly by solving the operation problem for a number of possible sequences of these random inflows, in a deterministic frame work. This results in a number of release sequences and a regression model is used with these sequences to obtain an average operating policy.

The major issue in the implicit analysis is the selection of a proper regression model, particularly regarding the selection of specific independent variables to be used in the regression equation, from a number of candidate variables. This selection poses a formidable task for a multireservoir system where a number of combinations of the independent variables are possible. In general, there is no specific criteria for selecting a particular regression model and usually the selection is based on the significance of the regression coefficients in terms of certain statistical parameters.

In this study, a generalized approach is proposed here for the selection of appropriate variables in the regression in the implicit stochastic Dynamic Programming (DP) framework. The approach selects the appropiate model from the judgements based on statistical and physical significance of the presence of a variable in the regression model and performance based on objective function values through simulation. In the regression analysis fundamental contribution have been made by introducing new concepts: (1) all possible combination of independent variables and (2) stepwise inclusion and elimination of independent variables in regression equations, in order to determining strength and character of relationship between optimal storage state (dependent variable) with initial storage states and inflow (independent variables) in reservoir management.

Literature survey

The implicit approach was first suggested by Hall and Howell (1963) . As stated by McKerchar (1975) this approach makes one important assumption that the average operating policy obtained is equivalent to the policy obtained from an explicit stochastic dynamic programming model, which by defination optimizes the expected value of the objective function. Although no mathematical proof is available, this assumption appears reasonable for the case where the loss function is smooth and convex. Hall and Howell (1963) suggested that as the historic inflow records are of limited length, synthetic inflow sequence which resemble the historic data in terms of certain statistical parameters, could be used in deriving the average operating policy.

Hall and Howell (1963) used a deterministic dynamic programming to find the optimal pattern of release decisions for known sequences of inflow data and combines these release data into a mean operating policy using a regression model. The variability of inflow has been cosidered implicitly. Young (1967) used the implicit approach to determine the operating rules for the nonseasonal operation of a single reservoir. Minimization of the losses associated with the failure to supply target outputs was used as the objective function and the optimal releases were determined for a simulated sequence of inflow data. A multiple regression technique was used to relate the optimal releases as a function of the storage and inflows of the current period. McKerchar (1975) extended Young (1967) model to determine the monthly release policies of a two reservoir system. A multivariate streamflow model was used to generate synthetic sequences of inflows. In the multiple regression analysis, he considered optimal monthly releases as a linear function of the two storage only (inflow variables were not considered) and obtained twelve sets of equations, one for each month. Trott (1979) used the implicit approach to determine the real time operating rules for a multireservoir system. A Linear Programming–Dynamic Programming model was used to determine the optimal operating policies which maximize the on-peak energy generation. Historical inflows were used in place of generated data. He considered release as a linear function of storage, previous periods inflow, estimated current periods inflow and releases from other reservoirs in the system. A two-stage least square technique was used for the multiple regression analysis.

Bhaskar and Whitlatch (1980) used the implicit approach for the optimal operation of a single reservoir and emphasized that the policies derived from the regression analysis should be verified for their performance through simulation. They have used both the linear and non-linear regression model where the storage and inflows of the current period and previous periods were used as the independent variables. A stepwise regression procedure was adopted to determine the regression coefficients. Karamouz and Houck (1982) examined the applicability of the operating rules obtained through regression. They proposed an iterative technique to refine the regression coefficients where the DP-regression–simulation cycle was executed repeatedly. They have used the technique for the annual and monthly operation of a single reservoir. Karamouz et al. (1992) applied ISO schemes through three step cycle to improve the initial operating rules for a multireservoir system. The optimal solutions are then analyzed in a regression procedure to obtain a set of operating rules.

Celeste et al. (2009) used ISO to determine monthly operating rules for a reservoir system located in the semiarid Northeast of Brazil. ISO employs a deterministic optimization model to find optimal reservoir allocations under several possible inflow scenarios. Liu et al. (2014) derived operating rules using a simulation-based optimization method in the context of implicit stochastic optimization of China’s Three Gorges Reservoir system. Parameter uncertainty for reservoir operating rules is analyzed using Linear Regression (LR) and Bayesian simulation (BS) and established that LR performed less than the Celeste et al. (2016) assesses the suitability of ISO against Stochastic Dynamic Programming (SDP) to set up reservoir release policies and established the performance of both SDP and ISO is superior to that of Standard Operating Procedure followed and close to that of perfect-forecast deterministic optimization (PFDO). Furthermore, the simple ISO shows to perform similarly to the more complex SDP. Moreira and Celeste (2017) applies ISO to develop monthly operating rules using the forecast of the mean inflow for a future horizon instead of the current-month inflow. With a hundred different 100-year monthly synthetically generated inflow scenario release policies derived using SDP and PFDO. The comparison between ISO and SDP shows small differences between both, justifying the adoption of ISO for its simplified mathematics as opposed to SDP. Celeste and Ahmed (2018) focused on assessing the applicability of an ISO procedure to derive rule curves for two different dams of contrasting reservoir scales in terms of physical and operational characteristics. They established that the ISO provided operating rules similar to, and even less vulnerable than, those derived by stochastic dynamic programming. Avila et al. (2020) applied ISO with copula functions to simulate long-term operating policies for a hydropower reservoir located in the Northeastern region of Brazil. Overall, ISO is considered as one of the most reliable techniques to derive long-term reservoir operating rules for reservoirs. In this review it is found that in ISO for reservoir management some researcher used regression analysis and some other used synthetically generated inflow sequence to deal stochasticity of inflow.

Study area

Fig. 1
figure 1

Damodar valley system

In this study, DV system (as shown in Fig. 1), a four reservoir multi-purpose system in Jharkhand, India has been considered. There are four reservoirs: Konar, Tilaiya, Panchet and Maithon in DV system as shown in Fig. 1. Reservoir Konar and Panchet are located over river Damodar and reservoir Tilaiya and Maithon are over river Barakar, a tributary of Damodar. Durgapur Barrage is located 17 km downstream from the point of confluence of river Damodar and Barakar near Dishergarh in Bardhman District in West Bengal. Tilaiya, Panchet and Mithon are associated with hydroelectric power plants of capacities 4 MW, 40 MW and 60 MW respectively. Total catchment area of the river Damodar and its tributaries is about 6159 km\(^{2}\). The catchment is irregular in shape and somewhat elongated in lower reach.

The mean annual rainfall are 125.9 cm (at Barakar basin), 127.1 cm (at Damodar basin), lower valley (down stream of point of confluence of Damodar and Barakar) 132.89 cm. About 82% of the total rainfall occurs during south west monsoon period (July to September). During the post-monsoon (October–November) and pre-monsoon (April–May) months the amount of rainfall is about 8 percent and 7 percent respectively. During the summer season (May) mean daily maximum temperature exceeds 40 \(^{\circ }\text {C}\) and in winter mean daily maximum temperature is around below 24 \(^{\circ }\text {C}\). During the summer season (May) mean daily maximum temperature exceeds 40 \(^{\circ }\text {C}\) and in winter mean daily maximum temperature is around below 24 \(^{\circ }\text {C}\). Due to high temperature, humidity and wind, almost 50% of the total annual evaporation takes place during March to June. The area faces almost 80% mean relative humidity during south west monsoon period.

Fig. 2
figure 2

Monthly inflow distribution during monsoon period

In DV system there is Municipal and Industrial (M &I) demand at: (1) downstream of Konar up to Panchet, (2) downstream of Tilaiya up to Maithon and (3) downstream of Panchet and Mithon up to Durgapur Barrage. As M &I demand vary from reservoir to reservoir, it is costant for a particular reservoir throughout the year. Irrigation demand is placed downstream of Durgapur barrage and Irrigation demand is variable throughout the year. For sustenance of aquatic life and ecology in the streams, quantity of water for municipal and industrial (M &I) use is considered to be adequate to meet this mandatory requirement and a minimum quantity (2.1 m\(^{3}\)/s) is to be released from the barrage to maintain the river ecology.

Statistical parameters like: maximum, minimum, mean, standard deviation, skewness and corelation coefficient for all monthly inflows have been computed from historical inflow of 56 years (1962–2017) and presented in Table 1. With the help of multivariate gamma AR(1) model, 1000 years synthetic sequence resemble to historic sequence been generated. Matalas (1967) proposed the multivariate AR(1) model preserves the lag-zero and lag-one autocorrelations and cross correlations. In this study, the randomness of inflows (hydrologic timeseries) have been tested by computing lag-zero autocorrelation which is almost near to zero. The monthly mean values show that maximum inflow in the system took place from June to October and in maximum and minimum values of these months also indicate about very high variation. In these 5 months appreciable amount of skewness are also observed which indicate a non-normal distribution. Some months (July, August, October, November) show strong co-relation. Variation of monthly inflows into the four reservoirs during June to October have been presented in Fig. 2.

Table 1 Statistical parameters of inflow of DV system

Methodology

In this research paper, optimal operating policies of a multi-purpose multi-reservoir system have been developed through implicit optimization method. At the begining variable selection techniques, in terms of (1) All Possible Regression (APR) and (2) Stepwise Regression (SR) are discussed. It is assumed that the regression equations are linear and a constant term (intercept) is present in the equation. The decomposition models and the simulation technique are then described in relation to the DV system. Finally, this approach is applied to the DV system with a number of synthetic inflow sequences generated by the multivariate gamma AR(1) model as input and the results are presented.

Here a subset of independent (predictor) variables is to be selected for use in a regression equation among many potential variables present in the system. Although different methods for subset selection are available which use different statistics for test of significance and in practice all of them may not lead to the same solution when applied to the same problem (Draper and Smith 1968; Kennedy and Gentle 1980). Hence, performances of different model building methods are to be examined before selecting the final equation. In the present study, two widely used methods for regression analysis in statistics: APR and SR, have been applied for reservoir management and tested.

All possible regression (APR) technique

One of the most comprehensive but cumbersome ways of selecting subset regression is to compute all possible subset regressions. In all possible regression, the pool of q variables is divided into subsets which involve n variables, \(n=1,2,\ldots q.\) In each subset, regressions are done for all possible combinations of the predictors, and each regression equation is ordered according to some criterion; usually the criterion is the value of \(R^{2}\) (coefficient of determination) achieved by the least square fit (Draper and Smith 1968, Chapter-8). The leaders in this ordering within each subset are then selected for further examination and a decision is made as to which equation is best to be used, after examining the \(R^{2}\) [correlation coefficients computed for optimal storage states (dependent variables), initial storage states and inflows into reservoir] values.

The major drawback of this approach is the number of regressions that must be calculated: \(2^{q}-1\) for a pool of q predictor variables. Contrasting with the computational effort involved is the valuable information provided by examining all possible subsets. Most other variable selection procedures yield only one or perhaps a few candidates for the final prediction equation. By examining summary information one can identify the better subsets and then choose one or more according to the nature and purpose of the investigation. Considerable attention has been directed at identifying the best (largest \(R^{2}\) or, equivalently, smallest residual sum of squares) subsets of a given size; i.e., of all subsets having a fixed number of predictor variables, identifying the one that has the largest \(R^{2}\) (Gunst and Mason 1980, Chapter-8 ).

The main advantages in computing all possible regression or the selection criteria for all the regression, have led to research for more efficient means of determining best subsets. The disadvantage of this approach is that only the single best subset of a given size is identified where as other good subsets that could be trivially inferior to the best ones might not be identified, yet these other subsets may be preferable due to problem related considerations.

Stepwise regression (SR) technique

Stepwise regression procedure is a selection technique that sequentially add or delete single predictor variables to the regression equation, based on certain criterion. Since a series of steps are involved and since each step leads directly to the next, these methods involve the calculation of a much smaller number of equations than \(2^{q}-1\) required for the APR approach.

At each step of the SR a predictor variable is added to the regression if its t-statistic is the largest one calculated and exceeds critical value, \(t_{c}\). Next each predictor variable already chosen is reconsidered and eliminated from the selected subset if its t-statistic is the smallest one and does not exceed \(t_{c}\) value. This process continues until all q predictor variables are in the equation or until the selection criteria are no longer met. At any step, the SR method allows one to judge the contribution of each predictor in the subset as though it were the most recent variable to enter the regression equation.

The advantages of SR is that (1) the particular variable which does not have any significant contribution is eliminated at the entry level for given step and (2) it puts more power and information readily avialable than ordinary multiple regression method. The major disadvantages of SR is that the end result is a single subset which may not be necessarily the best for a given size (Gunst and Mason 1980, Chapter-8).

Application to DV reservoir system

In the implicit stochastic DP framework, the operation problem of the DV system of reservoirs is first solved for 100 sets of synthetic inflow sequences (generated by the multivariate gamma AR(1) model, using (1) Discrete Dynamic Programming (DDP) algorithm as proposed by Bellman and Dreyfus (1962) considering one reservoir at a time approach and (2) Discrete Differential Dynamic Programming(DDDP) algorithm as proposed by Heidari et al. (1971) with the help of initial trajectories as obtained from DDP based on the following system dynamics (Eqs. 14), system constraints (Eqs. 56), performance function (Eq. 7), objective function (Eq. 8) and recursive equation (Eq. 9):

$$\begin{aligned} x_{t+1}(1)= & {} x_{t}(1)+y_{t}(1)-u_{t}(1)-ev_{t}(1)\end{aligned}$$
(1)
$$\begin{aligned} x_{t+1}(2)= & {} x_{t}(2)+y_{t}(2)-u_{t}(2)-ev_{t}(2)\end{aligned}$$
(2)
$$\begin{aligned} x_{t+1}(3)= & {} x_{t}(3)+y_{t}(3)+u_{t}(1)-en_{t}(3)-u_{t}(3)-ev_{t}(3) \end{aligned}$$
(3)
$$\begin{aligned} x_{t+1}(4)= & {} x_{t}(4)+y_{t}(4)+u_{t}(2)-en_{t}(4)-u_{t}(4)-ev_{t}(4) \end{aligned}$$
(4)
$$\begin{aligned}&{\mathbf {C}}_{\mathbf {\text {min}}}\le \mathbf {x_{\text {t}}}\le \mathbf {C_{\text {max}}} ({\text {on storage}}) \end{aligned}$$
(5)
$$\begin{aligned}&{\mathbf {u}}_{\text {t,min}}\le {\mathbf {u}}_{\text {t}}\le {\mathbf {u}}_{\text {t,max}} ({\text {on release}}) \end{aligned}$$
(6)
$$\begin{aligned} L_{t}= & {} [(D_{t}(1)-u_{t}(1))/D_{t}(1)]^{2}+[D_{t}(2)-u_{t}(2))/D_{t}(2)]^{2}+[D_{t}^{c}-u_{t}(3)-u_{t}(4))/D_{t}^{c}]^{2} \end{aligned}$$
(7)
$$\begin{aligned}&\min _{u_{t}} \sum _{t=1}^{N}L_{t}(u_{t}) \end{aligned}$$
(8)
$$\begin{aligned}&V_{t}({\mathbf {x}}_{t}^{i}) =\min _{{\mathbf {x}}_{t+1}^{K}}\{L_{t}+V_{t+1}({\mathbf {x}}_{t+1}^{k})\}\!\qquad i=1,2,\ldots ,I,\qquad \qquad t=1,2,\ldots ,T \end{aligned}$$
(9)

\(x_{t}(i)\), \(u_{t}(i)\), \(y_{t}(i)\), \(ev_{t}(i)\) and \(en_{t}(i)\) represent the volume of storage, the volume of release, the volume of inflow, the amount of evaporation loss to the ith reservoir during time period t and the en-route loss in the reach between \((i-2)\)th reservoir and ith reservoir. \({\mathbf {C}}_{\min }\) and \({\mathbf {C}}_{\max }\) represent vectors of minimum and maximum storage capacities of the four reservoirs, \({\mathbf {u}}_{t,\min }\) is the vector of minimum mandatory releases, and \({\mathbf {u}}_{t,\max }\) is the vector of maximum permissible releases,\(L_{t}\) is the single stage loss function, \(D_{t}(1)\) and \(D_{t}(2)\) represent the water supply target levels for reservoir Konar and reservoir Tilaiya respectively and \(D_{t}^{c}\) represents the combined water supply target level for reservoir Panchet and Maithon where \(V_{N+1}({\mathbf {x}}_{N+1}^{k})=0\) for all k (number of iterations) and \({\mathbf {x}}_{t+1}^{i}\), \({\mathbf {x}}_{t+1}^{k}\), \({\mathbf {u}}_{t}={\mathbf {f}}({\mathbf {x}}_{t}^{i},{\mathbf {x}}_{t+1}^{k})\) feasible. (Bracketted number 1, 2, 3 and 4 attached with all notations represent reservoir Konar, Tilaiya , Panchet and Maithon respectively). In system dynamics (Eqs. 14), seepage loss has been neglected as well as reduction of live storage space due to sedimentation is assumed as zero.

The 100 sets of optimal trajectories as obtained from DDDP, the average operating policies are determined for each reservoir for each month. As the final storage state (\({\mathbf {x}}_{t+1}\)) during stage t as used as the decision variable in the deterministic model, the same is used as the dependent variable (response) in the regression analysis. The independent (predictor) variables are \({\mathbf {x}}_{t}(i)\), \({\mathbf {y}}_{t}(i)\), \(i=1,\ldots M\) and \(y_{t-1}(j)\) for the particular reservoir j. For convenience, these independent variables are assigned name as follows:

Independent variable

\(x_{t}(1)\)

\(x_{t}(2)\)

\(x_{t}(3)\)

\(x_{t}(4)\)

\(y_{t}(1)\)

\(y_{t}(2)\)

\(y_{t}(3)\)

\(y_{t}(4)\)

\(y_{t-1}(j)\)

Name assigned

1

2

3

4

5

6

7

8

9

thus making a total of 9 candidate independent variables. A linear equation (with a constant term) is assumed for regression. In the present case, dealing with non-linear equations is extremely difficult as the number of predictor variables are many, and so it is not attempted here.

For the selection of the appropriate predictor variables from a total of 9 candidate variables, the proposed approach starts with the APR as described in “All possible regression”. In APR, out of several subsets four subsets: SS1, SS2, SS3 and SS4 with independent variable 2, 4, 6 and 8 respectively, are used for comparision. Then the SR (as described in “Stepwise regression (SR)”) is applied and continued till all 9 independent variables are in the equation based on the selection criteria (critical value, \(t_{c}\)) are met. As these two techniques deals with the statistical significance of the regressions, another group of subsets is formulated based on the decomposition of the original four reservoir problem into smaller sub-problems, which is described next.

Decomposition model

Decomposition models are based on the physical significance of the system variables in terms of continuity of flow and configuration of the system. In this study for DV system, the continuity equations (Eqs. 14) indicate that for any reservoir, the final storage is primarily a function of the initial storage state and the inflow state of that reservoir, during stage t. As reservoir is of finite size, any amount above the maximum storage capacity will be spilled. Similarly, as downstream channel is of finite capacity, outflow will not exceed optimal channel capacity. In this problem down stream demand is predefined and sum of the suare of deviation of outflow from target is minimized. Considering the integrated operation of the reservoirs, the other influencing variables are the state of the other reservoirs upstream or downstream, connected in series or parallel, although, their order of influence can not be judged precisely. Based on these considerations, four different configurations: DM-A, DM-B, DM-C and DM-D, of decomposition models are constructed as shown in Fig. 3.

Fig. 3
figure 3

Decomposition models with different configurations

DM-A represents the simplest (elementary) model which uses a single reservoir unit and the subset contains only two predictor variables: storage and inflow. Four models (DM-A21, DM-A22, DM-A23 and DM-M24) are constructed in this configuration. The DM-B configuration is constructed considering the unit of two reservoirs and the subsets consist of 4 independent variables. Two models are constructed (DM-B41 and DM-B42) in this configuration. The DM-C configuration consists of three reservoirs and accordingly, the subset contains six predictor variables. In this configuration also, two models are constructed (DM-C61 and DM-C62). Finally, DM-D81 considers the original configuration of the four reservoir system, and accordingly, the subset contains eight independent variables (Table 2).

Table 2 Regrssion equations of different decomposition models

For the downstream reservoirs, \(y_{t}^{'}\) represents the net inflow whereas \(y_{t}\) represents the local (intermediate) inflow. Each of them (\(y_{t}\) and \(y_{t}^{'}\)) is used in the equation separately to observe the corresponding effects. Actually, in the present case, the contribution of the \(y_{t}\) to the net inflow \(y_{t}^{'}\) is much more than the upstream release. Also, presence of \(y_{t}\) in the regression equation avoids the forecast errors associated with the inflow to the upstream reservoir. Previous period’s inflow instead of current period’s inflow is also tested. However, this is not tested for the other three configurations (DM-B, DM-C and DM-D). In the above table number 1, 2, 3 and 4 represent reservoir Konar, Tilaiya, Panchet and Maithon respectively and \(b_{j,t}(i)\) is j-regression coefficient for the ith reservoir for stage t. The decomposition models (DM-A, DM-B, DM-C and DM-D) are actually the stepwise inclusion of different reservoir units maintaining the spatial continuity of the system.

In decomposition model one after another reservoir unit is included in the system maintaining the spatial continuity. The main advantages of the decomposition model are the effect of one reservoir on the other reservoir can be studied clearly. The main disadvantage is with the increase of number of reservoir in the system, number of regression equation would me more, therefore number of predictor variables would increase with increasing number of reservoir in different configaration. The APR and the SR serve as guidelines, in terms of statistical significance, in order to select the appropriate decomposition model.

Simulation

From the regression analysis, a model can be selected by considering the physical and statistical significance, however, the real test of how good the resulting regression model will depend on the ability of the model to predict the dependent variable for observations on the independent variables that were not used in estimating the regression coefficients (Haan, 1979, Chapter-10). To make a comparison of this nature, the historical inflow sequence, which is not used in forming the data set for regression, is used with the regression coefficients obtained from different decomposition models (DM-A21, DM-A22, DM-A23, DM-A24, DM-B41, DM-B42, DM-C61, DM-C62, DM-C81). The resulting release sequences are then used to compute the average values of the objective function (as mentioned in Eq. 8). With the help of simulation model, performance and predictive efficiency of different regression models are computed and a model with less number of independent variables is selected among different models having relatively similar or closer objective function value.

Results and discussion

APR technique

The APR technique yields a number of solutions based on different subsets instead of a particular solution. As mentioned in “All possible regression”, the best of the each subset is selected based on the largest value of \(R^{2}\) for each reservoir for each month in percent. For subsets SS1, SS2, SS3 and SS4, the corresponding \(R^{2}\) values as obtained from the decomposition models DM-A, DM-B, DM-C and DM-D having same subsets have also been computed. Variable names corresponding the largest \(R^{2}\) values as obtained from those subsets and decomposition models are presented in Table 3. In all the following tables, variables number are written without separing by comma.

Table 3 Selected variables from different subset and decomposition model based on highest\(R^{2}\) value

From the Table 3, 1 it is observed that most of the variance in the response can be explained by two or at the most four variables and the regression model should contain the two primary variables, storage and inflow for the particular reservoir. It also demonstrates the limitation of selection process based on the highest \(R^{2}\) value e.g, for the month of February and for subset 2, the highest \(R^{2}\) criterion selects variables 1 [\(x_{t}(1)\)] and 2 [\(x_{t}(2)\)], whereas the corresponding decomposition model with variables 1 [\(x_{t}(1)\)] , and 5 [\(y_{t}(1)\)] yielded almost the same \(R^{2}\) value, but it contains variables physically more significant than subset of variables 1 and 2.

It is also observed that the 9th variable, i.e, previous period’s inflow for the corresponding reservoirs appears in fourth subset and onward. It indicates that predictor variable \(y_{t-1}\) is having very little influence on the response and so it’s presence in the regression models can be avoided. This is partly due to the presence of both \(y_{t}\) and \(y_{t-1}\) in the data matrix which are correlated. Although in the subset of two predictor variables, \(y_{t}\) is always present, but not \(y_{t-1}\). However, at this point, \(y_{t-1}\) is not discarded as a predictor variable and it will be used for further examination.

SR technique

Table 4 Selected variables from stepwise regression

In the SR technique a particular model has been selected based on t-statistic. In Table 4 the particular variables selected are shown in terms of their numbers for each reservoir for each month. It is observed from Table 4 that the \(x_{t}\) and \(y_{t}\) values for the corresponding reservoirs (1,5 for reservoir 1 and 2, 4 for reservoir 2) are always present in the selected models, for almost every month. Regarding the downstream reservoirs Table 4 showed that \(x_{t}\) and \(y_{t}\) values for reservoir 3 (numbers 3, 7) and reservoir 4 (numbers 4, 8) are present in the models almost always. The 9th variable (\(y_{t-1}\)) occurred in some months for different reservoirs. In a few places \(y_{t-1}\) is present in absence of \(y_{t}\) for a particular reservoir. In general, although the effect of \(y_{t-1}\) is not very prominent, it is retained for further examination. Regarding the number of variables in a subset, in most of the cases stepwise regression suggested a subset of minimum four variables and higher (some times it includes all the variables).

In general the results from APR and SR always indicate the use of elementary variables (storage and inflow). Regarding the inclusion of other system variables many times model suggested by these two techniques are similar to the decomposition models, although in some cases they suggested inclusion of variables of lesser physical importance. Hence, the selection of final regression model will be based on the set of decomposition models starting from simplest model with two variables to the full model with all reservoirs.

Decomposition models and simulation

Table 5 Decomposition models in terms objective function value

The decomposition models are based on the physical significance of the system variables and from the APR and SR analysis, it is found that these models are also having acceptable statistical significance. Hence, final selection of a particular regression equation can be made from these decomposition models. The APR analysis suggested for models with at the most four variables, whereas the SR method sometimes suggested for higher models. Before final selection, another comparison is made with these decomposition models through simulation. All the nine regression models are used with the historical inflow sequences (1961–2018) to determine the corresponding releases and the value of objective function as shown in Table 5 have been computed and the lowest value is obtained from DM-B41.

Fig. 4
figure 4

Trajectories of different decomposition models

Storage behaviors of the four reservoirs corresponding to the different decomposition models are shown in Fig. 4. The trajectories are shown for the first five years only. Although no specific inferences can be drawn, these figures show that trajectories obtained from DM-A23 (based on \(y_{t-1}\)) is somewhat different than the others which are almost identical. Due to fixed M & I demand, nature of trajectories for two upstream reservoirs are different than that of two downstream reservoirs because of variable irrigation demand.

Selection of the final model from competing models of different size is actually a compromise between two opposite criteria (Draper and Smith 1968, Chapter-6): (1) to select a model with less of number of variables because of the difficulties involved in obtaining information on a large number of predictors (specifically for inflows) and (2) to select a model with as many variables as possible for better representation of the system and for better prediction. The simpler models (DM-A and DM-B) are having the advantage of using less number of variables: for any reservoir, decisions are based on the state of that particular reservoir. Forecast errors or the numerical errors associated with the state variables of the other reservoirs do not affect the decision much. However, these models cannot incorporate the other significant configurational aspects, like the joint operation of downstream reservoirs, which sometime may have greater consequence.

On the other hand, higher models (DM-C and DM-D) show better approximation of the system regarding the system configuration and the influence of other reservoirs on a particular reservoir. Apart from this, the numerical errors associated with the use of regression is also more in case of higher models. For example, for the complete model DM-D81, during a particular stage t, for any reservoir, the final storage state \(x_{t+1}\) is based on the storage state \(x_{t}\) of all reservoirs which were again computed from some regression equations during stage \(t-1\). Thus as stage t increases, numerical errors also increase. Also, as the equations for DM-D81 involve more number of random variables \(y_{t}\), forecast errors associated with these variables will also affect the decision and the error will be propagated.

Based on statistical significance, physical significance and objective function values obtained through simulation, and other justifications discussed above, model DM-B41 is selected.

Conclusion

In the implicit stochastic DP framework, a generalized approach is proposed here for the selection of appropriate variables in the regression. The variable selection techniques based on statistical significance, APR and SR, are powerful tools but sometimes important variables from physical justification of the problem may be discarded. So, care should be taken regarding the final selection.The approach selects the appropriate model from the judgments based on statistical and physical significance of the presence of a variable in the regression model, performance of the model through simulation and from the view of utility. The statistical performances of the different decomposition models which are based on the spatial decomposition of the original problem and maintain the essence of physical significance of the operation problem, are also quite justified and acceptable. Therefore, final selections can be made from this configuration of models.

To obtain precise information about some variable (e.g, inflow) is very difficult and there is possibility of propagation of error, therefore models with less number of predictor variables are selected from a group of models with similar performance. Here simulation serves as an important basis for selecting models from almost similar models (in terms of statistical or physical significance), DM-41 is selected as most preferred model. All these conclusions are system dependent in general and true for the DV system in particular.