1 Introduction

The first thing springs to mind for understanding, forecasting, and improving the behavior of complex experiments for real-life phenomena, industrial applications, and scientific investigations is a data-based model. Designing and modeling a studied experiment are the two key stages for this purpose. The significant purpose of the first stage, designing the experiment, is the selection of a representative dataset that provide precise information and correct understanding about the most significant features and behavior of the phenomenon under the experimentation (cf. Elsawah 2021a). Modeling the collected representative dataset, i.e., screening the relationship between input factors and their responses, is the second stage that can be used to estimate unknown parameters and predict the behavior of the studied phenomenon and thus guide the investigators to improve the inputs or experimental conditions for optimizing the corresponding outputs (cf. Elsawah 2021b). This logical idea is a classical methodology that is extremely used in computer and physical experiments (cf. Elsawah 2023a, b). For example, it is used in the industry in designing the process, reducing the process time, improving the quality of the products by reducing variability and increasing reliability, and reducing the overall costs (cf. Elsawah 2022a).

Efficient designing and modeling methods are able to capture maximum valuable (accurate) information about the behavior of a given experiment, and thus, an efficient model can be established based on the optimal representative dataset to screen the relationship between the inputs and their corresponding responses that can be used to estimate significant unknown parameters without bias and with minimum variance and forecast the future behavior of the studied phenomenon. Whereas non-efficient designing or/and modeling methods cannot produce useful and correct information nor provide accurate estimation or prediction (Elsawah 2022b). The practice demonstrated that effectively designing and modeling experiments are significant hard problems experimenters may face in many real-life applications. Despite the fact that many approaches have been offered, the challenge faced by the experimenters is still daunting.

The significant problem in improving the designing and modeling methods is that the researchers are improving the methods of each stage independently. On one hand, the idea of design of experiment approach (Fisher 1935) and the corresponding approaches and developments are used to improve the first stage, designing the experiments, and many efficient methods are given to optimally select representative datasets. On the second hand, the power of the modeling approach and its corresponding methods such as machine learning (Samuel 1959) are used to improve the second stage, modeling the experiments. However, these two approaches are complementary and not alternative and their power can be merged to support each other. The combination of design of experiment and modeling has recently attracted the attention of researchers (cf. Lujan-Moreno et al. 2018; Salmaso et al. 2022).

Even though there is obvious link between the design of experiment and modeling, there are surprisingly few papers on addressing the potential usefulness of a combination of the two concepts. For instance, Staelin (2003) used the principles of design of experiment to identify optimal or nearly optimal initial parameter settings in an example of support vector machines; Packianather et al. (2000) applied the Taguchi design approach to optimize the design parameters in an example of neural networks; Sukthomya and Tannock (2005), Ortiz-Rodriguez et al. (2006), and Balestrassi et al. (2009) all reached the conclusion that the design of experiment approach allows for gaining a profound understanding of the effects of parameters on the network performance and hence enables better parameter adjustments. The existing work compares or combines the two concepts in specific areas of interest or for specific problem investigations (cf. for example Mohamed et al. 2023; Prasath et al. 2021, 2022), but a paper producing a generalizable assessment of how the two methodologies can be applied jointly to develop a new efficient designing-modeling approach has not been put forward so far and the work in this topic is limited. Readers who are interested in learning more new approaches for designing or modeling experiments may refer to Sikirica et al. (2023), Iordanis et al. (2022), Zhang et al. (2022) and Elsawah (2017a, 2017b).

Consider an explicit function for an experiment with p input factors \( X_1,X_2,\ldots ,X_p\) and only one output factor Y and the experimenter wants to estimate the true model \(Y=F(X_1,X_2,\ldots ,X_p)\) that gives the relationship between the p input factors and their corresponding responses. The classical modeling technique estimates the model \(Y=F(X_1,X_2,\ldots ,X_p)\) in one step based on a selected representative dataset that is an \(n\times p\) data matrix by selecting n different values from the range of each input factor \({X_i},~i=1,\ldots ,p\). However, the accuracy of the approximate model \({\widehat{Y}}={\widehat{F}}(X_1,X_2,\ldots ,X_p)\) in many cases is not good, especially when there is no or little prior information about the true model. Therefore, the logical idea is that: The weight of importance of each input factor needs to be taken into the consideration and a closer look at the sub-models between the most important input factors and their corresponding responses need to be investigated. This paper presents a sequential designing-modeling technique (SeqST) that takes the weight of the importance of each input factor into consideration. The power of the combination of the sequential design of experiment approach and sequential modeling approach is investigated. The input factors are added to the proposed technique and modeled sequentially, one input factor is added at each stage, according to their importance (i.e., expected influence on the output), while each remaining input factor keeps fixed at a given point (value) that has the highest influence based on a prior knowledge or an initial experiment (cf. Sect. 3 for more details). Based on this simple introduction of the new proposed SeqST, the following logical questions may arise: How to rank the importance of the input factors in order? How to find the point of the highest influence for each factor? What is the effect of the total number of training points on the performance of the SeqST? What is the effect of the number of training points in each stage on the performance of the SeqST? What is the effect of the order of the importance of the input factors on the performance of the SeqST? What is the effect of the gap between the importance of the input factors on the performance of the SeqST? This paper tries to answer these interesting questions to investigate the performance of the proposed SeqST for different scenarios that give benchmarks to guide the experimenters to effectively designing and modeling their experiments. The power of the new proposed SeqST is measured by comparing its performance with the performance of the classical modeling technique, single-stage technique (SinST).

The rest of this paper is organized as follows. Section 2 gives the new proposed SeqST. Measuring the importance of each factor and finding the point with the highest influence for each factors are discussed in Sect. 3. Section 4 gives an illustrative example based on the discussions in Sects. 2 and 3. The performance of the new proposed SeqST is compared with the performance of the SinST using linear and non-linear models in Sect. 5. Section 6 gives further investigations for the performance of the proposed SeqST using different scenarios of the number of training points and the order of the importance of the input factors. We close through the conclusion and future work in Sect. 7.

2 The new proposed sequential stages designing-modeling technique

Consider an experiment with p input factors \( X_1,X_2,\ldots ,X_p\) and only one output factor Y and the experimenter wants to find the meta-model \({\widehat{Y}}={\widehat{F}}(X_1,X_2,\ldots ,X_p)\) that gives the relationship between the p input factors and their corresponding responses. This paper presents a step-by-step technique for incorporating design of experiment approach into modeling approach and adapting it to address some drawbacks of the existing techniques. Due to the limitation of the space and for a clear explanation, the new proposed SeqST uses the regression model from the modeling approach, which is the most basic strategy in the modeling approach and its success is more conducive to the proliferation of other advanced models. However, many different models can be used to extend this study. The new proposed SeqST is given by the following steps:

  • \(\underline{{\varvec{Preparation stage:}}}\) Rank the p inputs \(X_1,X_2,\ldots ,X_p\) according to their importance, i.e., influence on the output. Let \(X_{1:p}\ggg X_{2:p}\ggg \cdots \ggg X_{p:p}\) is the corresponding importance order of the p input factors, where \(X_{1:p}\) is the input with the highest importance and \(X_{p:p}\) is the input with the lowest importance. Determine the most important level (value) of each input factor, i.e., the value for each factor that has the highest importance. Let \(x^*_{1:p},x^*_{2:p},\ldots ,\) and \(x^*_{p:p}\) are the p highest influence levels of the p input factors \(X_{1:p},~X_{2:p},\ldots ,\) and \(X_{p:p},\) respectively. It is worth mentioning that the importance (or influence) of the input factors and their levels that have the highest influences can be given based on expert knowledge or prior information by investigating an initial small experiment. If there is no prior information, Sect. 3 investigates a theoretical method to estimate the importance of each factor and the point with the highest influence for each factor.

  • \(\underline{{\varvec{First designing-modeling stage:}}}\) Generate the first-stage dataset (design) that is an \(n_1\times p\) data matrix \({\textbf{U}}_1=\left[ \textbf{D}_{1},X^*_{2:p},\ldots ,X^*_{p:p}\right] ,\) where \(\textbf{D}_{1}=\left( x^{(1)}_{1:p},\ldots ,x^{(n_1)}_{1:p}\right) ^T\) is an optimal design from the experimental design viewpoint over the domain of the highest importance input factor \(X_{1:p}\) and \(X^*_{k:p}=(x^*_{k:p},\ldots ,x^*_{k:p})^T\) is a vector that all of its \(n_1\) values are fixed to the highest importance level value \(x^*_{k:p}\) of the kth input factor \(X_{k:p}\) for \(k=2,\ldots ,p.\) Calculate the first-stage observed output vector via a physical experiment or exact output vector via a computer experiment, say \(Y_1=F({\textbf{U}}_{1}).\) Find the first-stage meta-model \(\widehat{F_1}\) that is the approximate model for the relationship between the \(\textbf{D}_{1}\) in the first-stage design \({\textbf{U}}_{1}\) and the corresponding first-stage observed output factor \(Y_1=F({\textbf{U}}_{1}).\)

  • \(\underline{{\varvec{Second designing-modeling stage:}}}\) Generate the second-stage design that is an \(n_2\times p\) data matrix \({\textbf{U}}_2=\left[ \textbf{D}_{2},X^*_{3:p},\ldots ,X^*_{p:p}\right] ,\) where \(\textbf{D}_{2}=\left( \begin{array}{cccc} x^{(1)}_{1:p}&{}\ldots &{}x^{(n_2)}_{1:p} \\ x^{(1)}_{2:p}&{}\ldots &{}x^{(n_2)}_{2:p} \\ \end{array}\right) ^T\) is an optimal design from the experimental design viewpoint over the domain of the first two highest importance input factors \(X_{1:p}\) and \(X_{2:p},\) and \(X^*_{k:p}=(x^*_{k:p},\ldots ,x^*_{k:p})^T\) is a vector that all of its \(n_2\) values are fixed to the highest importance level value \(x^*_{k:p}\) of the kth input factor \(X_{k:p}\) for \(k=3,\ldots ,p.\) Calculate the second-stage observed output vector via a physical experiment or exact output factor via a computer experiment, say \(Y_2=F({\textbf{U}}_{2}).\) Find the second-stage meta-model \(\widehat{F_2}\) that is the approximate model for the relationship between the \(\textbf{D}_{2}\) in the second-stage design \({\textbf{U}}_{2}\) and the corresponding second-stage observed output factor \(Y_2=F({\textbf{U}}_{2}).\)

  • \({\underline{{\varvec{Third designing-modeling stage:}}}}\) Generate the third-stage design that is an \(n_3\times p\) data matrix \({\textbf{U}}_3=\left[ \textbf{D}_{3},X^*_{4:p},\ldots ,X^*_{p:p}\right] ,\) where \(\textbf{D}_{3}=\left( \begin{array}{cccc} x^{(1)}_{1:p}&{}\ldots &{}x^{(n_3)}_{1:p} \\ x^{(1)}_{2:p}&{}\ldots &{}x^{(n_3)}_{2:p} \\ x^{(1)}_{3:p}&{}\ldots &{}x^{(n_3)}_{3:p} \\ \end{array}\right) ^T\) is an optimal design from the experimental design viewpoint over the domain of the first three highest influence inputs \(X_{1:p},X_{2:p}\) and \(X_{3:p},\) and \(X^*_{k:p}=(x^*_{k:p},\ldots ,x^*_{k:p})^T\) is a vector that all of its \(n_3\) values are fixed to the highest influence level value \(x^*_{k:p}\) of the kth input for \(k=4,\ldots ,p.\) Calculate the third-stage observed output vector via a physical experiment or exact output vector via a computer experiment, say \(Y_3=F({\textbf{U}}_{3}).\) Find the third-stage meta-model \(\widehat{F_3}\) that is the approximate model for the relationship between the \(\textbf{D}_{3}\) in the third-stage design \({\textbf{U}}_{3}\) and the corresponding third-stage observed output vector \(Y_3=F({\textbf{U}}_{3}).\)

  • \(\underline{{\varvec{P-th designing-modeling stage}}}\) Repeat the above systematic strategy up to the last stage as follows. Generate the pth-stage design that is an \(n_p\times p\) data matrix \({\textbf{U}}_p=\left[ \textbf{D}_{p}\right] ,\) where \(\textbf{D}_{p}=\left( \begin{array}{cccc} x^{(1)}_{1:p}&{}\ldots &{}x^{(n_p)}_{1:p} \\ \vdots &{}\vdots &{}\vdots \\ x^{(1)}_{p:p}&{}\ldots &{}x^{(n_p)}_{p:p} \\ \end{array}\right) ^T\) is an optimal design from the experimental design viewpoint over the domain of all the p inputs \(X_{1:p},X_{2:p},\ldots ,X_{p:p}.\) Calculate the pth-stage observed output vector via a physical experiment or exact output vector via a computer experiment, say \(Y_p=F({\textbf{U}}_{p}).\) Find the pth-stage meta-model \(\widehat{F_p}\) that is the approximate model for the relationship between the \(\textbf{D}_{p}\) in the pth-stage design \({\textbf{U}}_{p}\) and the corresponding pth-stage observed output vector \(Y_p=F({\textbf{U}}_{p}).\)

  • \(\underline{{\varvec{Final Meta-Model:}}}\) To define the overall meta-model, we use the idea of the weighted average for the coefficients of the factors in the meta-models \(\widehat{F_1},\ldots ,\widehat{F_p}.\) For instance as given in Fig. 1, for an experiment with three factors without interactions and the three meta-models are polynomial models as follows: \(\widehat{F_1}=\beta _1+a_{11}X_1+a_{12}X^2_1,\) \(\widehat{F_2}=\beta _2+a_{21}X_1+a_{22}X^2_1+b_{21}X_2+b_{22}X^2_2\) and \(\widehat{F_3}=\beta _3+a_{31}X_1+a_{32}X^2_1+b_{31}X_2+b_{32}X^2_2+c_{31}X_3+c_{32}X^2_3.\) Therefore, the overall meta-model is the weighted average that is given as follows:

    $$\begin{aligned} {\widehat{F}}= & {} \frac{1}{3} \sum _{k= 1}^{3} {\beta }_k+\left( \sum _{k = 1}^{3}\frac{a_{k1}}{3}\right) {X_1}+\left( \sum _{k = 1}^{3}\frac{a_{k2}}{3}\right) {X^2_1} +\left( \sum _{k = 2}^{3}\frac{b_{k1}}{2}\right) {X_2} \\{} & {} +\left( \sum _{k = 2}^{2}\frac{b_{k2}}{2}\right) {X^2_2} +c_{31}{X_3}+c_{32}{X^2_3}. \end{aligned}$$
Fig. 1
figure 1

The main idea of the weighted average to get the final meta-model

Now comes to mind the following logical question: How to select the optimal design (dataset) from the experimental design viewpoint for each stage over the domain of the input factors in each stage? An efficient way for selecting optimal representative training datasets for the new proposed SeqST is to make use of the techniques of experimental design approach. The optimality selection of experimental points (design) from an experimental region that provides valuable information about a given experiment is the most significant hard problem investigators may face, especially when there is no prior information about the model structure between the inputs and the corresponding outputs. An intuitive idea to overcome the mentioned problem is to scatter the representative training points in an intelligent manner to cover the experimental region well, which is called a space-filling design (cf. Elsawah 2022c). Among strategies coined for computer experiments, Latin hypercube designs (LHDs) (Mckay et al. 1979; Iman and Conover 1980) have become very popular. Other strategies include orthogonal arrays (Owen 1992), and Hammersley designs (Diwekar and kalagnanam 1997; Hammersley 1960). To illustrate their popularity, Fig. 1a in Viana (2013) (cf. Fig. 2) shows an approximate number of publications that referred to at least one of these three techniques. An LHD spreads its representative training points everywhere in the region with as few gaps or holes as possible (cf. Fig. 3), and thus, it gives a good representation of the experimental region with even fewer points. LHDs play an important role in computer simulation (cf. Husslage et al. 2011; Fang et al. 2006; Elsawah and Gong 2023). Therefore, LHDs is used in this study. It is pertinent to point out that the new proposed SeqST can be carried out utilizing uniform designs, which are a class of optimal space-filling designs that are currently extensively used in a variety of practical applications (cf. Elsawah and Vishwakarma 2022).

Fig. 2
figure 2

Number of papers published per year. Data obtained from the Google Scholar database in the week of March 4, 2013 (cf. Fig. 1a in Viana 2013)

Fig. 3
figure 3

Latin hypercube 20 points in two dimensions and three dimensions

3 On the importance of the input factors and their points

The following logical question comes to mind after reading the preparation stage of the new proposed SeqST: If there is no prior information, how to determine the order of the importance of the factors and the points with the highest influence for each factor? This section tries to provide an answer to this significant question for computer experiments. Consider a computer experiment with p independent input factors \(X_{i}\in [LB_i,UB_i],~1\le i\le p\) and \(x^*_{k}\) is the point with the highest influence for the kth factor \(X_{k},~1\le k\le p.\) From physics point of view, the points \(x^*_{k}\) with the highest influence can be defined as the Mass Centers (MCs). The MC is a point that causes a rigid body to maintain its equilibrium state. Within a solid Q with volume V, if the mass distribution is continuous with density \(\rho \), the integral of the weighted position coordinates of the points connected to the center of mass R can be expressed as follows:

$$\begin{aligned} \iiint _Q\rho (r)(r-R) \,\textrm{d}V= 0, \end{aligned}$$
(1)

where r is the vector representing the position of a point with respect to a fixed origin and the solution of coordinate R is given as follows:

$$\begin{aligned} R = \frac{1}{M}\iiint _Q\rho (r)r\,\textrm{d}V, \end{aligned}$$
(2)

where M is the total mass of the solid. For further details, the reader may refer to Mark (2009). If the body is formed by a function from mathematics viewpoint, its volume has a uniform density distributed state with a constant \(\rho (r).\) Therefore, for a function with p factors \(F(X_1,X_2,\ldots ,X_p)\), (2) can be rewritten as follows:

$$\begin{aligned} R = \frac{1}{M}\mathop {\int \cdots \int \cdots \int }\limits _{\text {p integrals}} F(X_1,X_2,\ldots ,X_p)\,\textrm{d}V. \end{aligned}$$
(3)

The point \(x^*_{k}\) with the highest influence for the factor \(X_k\) is defined as the point that divides the function into two parts with same mass, \(M_L=M_R\) (cf. Fig. 4). Therefore, from (3), we get

$$\begin{aligned} \frac{1}{R}\mathop {\int \cdots \int \cdots \int }\limits _{\text {p integrals}} F(X_1,X_2,\ldots ,X_p)\,\textrm{d}V_L = \frac{1}{R}\mathop {\int \cdots \int \cdots \int }\limits _{\text {p integrals}} F(X_1,X_2,\ldots ,X_p)\,\textrm{d}V_R. \end{aligned}$$
(4)

From (4), the point \(x^*_{k}\) with the highest influence for the factor \(X_k\) is the solution of the following equation:

$$\begin{aligned} \begin{aligned}&\int _{LB_p}^{UB_p}\ldots \int _{LB_k}^{x_k^*} \ldots \int _{LB_1}^{UB_1} F(X_1,\ldots ,X_k,\ldots ,X_p) \ \,\textrm{d}X_1\ldots \,\textrm{d}X_k\ldots \,\textrm{d}X_p \\&\quad = \int _{LB_p}^{UB_p}\ldots \int _{x_k^*}^{UB_k} \ldots \int _{LB_1}^{UB_1} F(X_1,\ldots ,X_k,\ldots ,X_p) \ \,\textrm{d}X_1\ldots \,\textrm{d}X_k\ldots \,\textrm{d}X_p. \end{aligned} \end{aligned}$$
(5)

Using the calculated points \(x^*_{k},~1\le k\le p\) with the highest impacts, the importance of the factor \(X_k\) can be measured by its corresponding area as follows:

$$\begin{aligned} A(X_k) = \left| \int _{LB_k}^{UB_k} F(x_1^*,x_2^*,\ldots ,X_k,\ldots ,x_p^*) \,\textrm{d}X_k\right| . \end{aligned}$$
(6)

The p areas \(A(x_k),~1\le k\le p\) need to be calculated and sorted in a decreasing order as follows:

$$\begin{aligned} A(X_{1:p})> A(X_{2:p})> \cdots > A(X_{p:p}), \end{aligned}$$

where \(X_{1:p}\) is the input with the highest importance and \(X_{p:p}\) is the input with the lowest importance.

Fig. 4
figure 4

The mass centers of a function

4 An illustrative example

The above-mentioned steps and discussions in Sects. 2 and 3 are used and explained using LHDs, polynomial regression models, and the following computer experiment:

$$\begin{aligned} Y=F(X_1,X_2,X_3)=200+5X_1^{2}+100X_1+\frac{1}{25} X_2^{2}+50X_2+\frac{1}{175} X_3^{2}+X_3,~0\le X_i\le 1,~1\le i\le 3. \end{aligned}$$

Based on (5), the points with the highest impacts for the factors \(X_1,\) \(X_2\), and \(X_3\) are calculated as follows \(x^*_{1}=0.5470,\) \(x^*_{2}=0.5225\), and \(x^*_{3}=0.5005,\) respectively. Based on (6), the corresponding areas for the factors \(X_1,\) \(X_2\), and \(X_3\) are calculated as follows \(A(X_1)=51.67,\) \(A(X_2) =25.01\), and \(A(X_3)=0.5019,\) respectively. Therefore, the order of the importance of the input factors is given as follows \(X_{1}\ggg X_{2}\ggg X_{3}.\) Table 1 gives LHDs with 11, 16, and 20 points for the first, second, and third stages, respectively, and their corresponding outputs. From Table 1 and the proposed SeqST in Sect. 2, we get

  • The first meta-model \(\widehat{F_1}\) gives the following relationship between the LHD \(\textbf{D}_{1}=[X_{1}]\) and the corresponding output \({Y}_1=F({\textbf{U}}_{1})\):

    $$\begin{aligned} \widehat{F_1}=278.3730+30.4786X_1. \end{aligned}$$
  • The second meta-model \(\widehat{F_2}\) gives the following relationship between the LHD \(\textbf{D}_{2}=[X_{1}~X_{2}]\) and the corresponding output \({Y}_2=F({\textbf{U}}_{2})\):

    $$\begin{aligned} \widehat{F_2}=278.1910+31.5193X_1+1.6511X^2_1+15.1802X_2+0.0123X^2_2. \end{aligned}$$
  • The third meta-model \(\widehat{F_3}\) gives the following relationship between the LHD \(\textbf{D}_{3}=[X_{1}~X_{2}~X_{3}]\) and the corresponding output \({Y}_3=F({\textbf{U}}_{3})\):

    $$\begin{aligned} \widehat{F_3}= & {} 275.0492+29.9669X_1+1.5248X^2_1+15.8355X_2\\{} & {} +0.0124X^2_2+0.2970X_3+0.0018X^2_3. \end{aligned}$$

Therefore, the overall meta-model is the weighted average that is given as follows:

$$\begin{aligned} {\widehat{F}}_{SeqST}= & {} 277.2204+30.6549X_1+15.5078X_2+0.2970X_3+1.5880X_1^2\\{} & {} +0.0123X_2^2+0.0018X_3^2. \end{aligned}$$

To test the performance of this meta-model, the SinST is used to find another meta-model using an LHD \({\textbf{U}}\) with the same number of points in the three stages of the SeqST, i.e., \(n=n_1+n_2+n_3=11+16+20=47\). Table 2 gives an LHD \({\textbf{U}}=[X_{1}~X_{2}~X_{3}]\) and the corresponding output \({Y}=F({\textbf{U}})\). From Table 2, the meta-model that describes the relationship between \({\textbf{U}}\) and \({Y}=F({\textbf{U}})\) is given as follows:

$$\begin{aligned} {\widehat{F}}_{SinST}= & {} 277.2204+28.9878X_1+14.4962X_2+0.2900X_3\\{} & {} +1.4974X_1^2+0.0120X_2^2+0.0017X_3^2. \end{aligned}$$

Figure 5 gives all the 47 values of \({F}({\textbf{U}}_{\textrm{test}}),\) \({\widehat{F}}_{SeqST}({\textbf{U}}_{\textrm{test}})\) and \({\widehat{F}}_{SinST}({\textbf{U}}_{\textrm{test}})\) and the absolute differences between each two of them using an LHD \({\textbf{U}}_{\textrm{test}}\) with 47 points as a testing dataset. The results show that the values of \({\widehat{F}}_{SeqST}({\textbf{U}}_{\textrm{test}})\) are closer to \({F}({\textbf{U}}_{\textrm{test}})\) than the values of \({\widehat{F}}_{SinST}({\textbf{U}}_{\textrm{test}}).\) Moreover, the mean square error (MSE), \(MSE =\frac{1}{n}\sum _{i = 1}^{n} (F_i - \widehat{F_i})^2,\) of these two meta-models are given as follows:

$$\begin{aligned} MSE_{SeqST}=1.0875\times 10^3<MSE_{SinST}=1.6838\times 10^3. \end{aligned}$$

Therefore, the SeqST is much better than the SinST.

Table 1 The three-stage designs and their corresponding observed outputs for the SeqST for the illustrative example
Table 2 The single-stage design and its corresponding observed outputs for the SinST for the illustrative example
Fig. 5
figure 5

All the 47 values of \({F}({\textbf{U}}_{\textrm{test}})\), \({\widehat{F}}_{SeqST}({\textbf{U}}_{\textrm{test}})\) and \({\widehat{F}}_{SinST}({\textbf{U}}_{\textrm{test}})\) (down) and the absolute differences between each two of them (up) using an LHD \({\textbf{U}}_{\textrm{test}}\) with 47 points as a testing dataset for the illustrative example

5 The performance assessment of the new proposed SeqST

To evaluate the performance of our proposed methodology, we consider the following four examples, two linear models and two non-linear models. The first linear model is the so-called the pullulan production model. Although pullulan has been produced commercially since 1978, the production mechanism on the genetic level is still far from being fully understood. As a result, only empirical models can be built to optimize pullulan production. One of these models is derived by Goksungur et al. (2005) as follows:

$$\begin{aligned} Y_1= & {} -29.851+1.189X_1+0.057X_2+5.086X_3-0.011X^2_1 -0.0000607X^2_2-1.3633X^2_3 \\{} & {} -0.000296X_1X_2+0.0263X_1X_3. \end{aligned}$$

This model predicts the final concentration of pullulan (g/L) as a function of the initial substrate concentration (\(X_1\)), the speed of agitation (\(X_2\)), and the airflow rate (\(X_3\)). The ranges of variation of the independent variables are \(X_1\in [30\,\,70]\) g/L, \(X_2\in [200\,\,600]\) rpm, and \(X_3\in [1\,\,3]\) vvm. The range of variation of the dependent variable \(Y_1\) is \([4.96\,\,17]\). The second linear model is the so-called the Goldprice model that has been studied by Andre et al. (2000) and Ranjan et al. (2008). The Goldprice function is given by

$$\begin{aligned} Y_2= & {} \left[ 1+(X_1+X_2+1)^2\left( 19-14X_1+3X^2_1-14X_2+6X_1X_2+3X^2_2\right) \right] \\{} & {} \times \left[ 30+(2X_1-2X_2)^2\left( 18-32X_1+12X^2_1+48X_2-36X_1X_2+27X^2_2\right) \right] , \end{aligned}$$

where the two input factors \(X_1\) and \(X_2\) are defined on the domain \([-2\,\,2]\times [-2\,\, 2].\)

The non-linear model is an equation selected for its very different topology and non-linearity compared to the first two models. The first non-linear model is given as follows:

$$\begin{aligned} Y_3=\frac{\ln (X_1)(\sin X_2+4)}{\exp (X_3)}+\ln (X_1){\exp (X_3)}, \end{aligned}$$

where the ranges of variation of the independent variables are \(X_1\in [ 0.1\,\, 10],~X_2\in [-\pi /2\,\,\pi /2],\) and \(X_3\in [0\,\,1]\) leading to a variation of the dependent variable Y in the range of \([-13.82\,\,13.82].\) The second non-linear model is given as follows:

$$\begin{aligned} Y_4=\exp (X_1)+\sin (X_2)+X_3^7, \end{aligned}$$

where the range of variation of the independent variables is \([0\,\,1].\)

A comparison study between the mean squared error (MSE), \(MSE =\frac{1}{n}\sum _{i = 1}^{n} (F_i - \widehat{F_i})^2,\) and mean absolute error (MAE), \(MAE =\frac{1}{n}\sum _{i = 1}^{n} |F_i - \widehat{F_i}|,\) of the meta-models using the new proposed SeqST and the classical SinST is given based on the above-mentioned four models using the LHDs as training and testing datasets, the polynomial models as the fitting models, and the medians of the ranges of the input factors as the points with the highest impacts. To have a fair comparison study between the SeqST and SinST, the number of representative training points for SinST is selected to be equal to the total number of representative training points in all the p stages of the SeqST, i.e., \(n=n_1+n_2+\cdots +n_p.\) Since the representative training datasets and the representative testing datasets (LHDs) are not deterministic for a given n, the minimum, mean, median, and \(95\%\) confidence interval (\(95\%\)CI) of the MSEs and MAEs of the approximate meta-models of the above-mentioned four models using the SeqST and SinST based on about 5000 different randomly generated representative training datasets and representative testing datasets are given in Table 3 to investigate the behavior of the SeqST for any randomly generated representative datasets. From Table 3, we get the following:

  • The new proposed SeqST is better than the classical SinST for all the four models, where the values of the MSE and MAE via the SeqST are smaller than their values via the SinST. The SeqST is better than the SinST for 5000 different training and testing datasets, where the minimum, mean, and median of about 5000 MSE and MAE values via the SeqST are less than their values via the SinST for all the cases.

  • The gaps among the impacts of the input factors for \(Y_3\) > (i.e., greater than) the gaps among the impacts of the input factors for \(Y_4\) > the gaps among the impacts of the input factors for \(Y_1\) > the gaps among the impacts of the input factors for \(Y_2\). The performance of the SeqST for \(Y_3\) \(\succeq \) (i.e., better than) the performance of the SeqST for \(Y_4\) \(\succeq \) the performance of the SeqST for \(Y_1\succeq \) the performance of the SeqST for \(Y_2\), where the percentage differences between the minimum, mean, and median of the MSEs (and MAEs) for the SeqST and SinST for \(Y_3\) > that for \(Y_4>\) that for \(Y_1>\) that for \(Y_2.\) That is, when there are significant gaps among the impacts of the input factors, the accuracy of the SeqST increases.

Table 3 The simulation results for the performance of the SeqST and SinST using the above-mentioned models \(Y_1,Y_2,Y_3\) and \(Y_4\) via 5000 repetitions

6 Further interesting investigation for the performance of the SeqST

After the above-mentioned results come to mind the following new logical questions: What is the effect of the order of the importance of the input factors on the accuracy of the new proposed SeqST? What is the effect of the gaps among the importance of the input factors on the accuracy of the new proposed SeqST? What is the effect of the total number of points on the accuracy of the new proposed SeqST? What is the effect of the number of points in each stage on the accuracy of the new proposed SeqST? The answers of these questions provide benchmarks for the optimality use of the new proposed SeqST. This section tries to answer these questions and other interesting questions based on computer experiments.

Let the following non-linear model:

$$\begin{aligned} Y_5= -e^{-\left( X_{1}+0.5\right) ^{2}}-2 e^{-\left( {X}_{2}-0.5\right) ^{2}}-4 e^{-\left( {X}_{3}+3\right) ^{2}},~0\le X_i\le 1,~1\le i\le 3. \end{aligned}$$

Figure 6 investigates the importance of the three input factors for the model \(Y_5\). From Fig. 6 and based on the area under each curve, we get that \(X_2 \ggg X_1 \ggg X_3\) is the order of the importance of \(Y_5.\) To check the effect of the number of points in each stage and the order of the importance of the input factors on the accuracy of the SeqST, different numbers of points in each stage are used as follows: \(10\le n_i\le 100,~1\le i\le 3\) and \(n_1+n_2+n_3=120\). Figures 78, and 9 give the MSE values of the SeqST for the model \(Y_5\) using different number of points in each stage based on the following three different order of importance: \(X_2 \ggg X_1 \ggg X_3\) (correct order), \(X_1 \ggg X_2\ggg X_3\) (wrong order), and \(X_3 \ggg X_2 \ggg X_1\) (wrong order), respectively. Figure 10 gives a comparison study between the SeqST and SinST based on different number of points in each stage from the three stages of \(Y_5,\) where the number of points in the SinST n is equal to the number of points in the three stages of the SeqST, i.e., \(n=n_1+n_2+n_3.\) From Figs. 789 and 10, we conclude that:

  • The MSE values using the correct order of the importance are less than the MSE values using the wrong order of the importance for any number of points in each stage, where the ranges of MSE values are about \((0.06\,\,0.073),~(0.24\,\,0.65)\), and \((0.43\,\,0.63)\) for \(X_2 \ggg X_1 \ggg X_3\) (correct order), \(X_1 \ggg X_2\ggg X_3\) (wrong order), and \(X_3 \ggg X_2 \ggg X_1\) (wrong order), respectively. Therefore, it is recommended to carefully check the order of the importance before using the new proposed SeqST.

  • The new proposed SeqST is better than the classical SinST for any number of points, where the MSE values for the SeqST are less than the MSE values for the SinST. However, the SeqST is much better than the SinST for a small number of points (cf. Fig. 10). Therefore, it is recommended to use the new proposed SeqST for small number of points (experiments with a few trials).

Moreover, from the discussions about the models \(Y_1,~Y_2,~Y_3\), and \(Y_4\) in Sect. 5, it is observed that: When there are significant gaps among the impacts of the input factors, the accuracy of the new proposed SeqST increases. The following discussion tries to give more investigations for this interesting observation using two different types of gaps among the impacts of the input factors. The first type is the power gap that is investigated using the following model:

$$\begin{aligned} Y_6=X_1^{\alpha _{1}}+X_2^{\alpha _{2}}+X_3^{\alpha _{3}},~0\le X_i\le 1,~1\le \alpha _i\le 8,~1\le i\le 3,~3\le \alpha _1+\alpha _2+\alpha _3\le 10. \end{aligned}$$

The second type is the coefficient gap that is investigated using the following model:

$$\begin{aligned} Y_7=\beta _{1} X_1+\beta _{2}X_2+\beta _{3} X_3,~0\le X_i\le 1,~1\le \beta _i\le 18,~1\le i\le 3,~3\le \beta _1+\beta _2+\beta _3\le 20. \end{aligned}$$
Fig. 6
figure 6

The importance of the inputs for the model \(Y_5\)

Fig. 7
figure 7

The MSE for order \(X_2 \ggg X_1 \ggg X_3\) (correct order) for the model \(Y_5\)

Fig. 8
figure 8

The MSE for order \(X_1 \ggg X_2\ggg X_3\) (wrong order) for the model \(Y_5\)

Figures 11 and 12 give the differences of the medians of the MSE values using the new proposed SeqST and the medians of the MSE values using the classical SinST based on about 5000 different randomly generated representative training datasets and representative testing datasets for different powers and coefficients of the models \(Y_6\) and \(Y_7,\) respectively. The order is taken here as: \(X_1 \ggg X_2 \ggg X_3.\) That is, the correct power that is consistent with this order is \(\alpha _1<\alpha _2<\alpha _3;\) however, the correct coefficient that is consistent with this order is \(\beta _1>\beta _2>\beta _3.\) From Figs. 11 and 12, we get

  • When there are big gaps among the impacts of the input factors, the performance of the new proposed SeqST is much better than the performance of the classical SinST. Keep in mind that \(0\le X_i\le 1,\) i.e., when there are big gaps among the powers, \(\alpha _1,\alpha _2,\) and \( \alpha _3,\) we have small gaps among the impacts of the input factors, \(X_1,~X_2,\) and \(X_3,\) and vice versa. However, when there are big gaps among the coefficients, \(\beta _1,~\beta _2,\) and \(\beta _3\), we have big gaps among the the impacts of the input factors, \(X_1,~X_2,\) and \(X_3,\) and vice versa. Therefore, it is recommended to use the new proposed SeqST for experiments with large gaps among the impacts of their input factors.

  • For small powers \(\alpha _1\) and \(\alpha _2\) and large power \(\alpha _3\) (i.e., correct order of the importance), the performance of the new proposed SeqST is much better than its performance for large power \(\alpha _3\) (i.e., wrong order of the importance). For large coefficients \(\beta _1\) and \(\beta _2\) and small coefficient \(\beta _3\) (i.e., correct order of the importance), the performance of the SeqST is much better than its performance for large coefficient \(\beta _3\) (i.e., wrong order of the importance). Therefore, we get the same conclusion that is mentioned above: It is recommended to carefully check the importance order before using the new proposed SeqST.

To provide a more investigation to the effect of the number of points in each stage on the accuracy of the new proposed SeqST, let the following model:

$$\begin{aligned} Y_8=X_1^{4}+\frac{1}{2} X_2^{4}+\frac{1}{3} X_3^{4},~0\le X_i\le 1,~1\le i\le 3. \end{aligned}$$

Figure 13 investigates the importance of the three input factors for the models \(Y_8\). From Fig. 13 and based on the area under each curve, we get that \(X_1 \ggg X_2 \ggg X_3\) is the order of the importance of \(Y_8.\) Table 4 gives the MSE values and MAE values for \(Y_8\) based on the correct order of the importance and different number of training points in each stage. Moreover, Table 4 gives the MSE values and MAE values for the above-mentioned \(Y_4\) and \(Y_5.\) From Table 4, we conclude that: \(n_1>n_2>n_3\) is the best selection of the number of the training points in the three stages. Therefore, it is recommended to use the new proposed SeqST with a descending order of the numbers of training points in its stages.

Fig. 9
figure 9

The MSE for order \(X_3 \ggg X_2 \ggg X_1\) (wrong order) for the model \(Y_5\)

Fig. 10
figure 10

The MSESeqST–MSESinST for the model \(Y_5\)

Fig. 11
figure 11

The Median MSE SeqST–Median MSE SinST for the model \(Y_6\)

Fig. 12
figure 12

The Median MSE SeqST–Median MSE SinST for the model \(Y_7\)

Fig. 13
figure 13

The importance of the inputs for the model \(Y_8\)

Table 4 The MSE and MAE for different number of points

7 Conclusion and future work

This paper gives a new sequential stage technique (SeqST) for designing and modeling experiments when the input factors are not equally important. In the new proposed SeqST, the input factors are added to the process and modeled sequentially according to their importance, one input factor is added at each stage, while each remaining input factor keeps fixed at a given point that has the highest influence. A comparison study between the new proposed SeqST and the classical single-stage technique (SinST) is investigated. The effects of: the order of the importance of the input factors, the number of the training points in each stage, the total number of the training points, and the gaps among the influences of the input factors, on the performance of the new proposed SeqST are investigated. This study gives a benchmark that guide experimenters to effectively designing and modeling their experiments. The main results show that:

  • The performance of the new proposed SeqST is better than the performance of the classical SinST under different experimental conditions and scenarios.

  • The deviation between the performance of the new proposed SeqST and the classical SinST for small number of training points is larger than that when there are a large number of training points.

  • The deviation between the performance of the new proposed SeqST and the SinST for experiments with large gaps among the impacts of their factors is larger than that when there are small gaps among the impacts of their factors.

  • The new proposed SeqST has a good performance using the correct order of the importance of the input factors.

  • The new proposed SeqST has a good performance using a descending order of the numbers of the training points in its stages.

Therefore, we conclude that the new proposed SeqST is highly recommended to be used with the correct order of the importance of the input factors using a descending order of the training points in its stages for experiments with a few trials and/or large gaps between the importance of their factors.

During this work, the following interesting new ideas for future work have been arisen. The first author is working on them, and some theoretical and simulation results are obtained. However, more time and effort are needed to crystallize them in high-quality research papers with significant results.

  • This paper is a good first stone toward more future work in this regard. For instance, there is a significant need to theoretically study the behavior of the new proposed SeqST more deeply. In this study, the LHDs are used as training and testing datasets and the polynomial model is used as the fitting model. The logical questions are that: What is the effect of the type of training and testing datasets on the performance of the new proposed SeqST? What is the effect of the type of fitting model on the performance of the new proposed SeqST? Is the new proposed SeqST still applicable to implicit functional relationships in engineering without prior information? In the future work, the performance of the new proposed SeqST under various types of optimal experimental designs, such as uniform designs, orthogonal arrays, D-optimal designs, and various types of machine learning modeling techniques, will be investigated.

  • Elsawah 2022d (cf. its Sect. 5) presented a mixture factor-weight WD (MFWWD) as a new criterion for constructing new uniform mixture factor-weight experimental designs (training and testing datasets) when the input factors are not equally important. A comparison study between the classical SinST using the new uniform mixture factor-weight experimental designs and the new proposed SeqST using classical uniform designs in all of its stages will be investigated in the future work.