Introduction

Recombinant proteins are widely used in diverse fields in laboratory and industry, while many applications require sufficient amounts of high-quality proteins in terms of purity and activity [1]. In addition, the use of therapeutic recombinant proteins (biopharmaceuticals) has significantly increased since the introduction of recombinant human insulin in 1982 [2]. Nowadays, biopharmaceuticals are also being used for the treatment of a variety of diseases including cancer and metabolic disorders [3]. Interestingly, approximately 50% of all new medicines are classified as biopharmaceuticals [4, 5] and according to the report Biopharmaceuticals Market by Type and Application: Global Opportunity Analysis and Industry Forecast, 20182025 (https://www.researchandmarkets.com/research/qh3vxr/global?w=4), in 2017 the total marker of biopharmaceuticals has reached US$186,470 million, while it is estimated that it will exceed $500,000 million by 2025. In addition, there are over 400 marketed recombinant pharmaceutical products while more than 1300 are undergoing clinical trials [4, 6]. Recombinant proteins are also a prerequisite and vital component of several drug design projects while crystallographic studies in these projects require hundreds of milligrams of purified protein samples [6, 7]. Therefore, the main goal of industrial and academic research laboratories is to produce high amounts of pure and functional proteins at a reasonable cost [8, 9].

Even though recombinant proteins can be expressed in both prokaryotic and eukaryotic systems [10, 11], Escherichia coli is the first choice for the production of non-glycosylated recombinant proteins at industrial scale due to its ability to easily replicate, low cost, simplicity, and Food and Drug Administration (FDA)-approved status for human applications [12, 13]. In addition, PCR-based cloning represents one of the most essential tools in recombinant protein production technology. Theoretically, using E. coli as an expression host and PCR as a cloning method, the production of a recombinant protein is a straightforward process, i.e., the gene of interest (GOI) is cloned into an expression plasmid vector and, subsequently, it is inserted into an expression host. Following the induction of protein expression in the host’s cells with the appropriate inducer (e.g., IPTG), the recombinant protein is purified and, subsequently, its biophysical properties and activity are determined [14]. Moreover, recombinant DNA synthesis techniques and technologies facilitate the synthesis of recombinant DNA. The construction of a recombinant plasmid requires only a thermocycler, primers, and DNA polymerases and, therefore, any basic molecular biology laboratory can synthesize several kilobase pairs of DNA in less than a week [15]. However, in practice several things may go wrong: low amounts of PCR product, insufficient ligation, formation of inclusion bodies, background proteins during purification, low activity or difficulties to obtain protein crystals [16]. In addition, protein stability and low purification yields are challenges that must be resolved when E. coli and other microbes are used as hosts to produce recombinant proteins [17]. Therefore, the optimization of reaction conditions may be needed in one or more of the following processes that are used in recombinant protein biotechnology: (i) amplification of the GOI using PCR; (ii) ligation of the GOI with an expression plasmid vector; (iii) expression of the target protein in high amounts and in a soluble form; (iv) isolation of the target protein in a pure and active form; (v) assessment of protein activity; (vi) identification of conditions to obtain a single protein crystal.

The majority of methods that are used in recombinant protein biotechnology are developed and subsequently optimized using the time-consuming and inefficient one-factor-at-the-time (OFAT) approach [18], which examines the effect of only one factor at a time. However, several biochemical processes are affected by the interactions of the experimental variables (factors). The best approach to examine the effect of multiple factors, as well as the effect of their interactions on a process, is the statistical design of experiments (or simply Design of Experiments—DoE) approach. DoE approaches have been successfully used in the development, optimization, and assessment of the robustness of many biochemical processes [19], including those that are employed in recombinant protein biotechnology [20]. However, when browsing through the literature using the keywords “optimization” and “recombinant proteins” from 2017 to 2019, I found that more 2000 papers have in their title the word “optimization” or “effect of,” while less than 10% of them used statistical-based approaches to optimize a process. Interestingly, the process optimization in most cases was carried out using the OFAT approach.

The aim of this review is to discuss the potential applications of DoE approaches at every step of recombinant protein biotechnology, from the construction of an expression plasmid vector to crystallographic studies, and the recent progress in this growing field is discussed. Initially, the basic principles of DoE are presented. Then, the main factors affecting each step of recombinant protein production, purification, and characterization are discussed. This review focuses on the production of recombinant proteins specifically in E. coli that is one of the organisms of choice for the production of recombinant proteins including biopharmaceuticals. This will probably be the first review that extensively examines the use of DoE in all steps of recombinant protein biotechnology from the construction of a plasmid vector to crystallographic studies.

Optimization of a Process

An essential question, to begin with, is What is the optimization of a process? According to the English Oxford dictionary [21], optimization is the action or process of making the best of something or the action or process of rendering optimal; the state or condition of being optimal. Moreover, the online business dictionary (http://www.businessdictionary.com) defines optimization as Finding an alternative with the most cost effective or highest achievable performance under the given constraints, by maximizing desired factors and minimizing undesired one. Thus, according to these definitions, in order to optimize a process in recombinant protein biotechnology the experimenter should (i) examine the effect of multiple factors (variables) on the response (e.g., protein expression in a soluble form, purity, activity), in order to exclude the unimportant ones and subsequently (ii) find the optimum combination of the important factors that maximize the response.

Optimization Approaches: One-Factor-at-a-Time Approach Versus Statistically Designed Experiments

The majority of methods that are used in recombinant protein technology are developed and optimized using the traditional OFAT approach [18]. However, using the OFAT design, the experimenter gets information about one factor in each experimental trial [22] and, therefore, this approach is time-consuming especially when a large number of factors must be evaluated. The main disadvantage of the OFAT approach is that it does not examine the effects of the interactions among the experimental factors on a process, i.e., whether one factor influences the effect of another factor on a process (response). On the other hand, optimization studies can be conducted by varying several factors at the same time and examining both their effects and the effects of their interactions on a process (response) using statistical-based experimental approaches [23]. Overall, DoE is an organized approach that provides more reliable and useful information per experiment compared to the OFAT approach.

Theory and Steps of Design of Experiments

DoE approaches are employed in both the early and late stages of bioprocess development [24]. DoE uses statistical experimental methods and varies the factors that affect a process simultaneously over a specific set of experiments. The results are subsequently analyzed using a mathematical model, which gives significant information about the effect of each factor and the effect of the interactions between factors on the process (response) facilitating optimization of the process [25]. The main advantage of DoE approaches is that they use only a minimum number of experiments to examine simultaneously the effect of many parameters on a process, while biases are avoided [26]. Most importantly DoE is not only cheaper but it is also faster than OFAT because it provides optimized information content by using a small number of experiments [26, 27]. The theory and potential applications of DoE approaches are extensively described in many textbooks [28,29,30].

Several statistical software packages, such as Design-Expert (Stat-Ease Inc, MN, USA), JMP (SAS Institute Inc., Cary, NC, USA), Minitab (Minitab Inc, PA, USA), and ECHIP (ECHIP Inc, DE, USA), are available that lead the user during the design of experiments and analysis of the results. It should be noted that these software packages require only basic knowledge of statistics, design of experiments principles, elementary optimization methods, and regression modeling techniques. In addition, these software packages are able to develop mathematical models that demonstrate the relationships between factors and the response(s). In my laboratory, we routinely use Design-Expert software (https://www.statease.com/software/design-expert/) to design the experiments, analyze the data, and visualize the results. Design-Expert has been specifically developed for performing DoE and includes a variety of experimental designs including full factorial, fractional factorial, Plackett–Burman design, Taguchi Orthogonal Array, several types of response surface methodology, mixture designs, combined designs, etc., while it contains test matrices for testing up to 50 factors. The statistical significance of the test factors on the response is assessed using analysis of variance (ANOVA). The data are fitted on a mathematical model and graphical tools are employed to identify the impact of each factor on the response and reveal abnormalities in the data. It should be pointed out that other software packages, such as Minitab (http://www.minitab.com), JMP (https://www.jmp.com), and ECHIP (http://www.experimentationbydesign.com), contain the same/similar tools and features as Design-Expert and the selection of a software package for DoE purposes is a matter of personal choice.

Usually, a bioprocess is affected by a large number of factors and, therefore, DoE is carried out in two stages. During the first stage (screening experiments), the factors that have a statistically significant effect on the process are identified using a factorial design (discussed below), in order to reduce the number of factors to a manageable one [25]. Once the important parameters are identified, an optimization step is performed using the response surface methodology (RSM) in order to identify the optimum combination of factors that maximize the response (discussed below). The exclusion of insignificant factors during the first stage reduces the number of experiments and helps in the reduction of experimental effort required in the second step [31].

Before beginning the discussion on the applications of DoE in recombinant protein biotechnology, it is essential to give a short description of the terms that are widely used in DoE and they are summarized in Table 1.

Table 1 Vocabulary of DoE

The specific steps taken to optimize a process are described in the following paragraphs.

Stage 1: Screening Experiments—Identification of Significant Factors

A fundamental question that should be answered is which is the most suitable experimental design for optimization studies? The answer is that the choice of experimental design is dependent on the number and the type (categorical or continuous) of factors that have to be evaluated as well as on the previous knowledge about the protein of interest [18]. Another important question that should also be answered is what factors should be tested? In general, the choice of initial factors and the range of their values should be based on either literature examples with the same or similar proteins or previous experience expressing recombinant proteins [18].

Factorial Designs

Usually to identify the variables (factors) that significantly affect a process (response), a 2-level factorial approach is employed. In general, 2-level approaches are those in which all factors have only two values (a high- and a low-value) and these approaches are often referred to as “screening or preliminary experiments.” A 2-level approach could be either full or fractional depending on whether all or a fraction of all possible combinations of factors are tested. A 2-level full factorial design (2k, k is the number of factors and k > 2) examines all the possible combinations of factors. In a 2-level fractional factorial design (\(2_{R}^{k - p}\) fractional factorial, k is the number of factors, p indicates the size of the fraction of the 2k full factorial, and R is the resolution of the method) only the \(\left( {1/2} \right)^{p}\) fraction of the total number (2k) of the combinations is examined (e.g., the 2k1 and 2k2 designs require only the half and the one quarter, respectively, of the experiments). The resolution (R) of the method illustrates how clearly the effects can be separated in a design (the higher the better) and resolution IV designs are usually employed.

Two-level approaches are useful for highlighting the critical factors for further detailed study using RSM (discussed further below). An example of a \(2_{III}^{7 - 4}\) fractional approach that is widely being used in screening experiments and that requires only 8 experiments (a 27 full factorial design requires 128 experiments) is illustrated in Table 2. Using a fractional factorial approach is also beneficial when a large number (> 4) of variables must be examined, and thus a 2k full factorial demands a high number of experiments and, therefore, a high cost [32]. The fraction of experiments to be carried out is defined by the aforementioned software packages based on the number and type of variables that need to be examined. More details about fractional factorial designs could be found in Ref. [33].

Table 2 Experimental matrix of a \(2_{III}^{7 - 4}\) fractional factorial for 7 variables studied at two levels

A variety of algorithms, such as Plackett–Burman [34] and Taguchi orthogonal array [35], are also available which guide the selection of the fraction to be tested. Plackett–Burman design (PBD) is a small-sized two-level factorial experimental design that is widely used to identify large main effects. PBD identifies the important effectors from N number of variables in N + 1 experiments (where N is a multiple of 4) without recourse to the interaction effects between and among the variables. Thus, PBD just screens the design space to detect large main effects [36]. As the number of factors increases, full and fractional factorial approaches become impractical and expensive since a large number of experiments must be carried out. To overcome this problem, Taguchi introduced the orthogonal array, a specially designed method, to study the entire space of variables using a smaller number of experiments. Taguchi proposed to use signal-to-noise (S/N) ratio as a measurable value instead of standard deviation because, as the mean decreases, the standard deviation also decreases and vice versa [37]. However, Taguchi designs are more complicated and should only be used by experimenters who are familiar with the complex aliasing issues behind the designs.

Identification of Significant Factors

As aforementioned, an essential step in the optimization of a process is the identification of the factors that have a statistically significant effect on the response. For example, the statistical significance of the seven variables of Table 2, i.e., A, B, C, D, E, F, and G on a response can be initially evaluated using a half-normal probability plot (Fig. 1a). A half-normal probability plot is a plot of the absolute value of the effect estimated with respect to their cumulative normal probabilities [38] and unimportant factors are those that have near-zero effects (i.e., they have a normal distribution centered near zero; factors D, E, C in the example of Fig. 1a), while important factors are those whose effects are significantly removed from zero, i.e., factors A, G, B, and F in the example of Fig. 1a.

Fig. 1
figure 1

An example of identification of the factors that have a statistically significant effect on a process (response). a Adjusted half-normal probability plot with significant factors selected: A, G, B, and F. The farther that a factor is from the diagonal line, the greater its influence on the response. b The Pareto chart identified factors A, G, B, and F lying above the Bonferroni limit and are designated as certainly significant coefficients

The magnitude of the effect of each factor on the response is more clearly illustrated in a Pareto chart (Fig. 1b) which is a bar chart that rank-orders the effect of each factor by its magnitude. Pareto charts establish the t-value of the effect that is examined by two limit lines, i.e., the Bonferroni limit line (top line) and t-limit line (bottom line). Coefficients with a t-value of effect above the Bonferroni line are the significant factors; coefficients with a t-value of effect between Bonferroni line and t-limit line are termed as “coefficients likely to be significant,” while coefficients with a t-value of effect below the t-limit line are statistically insignificant and should be excluded from the analysis [39]. The Pareto Chart of the example in Fig. 1b shows that factors A, G, B, and F lie above the Bonferroni limit and are considered as highly probable to be statistically significant.

Stage 2: Optimization Experiments Using Response Surface Methodology

Following identification of factors that have a statistically significant effect on a process, RSM is usually applied to identify the best combination of these factors that maximize the response. The goals of RSM are to (1) develop a mathematical model that describes how the variables (factors) and the interactions between variables affect the response and (ii) determine the values of all variables that optimize the response [40].

Response surfaces are typical second-order polynomial models and the central composite design (CCD) is usually used. A CCD is composed of [1] a fractional factorial (or full factorial) design; [2] an additional design (often a star design in which experimental points are at a distance from its center); and [3] a central point. CCD is an efficient design that is ideal for sequential experimentation and allows a reasonable amount of information to test the “lack of fit” using a small number of design points. Besides CCD, there are many experimental designs for RSM such as Box–Behnken design (BBD), and small composite design (SCD). In a CCD, all factors are studied at five levels (− α, − 1, 0, 1, + α) and for two, three, and four variables the value of alpha is 1.41 (√2), 1.68 (√3), and 2 (√4), respectively. The theory and the mathematical part of DoE approaches including RSM are extensively discussed in many books (see [41] and references cited therein).

Table 3 illustrates the experimental setup of a four-factor-five level CCD (four factors are examined at 5 levels; − 2, − 1, 0, 1, 2). It should be noted that a limitation of RSM is that only continuous factors can be examined, while if categorical factors are added the design will be duplicated for every combination of the categorical factor levels.

Table 3 Central composite design of 4 independent variables that are examined at 5 levels (− 2, − 1, 0, + 1, + 2) for process optimization

The experimental data obtained from the design (e.g., Table 3) are subsequently fitted on a second-order polynomial model (Eq. 1).

$$Y = \beta_{o} + \sum {\beta_{i} x_{i} + \sum {\beta_{ij} x_{i} x_{j} + \sum {\beta_{ii} x_{ii}^{2} } } },$$
(1)

where Y is the measured response variable, βο is a constant, βi, βij, and βii are the regression coefficients of the model, and xi and xj represent the independent variables in coded values.

The second-order polynomial coefficients are estimated using a software package, e.g., Design-Expert. An example of a second-order equation (mathematical model) obtained during the optimization of a process is illustrated below (Eq. 2). In this example, the effects of four variables, namely A, B, C, and D, as well as the effects of their interactions, on the response (Y) were examined.

$$\begin{aligned} Y\left[ {\text{response}} \right]\, & = \, + \,12.41\, - \,0.86A\, - \,1.39B\, - \,1.61C \\ & \quad+ \, 1.04D \, + 0.076AB\, - \,0.19AC - \,0.14AD\\ &\quad + \,0.13BC- \,0.71BD\, - \,0.69CD\, - \,1.09A^{2}\\ & \quad- \,0.18B^{2} \, - \,1.90C^{2} \, - \,0.79D^{2}. \end{aligned}$$
(2)

In Eq. (2), plus (+) and minus (−) symbols show whether a model term has a positive or negative effect on the response.

Graphical Representation of the Interactive Effects of Variables on the Response

After the generation of the mathematical model, it is possible to predict the response for any possible combination of the factors that are tested within the experimental region (domain) even for those experiments that have not been actually carried out. Thus, the response at any point in the experimental domain can be predicted and, therefore, a graphical representation can be easily obtained. Usually, the fitted second-order equations (e.g., Eq. 2) are presented as a two-dimensional representation (contour plots, Fig. 2a) or as a three-dimensional plot (response surface plots, Fig. 2b).

Fig. 2
figure 2

An example of a contour plot (a) and a response surface plot (b) showing the effect of reaction temperature and time on the response, adapted from [129] (modified). The figure illustrates that the one factor influences the effect of the other factor on the response. In this example, the maximum response (~ 13.3 relative units) was obtained at a reaction temperature between 23 and 26 °C when the reaction time was set at ~ 8 h

These plots are the graphical representation of the relationship among three variables, i.e., two independent variables (while the others are kept at their zero points) and the response. A 3D response surface plot (Fig. 2b) is obtained by plotting two independent variables on the x- and y-axes (the other are kept at their center points), while the response is shown by a smooth 3D surface plot. The plots indicate the direction in which the original design must be displayed in order to achieve the optimal conditions. By looking at the plots of Fig. 2, it is easy to understand that the best reaction time changes according to the reaction temperature (and vice versa), while a reaction time above 8 h and a temperature above 26 °C have a negative effect on the response.

Validation of the Mathematical Model

The evaluation of the quality of the fitted mathematical model, as well as the effect of the factors that are examined and the effect of their interactions on the response(s), is usually carried out with the analysis of variance (ANOVA). Briefly, ANOVA uses F-tests (F = variation between sample means/variation within the samples) to compare the mean values between the factors that are examined and determines whether any of those means are statistically significantly different from each other (for more information above ANOVA see Ref. [42]). To determine whether each main effect is statistically significant, the p value for each term is compared to the significance level to assess the null hypothesis, while a significance level of 0.05 is usually used. Overall, the p value of each factor (term) should be < 0.05 to be significant while in several cases the insignificant factors (p > 0.05) are excluded from the model. Initially, the quality of the model is evaluated by the F value and p value of the model. In general, a high model F value indicates that more of the variance can be explained by the model (the higher the better), whereas the p value of the model should be strongly significant (< 0.05). The lack of fit is also used to determine whether the model fits the data well. If the model does not fit the data well, then the lack of fit will be significant (p > 0.05). The insight of mathematical model significance is also assessed from two determination coefficients (R-squared or R2), namely “adjusted” R2 and “predicted” R2. The adjusted R2 indicates the amount of variation around the mean explained by the model, while the predicted R2 indicates how well a response value is predicted by the model. In general, the higher the R2, the better the model fits the data while an R2 > 0.6 is required. Finally, the quality of the model is evaluated by the Adequate Precision that is the signal-to-noise ratio and a ratio greater than 4 is required (for more information about ANOVA see https://www.statease.com/docs/v11/navigation/anova-rsm.html).

Overall, the adequacy of the mathematical model is evaluated using the “lack-of-fit” test and the “Adj R-squared,” as well as using:

  1. (i)

    The normal (%) probability plot of the “Studentized” residuals that shows whether the data are normally distributed or not. Figure 3a shows an example of the evaluation of a mathematical model using the normal (%) probability plot and, as can be seen, the errors are normally distributed.

    Fig. 3
    figure 3

    An example of diagnostic plots that are used for the evaluation of the accuracy of mathematical models in RSM adapted from [129]. a Normal (%) probability plot of the “Studentized” residuals for the model. b Predicted (by the model) values of the response versus actual values (experimental)

  2. (ii)

    The “predicted vs actual” plot which shows whether the actual values (experimental data) are in agreement with the predicted values. In other words, predicted vs. actual plots detect how well the model fits the data. For a perfect fit, all the points would be on a straight line [30, 43]. Figure 3b shows an example where experimental values and predicted values are in good agreement.

Incomplete Fractional Factorial Designs: An Alternative Approach

In several cases, a high number of variables have to be tested, and therefore a high number of experiments is required. Thus, performing a full or fractional factorial as well as RSM is impractical especially when a large number of categorical factors must be examined. To this end, incomplete factorial (IF) designs were developed to test only a part of a large full factorial design when a large number of combinations of factors must be examined [44, 45]. Thus, any design that is developed by removing experimental conditions from a full factorial design is an IF. Even though according to this definition, a fractional factorial design can be designated as an IF design, a main dereference between the two designs is that the term “fractional factorial” refers to incomplete factorials that share the balance property of the corresponding full factorial approach, while an IF involves fewer experimental combinations which are not balanced. Thus, all fractional factorials are IF, but not all IF designs are fractional factorials [46]. IF that are not fractional factorials involve fewer experimental conditions, and they provide an economical and effective way to assess the effect of different possible factors and identify those most likely to be essential, which is beneficial especially when experimental costs are high [46]. Another major advantage of IF designs is that both categorical and continuous factors can be simultaneously examined while each factor can be examined at more two levels [18, 47]. Most importantly, a freeware online software called SAmBA (http://www.igs.cnrs-mrs.fr/samba) has been developed for the design of IF approaches. An example of an IF design is provided and discussed in section “Construction of Recombinant Plasmids”: Application of design of experiments in recombinant protein biotechnology; paragraph Ligation of the insert with the vector.

Application of Design of Experiments in Recombinant Protein Biotechnology

Table 4 summarizes the DoE approaches that are commonly employed in the optimization of processes that are used in recombinant protein biotechnology. It should be noted that for screening experiments, two-level factorial designs are very common and more economical compared to the 3- or higher level factorial designs, and due to do their simpler structure are more interpretable in practice [30]. IF designs are probably a better choice when the effect of 2 or 3 categorical factors must be examined; however, previous experience with the protein of interest is required. Overall, a statistical design should be carefully selected based on (1) the availability in resources and equipment and (2) the existing information about the protein of interest.

Table 4 DoE designs that are commonly used in recombinant protein biotechnology

The potential applications of DoE designs in recombinant protein biotechnology are extensively discussed in the following paragraphs.

Construction of Recombinant Plasmids

The first step in producing recombinant DNA is to identify and isolate the target DNA and vector DNA. Despite several techniques have been developed for generating recombinant DNA sequences including TA cloning [48], ligation-independent cloning [49, 50], recombinase-dependent cloning [51,52,53], and PCR-mediated cloning [54,55,56], PCR-based cloning is routinely being used in molecular cloning [57,58,59]. In addition, PCR primers that introduce restriction enzyme sites on the insert’s sequence are usually employed. This review will be focused on the traditional PCR-based cloning and the basic steps that are followed during this technique are described below:

PCR Amplification of the Gene of Interest

Because of the potential interactions among the components of PCR, usually optimization of PCR conditions is carried out by changing one or more factors that are known to affect the primer–DNA interaction and primer extension. The isolation of pure, intact, and high-quality DNA is essential for molecular biology studies [60]. Although PCR cloning is routinely being used in molecular cloning [57], there are not general guidelines on setting up a PCR reaction. It is well known that the concentration of Mg2+ ions, the pH of the reaction buffer, and the annealing temperature influence the amplification of a DNA fragment in PCR. In addition, the interactions of some reagents (factors) influence the amplification of the GOI (response). For example, dNTPs chelate Mg2+ ions, and therefore an increase in the concentration of dNTPs will reduce the concentration of Mg2+ ions in the reaction mixture [61]. To this end, DoE approaches have been successfully employed in many cases to optimize the reaction conditions of PCR reactions. Boleda et al. [62] used a two-step approach to optimize PCR of DNA blood spots. In the first step of optimization and by using a 25 fractional factorial approach, the DNA concentration and Mg2+ were identified as the factors that significantly affect the response (DNA amplification). Subsequently, an RSM was employed to identify the optimum concentration of the two factors. In another study, the optimum concentrations of dNTPs, Mg2+, and primers were identified by using a full factorial approach and by a three-dimensional Simplex [63]. DoE approaches have also been employed for the optimization of reaction conditions of quantitative PCR [64], real-time PCR [65, 66], and digital PCR [67] assays. Therefore, it is suggested that the conditions of any PCR assay, including cloning PCR, could be optimized by statistically designed experiments. Following PCR, the amplified GOI is analyzed on an agarose gel and recovered using a commercially available gel extraction kit.

Digestion with Restriction Enzymes

Subsequently both the PCR-amplified GOI (insert) and vector are digested with the same restriction enzymes in order to create complementary cohesive sticky ends. Restriction digestions are carried out according to the manufacturer’s instructions and the optimization of at this point is limited to the duration of reaction and/or the amount of enzyme [68].

Ligation of the Insert with the Vector

This step is usually catalyzed by a DNA ligase and optimizing ligation efficiency is essential to cloning experiments [69]. Even though some ligation mathematical models have previously been reported [70,71,72,73], they are either too specific or too general to be used as a universal tool to improve ligation efficiency. The factors that significantly affect ligation are the molar insert-to-vector ratio, the temperature, and duration of ligation, and total DNA concentration. However, optimization of ligation reactions is usually carried out using the OFAT approach. It has been suggested that a more generic but easily altered strategy is needed to improve DNA ligation [69]. In my laboratory, we have developed an IF design composed of 16 combinations of 3 insert-to-vector ratios, 3 ligation temperatures, 3 durations of ligation, and 3 total DNA amounts as illustrated in Table 5, to identify the best combination of these factors for the ligation of a plasmid vector with a PCR-amplified gene. The ligation efficiency is monitored using the Lig-PCR method (i.e., the ligation reactions are monitored using PCR and primers that are present in the majority of vectors) as previously described in [74]. This straightforward approach examines all factors affecting ligation efficiency and provides in less than 2 days a positive answer to the ligation query. In the case of a negative result (no ligation), a significant amount of time can be saved.

Table 5 Incomplete factorial approach for the ligation of a plasmid vector with the gene of interest (insert)

Transformation of Competent Cells

The ligation product is subsequently inserted into a cloning strain (e.g., DH5α), to ensure the stable amplification of recombinant DNA [75] using a standard transformation protocol (see [76] and references cited therein). In general, the introduction of foreign DNA into bacteria using either electroporation or chemical transformation is affected by many factors including electrical parameters [77] (only for electroporation), washing buffer, cell wall weakening agents [78], the cell density (optical density at 600 nm − OD600nm) [79], duration of heat-shock, medium composition, and by the presence of some co-factors (e.g., DMSO) [76]. However, optimization of transformation conditions is carried out using OFAT approaches. Even though the optimization of transformation conditions for cloning experiments using DoE has not been reported, fractional factorial approaches have been successfully employed to identify the factors that significantly affect the transformation efficiency of bacteria for other purposes (e.g., drug development). For example, Yildirim et al. [80] evaluated the effect of five factors (cell density, voltage, resistance, plasmid DNA concentration, and Mg2+ concentration) on the transformation efficiency of Acinetobacter baumannii using a three-level fractional factorial approach and the transformation efficiency was increased by four times. Thus, DoE approaches could be probably used as a tool to maximize the transformation efficiency of bacteria during cloning experiments.

Expression of Recombinant Proteins in E. coli

E. coli expression systems provide an inexpensive, robust, and flexible platform appropriate for the production of recombinant proteins at both industrial and laboratory scales. Optimum expression conditions for each construct must be identified for the maximal production of soluble protein. A recombinant protein is expressed in the microenvironment of E. coli which may differ from that of its native source in terms of folding mechanism, pH, co-factors, ionic strength, and redox potential. These factors affect both protein stability and solubility, while in several cases recombinant proteins are expressed in the form of inclusion bodies [81,82,83].

In general, soluble expression of recombinant proteins is affected not only by the expression host strain and expression vector, but also by expression conditions including induction temperature and time, the concentration of inducer, and the composition of the culture medium [18, 84]. The yield and solubility of recombinant proteins can be therefore increased by optimizing these factors and several expression conditions are usually tested [85]. One of the standard procedures, when setting out to express a recombinant protein, is to test different culture conditions and media because this is easy, cheap, and has been proven to have an impact on protein solubility levels [20]. Among the factors, affecting protein expression and/or solubility in E. coli, the induction temperature and time are probably the most important ones, because these variables, in most cases, interact. In bacterial, a slower and longer induction promotes the expression of several proteins in a soluble form, and this approach requires a low temperature [86]. The magnitude of induction is also an important factor that affects both the expression and solubility of recombinant proteins. Insufficient concentration of inducer (e.g., IPTG) may result in low protein expression, whereas the addition of a high concentration of inducer can result in reduced cell growth and/or recombinant protein yield [87]. The soluble expression of recombinant protein is also affected by the expression host [88] and thus multiple E. coli strains that facilitate the expression of membrane proteins, proteins with rare codons, proteins with disulfide bonds, proteins that are otherwise toxic to the cell, etc., are commercially available (see [18] and references cited therein). Cell density before induction has also an impact on the soluble expression levels of recombinant proteins. Despite induction is usually performed at early mid-log phase, some proteins require induction at late-log phase [89] or stationary phase [90]. To facilitate recombinant protein expression in a soluble form and to accelerate the characterization of protein structure and function, a variety of affinity and solubility tags have also been employed [17], including small peptides [e.g., hexahistidine (6 × His-tag), FLAG] and large peptides/proteins [e.g., Glutathione-S-transferase (GST), maltose-binding protein (MBP)] [91]. The main factors affecting recombinant protein soluble expression have been extensively reviewed elsewhere [14, 18].

Even though several criteria must be considered when expressing recombinant proteins in E. coli, optimization experiments are usually carried out using the traditional OFAT approach [92]. On the other hand, DoE approaches have successfully been employed for the identification of factors that have a statistically significant effect on the soluble expression of recombinant proteins. Usually, optimization of soluble expression goes from screening experiments to ascertain what variables have an effect (i.e., a full or fractional factorial design is initially employed looking at cell line, and possibly media and additives, along with OD600nm at induction, temperature, time and IPTG concentration, as variables to determine which factors are significant) and then to RSM to determine the optimum values (levels) of the most significant variables [16, 18].

DoE designs and especially RSM have been successfully employed to optimize the culture conditions and/or culture medium composition for a variety of recombinant proteins such as recombinant scFv antibody [93], human Interferon-beta [94], DT386-BR2 [95], receptor activator of nuclear factor (NF)-κB ligand (RANKL) [96], superoxide dismutase [97], pneumolysin from Streptococcus pneumoniae [32], pneumococcal surface adhesin A [98, 99], TaqI endonuclease [100], pyruvate oxidase [101], sea anemone neurotoxin [102], heparinase I [103], lipase KV1 [104], and glutaryl-7-aminocephalosporanic acid acylase [105]. The potential applications of DoE in the soluble expression of recombinant proteins have been extensively reviewed elsewhere [2, 116]. In my laboratory, we have successfully employed RSM for the optimization of soluble expression of several recombinant proteins including tumor necrosis factor-alpha (TNF-α) [106], RANKL [107], heme oxygenase-1 (HO-1) [108], and human rhinovirus-3C protease (HRV3CP) [109]. In each case, preliminary experiments were performed in order to identify the best expression host for each target protein as well as to identify the factors that have a statistically significant effect on both the yield and soluble expression of the protein of interest. Subsequently, the culture conditions that maximize the soluble expression of each recombinant protein (i.e., TNF-α, RANKL, HO-1, and HRV3CP) were identified using RSM. The use of DoE approaches in recombinant protein expression with examples of media and culture conditions optimization has been recently reviewed in Ref. [40].

Even though RSM has been successfully employed to maximize the soluble expression of several recombinant proteins, this design has several limitations, especially when the number of the test variables is high [98]. Moreover, RSM is a fine-tuning technique, i.e., it is used to identify the optimum combination of the independent variables that maximize the response. Therefore, as mentioned above, preliminary experiments are essential to identify the factors that significantly affect the response, while only continuous factors can be examined [11]. To this end, in my laboratory we have recently developed an IF approach that we called IF-STTI (Incomplete Factorial-Strain/Time/Temperature/Inducer) [109] to identify the best combination of the four most important factors (i.e., expression host, temperature and duration of induction, and IPTG concentration) affecting the soluble expression of recombinant proteins in E. coli [18] in a single experiment. In detail, IF-STTI is composed of 24 different combinations of three expression strains, three post-induction temperatures, four induction times, and three IPTG concentrations. The design was validated with three GST-tagged recombinant proteins, i.e., TNF-α, RANKL, and HRV3CP. The results obtained from this design were subsequently compared with those obtained using RSM and interestingly the soluble expression levels of the three tested proteins were close to those obtained by RSM. Most importantly, we demonstrated that the IF-STTI design is an accurate and straightforward method as it provides, in only 24 experiments, the same information regarding the interactions of variables on the soluble expression of recombinant proteins, as would do a full factorial design (108 experiments) or RSM (30 experiments). Another advantage of the IF-SSTI compare to the 2k factorial designs is that all variables may be examined in more than two levels.

Two incomplete factorial designs called “InFFact” [110] that is made of 12 combinations of 4 E. coli strains, 3 media, and 3 expression temperatures (full factorial 36 combinations) and “Fusion-InFFact” [111] that is composed of 24 combinations of 4 expression strains, 3 media, 3 expression temperatures, and 5 N-terminal tags (full factorial 180 combinations) have also been reported. Both methods have been successfully employed to determine the conditions that maximize the soluble expression of several recombinant proteins in E. coli.

Purification of Recombinant Proteins

The ultimate objective of protein purification for therapeutic or analytical applications is to achieve both high yield and purity [112]. Thus, it has been suggested that the protein of interest should be produced as a fusion to an affinity tag because tags facilitate the purification of any protein, in one step, without any prior knowledge about the protein of interest and do not affect its biochemical or biological activity [113]. To this end, a variety of affinity purification methods have been developed (for a review on affinity tags see Ref. [17]). In general, affinity purification of recombinant proteins depends on two factors: (i) the ability of a protein to bind to an affinity matrix which is composed of a substrate attached to a solid support, e.g., Sepharose, agarose, or resin and (ii) the ability to recover the protein from the affinity matrix. Elution is usually carried out either using a soluble substrate that competes for binding sites (competitive elution) or, in some cases, by cleavage between the protein and affinity matrix with a specific protease [114, 115].

A typical protein purification includes several operating parameters that can affect the yield and purity of the protein of interest. To achieve both high purity and yield of the target protein, it is essential to examine the relationship between these two goals and the purification factors and to optimize purification conditions accordingly. The final yield and purity of a protein are affected by multiple factors including the composition of the sample to be loaded, chromatography medium, purification method, binding, wash, and elution conditions [116]. Moreover, the final purity and recovery of the protein can be optimized by controlling the operating conditions such as flow rate, ionic strength gradient, sample load, physical properties of the adsorbent matrix, column dimensions, and the ratio of the protein to the column size [112]. In addition, it is important to take into account the effect of buffer composition in protein stability and purification yield while it is beneficial to decide what the ideal final buffer would be. Therefore, optimization of purification processes can be time-consuming [116] and despite the obvious advantages of DoE approaches over OFAT approach, optimization of purification conditions of recombinant proteins is usually carried out using the latter method. For example, during optimization of immobilized metal affinity chromatography (IMAC) protocols using the OFAT approach, a different volume of metal-chelated resin and concentration of imidazole in the washing and elution buffer are tested in each experiment. However, OFAT approach does not take into account the effect of interactions among purification factors, on purity and yield of the target protein. Because every protein is different, the optimum purification conditions must be identified for each protein [9].

Nevertheless, the significant advantages of DoE over OFAT approach in recombinant protein purification are highlighted in recent publications. A two-step DoE approach has been used for the affinity purification of recombinant 6 × His human erythropoietin (hEPO). During the first step of this approach, it has been demonstrated that the ratio of loaded protein to resin significantly affects both protein purity and yield. Subsequently, in the second step, the optimal purification conditions (i.e., the amount of resin and wash/elution conditions) of 6 × His- hEPO were identified using RSM. This two-step DoE-optimized purification approach resulted in a 45% yield and a 90% purity of recombinant 6 × His-tagged hEPO [117].

A two-step DoE approach has also been employed for the optimization of purification conditions of recombinant single-chain variable fragment against type 1 insulin-like growth factor receptor (IGF-1R) using the capto-L affinity chromatography medium [118]. In an initial step, the effect of seven variables including the pH value of the buffer, and the concentration of the following additives: NaCl, urea, arginine, trehalose, polyethylene glycol (PEG), and dextran on both IGF-IR aggregation and recovery were evaluated using a 2-level fractional factorial approach. Trehalose concentration and pH were identified as the main factors and, subsequently, the purification conditions were optimized using a central composite circumscribed design. Overall, a total yield of 77% and a 98.5% purity of the final product were achieved.

In another study, Amadeo et al. [119] employed RSM in order to identify the best combination of the critical factors, i.e., sample pH, the ratio of loaded protein to resin, and residence time, that affect the purity and yield of recombinant human erythropoietin using Blue Sepharose as an affinity matrix. An 88% recovery and a 71.5% purity of the protein of interested were achieved after optimization of purification conditions. RSM has been successfully employed for the optimization of purification conditions (PEG and salt concentration, pH value and/or concentration of the purification buffer), with aqueous two-phase system of several enzymes including glucose dehydrogenase (GDH) from Bacillus subtilis [120] and d-galactose dehydrogenase (GalDH) from Pseudomonas fluorescens AK92 glucose [120, 121].

In my laboratory, we have recently reported an IF approach composed of 16 different combinations of three resin volumes, three glycerol and four DTT concentrations in purification buffers, and three incubation times of cell lysate with resin, in order to determine the optimal purification conditions for GST-tagged HRV3CP [109]. The 16 combinations of these factors were selected out of the 108 combinations (3 × 3 × 4 × 3) of the full factorial design using the SAmbA freeware. The results revealed that the recovery of the protease was increased by 15% (compared to the protease recovery before optimization), while the proteins that were previously co-purified (before optimization of purification conditions) with the target protein (GST-HRV3CP) were eliminated. Our method was validated further using another two GST-tagged recombinant proteins, i.e., GST-TNF-α and GST-RANKL and the yields of two proteins were increased by 11% and 10%, respectively [109].

Based on the examples described above, DoE approaches could overcome the limitations of the traditional OFAT approach for the optimization of purification conditions of any recombinant protein. As purification of any protein is affected by multiple factors (variables), the OFAT approach often fails to identify the optimal purification conditions because this approach examines only a limited part of the experiment space and most importantly it does not examine the combined effects of all the factors involved. The specific steps that are followed during the optimization of purification conditions of recombinant proteins using DoE, as well as specific guidelines for execution and analysis of experiments, are described in Ref. [116].

Assessment of Protein Activity

Following purification of the target protein, in several cases, its activity should be assessed. Moreover, enzymes are important drug targets and in areas such as drug development, clinical diagnosis, and biotechnology research the determination of kinetic parameters of enzymes is essential. To identify potential therapeutics that inhibit the function of proteins/enzymes that have been implicated in the pathogenesis and development of diseases is essential to design, develop, and validate biological assays for high-throughput screening (HTS). Developing sensitive biological assays suitable for HTS requires identification of factors affecting assay performance and robustness and the correct design of a biological assay is essential to derive the correct information and to collect data suitable for analysis and modeling [122]. Thus, the development of reliable biological assays for the identification and validation of potential therapeutics is essential in the various stages of drug development [123,124,125].

Depending on the assay format and the nature of the protein that is studied in each case, different variables, i.e., assay conditions, should be examined. For example, typical factors that affect the enzyme activity include the composition, concentration, and pH of the reaction buffer, type and concentration of enzyme, ionic strength, as well as the type and concentration of substrate, reaction conditions (assay incubation time and temperature), and appropriate assay technology. Likewise, in non-enzymatic protein assays, e.g., ligand-binding assays, several factors should also be examined including the dilution of the protein, assay incubation time and temperature, viscosity, and ionic strength, while in several cases buffer additives should be included in order to facilitate protein stability or to improve ligand solubility [126, 127]. Several factors can be optimized during the development, optimization, and assessment of the robustness of ELISA-based ligand-binding assays including conditions associated with samples and calibrators and conditions associated with the detection of the analyte, such as substrate development time [125, 128].

A major concern during the development and optimization of a biological assay is the selection of factors to be tested as well as their ranges to be used. In general, assay optimization determines how a range of experimental conditions can affect assay performance and is an essential step to find the value that each variable should have to produce the best possible response. Usually, if there is literature available, the experimenter begins with the reaction conditions and factors published previously to be needed for the activity of the same or a similar protein/enzyme. The reaction conditions and the concentration range of the selected factors to be examined should then be selected carefully and should be large enough to cause a clear alteration in the measured response, but not so large that the process will ‘fall off a cliff’ and produce unusable data [129].

Despite methodologies for assays have been extensively reviewed, however, because each protein/enzyme is different, a further modification of procedure is often required, (i) to adjust the assay conditions to the special features of the protein/enzyme of interest or (ii) in order to develop an assay for a newly discovered protein/enzyme [130]. It has been suggested that a DoE study must be carried out before the validation of an assay for early identification of factors that significantly affect assay performance and robustness [128]. However, a survey carried out by HTStec in 2009 [131] revealed that optimization of assays is carried out using the traditional OFAT approach because most researchers believe that DoE designs are very difficult to be employed. However, using the traditional OFAT approach usually takes at least 4 months to develop an assay, and therefore assay development can become a bottleneck in drug discovery projects [122]. To this end, we have recently reported the steps any researcher could follow to develop, optimize, and define the design space for determination of enzyme activity using a two-step DoE methodology, including guidelines for (i) identification of factors that significantly affect the activity of the enzyme to be studied, and (ii) execution of experiments and data analysis, using HRV3CP in a 96-well plate format assay, as an example [129]. Briefly, a 28−4 fractional factorial design was initially employed to assess the effect of seven factors: one categorical factor, i.e., buffer composition (Tris–HCl and HEPES) and seven continuous factors including reaction pH, temperature, and time as wells as the concentration of NaCl, DTT, EDTA, and glycerol on protease activity. The results of the screening DoE were used to eliminate non-significant factors using the half-normal probability and Pareto charts as described in section Theory and steps for design of experiments and particularly in paragraph Identification of  significant factors. Our analysis revealed that only the pH of the buffer, the incubation time, and the concentrations of both DTT and glycerol produced significant effects on the activity of HRV3CP. Subsequently, we employed RSM to determine the optimal combination of the four statistically significant variables that produce the maximum HRV3CP activity and a 1.5-fold increase in the activity of the protease was achieved [129].

It should be noted that the quality of an HTS assay is usually assessed using the Z-prime (Z′) statistical test that takes into account both the signal window and assay viability [132]. The Z′-factor is calculated based on the following equation (Eq. 3):

$$Z^{\prime} = 1 - 3 \times \frac{{\sigma_{p} + \sigma_{n} }}{{\mu_{p} - \mu_{n} }},$$
(3)

where σp and σc are the stand deviations of the positive and negative controls, respectively, and μp and μn are their respective average values.

In general, the Z′ test is the most important statistical test to assess the quality of an assay. A Z′ equal to 1 is ideal, though an assay can never have a Z′ of 1.00000, while Z′ can never be greater than 1.0. When an assay has a Z′ between 0.5 and 1.0, it means that it is excellent, while when Z′ is between 0 and 0.5 it means that the assay is marginal. A Z′ factor less than 0 means that the assay is not suitable for HTS screening [132]. In the aforementioned example, an increase of Z′ factor from 0.78 (before optimization) to 0.92 (after optimization) was achieved and thus the assay is suitable for HTS of HRV3CP inhibitors [129].

Even though DoE designs have significant advantages over the OFAT approach in assay optimization and in assessing the robustness of a method, there are only a limited number of publications in the literature that utilize these designs in assay development, optimization, and validation. Nevertheless, DoE approaches have been successfully used for the optimization of several assay conditions and for the determination of kinetic constant of various enzymes including glucose oxidase [133], the enzymes involved in the synthesis of precorrin-2 [134], and hydrolases [135], for the development and validation of a cell-based bioassay for the detection of anti-drug neutralizing antibodies in human serum [136], for the optimization of various immunoassays [137,138,139,140], as well as to evaluate the robustness of a ligand-binding assay [128] and other assays [141]. A detailed tutorial that describes the use of DoE approaches in non-enzymatic assay optimization has been previously reported [122].

Protein Crystallography

In drug discovery projects, crystallization of the target protein(s) that is (are) implicated in the pathogenesis of a disease with a potential therapeutic is an essential step in order to identify the interactions between the two molecules. These interactions are translated into a picture where a drug molecule binds to the target protein(s) and acts as an inhibitor, an agonist or a modulator [142].

A major issue in protein crystallization is that a high number of parameters must be tested to identify the conditions that yield a single large crystal for the collection of X-ray data [143, 144]. Biochemical, chemical, and physical factors such as genetic modifications of the protein, the type of precipitants, type of salts, concentrations, pH value of the buffer, and the temperature of the environment may have an impact on the crystallization process. Because each protein has a unique primary structure it is quite challenging to determine the crystallization conditions that can yield a crystal for a protein a priori [145], and therefore adapted methods are employed to enable the growth of the appropriate crystals [146].

To this end, the conditions for protein crystallization have been traditionally identified using two DoE designs, namely incomplete factorial experiments (IFE) [145, 147] and sparse matrix sampling (SMS) [144, 148]. The incomplete factorial approach was introduced in protein crystallography in 1979 [44] as a powerful tool for identifying the factors and conditions that need to be varied to obtain crystals. The goals of this approach are to (i) identify the important factors that influence the crystallization of the target protein and (ii) reduce the total number of crystallization conditions compared to full factorial designs [44]. The IFE is an essential tool especially in the case there is not enough protein to test a high number of crystallization conditions while at the same time it provides sufficient information regarding the important factors in a small number of experiments [149]. The SMS method was initially reported in 1991 by Jancarik and Kim [144], and interestingly their original screen, plus a wide range of variations, has been commercialized [150]. The sparse matrix approach uses three categories of major variables: pH and buffer materials, additives, and precipitating agents. These ranges of the buffer, pH, additives, and precipitant conditions are empirically derived based on past experience to have resulted in protein crystallization.

Following screening crystallization experiments, a set of optimization methods are usually applied to improve the quality of the crystals. Further details regarding the IFE and SMS methods and optimization techniques can be found in the literature and are beyond the scope of this review. Importantly, crystallization techniques that are based on DoE are continually optimized. For example, Dinć et al. [151] reported the “Associative Experimental Design (AED)” approach for the optimization of crystallization conditions for proteins. The main advantage of this approach is that following analysis of preliminary experiments, the AED generates candidate cocktails, i.e., novel conditions, leading to crystals (see also [151] and references cited therein).

Conclusion

Recombinant proteins are essential tools in biomedical, pharmaceutical, and biological industries and the production of soluble and functional recombinant proteins is the ultimate goal in protein biotechnology. Several recombinant proteins are being used as drugs (biopharmaceutical) and their demand in the pharmaceutical industry will be increased in the next years because biopharmaceuticals have been successfully used for the prevention, detection, and treatment of diseases. To meet the growing demand for recombinant proteins, it is essential to produce them in high amounts and in a pure and active form. Due to the unique properties of each protein and the complex interactions among the reagents in the experiments, it is almost impossible that one set of reaction conditions would be optimal for all cases. Optimization of several processes that are used in recombinant protein biotechnology is usually carried out using the traditional OFAT approach that is not only time-consuming, but also incapable of identifying the true optimal conditions as it does not examine the interactions between the factors affecting the desired response(s). On the contrary, DoE designs are gaining success for optimization of all processes of recombinant protein biotechnology including construction of recombinant plasmid vector, protein production, expression, purification, assessment of activity, and crystallography as summarized in Table 6, because they require fewer experiments and therefore less time, for the amount of information obtained, while in the case of negative results a significant amount of time can be saved. Most importantly, DoE designs can provide models that may assist to (i) identify the factors that have a statistically significant effect on a process and (ii) study interactions between different variables and predict the maximized response in all processes of recombinant protein technology.

Table 6 Applications of DoE in the main processes that are used in recombinant protein biotechnology