1 Introduction

The leading contribution to the greenhouse effect is due to the immense usage of fossil fuels in transportation and power generation. The petroleum reserves are not long lasting, rather it has been stated that with the current production rate oil reserves will diminish in less than half a century [1, 2]. Therefore, the renewable and environmentally friendly energy resources may take part important role to alleviate the probable problems related to the fossil fuel (health hazard, acid rain, global warming and reservoir exhaustion). In this regard, biofuels, (ester-based oxygenated fuel, sulfur-free, renewable, non-toxic and biodegradable) obtained from seed oil could be the alternate solution that may save the universe [3,4,5,6,7,8,9] Extracted oil from seeds are the major source of biofuel. In the last 10 years, more than 350 oil-containing seeds (palm, coconut, jatropha, sunflower, rapeseed, soybean, jojoba, karanja, neem, moringa, castor, cotton oil, etc.) have been identified for feedstocks of biofuel. However, these feedstocks are edible food for human [10, 11]. Therefore, non-edible feedstocks (e.g., papaya seeds, dates seeds, microalgae biomass and others) are recently attracting attention worldwide since those do not combat with human consumption [12,13,14]. Moreover, many studies were conducted and proven that seeds oil can be used for the production of a variety of other valuable products such as biolubricants, biosolvents, cosmetics and beauty products (e.g., demulcent skin care products, hair conditioners, bath oils and makeup) [15,16,17].

The papaya (Carica papaya) is the fourth trading tropical fruit worldwide. Almost 75% of the papaya produced in the world comes mainly from India, Bangladesh, Brazil, Indonesia, Nigeria and Mexico [13]. Among these, India alone is contributing 42% of world production (about 3 million tons per year) [18]. The papaya fruit consists of many nutrients like vitamin C and A, magnesium, folate, fiber and potassium [16]. The seed is black or dark brown, soft and round shape with a strong smell. The literature reveals that the dry papaya seed contains around 30% lipids, 28% proteins and 22% fibers [19]. The component of the unsaturated fatty acids (carbon ranging from C14 to C22) in the seeds consists of mainly meristic acid, oleic acid, stearic acid, linoleic acid and palmitic acid [20]. The weight of papaya fruit is usually from 200 to 3000 g and approximately 15% of the wet weight of the fruit is seed. Since seeds are not eaten, 15% of the seeds or biomass is thrown as waste material, which can be used as the feedstock for valuable bioproducts syntheses [21, 22]. Nonetheless, to make papaya seeds more useful, it is imperative to examine and evaluate the production of papaya seed oil through extraction methodology.

Until now, several extraction techniques have been utilized for the extraction of useful oils from seeds of various plants. The method of extraction of oils from seeds is an important factor to produce high-quality oils. Solvent extraction is one of the choices, but it depends on the selectivity and needs extreme heat [23]. Cold press extraction is the conventional technique for oil extraction but have low yield [24]. Supercritical fluid extraction (SFE) and enzyme-assisted aqueous oil extraction are currently used techniques to extract plant oils and provide convenient features [25,26,27,28,29]. The Soxhlet extraction process with hexane is also mostly used for oil extraction from seeds [23, 30,31,32,33,34]. However, in order to find the high oil yield, the Soxhlet extraction process needs to be operated at the optimal conditions including extraction time, seed particle sizes, etc.

Most of the reports in the field of extraction of papaya seeds oil adopt either one-factor-at-a-time (OFAT) or response surface methodology (RSM) strategy via central composite design (CCD)/Box–Behnken design(BBD) to optimize the process variables [20, 35]. It is noteworthy that optimization is the technique of getting the point that minimizes or maximizes a response or output. It is well known that OFAT is a trial and error approach and it overlooks interactive effects within the process parameters and thereby, and it is unable to optimize a process perfectly. Though RSM is efficient, can solve problems with large numbers of design variables, considers interaction effects and requires little parameter tuning, it may give only local optimal solution [36,37,38,39,40]. To overcome this limitation, global optimum solution is preferable.

Several popular algorithms utilized for global optimization include genetic algorithm (GA) [41], simulated annealing (SA) [42, 43], ant colony optimization (ACO) [44], particle swarm optimization (PSO) [45], combinatorial optimization [46] and harmony search (HS) [47]. However, the usage of high number of tuning parameters with some disputes makes these processes tiresome. Recently, a new metaheuristic optimization algorithm entitled ‘crow search algorithm (CSA)’ has been documented, which can overcome the drawbacks [48]. It is a nature-inspired metaheuristic algorithm for solving global optimization problems in the areas of engineering research [48,49,50,51,52,53,54,55,56,57]. It has only two variables and as such it is remarkable to all other well-known algorithms.

RSM is a well-known effective method to explain the relationships between input and output parameters. However, there is possibility of this method to provide suboptimal solution. On the other hand, RSM coupled with CSA is a very efficient platform to optimize the independent variables for maximizing the response variables or outputs. This integrated platform always provides true optimal solution. A very limited number of past research investigations used this integrated approach for optimizing process variables. Water jet cutting process parameters are optimized by RSM coupled with CSA [54]. However, to our knowledge, this integrated platform has not been applied in any areas of extraction-based optimization problems. As such, it is very crucial for the scientists, engineers and researchers to optimize the extraction conditions of papaya seed oil using the RSM coupled with CSA.

This paper focuses on the optimization of the Soxhlet extraction parameters (extraction time and seed particle size) on papaya oil yield using RSM articulated with CSA. Initially, RSM integrated with CCD approach was adopted for development of the quadratic regression model to predict oil yield. The performance of RSM model is also compared with that of widely used generalized linear model (GLM) [58]. The GLM allows to fit regression models for univariate response data that follow a very general distribution called the exponential family. The exponential family includes the normal, binomial, Poisson, geometric, negative binomial, exponential, gamma and inverse normal distributions [59]. Later, the developed best regression model was used to investigate the optimum combination of the input variables using CSA to get global optimum solution and these results were compared with those of the desirability function-based optimization approach. The desirability function approach is used widely for factor optimization in engineering research [60,61,62]. In this function, the characteristics of each anticipated output are converted into a unitless value (d), which varies from 0 to 1. The desirability of output raises with the value of d. Finally, the predicted operating conditions have been verified by conducting triplicate experiments. The produced oil was then analyzed by using a GC–MS method.

2 Materials and Methods

2.1 Seed Preparation

The raw ripe papaya seeds were obtained from Al-Jazira supermarket in Bahrain. The seeds were dehydrated in the laboratory at 50 ˚C for 24 h in an oven. After drying, the seeds were ground into five different sizes using a coffee grinder. Four sieves of the following sizes were used to isolate the seed particles: 0.85 mm, 1.18 mm, 1.40 mm and 2.00 mm. The fifth size was the full seed size which is approximately 3.75 mm. The grounded seeds with each size was placed in small bottles and stored in a refrigerator until use.

2.2 Papaya Seed Oil Extraction

The seed oil was extracted using Soxhlet apparatus via hexane as a solvent at 80 °C. Hexane (purity ≥ 99%) was obtained from Sigma-Aldrich. A typical Soxhlet apparatus is composed of: (1) a percolator (boiler and reflux) that allows the flow of solvent, (2) a thimble (porous thick filter paper) that holds the solid to be extracted and (3) a siphon mechanism that systematically empties the thimble. Briefly, in working principle, initially the grounded seed of each size (2 g) is added in a thimble, which is placed inside the extraction chamber. 125 mL of hexane (in a 250 mL distillation flask) can evaporate at 80 °C. The hexane vapor travels up the distillation arm, condenses while passing through the condenser and drips on the grounded seeds in the thimble. The chamber containing the seed powder slowly fills with warm hexane, and the seed oil mixes in the hexane. As soon as the Soxhlet chamber is full, the chamber is devoided by the siphon. The oil–hexane mixture is then returned to the distillation flask. The thimble acts a filter, and thereby, no solid material moves to the still pot. This cycle is repeated many times.

2.3 Oil Separation

The oil–hexane mixture was poured in a conical flask and clamped in a water bath at 80˚C. The entire separation process takes place in a fume hood. The mixture was evaporated until it resulted in a viscous yellow residue. The volume and weight of the extract (oil) were measured by using pipette and an electric balance, respectively. The percentage yield of oil extraction was determined by using Eq. (1).

$$ \% {\text{Yield}} = \frac{{{\text{Mass}}\;{\text{of}}\;{\text{the}}\;{\text{oil}}}}{{{\text{Mass}}\;{\text{of}}\;{\text{the}}\;{\text{intial}}\;{\text{sample}}}} \times 100 $$
(1)

2.4 Response Surface Methodology (RSM)

The RSM is a bias-less statistical method to investigate the relationship between the output (response) and input variables, as well as to optimize the relevant processes [13, 63,64,65,66]. The central composite design (CCD), one of the RSMs, was employed in this study. The size of seeds and the extraction time were considered two independent (input) factors, while the percentage yield of oil extraction was taken as response. The behavior of the process is characterized by the second-order multiple regressions model as Eq. (2):

$$ y = \beta_{ 0} + \sum\limits_{i = 1}^{N} {\beta_{i} } x_{i} + \sum\limits_{i = 1}^{N} {\beta_{ii} } x_{i}^{2} + \sum\limits_{i < j} {\sum {\beta_{ij} } x_{i} x_{j} + \varepsilon } $$
(2)

where y denotes the predicted output, xi denotes the coded factors, βo denotes the intercept term, βi designates the linear effect, βii denotes the squared effect, βij denotes the interaction effect and ɛ denotes the residual.

The relationship between natural and coded variables can be expressed as Eq. (3).

$$ {\text{Coded}}\;{\text{value}} = \frac{{{\text{Natural}}\;{\text{value}} - {\text{Mean}}}}{\text{Range/2}} $$
(3)

The experimental design matrix (total number of runs, coded/un-coded variables, range and levels of two independent variables, combination of two variables, etc.) was generated using Minitab v.18 as shown in Table 1.

Table 1 CCD for optimizing size of seed and time of extraction. For each parameter, five coded levels of extremely low (− 1.414), low (− 1), center (0), high (+ 1) and extremely high (− 1.414) were considered

2.5 Generalized Linear Model (GLM)

The GLM can be written as:

$$ g\left( {\mu_{i} } \right) = g\left[ {E\left( {y_{i} } \right)} \right] = \varvec{x}_{\varvec{i}}^{{\prime }}\varvec{\beta} $$
(4)

where \( E\left( y \right) \) is the expected value (or expectation function) of the response y, \( \varvec{x}_{\varvec{i}} \) is a vector of covariates for the ith observation and \( \varvec{\beta} \) is the vector of regression coefficients. Every generalized linear model has three components: a response variable distribution, a linear predictor that involves the covariates and a link function g (identity, logit, log, etc.) that connects the linear predictor to the natural mean of the response variable [59]. Depending on the choice of the link function g, a GLM can include a nonlinear model.

2.6 Crow Search Algorithm (CSA)

The best developed regression model (obtained from RSM or GLM) was utilized to find the global optimum combination of the input factors via crow search algorithm (CSA) using MATLAB v.16. The optimization codes of CSA were based on the following steps [36, 49, 50]:

  1. 1.

    Initialization of parameters: The parameters include the flock size (N), flight length (fl), awareness probability (AP), iteration number (\( {\text{iter}}_{ \hbox{max} } \)) and any other parameters that are considered constants.

  2. 2.

    Initialization of memory and position: The N crows are positioned randomly in the following matrix of dimension (d) where d denotes the number of decision parameters. Each crow in the matrix represents a viable solution to the problem.

    $$ {\text{Crows}} = \left[ {\begin{array}{*{20}c} {x_{1}^{1} \;x_{2}^{1} } & \cdots & {x_{d}^{1} } \\ \vdots & \ddots & \vdots \\ {x_{1}^{N} \;x_{2}^{N} } & \cdots & {x_{d}^{N} } \\ \end{array} } \right] $$
    (5)

    The memory of the crow is initialized. Since the crow has no experience, the food is assumed to be hidden in the initial position.

    $$ {\text{Memory}} = \left[ {\begin{array}{*{20}c} {m_{1}^{1} \;m_{2}^{1} } & \cdots & {m_{d}^{1} } \\ \vdots & \ddots & \vdots \\ {m_{1}^{N} \;m_{2}^{N} } & \cdots & {m_{d}^{N} } \\ \end{array} } \right] $$
    (6)
  3. 3.

    Evaluation of the fitness function: The quality of the initialized position is computed in the fitness function using the decision variables for each crow.

  4. 4.

    Generation of new positions: A crow of the generated matrix (crow i, for example) selects a random crow in the flock to follow (crow j). Crow i either reaches to the position of the food of crow j, or crow j notices that crow i is following and generates new position according to (Eq. 7).

    $$\begin{aligned} x^{{i, {\text{iter}} + 1}} = \left\{ \begin{array}{llll} x^{{i,{\text{iter}}}} + r_{i} \times fl^{{i,{\text{iter}}}}\\ \times \Big( {m^{{j,{\text{iter}}}} - x^{{i,{\text{iter}}}} } \Big) \hfill &\quad {\text{when}}\;r_{j} \ge {\text{AP}}^{{j,{\text{iter}}}} \hfill \\ {\text{a}}\;{\text{random }}\;{\text{position}} \hfill &\quad {\text{otherwise}} \hfill \\ \end{array} \right.\end{aligned} $$
    (7)
  5. 5.

    Testing the viability of the new positions: The feasibility of the new positions of the crows is checked within the constrain or limit. If it is feasible, then the crow updates the position or otherwise stays in the old position.

  6. 6.

    Evaluation of the fitness function, given the new positions.

  7. 7.

    Memory update: The crow updates its memory as follows:

    $$ m^{{i,{\text{iter}} + 1}} = \left\{ {\begin{array}{*{20}l} {x^{{i,{\text{iter}} + 1}} } \hfill & {{\text{if}}\;f\left( {x^{{i, {\text{iter}} + 1}} } \right)\;{\text{is}}\;{\text{better}}\;{\text{than}}\;f\left( {m^{{i,{\text{iter}}}} } \right)} \hfill \\ {m^{{i,{\text{iter}}}} } \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right. $$
    (8)

    where \( f( \cdot ) \) is the objective function value.

  8. 8.

    Termination benchmark: The steps from 4 to 7 are reiterated until the maximum iteration (\( {\text{iter}}_{ \hbox{max} } \)) is reached. The best positions of the memory give the best objective function and the solution of the optimization problem.

2.7 Oil Characterization

In order to characterize oil components, the fatty acid contents of the papaya oil were examined using GC–MS [67]. The instrument is composed of a PerkinElmer Clarus® 600 GC coupled with an MSD and a capillary column (30 m long ×0.25 mm ID and thickness of 0.25 μm). The sample (1 µL) was introduced in split mode (40:1) at 240 °C with 1.2 mL/min of He flow. Prior to injection, the papaya oil was diluted 25 times using pure hexane. A temperature program: 110 °C (4 min)–10 °C/min–150–3.9 °C/min–230 °C (5 min) was used to separate fatty acids. The individual component of the oil was identified using in-built chemical library and quantified by normalizing of percent area.

2.8 Statistical Significance

Analysis of variance (ANOVA) was used to explain the significance. A p value less than 0.1 denotes significant at 10% level, while a p-value less than 0.05 denotes significant at 5% level. All experiments were conducted in duplicate. Minitab (version 18.0) and SPSS (version 17.0) softwares were applied for RSM and GLM models development, respectively, while MATLAB v.16 was utilized for CRA-based analysis.

3 Results and Discussion

3.1 Development of Regression Models

In order to find optimum combination of seed particle size and extraction time for maximizing yield of papaya seed oil, two predictive models using RSM with CCD and regression analysis techniques were implemented. Table 2 shows the highest values of oil yield obtained for each experimental run. The results indicate different values, which are obvious as the extraction modes were different for each run (except for Runs 2, 6 and 7). For RSM-based model development, the experimental data were analyzed by multiple regressions using Minitab v.18. The predictive model for oil yield with respect to coded variables was expressed by Eq (9), where y denotes the yield, T is the time and S depicts the seed particle size.

$$ y = 16.27 + 2.508 T - 2.812 S - 2.619 T^{2} + 3.501 S^{2} - 1.99 TS $$
(9)
Table 2 Oil yield (%) obtained in the central composite design (CCD) in duplicate sets of experiments

Table 3 demonstrates the ANOVA data for the second-order response surface model. The F-value for the model was high (F = 10.13) with a very low probability (p = 0.0000), which indicates the very high significance of the model. The significance of each term was investigated by their respective p-values. In the model, all the linear terms (T and S) and quadratic terms (S2 and T2) were statistically significant. The effect of the interaction is not significant, due to the p-value being slightly > 0.1. The significance effects were also proven using normal plot of standardized effects as shown in Fig. 1. Generally, the terms located far from line are significant. It can be seen from the figure that the terms T, S, T2 and S2 are located far from line, and thereby, they are significant. From Fig. 2 which is a Pareto chart, it can be seen even clearer the significance effect of the variables with square of particle size being the best, followed by normal particle size and extraction time. The interaction effect between the particle size and extraction time falls even below the average which is 2.120 as indicated in the chart.

Table 3 Analysis of variance (ANOVA) for response surface second-order model
Fig. 1
figure 1

Normal plot for the standardized effects of seed particle size and extraction time on percent oil yield. The significant terms (deep red in color) are located far from the diagonal line, while insignificant term (blue in color) is placed relatively closer to the line

Fig. 2
figure 2

Pareto chart for significant effects of seed particle size and extraction time on percent oil yield. Each bar denotes specific term. The bars for significant terms are crossed above the vertical red line (at an average value 2.120), while the bar for insignificant term falls below the average value as shown in the chart

Similarly, for GLM development, the experimental data have been assessed by multiple linear regressions using SPSS (version 17.0). The model for oil yield via coded parameters can be written as follows:

$$ y = 17.982 + 2.734 {\text{T}} - 3.568 {\text{S}} - 1.600{\text{TS}} $$
(10)

where y, T, S and TS represent the oil yield, time, seed particle size and interaction between time and particle size, respectively.

3.2 Evaluation and Models Comparison

Both the response surface and generalized linear models are analyzed and compared with each other to find the best predictive model that leads better results with high accuracy. The data in Table 2 indicated that the lowest relative error, RE (%), was observed in RSM compared to GLM indicating better performance in the response model. To evaluate the models further, the results presented are compared based on two-performance grading criterions that include root mean squared error (RMSE) and mean absolute error (MAE). The values of RMSE and MAE are presented in Table 4. Based on RMSE, the RSM model performs better than GLM with performance enhancement of 33.75%. Similarly, using MAE to examine the model achievement, the RSM model outperforms the GLM model with performance enhancement of 34.9%.

Table 4 Performance comparison between RSM and GLM

Figure 3 shows the Parity plot, which is the relationship between anticipated and experimental data for oil yield. As shown in Fig. 3a (for RSM), the points are scattered around the line resulting in an R2 = 76%, indicating that the standard deviations between experimental and model predicted data are comparatively high. While it can be seen in Fig. 3b (for GLM), the R2 value is comparatively low (41.3%), indicating that the model is unable to predict about 60% data. It has been reported that the R2 value in the Parity plot for biological system should not be < 0.75 [68]. Accordingly, though the results in RSM are satisfactory, a further investigation is required with wide ranges of experimental data to attain better predictions.

Fig. 3
figure 3

Model predicted versus experimental percent oil yield via response surface methodology (RSM) (a) and generalized linear model (GLM) (b) approaches. The R2 value in RSM is shown to be higher in compare to that in GLM

3.3 Effect of Environmental Conditions on Percent Oil Yield

To examine the effect of environmental conditions (e.g., extraction time and seed particle size) on response, percent oil yield by RSM, main effect and interaction plots were created as depicted in Fig. 4. Figure 4a shows that the response changes with the different levels of each factor, indicating that both parameters are important for papaya oil production. The data indicate that the percent oil yield increases with the extraction time and reaches to a plateau. This is expected since the contact time of solvent with the solute (seed particle) is longer, and diffusional mass transfer is favored, while in the case of particle size (S), at very small size (coded value of − 1), the percent oil yield (y) is maximum. Small size means more surface area for the solvent and easier diffusion. When the particle size increases, the solvent takes longer time to diffuse through the particle and as a result the percent yield decreases. As such, a minimum of extraction yield was observed at a maximum particles size. Contrarily, an interaction plot (Fig. 4b) is a visual image of the interaction of two input variables, extraction time and seed particle size on the percent oil yield. It shows three trends of different seed particle sizes − 1, 0 and 1 which correspond to 1.18, 1.4 and 2 mm, respectively, with time and percent yield. From figure it is clear that the two-way interaction effects among factors (e.g., T and S) were low on the percent oil yield, and the blue line (corresponding to the size of 1.18 mm, smallest among the three) does not interact. The data agree well with the ANOVA results in Table 3 (p value = 0.11).

Fig. 4
figure 4

Main effect (a) and interaction (b) plots of independent parameters (e.g., extraction time, seed particle size) on percent oil yield. Both parameters seem important since different levels of each factor affect the response (percent oil yield) differently. In interaction plot, factors influence each other. In this figure, the interaction effect among factors (e.g., T and S) is seemed to be low on the output

3.4 Crow Search Algorithm (CSA) Coupled with RSM for Global Optimal Solution

In this study, the polynomial model obtained from RSM (Eq. 8) was used as the fitness function to get the global optimum solution since RSM model was found to be superior. The codes were run on MATLAB v.16 using the awareness probability (AP) of 0.1, flight length (fl) of 2, flock size (N) of 100 and maximum iteration of 150. A convergence plot of the fitness value and the iteration number were generated as shown in Fig. 5. It is well known that convergence plot explains the combined effects of all independent factors simultaneously and provides a stable or uniform output. The non-smooth line at the beginning is the result of random generation of parameters at the early iteration that might be far from each other. However, with further iterations the results get closer and after 54 iterations the optimum point is achieved. The optimum coded combination of S and T is − 1.414 and 1.013, which correspond 0.85 mm of particle size and 6.5 h of extraction time, respectively, to maximize yield of 29.95\%.

Fig. 5
figure 5

Convergence rate of CSA for finding the best optimal solution. This plot assesses simultaneously the combined effects of all input factors and caters a stable output. After 54 iterations, a stable as well as an optimum point is achieved

The results obtained from integrated RSM–CSA-based approach have been compared with those of response surface methodology coupled with desirability function (RSM–DF)-based technique. In this regard, the transformation of oil yield is considered as a higher-the-better characteristic. For searching for a maximum, the desirability (di, denoted as the ith targeted output), can be written as follows:

$$ d_{i} = \left\{ {\begin{array}{*{20}l} 0 \hfill &\quad {y_{i} < L} \hfill \\ {\left( {\frac{{y_{i} - L}}{{U - L}}} \right)^{w} } \hfill &\quad {L \le y_{i} \le U} \hfill \\ 1 \hfill &\quad {y_{i} > U} \hfill \\ \end{array} } \right., $$
(11)

where w represents weight (assumed w = 1 in this study), L and U denote lower and upper values, respectively, and yi denotes ith response. The response is entirely unsatisfactory when d = 0, while the response is ideal when d = 1. In order to get a single optimal point, response optimizer plot (which is the integrated RSM–DF-based approach) was constructed using Minitab 18, as presented in Fig. 6. As clearly shown, the particle size (S) and extraction time (T) obtained by RSM–DF approach give almost similar results as those found in RSM–CSA technique with very high desirability value (d = 1), indicating that the RSM–DF-based optimization was extremely favorable and thereby supported the global optimal conditions. Since CSA is based on random generation, a little variation is observed for each run. Therefore, 100 runs were conducted separately and the standard deviation and mean of the results were taken as shown in Table 5. The very low standard deviation of 5 × 10−5 indicates that the optimal point obtained by CSA is robust and productive.

Fig. 6
figure 6

Response optimizer plot for finding the optimum conditions to maximize percent oil yield. Optimum coded values for both parameters are shown in red color (above the figure), while predicted percent oil yield is presented in blue color (left side of the figure). The value desirability function (d) is also mentioned in the left side of the figure

Table 5 Statistical result obtained at optimized conditions by CSA (100 runs)

3.5 Validation in Optimal Conditions and Characterization of Extracted Oil

To validate the optimal settings for the extracted oil yield, the set of triplicate experiments were conducted with combination of seed particle size of 0.85 mm and extraction time of 6.5 h. The predicted and experimental yields were calculated to be 29.95 and 31.1%, respectively, with < 5% error as shown in Table 6. Thus, the optimized conditions obtained by integrated RSM–CSA-based platform to predict the oil yield are robust and productive. The extracted oil was characterized by profiling fatty acids composition using GC–MS (Supplemental Figure S1) and quantified values presented in Table 7. The main fatty acids in the papaya oil were oleic (75.0%), followed by myristic (19.6%), palmitic (2.95%) and lenoleic (1.85%) acid. The data agreed well with those of previous reports [16, 17]. The high oleic oil makes it enough stability for food frying applications like those of high oleic vegetable oils in the market such as safflower (77%) and canola (75%). The intake of high oleic acid containing oils may have many nutritional and health benefits including reduction in oxidative stress in vivo [69]. In addition, for the production of high-quality biodiesel, biolubricants or other beauty products, the amount of fatty acids especially oleic acid present in oil play a very important key role [13].

Table 6 Predicted versus experimental oil yield data for: (a) RSM and (b) GLM
Table 7 Fatty acids found in papaya oil by GC–MS analysis

4 Conclusions

The study demonstrated the effects of Soxhlet process variables on the percent yield of papaya seed waste oil. For this, comparison studies between RSM with CCD and GLM have been carried out to develop predictive models for predicting the oil yield. The performance of RSM model (based on RE, R2, MAE, RMSE) was found to be superior compared to GLM. The ANOVA analysis (in RSM) indicates that both linear (T and S) and quadratic (T2, S2) terms are strongly associated (significant) with the percent yield of oil (all p values < 0.05), while the interaction effect within factors (T and S) was found to be insignificant (p value = 0.11). In order to find global optimal solution of these two parameters, RSM was then integrated with CSA. The optimal set of extraction time of 6.5 h and particle size of 0.85 mm provides the maximum yield of 29.95%. The similar results were obtained by desirability function-based optimization approach, indicating that the integrated RSM–CSA approach is robust. The predicted optimal conditions were validated experimentally with < 5% error. Finally, the GC–MS results confirmed the oil composition. Overall, the application of CSA integrated with RSM has been utilized successfully for the first time in solving extraction-based factorial optimization problem. This integrated platform could easily be used in the future as an ideal pivotal tool for extraction of oil from other non-edible feedstocks such as dates seeds and microalgae biomass. Further, this platform might plausibly be utilized for other complex engineering processes where factors optimization is required in order to maximize/minimize either single or multiple objectives. Some of our current efforts are aligned in these directions.