1 Introduction

An alternative source of energy attracts attention due to the scarcity of energy resources [1]. Energy sources such as wind energy, solar energy, nuclear energy, geothermal energy, hydrogen gas, tidal energy, biomass energy, and biofuels were considered non-depleting sustainable energy sources. Various studies were being carried out on these energy sources to find the efficient extraction of energy. Computational models using fuzzy and artificial intelligence approach were applied in wind energy generation [2,3,4,5].

Among these energy resources, biomass and biofuels are the instantaneous sources of energy compared with other mentioned sources [6]. Biomass feedstock comprises purpose-grown crops, residues of crops, wood, algae, fatty acids, edible plant oils, and wastes from sewage, and food [7]. Choosing appropriate input feedstock and process conditions from the wide range of source of biomass claims a strong knowledge [8].

The thermo-chemical conversion process was the most preferred common technique to derive valuable products from waste biomass. Hydrothermal process, pyrolysis, and hydro-deoxygenation are the most preferred methods used to produce liquid hydrocarbons from waste. The hydrothermal process has three classifications based on the desire of product such as hydrothermal liquefaction (HTL), hydrothermal gasification (HTG), and hydrothermal carbonization (HTC) [9]. The HTL process was preferred to convert wet biomass into bio-oil under moderate temperature (200–380 °C), pressure (5–20 MPa), and time (15–60 min).

Various wet biomasses like microalgae, wood biomass, agriculture waste, sewage sludge, etc. are converted into bio-oil through the HTL process. Biomass undergoes numerous reactions like hydrolysis, oxidation, reduction, depolymerization, dehydration, deoxygenation, and repolymerization [10]. It was hard to predict the route of the mechanism since it undergoes multiple reactions. Apart from multiple reactions, HTL has multiple parameters like biomass composition (carbon, hydrogen, nitrogen, sulfur, oxygen, moisture, ash, etc.), and experimental conditions (temperature, time, pressure, catalyst load, solvent-biomass ratio, etc.) which decide the product quantity and quality (bio-oil, bio-char, calorific value, etc.). So far researchers have used response surface methodology (RSM) to predict the responses between the independent variables on bio-oil yield. Since HTL has multiple parameters, RSM has its own restriction towards the number of independent variables to be compared for better bio-oil yield.

Intelligent computational models bridge the gap between conventional research methodologies and modern methodologies by providing the ability of optimization for improved experimental results [11,12,13,14,15]. Intelligent soft computing models provide high accuracy compared with hard computing systems [16,17,18,19]. Thus, novel computational methods need to be brought into research to predict the accurate HTL response. Though various optimization algorithms [12, 20, 21] are available in forecast and prediction of output, machine learning algorithms–based Artificial Intelligence Systems [14, 17, 22,23,24] would be a suitable alternative.

Machine learning makes the systems learn autonomously from the inputs given to it and learn to improve the efficiency based on the change in parameters. Machine learning algorithms were applied in various disciplines depending upon the applications [25]. Osarogiagbon et al. [26] have applied machine learning algorithms for identifying hazardous events in petroleum industry. Detailed methodologies of machine learning were reported [27] for assessing the performance of building structural designs. Machine learning and data mining combo were applied [28] for the application of manufacturing industries. Another application of machine learning was discussed in [29] emphasizing the significance of machine learning in chemical heuristics. The meta-algorithmic approach was suggested [30] to predict the climatic change in geographical regions and provided a machine learning method for proper urban planning. Numerical analysis and investigation of machine learning optimization techniques were applied in measuring the strength of shear connectors [11, 31,32,33,34,35].

Neuro fuzzy systems (NFS)—a combination of fuzzy logic and neural networks—work based on the rules specified during initialization of the system. NFS were implemented in [36,37,38,39] to find the strength of shears. Since NFS works only based on the rules, topological change makes the algorithm weaker [16]. An artificial neural network (ANN) tries to pretend the biological behavior of the human brain. Applications of ANNs include recognizing patterns of a specific environment [40,41,42,43,44,45], where supervised machine learning algorithms such as random forest and support vector machine (SVM) solve the problems present in accessing huge data clustering and classification [22, 46,47,48,49]. Random forest approach produced improved accuracy in classification and prediction compared with SVM [50]. Random forest approach generates a decision tree for classification and regression from the given training data set. Multivariate random forest (MVRF) creates an enhanced multi-set of classification of nonlinear dataset features from the given training dataset [51]. Retrieving meaningful information from the largely classified multi-dataset requires complex ranking methods for a decision-making system. The rank aggregation strategy computes weightage based on the distances between the sub-queries from the dataset available and results in the rank by combining multiple sub-query results.

In this study, a novel Decision Support System (DSS) was proposed to obtain the optimal bio-oil yield from the given biomass and to optimize the HTL process parameters. The exclusive “Supervised Multivariate Random Forest (SMVRF)” classifier function and weighted rank aggregation (WRA) function were developed using Python programming. The SMVRF function classifies the nonlinear biomass training dataset and stores classified results in the cloud repository for further access. The WRA function assigns weights to the liquefaction features based on the nearest weight distance and results in the aggregated ranks for the features. The Decision Vector Block (DVB) 1 and 2 were created to access data from the cloud repository and act as an interface with SMVRF and WRA functions. Based on the ranks obtained for features, DVB2 suggests the bio-oil yield and HHV.

Using DSS in optimization of process-specific yield and quality of the product obtained is the main novelty involved in this study. Most of the other studies report data trend models based on regression fits. These models were further used to predict responses for any given variable counts. In this study, emphasis was given to important variables based on ranking system. Instead of considering a simple regression system, this study used weighted rank–based decision-making process. When multiparameter-dependent processes like HTL were analyzed using machine algorithm, it is important to consider ranked parameters rather than considering all the variables. This DSS algorithm considers this important point and programs were written accordingly to consider the weights and to take smart decisions.

2 Materials and methods

2.1 Biomass characterization

The biomasses used in this study were cultivated and/or collected within SSN College of Engineering campus, Chennai, Tamil Nadu, India. The collected biomass was dried and ground into a fine powder. The biomass characteristics such as %C, %H, %N, %O, %moisture, and %ash were determined and used as input parameters. The moisture content (ASTM, 2006) and ash content (ASTM, 1995) of collected biomasses were analyzed as per ASTM standards. The elemental analysis (carbon, hydrogen, nitrogen, sulfur, and oxygen) was estimated using an elemental analyzer (Perkin-Elmer 2400 series CHNS analyzer).

2.2 Hydrothermal liquefaction

The collected biomasses were individually liquefied in a hydrothermal reactor (4598, Parr reactor). The reactor is made up of stainless steel with 250-mL capacity and is comprised of an auto temperature controller unit. The leftover residues in the reactor in between each experiment were washed with water at 400 °C for 4 h. Precisely, 15 g of biomass (various feedstock used in this study) was heated at varying temperatures of 200–340 °C for 1 h with 200 mL of solvent (ethanol, water, and acetone). The reactions were performed at the nitrogen atmosphere (5 MPa) and the reactor stirred at 720 rpm. After the HTL experiments, the reactor was allowed to cool down and the gaseous products were collected in airtight bags. Later, dichloromethane (DCM) was used to collect bio-oil from the dark brown slurry obtained. The excess DCM in the bio-oil was removed and bio-oil was concentrated using a rotary evaporator. The bio-oil yield was estimated using Eq. (1).

$$ Bio- oil\ yield\ \left( wt\%\right)=\frac{Mass\ of\ bio- oil\ (g)}{Mass\ of\ biomass\ (g)}\times 100 $$
(1)

2.3 Machine learning methods

Data pertaining to the biomass characteristics, HTL process conditions, and bio-oil yield with its HHV value were collected from previously published reports [52,53,54,55,56,57,58,59,60,61,62]. This study was carried out with three modules (Fig. 1): (i) biomass input and processing module, (ii) liquefaction module, and (iii) yield prediction module. New biomass input features were compared with the training set data stored in the repository with the help of the SMVRF function. Classified results from SMVRF were fed as an input to the liquefaction module through the Decision Vector Block (DVB) 1. WRA for each liquefaction was created and given as input to the DVB2. The predicted optimum bio-oil yield and HHV were displayed and saved in the repository via DVB2.

Fig. 1
figure 1

Workflow diagram of Decision Support System (DSS) used in this study

The training dataset contains features of biomass such as %C, %H, %N, %O, % M, and %ash which were extracted from the “.csv” (comma-separated values) file format and given as an input to the SMVRF function developed in “Python programming”; the “Pandas” library package was used for processing data and the “Numpy” library package was used to compute the features stored as multidimensional array values.

This function generates missing value if any by the nearest neighbor method [63]. The SMVRF function was designed with 6 features of biomass and 5 features of the liquefaction module and creates a subset %C = {C1, C2, C3, ….Ci}, where C1, C2, C3, ….Ci were the integer array variables assigned for %C. Similarly, variables were assigned for other biomass features (%H, %N, %O, %M, and %ash). The liquefaction features have the subset of integer array for T, P, t, and Sb (T, temperature; P, pressure; t, time; Sb-S/B (solvent-biomass ratio)) and label variables for CL and Sol (CL, catalyst; Sol, solvent). Classified feature variables were stored in the cloud repository. This function would assign variables autonomously for additional feature inclusion if necessary to the dataset.

DVB handles the input queries of the test data, transfers the query to the SMVRF subsets, and collects the similarity feature variables from the training dataset. WRA assigns weights for each feature variable in the liquefaction block and the aggregate rank was generated based on the weights of each feature.

In this study, 400 training dataset were created and tested with 20 test datasets. Weights for each test datasets were assigned with WRA, the distance between the WRA in each multivariate subset were calculated using the Euclidean Distance algorithm [64] and cross-validated with the training dataset, and the final results were generated.

3 Results and discussion

3.1 Influence of biomass characteristics on bio-oil yield and higher heating value

Biomass characteristics such as %C, %H, %N, %O, %moisture, and %ash were considered in this study as these are the main influencing factors in the yield and characteristics of the bio-oil. Figure 2 represents the consolidated statistical trend of the input dataset used in this study. Two output responses %Y and HHV were compared for the 6 aforementioned biomass characteristics. It is evident that %C, %H, and %N have a positive influence on %Y and HHV, whereas %O, %moisture, and %ash were indirectly proportional to %Y and HHV. Many previous reports used in this study are strongly supporting this trend. Ulva prolifera had carbon (46.2%), hydrogen (7.4%), nitrogen (3%), and oxygen (43.2%) content and upon liquefaction resulted in 26.7 wt% of bio-oil yield [65]. Liquefaction of castor residue comprised carbon (43.59%), hydrogen (5.56%), nitrogen (4.69%), and oxygen (46.16%) content resulting in 21.2 wt% of bio-oil yield [66]. Carbon and hydrogen are the main elements that form most of the hydrocarbons available in bio-oil. Hence, their presence in a higher proportion of the biomass is always favorable. The higher the internal oxygen, the lower will be the yield and quality of bio-oil, since oxygen contributes to the oxidation of carbon during the HTL process. Similarly, moisture content has adverse effects on bio-oil yield and quality. Higher ash content in the biomass would increase the production of bio-char rather than bio-oil. Hence, the increase in %A decreases %Y and HHV. Liquefaction of less ash containing biomass in presence of catalyst increases the bio-oil yield with less solid residue formation [67].

Fig. 2
figure 2

Consolidated statistical trend of input parameters against %Y and HHV used in this study

3.2 Correlation between variables

Analyzing the correlation between various parameters (both biomass characteristics and liquefaction conditions) and their statistical significance is highly important in deciding the weighted ranks by the WRA function. Figure 3 shows the Pearson correlation matrix between various parameters.

Fig. 3
figure 3

Pearson correlation matrix

The carbon content in the biomass was correlated to %Y (p<0.01) with a coefficient value of 0.468. This shows that %C is the main influencing factor of %Y. Similarly, %O was correlated to %Y (p<0.01) with a coefficient of − 0.526. The negative correlation shows that the increase in %O considerably reduces the bio-oil yield. In HTL process conditions, temperature, catalyst loading, and time were correlated positively with p<0.01 and coefficient values of 0.454, 0.089, and 0.129, respectively. Temperature and %C correlated positively with a coefficient value of 0.716 (p<0.05). This shows that for an increasing amount of %C in biomass, the HTL temperature must be maintained high for higher bio-oil yield.

Furthermore %H also correlated positively (p<0.05) with temperature (coefficient: 0.678). %Y was also positively correlated to %N, catalyst loading, and time (p<0.05) with coefficient values of 0.580, 0.089, and 0.129, respectively. All these trends are similar to the consolidated statistical trend of input dataset and reports published in previous studies. Recently, SundarRajan et al. (2020) liquefied Scenedesmus abundans at a reaction temperature of 350 °C and the resulted bio-oil yield was 35.5 wt%. Similarly, another researcher liquefied Nannochloropsis biomass at a reaction temperature of 250 °C which resulted in 31.4 wt% of bio-oil and 5.3 wt% of bio-char yield [58]. Also, liquefaction of Chlorella vulgaris biomass resulted in 41.1 wt% of bio-oil yield at a reaction temperature of 350 °C.

3.3 Weighted rank aggregation of parameters

In a total number of 11 parameters (6 biomass, 5 liquefaction), the WRA algorithm gave weighted ranks for each aggregate and the final ranking was given as cumulative weightage. As per the ranking system, Rank 1 was given to the most influencing parameter based on bio-oil %Y and HHV. Figure 4 presents the frequency histogram of the parameters ranked (1 to 6) most significant.

Fig. 4
figure 4

Frequency histogram plot of weighted rank aggregation parameters used in this study (1 to 6)

These 6 parameters are statistically highly influential compared to the other 5 parameters. These parameters mostly decide %Y and HHV values. As discussed in the previous sections, %C is highly important with the first weightage rank 373 times in the 400 training datasets. Also, %O and %A ranked 2 and 3 with a frequency of 357 and 358, respectively.

An increase in oxygen content of bio-oil will trigger an increase in viscosity and a decrease in the HHV of bio-oil [59, 68]. The presence of high oxygen content in bio-oil also leads to bio-oil aging as it was stored in metal containers [69]. Oxygen content in bio-oil will initiate condensation and esterification reactions if the bio-oil is stored for a longer duration resulting in water impurity formation and thus reduce the HHV of bio-oil [70]. High carbon content and hydrogen content in bio-oil were due to the presence of hydrocarbons, esters, and aromatic compounds. They provide stability to bio-oil over time and temperature. Few volatile compounds may evaporate but the carbon content of bio-oil will not change much [57]. Biomass with high ash content will lead to a higher amount of bio-char formation [55, 56].

WRA algorithm works based on the Euclidean distance comparison analogy. Each dataset was compared to find the distance between each similar parameter, and if the distance was found to be the same, a similar ranking weightage was given; if not the ranking weightage was calculated based on the nearest distance parameter. This was repeated “n” times based on the iterations and all the 400 training dataset were run in this algorithm “n” times. The results generated after “n” iterations were fed to the DVB2 and stored in the repository.

3.4 Prediction of optimum parameters

The core decision-making part of this system is DVB2. To optimize HTL process conditions, a set of 20 data were given as testing input. At the initial stage of the DSS workflow, the input biomass characteristics of the testing dataset were correlated with derived trends stored in the cloud repository. Based on the comparison, weightage ranks were given to the DVB2. In DVB2, the optimum HTL process parameter data trend was compared with freshly derived trends and residuals were calculated. The derivation was repeated “n” times until the optimum values for a given bio-oil yield were similar to that of an input testing dataset. Once the residual fit was satisfactory, the predicted optimum parameter dataset was stored in the repository as another training dataset. From this optimum HTL parameter dataset, %Y and HHV values were calculated by DSS and given as output.

Figure 5 shows the residual plots derived for 5 numerical parameters of the HTL process. From the figure, it is evident that except catalyst loading, all other parameters fit well with 95% confidence levels. This may be due to the complexity of predicting catalyst loading when a theoretical parameter type of catalyst should be decided arbitrarily in the initial steps. The higher R2 values of the other 4 parameters showed that the DSS prediction of optimum parameters fit well with actual parameters. Also, it is noted that the DSS will predict the optimum HTL parameters from any given biomass characteristic parameter input when bio-oil yield can be fixed by the user.

Fig. 5
figure 5

Comparison of predicted data of HTL process parameters with actual values available in the test data

3.5 Validation of Decision Support System

The HTL process parameters predicted by DSS were validated using laboratory experiments and the results are shown in Table 1. Ten various biomass feedstocks were considered for validation experiments and their biomass characteristics were given as input in DSS. The derived optimum parameters were set in laboratory experiments and %Y and HHV values of obtained bio-oil were determined. Similarly, the DSS also predicted %Y and HHV values of each biomass-derived bio-oil. % accuracy values were calculated by Eq. 2:

$$ \% Accuracy=100-\left(\frac{Actual\%Y\sim Predicted\%Y}{Actual}\times 100\right) $$
(2)
Table 1 Decision Support System (DSS) performance based on experimental validation

From the table, it is evident that (i) HTL temperature increased considerably when biomass had high carbon content. (ii) DSS as selected catalysts and solvents based on training dataset and catalyst loading and S/B ratio were predicted for the selected compound. (iii) %Y decreased for Sargassum tenerrimum due to its high %O (56.2) and %ash (24.5). (iv) %Y was low for Chlorella vulgaris, castor residue, Cypress wood, and pinewood due to their higher moisture content (10.2, 10.8, 11.2, and 18.4, respectively). (v) For Amphiroa fragilissima, the bio-oil yield was reduced due to higher %ash (17.6). (vi) All the predicted results were more than 94% accurate with the actual %Y and HHV.

3.6 Validation of targeted DSS model

A targeted DSS model predicts the optimum HTL parameters for the %Y value required by the user. Five biomass feedstocks (Scenedesmus obliquus, Scenedesmus abundans, Chlorella vulgaris, Sargassum tenerrimum, and Prosopis juliflora) were chosen from the previous validation study. The bio-oil yield target was set to 35%. Experiments were carried out with DSS predicted HTL process parameter values and %Y values were determined. On comparing with the target value, all the results showed a %accuracy of more than 95%. This shows the robustness and suitability of DSS in the optimization of HTL.

4 Conclusion

This study used a machine learning approach–based artificial intelligence tool named as Decision Support System (DSS) to optimize the hydrothermal liquefaction process of various biomass feedstocks. Four hundred dataset were used as training data and 20 were used as testing data. Pearson matrix showed a high correlation between %C and %H with %Y. Ranks were given based on weightage. %C was found out to be the most influential factor. More than 94% of accuracy was seen in the results that were predicted. About 95% accuracy was achieved for targeted optimization. This system is highly suitable for the optimization of HTL process parameters based on biomass characteristics.