Introduction

Global warming, the gradual increase in the Earth’s surface temperature due to the emission of greenhouse gases, has become one of the most pressing issues facing humanity [1, 2]. In recent years, scientists have been exploring various methods to reduce greenhouse gas emissions and slow down the pace of global warming [3,4,5]. One promising area of research is the use of biomass hydrothermal liquefaction (HTL) as a means of converting organic matter into a renewable energy source [6, 7]. Biomass HTL is a process that involves subjecting biomass, such as agricultural and forestry residues, to high temperatures and pressure in the presence of water [8, 9]. The process results in the conversion of the biomass into a liquid bio-oil, which can be used as a substitute for fossil fuels. Biomass HTL has several advantages over other forms of bioenergy production, including its ability to produce a high-quality fuel that is compatible with existing infrastructure [10].

Current research on biomass HTL is focused on improving the efficiency of the process, increasing the yield of bio-oil, and optimizing the quality of the final product. Researchers are also exploring the potential of using different types of biomass and developing new catalysts to enhance the reaction. Additionally, there is a growing interest in the use of biomass HTL as a means of reducing greenhouse gas emissions from various industries, such as agriculture and forestry. The use of biomass HTL shows great promise as a sustainable and renewable energy source that could help mitigate the effects of global warming. As research in this area continues, it is hoped that the technology will become increasingly efficient and economically viable, paving the way for a cleaner, more sustainable future [11, 12]. HTL is a promising technology for converting biomass into liquid fuels and chemicals. However, the complex and non-linear nature of the HTL process makes it challenging to model and optimize using traditional methods.

Machine learning (ML) has emerged as a powerful tool for modeling and optimization of complex processes, and recent research has shown that machine learning can be applied to HTL to improve its efficiency and sustainability [13]. ML models can learn from large datasets of HTL process variables and performance metrics to accurately predict process behavior and optimize operating conditions [14, 15]. Current research in ML for HTL is focused on developing models that can accurately predict HTL process performance and optimize process parameters, such as temperature, pressure, and residence time. This involves developing ML algorithms that can handle the large and complex datasets generated during HTL experiments and simulations and that can account for the non-linear behavior of the HTL process. In addition, ML models can be used to predict the chemical and physical properties of HTL products, which can be used to optimize downstream processing and identify potential applications for the resulting fuels and chemicals. ML can also be used to predict the environmental impacts of the HTL process and to identify strategies for reducing its carbon footprint. The application of ML to HTL modeling is an active area of research and has the potential to significantly improve the efficiency and sustainability of the process. As the technology matures, it is likely that ML will play an increasingly important role in the development and optimization of HTL and other biomass conversion technologies.

Previous studies have been conducted by various researchers on the application of ML in the HTL process. For example, Zhang [16] studied ML prediction and optimization of bio-oil production from HTL of algae. Shafizadeh [17] adopted ML to predict and optimize the HTL of biomass. Katongtung et al. [18] predicted bio-crude yields and higher heating values of products from HTL of wet biomass and wastes. The aforementioned studies share common feature inputs but differ in the number and type of data sets utilized. These studies demonstrate that ML models exhibit high predictive accuracy and can effectively capture the complex relationships among the variables. Moreover, the models can identify various factors that mutually affect each other.

Hence, the objective of this study is to investigate the effectiveness of ML in predicting the bio-crude yield and calorific value of the high-throughput screening process, utilizing unique biological property input features rarely previously utilized in other studies. Specifically, the input features in this study consist of cellulose, hemicellulose, and lignin, and their impact on the output is analyzed using the Shapley value method.

Materials and Methods

Data Collection and Pre-processing

The process of data collection and pre-processing involved several stages. Initially, raw data was collected from the HTL process, and the relevant input features, such as cellulose, hemicellulose, and lignin, were identified and extracted. Subsequently, the data underwent pre-processing procedures to identify and eliminate any inconsistencies, outliers, or missing values. The pre-processed data was then normalized and split into training and testing sets to ensure model accuracy and prevent overfitting. Finally, feature selection techniques were employed to identify the most significant input features, using the Shapley value method, for predicting the bio-crude yield and calorific value.

The dataset utilized in this study consists of lignocellulosic biomass, with a total of 215 data points, comprising both dependent variables (bio-crude yield and calorific value) and their corresponding independent features. The data was collected from existing publications and is further detailed in supplementary material Table S1. Table 1 presents a comprehensive list of 17 input features, derived from feedstock characteristics with dry basis and operating conditions. The table provides the name of each feature, along with its associated numerical and statistical values.

Table 1 Set of input features or variables

Machine learning (ML) algorithms often require input data to be standardized and normally distributed for optimal performance. To address this issue in the present study, standardization of all 17 input features was conducted using Yeo-Johnson’s transformation [19], as shown in Eq. 1:

$${\widehat{x}}_{f,g} = \left\{\begin{array}{c}\frac{[{({\stackrel{\sim}{x}}_{f,g}+1)}^{\lambda }-1]}{\theta }, if \theta \ne 0,{\stackrel{\sim}{x}}_{f,g}\ge 0 \\ ln({\stackrel{\sim}{x}}_{f,g}+1), if \theta =0,{\stackrel{\sim}{x}}_{f,g}\ge 0\\ -\frac{[{(-{\stackrel{\sim}{x}}_{f,g}+1)}^{2-\theta }-1]}{(2-\theta )}, if \theta \ne 2,{\stackrel{\sim}{x}}_{f,g}<0\\ -ln(-{\stackrel{\sim}{x}}_{f,g}+1), if \theta =2,{\stackrel{\sim}{x}}_{f,g}<0\end{array}\right.$$
(1)

Here, \(\widehat{x}\) represents the transformed data obtained from the original data \(\stackrel{\sim}{x}\) for a particular feature g, with \(\theta\) being the transformation factor derived from maximum likelihood estimation. The subscript f ranges from 1 to N, representing the individual data points for each feature. Following this transformation, the standardized data was further normalized using Eq. 2:

$$x_{f,g}={\widehat x}_{f,g}-{\overline x}_g/\sigma_g$$
(2)

Here, x represents the normalized data obtained from \(\widehat{x}\), while \(\stackrel{-}{x}\) and σ represent the mean and standard deviation values of each input feature j, respectively [20].

The definition of bio-crude may vary in the literature, leading to different methods of calculating bio-crude yields. In this study, bio-crude is defined as the organic fraction extracted/separated from the aqueous-phase (water-soluble) fraction using organic solvents, such as acetone or dichloromethane. To ensure consistency in calculations, the bio-crude yield (\({\widetilde y}_{\mathrm{BCY}}\)) was recalculated using Eq. 3:

$${\widetilde y}_{\mathrm{BCY}}=m_{\mathrm{BCY}}/M$$
(3)

Here, \(m_{\mathrm{BCY}}\) represents the mass of bio-crude, while M represents the initial mass of feedstock on a dry basis. Additionally, the higher heating value (HHV) of the bio-crude was recalculated using Eq. 4 [18]:

$$\mathrm{HHV}=0.338\mathrm C+1.428(\mathrm H-\mathrm O/8)$$
(4)

The variables C, H, and O represent the carbon, hydrogen, and oxygen contents of the bio-crude (in w/w). Subsequently, the 2-target output, y ̃, was scaled and normalized using their respective maximum values, denoted as y ́.

$$y_{t,h}={\widetilde y}_{t,h}/{\overset'y}_h$$
(5)

Here, y represents the normalized data of an output h, which refers to the bio-crude yield (BCY) and higher heating value (HHV) of the bio-crude.

ML Model Development

This study utilized advanced ML algorithms, including multilayer perceptron (MLP), kernel ridge regression (KRR), random forest (RF), and extreme gradient boosting (XGB), which were implemented using the scikit-learn libraries [21] in the Python environment. The XGB library was obtained from open-source code developed by Chen and Guestrin [22]. All codes were run on a computer with a MacBook processor running at 1.1 GHz Dual-Core Intel Core m3. To evaluate the performance of these algorithms on a dataset consisting of 17 input features and 2 target outputs, the k-fold cross-validation method was employed. This involved randomly dividing the entire dataset into k groups or folds, where each fold was used as a test dataset, while the remaining k-1 folds were used as a training dataset. This process was repeated k times to ensure that each fold was used as a test dataset once. In this study, 10 folds (i.e., k = 10) were used to ensure that each fold was large enough to represent the statistical properties of the entire dataset [23]. By using the k-fold cross-validation method, the prediction accuracy of each algorithm was obtained in an unbiased manner since the entire dataset was used for training and testing. This allowed for a fair comparison of the performance of the algorithms, and the results obtained were used to identify the algorithm that performed best on the given dataset. This study employed advanced ML algorithms implemented using the scikit-learn and XGB libraries in Python. The k-fold cross-validation method was used to evaluate the performance of these algorithms on a dataset with 17 input features and 2 target outputs. The results obtained were unbiased and were used to identify the best-performing algorithm for this particular dataset.

To evaluate the accuracy of the model’s predictions, two metrics were used: the coefficient of determination (R2) and the root-mean-square error (RMSE). R2 is a statistical measure used in regression models to determine the proportion of variance in the output parameters that can be explained by the input parameters [24]. On the other hand, RMSE is a measure of the differences between the real values and their corresponding predicted values. In general, a higher R2 and a lower RMSE indicate better predictive performance [25, 26].

The equations for calculating R2 and RMSE are provided below as Eqs. (6) and (7), respectively, following the work of [19]:

$${R}_{a}^{2}=1-\sum _{1}^{n}{{(y}_{a,b}-{f}_{a,b}\left(x\right))}^{2}/\sum _{1}^{n}{{(y}_{a,b}-{\stackrel{-}{y}}_{b})}^{2}$$
(6)
$${\mathrm{RMSE}}_a={(\frac1n\times\sum_1^n{{(y}_{a,b}-f_{a,b}\left(x\right))}^2)}^{1/2}$$
(7)

where y is the observed value, f(x) is the predicted value, ȳ is the mean of the observed values, and N is the number of observations in the dataset. In addition to R2 and RMSE, the normalized root-mean-square error (NRMSE) was also calculated in this study to account for any differences in the scale of the test datasets used in each fold. NRMSE is defined as the ratio of RMSE to the mean value of the observed data, and it is calculated using Eq. 8:

$$\mathrm{NRMSE}=\mathrm{RMSE}/\overset-y$$
(8)

where ȳ is the mean value of the observed data. The use of NRMSE allows for a more accurate comparison of the performance of the different models, as it takes into account the variability in the scale of the test datasets.

The equation for calculating NRMSE was derived from the work of [19] and was used in this study as an additional metric for assessing the performance of the models.

Selected ML Algorithm

In this study, four advanced ML algorithms, namely, MLP, KRR, RF, and XGB, were used to analyze the HTL datasets. Each algorithm has its own mathematical principles and parameters, which were optimized using a full-factor grid search technique in combination with a nested 10-fold cross-validation method [27, 28]. The optimized hyperparameters of all the algorithms are summarized in supplementary material Table S2.

An MLP is a type of fully connected, feedforward artificial neural network (ANN) that has found widespread use in various applications due to its ability to learn complex relationships between input and output data. While the term MLP is sometimes used interchangeably with the more general term feedforward ANN, it more strictly refers to ANNs composed of multiple layers of perceptrons with threshold activation. The MLP architecture consists of at least three layers of nodes, including an input layer, one or more hidden layers, and an output layer. The nodes in each layer, except for the input layer, are modeled as neurons that use a non-linear activation function. The multiple layers and non-linear activation functions employed by MLPs enable them to learn complex and non-linear relationships in data, making them more powerful than linear perceptrons in handling data that is not linearly separable [29].

The backpropagation algorithm is a supervised learning technique that is commonly used to train MLPs. During training, the algorithm adjusts the weights between nodes in the network by iteratively computing the gradient of the loss function with respect to the weights and then updating the weights in the opposite direction of the gradient. This process is repeated until the network converges to a satisfactory solution. MLP is a powerful and widely used class of ANN that utilizes multiple layers and non-linear activation functions to enable the learning of complex relationships in data. Their ability to handle non-linearly separable data has made them particularly useful in various applications, and the backpropagation algorithm is commonly used to train them [30].

KRR is a machine learning algorithm that combines the principles of ridge regression and classification with the kernel trick. Specifically, KRR performs linear least squares with L2 norm regularization in the space induced by a given kernel and the input data. The kernel trick allows KRR to implicitly transform the input data into a higher-dimensional feature space, where a linear function can be used to model the relationship between the input features and the output variable. This is achieved by using a kernel function to compute the similarity between pairs of data points in the input space and then projecting these similarities into the higher-dimensional feature space. By using a non-linear kernel function, KRR is able to model non-linear relationships between the input features and output variables. This corresponds to a non-linear function in the original input space. The regularization parameter in KRR controls the balance between model complexity and generalization performance and can be tuned to optimize the model for a specific dataset. KRR is a powerful machine learning algorithm that combines the strengths of ridge regression, classification, and the kernel trick to learn a non-linear function that can model complex relationships between the input features and output variables. The choice of kernel function and regularization parameter can significantly impact the performance of the model and must be carefully tuned to optimize performance on a given dataset [31, 32].

RF regression is a widely used supervised learning algorithm that leverages an ensemble learning method for regression tasks. Ensemble learning is a powerful technique that involves combining the predictions of multiple machine learning algorithms to improve the accuracy and generalization performance of the resulting model. RF algorithm constructs a collection of decision trees, where each tree is built using a random subset of the input data and a random subset of the input features. During prediction, the algorithm generates predictions from each decision tree and then aggregates them to produce a final prediction that is more accurate and less prone to overfitting than a single decision tree. The benefits of the RF algorithm include its ability to handle high-dimensional and complex data, its resilience to overfitting, and its ability to provide information on feature importance, making it an effective tool for feature selection. Moreover, the RF algorithm can handle both regression and classification tasks [33].

RF is a powerful machine learning algorithm that utilizes an ensemble learning method to combine the predictions of multiple decision trees, enabling it to model complex relationships between input features and output variables with high accuracy and robustness. The algorithm’s ability to handle high-dimensional data, its resilience to overfitting, and its ability to provide information on feature importance make it a valuable tool for a wide range of regression tasks [34].

XGB is a popular supervised learning algorithm that uses an ensemble learning method to improve the accuracy of predictions. XGB is an extension of the gradient boosting algorithm, which iteratively trains a sequence of weak models and combines their predictions to make a final prediction. The XGB algorithm incorporates several advanced techniques to improve performance and scalability. It uses a technique called gradient boosting to iteratively add weak models to the ensemble and a loss function to optimize the model during training. The algorithm also employs regularization techniques, such as L1 and L2 regularization, to prevent overfitting. One key feature of XGB is its ability to handle missing data and categorical variables by encoding them in a unique way. This allows the algorithm to handle a wide range of data types and to extract meaningful features from complex datasets [35].

XGB has been used successfully in a variety of applications, including natural language processing, image recognition, and time series forecasting. It is also highly scalable, making it well suited for large-scale datasets. XGB is a powerful machine learning algorithm that uses an ensemble learning method to improve the accuracy of predictions. The algorithm incorporates several advanced techniques, including gradient boosting, regularization, and unique encoding of missing and categorical data. XGB has shown strong performance across a range of applications and is highly scalable, making it a valuable tool for a wide range of machine learning tasks [36].

Feature Evaluation and Interpretation

The ML models were optimized by tuning their hyperparameters using a full grid search methodology in conjunction with 10-fold cross-validation. To determine the relative importance of features in predicting the HHV and BCY, a built-in function of each algorithm was utilized to extract the relevant features. These features were subsequently ranked based on their significance in accurately predicting the desired outcomes. The Shapley method was employed to investigate the impact of input features on the output. In the context of machine learning, it is used to measure the relative importance of input features in predicting the output. The Shapley value for each feature represents the average marginal contribution of that feature across all possible feature combinations, encapsulating its relative importance within the model’s decision-making process. By meticulously analyzing the Shapley values attributed to each input feature, researchers and practitioners can glean valuable insights into the nuanced impact and significance of individual features on shaping the final output of a machine learning model. This comprehensive understanding enables informed feature selection, model refinement, and deeper interpretability, thereby facilitating enhanced performance and transparency in machine learning systems [37]. Therefore, in this study, the Shapley method was employed as a tool for exploring and understanding the impact of input features on the output [38].

Results and Discussion

Model Selection and Accuracy

Table 2 presents the predictive accuracy results of the XGB, RF, MLP, and KRR models, each optimized using their respective best hyperparameters. The accuracy of each model was evaluated using appropriate performance metrics, such as NRMSE, RMSE, and R2. The presented results provide insights into the performance of these models in predicting the desired outcome and can help in selecting the best model for a given application.

Table 2 Prediction accuracy in terms of R2, RMSE, and NRMSE from 10-fold cross-validation of XGB, RF, MLP, and KRR algorithms for all three cases of input features

The XGB model exhibited the best overall performance, with an R2 value of 0.8861 for BCY and approximately 0.8286 for HHV. The RMSE was 1.9936 for BCY and 1.6586 for HHV. In contrast, the RF model achieved an R2 value of 0.8103 for BCY and 0.7049 for HHV, with an RMSE of 2.3986 for BCY and 2.2456 for HHV. The MLP model yielded an R2 value of 0.8103 for BCY and 0.7681 for HHV, with an RMSE of 2.7617 for BCY and 2.4247 for HHV. Furthermore, the KRR model exhibited the lowest accuracy among the models tested in this study, with an R2 value of 0.7887 for BCY and 0.7681 for HHV. The RMSE was 3.0816 for BCY and 2.0068 for HHV. It should be noted that the XGB model demonstrated the highest level of accuracy and the lowest error rate in this study.

To provide further insight into the prediction performance of the XGB model, Fig. 1 illustrates the scatter plots of predicted values versus actual (test) values for both BCY and HHV in both the training and testing phases. The black trend line indicates the positions where the predicted values are equivalent to the test values. Meanwhile, the green band represents a 10% error range, and the blue band denotes a 20% error range. This visualization enables a comparison of the performance of each model in accurately predicting both BCY and HHV of bio-crude. This figure allows for a visual comparison of the predicted values to the actual values, providing a more intuitive understanding of the model’s performance. It is evident that the prediction accuracy of the model was higher in the training phase compared to the testing phase. For BCY, the training case had R2 values within 0.98–0.99 and NRMSE between 0.03 and 0.09, while the test case provided an R2 at 0.78–0.88 and an NRMSE of 0.08–0.12. For HHV, the training case offered an R2 within 0.97–0.99 and an NRMSE between 0.00 and 0.05. The test case had an R2 of 0.70–0.82 and an NRMSE of 0.05–0.08. The XGB algorithm demonstrates promising effectiveness in facilitating the development of predictive models for the yield and HHV of bio-crudes obtained from HTL of lignocellulosic biomass, utilizing input variables derived from feedstock characteristics, such as biological and elemental properties, as well as operating conditions. The level of precision achieved by the XGB model in predicting the yield and HHV of bio-crudes from HTL of lignocellulosic biomass is comparable to that of models specifically developed for certain biomass types or other ML models applied to different biomass conversion techniques. Therefore, the XGB algorithm is a valuable tool for facilitating model development and optimizing the conversion of lignocellulosic biomass into bio-crude.

Fig. 1
figure 1

depicts the distribution of prediction data against the test dataset in ten-fold cross-validation of both BCY and HHV using four different machine learning models, namely, XGB (a, e), random forest (RF) (b, f), multilayer perceptron (MLP) (c, g), and kernel ridge regression (KRR) (d, h)

SHAP Summary Plot

SHAP summary plot is a function in the SHAP (SHapley Additive exPlanations) library used for visualizing the summary of Shapley values for a set of features in a machine learning model. The plot displays the importance and impact of each feature in the model predictions and how they contribute to the final outcome. The function generates a bee swarm where features are ranked based on their importance, and the magnitude and direction of the Shapley values are shown using colored points. Positive Shapley values indicate that the corresponding feature increases the prediction, while negative values imply the opposite. The plot can help identify the most relevant features and understand the relationship between them and the model’s output.

Fig. 2
figure 2

Feature importance by SHAP value for a BCY and b HHV

Figure 2 illustrates the sequence of the impact of input variables on outcomes. Specifically, Fig. 2a displays the order of effect of input variables on BCY, highlighting that the input feature with the most significant impact on BCY is T, followed by H, N, Lig, and W. This order of effect is primarily attributed to the characteristics of the feedstock. On the other hand, Fig. 2b exhibits the order of effect of input variables on HHV. In this case, the feature with the greatest influence on HHV is T, followed by ash, time, Hemi_cel, and Cel. These variables are primarily affected by the operating conditions. The importance of the five features that have the greatest impact on both BY and HHV can be described as follows, while other less important features are shown in Supplementary Material Table S3.

Temperature

Temperature is one of the most important variables that affect the BCY and HHV in the HTL process of lignocellulosic biomass. The effects of temperature on the HTL process can be summarized as follows:

Bio-crude yield (BCY): The BCY increases with increasing temperature. This is because the higher temperatures promote the breakdown of complex biomass molecules into simpler compounds, which increases the bio-crude yield. However, excessive heating can also cause degradation of the bio-crude, resulting in a lower yield. Generally, the optimal temperature range for maximum bio-crude yield is between 250 and 350 °C.

Higher heating value (HHV): The HHV of the bio-crude also increases with increasing temperature. This is because the higher temperatures lead to the production of more stable and energy-dense compounds, such as aromatics, which increases the HHV. However, excessive heating can also cause the formation of less stable compounds, resulting in a lower HHV. Generally, the optimal temperature range for maximum HHV is between 300 and 400 °C.

The temperature variables are therefore crucial in the HTL process as they can impact the bio-crude yield and HHV, which are important factors for the economic viability of the process. Proper control of temperature variables can also improve the quality of the bio-crude, making it more suitable for use in a variety of applications, including fuel and chemical production [39,40,41].

Ash Content

Ash content is an important variable to consider in the HTL process of lignocellulosic biomass as it can affect the BCY and HHV in several ways. The effects of ash variables on the HTL process can be summarized as follows:

BCY: The presence of ash in the biomass feedstock can reduce the BCY by interfering with the thermal decomposition of biomass compounds. This is because ash can act as a heat sink, which reduces the temperature in the reaction zone and inhibits the formation of bio-crude. In addition, ash can catalyze unwanted reactions that lead to the formation of tars and other undesirable compounds, further reducing the BCY.

HHV: The presence of ash in the bio-crude can reduce its HHV by diluting the energy content of the bio-crude. This is because ash is mostly composed of inorganic materials that do not contribute to the energy content of the bio-crude. In addition, ash can also cause fouling and corrosion of equipment, which can increase maintenance costs and reduce the overall efficiency of the HTL process.

Therefore, it is important to minimize the ash content in the biomass feedstock to maximize the bio-crude yield and HHV in the HTL process. This can be achieved by using clean biomass feedstocks with low ash content or by pre-treating the biomass to remove or reduce the ash content before HTL. Proper monitoring and control of ash variables can also improve the overall efficiency and economic viability of the HTL process [42].

Time

Reaction time is a significant consideration when undertaking the HTL process of lignocellulosic biomass as it can affect the BCY and HHV in several ways. The effects of time variables on the HTL process can be summarized as follows:

BCY: The BCY increases with increasing reaction time up to a certain point, beyond which it starts to plateau or decrease. This is because the longer reaction time allows for a more complete breakdown of the complex biomass molecules into simpler compounds, which increases the BCY. However, excessive reaction time can also cause degradation of the bio-crude, resulting in a lower yield. Generally, the optimal reaction time for maximum BCY is between 30 and 60 min.

HHV: The HHV of the bio-crude also increases with increasing reaction time up to a certain point beyond which it starts to plateau or decrease. This is because the longer reaction time allows for a more complete conversion of the biomass into energy-dense compounds, such as aromatics, which increases the HHV. However, excessive reaction time can also cause the formation of less stable compounds, resulting in a lower HHV. Generally, the optimal reaction time for maximum HHV is between 30 and 60 min.

The time variables are therefore crucial in the HTL process as they can impact the bio-crude yield and HHV, which are important factors for the economic viability of the process. Proper control of time variables can also improve the quality of the bio-crude, making it more suitable for use in a variety of applications, including fuel and chemical production. In addition, shorter reaction times can also increase the overall efficiency of the HTL process, reducing the capital and operating costs associated with longer reaction times [43].

Hydrogen Content

The variable of hydrogen content is an important consideration in the HTL process of lignocellulosic biomass as it can affect the BCY and HHV in several ways. The effects of hydrogen content variables on the HTL process can be summarized as follows:

BCY: The BCY increases with increasing hydrogen content up to a certain point, beyond which it starts to plateau or decrease. This is because the higher hydrogen content promotes the hydrogenation of the biomass compounds, which increases the BCY. However, excessive hydrogenation can also cause degradation of the bio-crude, resulting in a lower yield. Generally, the optimal hydrogen content for maximum BCY is between 10 and 20 wt%.

HHV: The HHV of the bio-crude also increases with increasing hydrogen content up to a certain point beyond which it starts to plateau or decrease. This is because the higher hydrogen content leads to the production of more stable and energy-dense compounds, such as aromatics, which increases the HHV. However, excessive hydrogenation can also cause the formation of less stable compounds, resulting in a lower HHV. Generally, the optimal hydrogen content for maximum HHV is between 10 and 20 wt%.

The hydrogen content variables are therefore crucial in the HTL process as they can impact the bio-crude yield and HHV, which are important factors for the economic viability of the process. Proper control of hydrogen content variables can also improve the quality of the bio-crude, making it more suitable for use in a variety of applications, including fuel and chemical production. In addition, hydrogenation can also increase the overall efficiency of the HTL process, reducing the capital and operating costs associated with longer reaction times or higher temperatures [44].

Nitrogen Content

The variable of nitrogen content is important to assess in the HTL process of lignocellulosic biomass, as it can affect BCY and HHV in various ways. The effects of nitrogen content variables on the HTL process can be summarized as follows:

BCY: The BCY decreases with increasing nitrogen content. This is because nitrogen compounds are less reactive than other biomass components and can act as catalyst poisons, reducing the effectiveness of the HTL process. Additionally, nitrogen can also lead to the formation of tar-like substances, which can clog the reactor and decrease BCY. Therefore, proper control of nitrogen content is important for maximizing BCY.

HHV: The HHV of the bio-crude decreases with increasing nitrogen content. This is because nitrogen-containing compounds have a lower HHV than other biomass components, such as lignin and cellulose. Therefore, higher nitrogen content in the bio-crude can decrease its overall energy density and decrease its usefulness as a fuel. However, the effect of nitrogen content on HHV can be mitigated by separating nitrogen-rich compounds from the bio-crude.

The nitrogen content variables are therefore crucial in the HTL process as they can impact the BCY and HHV, which are important factors for the economic viability of the process. Proper control of nitrogen content variables can also improve the quality of the bio-crude, making it more suitable for use in a variety of applications, including fuel and chemical production. In addition, minimizing nitrogen content can increase the overall efficiency of the HTL process, reducing the capital and operating costs associated with longer reaction times or higher temperatures [44, 45].

Water

The amount of water used plays a pivotal role in the HTL process of lignocellulosic biomass, and it can have a substantial impact on both BCY and HHV in various ways. The effects of water variables on the HTL process can be summarized as follows:

BCY: The BCY generally increases with increasing S/BL up to a certain point, beyond which the yield may begin to decrease. This is because water acts as a solvent and catalyst in the HTL process, facilitating the liquefaction of biomass and promoting the formation of bio-crude. However, excessive water can also lead to reduced reaction rates and increased energy consumption, as well as dilution of the bio-crude, which can reduce its yield. Therefore, the optimal S/BL ratio is important for maximizing bio-crude yield.

HHV: The HHV of the bio-crude generally decreases with increasing the S/BL ratio. This is because increasing the water content can lead to more extensive degradation of the bio-crude, resulting in lower energy density. Additionally, higher S/BL ratios can also increase the concentration of inorganic elements in the bio-crude, which can reduce its HHV. Therefore, smaller S/BL ratios may be advantageous in maintaining the quality of the bio-crude and maximizing its energy density.

The water variables are therefore crucial in the HTL process as they can impact the BCY and HHV, which are important factors for the economic viability of the process. Proper control of water variables can also improve the quality of the bio-crude, making it more suitable for use in a variety of applications, including fuel and chemical production. In addition, an optimal S/BL ratio can lead to higher productivity and lower operating costs, making the process more competitive. However, the optimal S/BL ratio may vary depending on the specific biomass feedstock and process conditions and should be determined through experimental optimization [43, 46].

Cellulose

Cellulose is a major component of lignocellulosic biomass and plays a significant role in determining the BCY and HHV in the HTL process. The effects and importance of cellulose variables on BCY and HHV can be summarized as follows:

BCY: The presence and amount of cellulose in the biomass feedstock can significantly affect the bio-crude yield in the HTL process. Cellulose is a complex polysaccharide that requires hydrolysis and solubilization to be liquefied into bio-crude. Therefore, biomass with a higher cellulose content is expected to have a higher bio-crude yield. However, excessive cellulose content can lead to the formation of solid residues, reducing the yield of bio-crude. Therefore, the optimal cellulose content in the biomass feedstock should be determined experimentally to maximize BCY.

HHV: The heating value of the bio-crude is also influenced by the cellulose content of the biomass feedstock. Cellulose is a high-energy component of biomass, and its presence can increase the energy density of the bio-crude. However, the degree of cellulose degradation and solubilization during the HTL process can also impact the HHV of the bio-crude. Excessive degradation of cellulose can lead to the formation of low-energy compounds, reducing the HHV of the bio-crude. Therefore, the optimal cellulose content and processing conditions should be determined to maximize the HHV of the bio-crude.

In summary, cellulose content is an important variable that can significantly impact the BCY and HHV in the HTL process. Higher cellulose content can increase the BCY and energy density, but excessive cellulose can lead to the formation of solid residues and low-energy compounds. Therefore, the optimal cellulose content and processing conditions should be determined experimentally for each biomass feedstock to achieve the best results [47, 48].

Hemicellulose

Hemicellulose is another major component of lignocellulosic biomass and plays an important role in determining the BCY and HHV in the HTL process. The effects and importance of hemicellulose variables on BCY and HHV can be summarized as follows:

BCY: Hemicellulose is a complex polysaccharide that can be easily hydrolyzed and solubilized during the HTL process, making it an important contributor to BCY. Biomass with a higher hemicellulose content is expected to have a higher BCY due to its ease of solubilization. However, excessive hemicellulose content can lead to the formation of solid residues, reducing the yield of bio-crude. Therefore, the optimal hemicellulose content in the biomass feedstock should be determined experimentally to maximize BCY.

HHV: The heating value of the bio-crude is also influenced by the hemicellulose content of the biomass feedstock. Hemicellulose is a high-energy component of biomass, and its presence can increase the energy density of the bio-crude. However, the degree of hemicellulose degradation and solubilization during the HTL process can also impact the HHV of the bio-crude. Excessive degradation of hemicellulose can lead to the formation of low-energy compounds, reducing the HHV of the bio-crude. Therefore, the optimal hemicellulose content and processing conditions should be determined to maximize the HHV of the bio-crude.

In summary, hemicellulose content is an important variable that can significantly impact the bio-crude yield and HHV in the HTL process. Higher hemicellulose content can increase the BCY and energy density, but excessive hemicellulose can lead to the formation of solid residues and low-energy compounds. Therefore, the optimal hemicellulose content and processing conditions should be determined experimentally for each biomass feedstock to achieve the best results [48].

Lignin

Lignin is a major component of lignocellulosic biomass and plays an important role in determining the BCY and HHV in the HTL process. The effects and importance of lignin variables on BCY and HHV can be summarized as follows:

BCY: Lignin is a complex polymer that is relatively resistant to hydrolysis and solubilization during the HTL process. As a result, lignin content in the biomass feedstock has a negative impact on bio-crude yield. A higher lignin content can lead to the formation of solid residues, reducing the yield of bio-crude. Therefore, it is important to consider the lignin content of the biomass feedstock when selecting the feedstock for HTL process.

HHV: The heating value of the bio-crude is also influenced by the lignin content of the biomass feedstock. Lignin is a high-energy component of biomass, and its presence can increase the energy density of the bio-crude. However, the degree of lignin degradation and solubilization during the HTL process can also impact the HHV of the bio-crude. Excessive degradation of lignin can lead to the formation of low-energy compounds, reducing the HHV of the bio-crude. Therefore, the optimal lignin content and processing conditions should be determined to maximize the HHV of the bio-crude.

In summary, lignin content is an important variable that can significantly impact the bio-crude yield and HHV in the HTL process. Higher lignin content can reduce the bio-crude yield but increase the energy density, while excessive degradation of lignin can lead to low-energy compounds and reduced HHV. Therefore, the optimal lignin content and processing conditions should be determined experimentally for each biomass feedstock to achieve the best results [48].

User Interface

The user interface (UI) serves as the bridge between the user and the product, facilitating interaction. It primarily emphasizes appearance, design, and user-friendliness, aiming to ensure ease of use and avoid complexity. In this research, UI plays a crucial role in enhancing user convenience when utilizing the researcher’s developed ML model. Users are not required to possess programming knowledge for effective model utilization. Figure 3 displays the UI screen for predicting bio-crude oil yield and higher heating value in the HTL process, which can incorporate biomass data for value prediction.

Fig. 3
figure 3

User interface of the HTL process

Figure 3 illustrates that in order to predict the values of bio-crude oil yield or higher heating value, it is necessary to input complete data for both operating conditions and elemental properties. For the biological properties, users have the option to enter values selectively within the cellulose, hemicellulose, and lignin groups or input all values for comprehensive predictions.

Limitations of this Study

A limitation of this study is the relatively small number of datasets utilized in the model, which may result in lower accuracy compared to previous research by Djandja et al. [49] who investigated machine learning prediction of bio-oil yield during solvothermal liquefaction of lignocellulosic biowaste. However, in machine learning, varying input features can yield different results, such as the number and novelty of input features. Additionally, considerations such as data management and the ratio of training to testing data are crucial in machine learning tasks. Consequently, despite addressing similar topics, many machine learning tasks exhibit differences.

Another limitation of this study is that the authors describe the variables influencing the output solely in one dimension through the SHAP summary plot. While this method effectively reveals the importance and positive or negative effects of different variables on the output, the author recognizes the significance of elucidating the relationships among various variables influencing the output. This could be achieved by employing the partial dependence plot analysis or SHAP dependence plot in future studies to comprehensively explain the relationships among the variables affecting the output.

Conclusions

This study aimed to predict the resulting bio-crude oil yields and their calorific values from HTL of lignocellulosic biomass using a machine learning approach. Feature selection, employing the Shapley value method, identified significant input features from a dataset comprising 215 data points, 17 input features, and 2 target outputs. An extreme gradient boosting algorithm was demonstrated to provide the best performance, followed by random forest and multilayer perceptron. Conversely, kernel ridge regression exhibited lower accuracy. For the bio-crude yield, the temperature was found to have the most significant impact, followed by hydrogen, nitrogen, lignin contents, and the amount of water used. Meanwhile, for the calorific value, temperature also emerged as the most influential feature on model predictions, followed by ash content, reaction time, hemicellulose, and cellulose. The analysis of feature effects and interactions proved significant in the understanding of the HTL system.