1 Introduction

In the modern world, non-renewable energy sources like fossil fuels are used to meet the world’s energy needs [1]. Fossil fuels do, however, have some significant drawbacks, such as an adverse impact on the environment [23]. Fossil fuels are becoming less abundant due to rising industrialization and population, which is one of the primary contributors to climate change and greenhouse gas emissions. In light of this, it is essential to investigate and optimize the alternate energy sources including biodiesel, bioethanol, and hydrogen [4,5,6]. The cellulose consisting of polysaccharides and hemicellulose plays an important role in bioethanol production [78]. The conversion of cellulose into glucose and ethanol requires an efficient hydrolysis process [9]. Hence, bioethanol production requires an effective pre-treatment [10]. The change in the structural components of the lignocellulose by pre-treatment is to remove hemicellulose and lignin. Several pre-treatment techniques such as enzymatic hydrolysis, acid hydrolysis, and alkali hydrolysis were used [11]. For the removal of lignin, it requires high chemical concentration and temperature. The development of metabolic inhibitory components is required, which is considered a drawback.

Hydrolysis of lignocellulose for bioethanol production can be carried out in enzymatic and acid hydrolysis [12,13,14]. Pre-treatment of lignocellulosic materials can be done using acid hydrolysis. Conversion of hemicellulose into glucose with higher reactions and less cost by using sulfuric acid, hydrochloric acid, and citric acid. At moderate temperatures of acid hydrolysis, saccharification suffers because of the decomposition of glucose [15]. At high temperatures and low pH of acid hydrolysis, it is necessary to utilize high-cost equipment. So, neutralization of acid hydrolysis before the fermentation process is important [16]. Intelligent computerized models connect the gap between research approaches and the latest methods by imparting the ability of optimization to advanced experimental results [1718]. Compared to hard computerized models, intelligent user models offer high accuracy in predicting results [1719]. Thus, a user-friendly computerized model plays an important role in the optimization of bioethanol production to predict the glucose and ethanol yields. A machine learning algorithm plays a better role when comparing another optimization algorithm [2021]. Firstly, food crops, lignocellulosic biomass, and waste are feedstocks of the first and second generations. Microalgae constitute the third and fourth generation of feedstock. In 3rd and 4th generation feedstock, this AIDMS algorithm can be implemented in anaerobic digestion process parameter optimization to obtain biogas yield.

Machine learning is a favorable tool for modeling and optimizing bioethanol production. It can process and predict without prior knowledge, making the method useful for bioethanol production [2223]. AI models used in various processes have achieved better results [24]. With the development of a machine learning algorithm, good results were obtained from the fermentation of glucose to estimate the bioethanol production [25]. A machine learning algorithm makes the system analyze the given inputs and change the process conditions to improve efficiency [26]. Machine learning algorithms can be applied in several applications, such as the chemical industry, petroleum industry, manufacturing industry, and climatic conditions [172728].

In this work, an Artificial Intelligence Decision-Making System was proposed to obtain the optimal bioethanol yield from the given biomass characteristics and process conditions. In Python programming, the simple linear regression (SLR) and the weighted rank order aggregate (WROA) function were developed. The SLR function categorizes the training dataset consisting of biomass, and the categorized results were stored in the cloud repository. The WROA function allocates weighted rank to the biomass characteristics and process conditions based on the importance of distance weight. The judgment block (JB) was created to admit data from the cloud repository and acts as a port with SLR and WROA functions. Using AIDMS in the optimization of biomass characteristics, saccharification conditions and fermentation conditions. Regression analysis was reported in other research articles. In this study, a ranking order was given to cellulose, hemicellulose, lignin, and saccharification and fermentation conditions based on the importance of bioethanol production. This AIDMS algorithm considers the importance according to the rank order aggregate and makes brilliant decisions.

2 Materials and methods

This study was by relevant institutional, national, and international guidelines and legislation.

2.1 Collection and preparation of biomass

The biomass feedstock such as cotton stalk, wheat stalk, rice straw, banana plant waste, corn cob, olive tree, potato peel waste, and coconut shells used in this study was collected from a farm around SSN College of Engineering, Kalavakkam, Chennai, Tamil Nadu, India a (12.7517° N, 80.2033° E). The biomass feedstocks collected were washed and dried in a hot air oven at 40 °C until the steady weight was accomplished and processed to a 70-mesh size and stored in plastic bags at a dry place until further use. The biomass characteristics and saccharification and fermentation conditions act as the input. As per the standards of the American Society for Testing and Materials (ASTM D 2974-07a), the biomass feedstock was tested for their contents of moisture (%) and ash (%), and the biomass characteristics such as the cellulose (%), hemicellulose (%), and lignin (%) were analyzed using the Neutral Detergent Fiber (NDF) method [2930].

2.2 Dilute acid hydrolysis

The upper and lower limits of various operational parameters such as dilute acid concentration, temperature, and retention time were fixed based on previous research works. For acid hydrolysis experiments, a constant solid-to-liquid ratio of 5% (wt/vol) was maintained throughout, and the acid concentration, temperature, and retention time were varied from 0.89 to 5% (vol/vol), 70–120 °C and 20–120 min, respectively [31]. All experiments were carried out in screw-capped Pyrex bottles, and deionized water was added to the mixture to adjust the solid:liquid ratio. Further, the solid and liquid parts were separated by the Buchner funnel after cooling down to room temperature. The reducing sugar concentrations of the hydrolysate (liquid fraction) were determined by Dinitro salicylic acid (DNS) assay with modification [32] method using a UV–Vis spectrophotometer (UV1800, Shimadzu, Japan) at 540 nm [33]. Finally, the hydrolysate required for fermentation for the production of ethanol was prepared by carrying out the above-mentioned acid hydrolysis process.

2.3 Fermentation

The hydrolysate obtained from the method described in Section 2.2 was made to undergo a fermentation process, and ethanol production potential was analyzed. The fermentation process was carried out in Erlenmeyer flasks of 150 mL capacity, having a suitable media of 100 mL at 150 rev/min on a shaker. The operating temperature and retention time varied from 25–35 °C and 24–55 h, respectively. A 4% (v/v) of Saccharomyces cerevisiae was used as inoculum (10 ml) with a fermentation medium containing 3 g/L yeast extract, 5 g/L peptone, 2 g/L KH2PO4, 0.2 g/L MgCl2, and 0.2 g/L CaCl2. Periodically, samples were withdrawn to estimate the amount of ethanol produced by the spectrophotometric potassium dichromate method [34].

2.4 Dataset pre-processing

Data pre-processing steps are applied to the training dataset.

2.4.1 Cleaning of biomass dataset

This involves identifying and correcting errors in the dataset by predicting the missing values using the oversampling method.

2.4.2 Normalization of dataset

It splits the data into training and test datasets. By standardizing the range of independent variables or features within a dataset, data normalization enhances the consistency and comparability of the prediction model and produces more consistent and reliable output.

2.4.3 Dataset feature scaling

Ensuring that every characteristic receives equal consideration during the learning process is made possible by scaling features. Without scaling, learning could be dominated by features at a larger scale, leading to skewed results. Scaling ensures that each feature contributes equally to model predictions while also eliminating this bias. From these three data pre-processing techniques, the accuracy of the results is improved.

2.5 Artificial intelligence algorithm

This algorithm architecture includes a supervised machine learning approach that may be applied to regression and classification problems called support vector machine (SVM). The primary goal of SVM is to identify the optimal boundary, also known as a hyperplane, for classifying the data. It can handle smaller datasets and may effectively approximate decision boundaries even in the absence of a universal approximation theorem resulting in good accuracy [35]. The main reason for not including ANN in this study because it requires a larger number of labeled datasets for training resulting in less accuracy when compared to SVM. So SVM is an effective and simple tool in the optimization of the dilute acid hydrolysis process. The process parameters like temperature and time symbolize the change in time and temperature. For optimization of dilute acid hydrolysis process parameters to obtain the glucose and ethanol yields, an Artificial Intelligence Decision-Making System was developed. The data containing biomass characteristics, saccharification conditions, and fermentation conditions were collected from various research articles. This study consists of four segments as shown in (Fig. 1), namely, the biomass characteristics segment, the pre-processing segment, the process conditions segment, and the yield segment.

Fig. 1
figure 1

Artificial Intelligence Decision-Making System (AIDMS) used in the study

For tuning hyperparameters, the Bayesian optimization model was used because it improves efficiency by evaluating all possible hyperparameter combinations by building the probability model of the objective function [36]. This model also reduces the required evaluations by directing the search toward promising hyperparameters. It is more efficient when compared to other hyperparameter tuning models like random and grid search [37]. This model’s main advantage is determining the next dataset to evaluate and automatic tuning [38]. The hyperparameters like cellulose (%), hemicellulose (%), lignin (%), acid concentration (%), saccharification–time (min), saccharification–temperature (°C), glucose (g/L), fermentation–time (h), fermentation–temperature (°C), and ethanol (g/L) were tuned using this model. It improves the AIDMS algorithm performance and interpretability and reduces the overfitting problems.

Biomass characteristics were compared with the training dataset with the help of an SLR function classifier and stored in the storage block called a cloud repository. The obtained results from the SLR act as the input to the bioethanol production module via JB-1. WROA for process conditions is created and acts as the input to the JB-2. The optimum predicted glucose and ethanol yields were stored in the storage block (i.e.) cloud repository. The training dataset consists of biomass feedstock such as rice straw, rice husk, sugarcane bagasse, olive tree, and cotton stalk which was obtained from the tensor flow (“.csv”) file format and acts as the input to SLR function developed in Python [39]. Sqllite3 is used for operating the database. Pandas’ library package is used to manipulate the dataset and acts as the analyzing tool [40]. The nearest neighbor approach uses the function to produce the missing data.

The SLR function is considered for four biomass characteristics such as cellulose (%), hemicellulose (%), lignin (%), and three process conditions for saccharification such as dilute acid concentration (%), temperature (T in °C), and time (t in min), and two process conditions for fermentation such as temperature (T in °C) and time (t in h). This SLR function generates an integer subset. This subset consists of cellulose (%) as C (%) = (C1, C2…Ci) where Ci is considered as the cellulose integer characteristics variable. Similarly, the same integer set is generated for hemicellulose (%) as H (%) = (H1, H2…. Hi) and lignin (%) as L (%) = (L1, L2…. Li). The acid hydrolysis process conditions such as saccharification and fermentation also have the integer condition variable. Saccharification process conditions also have integer variables such as Acid Conc., S-T, and S-t, (S- conc. as acid concentration, S-T as saccharification temperature, S-t as saccharification time), and for fermentation conditions has F-T and F-t, (F-T as fermentation temperature, F-t as fermentation time). JB deals with the test dataset. WROA allocates rank order to the biomass characteristics and process conditions according to their importance in the production of bioethanol. The algorithm generated 250 data points, which are detailed in Table A1 (Annexure), designated for training purposes, alongside an additional 20 data points allocated for testing. Euclidean distance algorithm was used in this WROA to obtain the weighted rank for biomass characteristics and process conditions, and the results were validated with the training dataset to obtain the output [41]. The Artificial Intelligence Decision-Making System algorithm analyses every parameter in the dataset to predict the glucose and ethanol yields. The predicted data using the AIDMS algorithm is stored in the storage block of the cloud repository as the additional training dataset. During the optimization procedure, N datasets are predicted and stored in the cloud. This proves that this AIDMS algorithm improves the quality of the dataset.

3 Results and discussion

3.1 Influence of biomass characteristics and process conditions on glucose and ethanol yields

Biomass characteristics such as cellulose (%), hemicellulose (%), and lignin (%) were considered the most efficient parameters in glucose and ethanol yields. Figures 2 and 3 represent the input statistical data analysis used in the study. Five responses such as glucose and ethanol yields were compared with the process conditions such as S-conc., S-T, S- t, F-T and F-t. Because of its higher glucose yield, dilute acid hydrolysis is used in this AIDMS algorithm. Mineral acid like H2SO4 can be used in the hydrolysis of cellulose to obtain an ethanol yield. H2SO4 is 96% by weight in an aqueous solution [42]. It also consists of a high concentration of H+ and it is considered a strong catalyst because of its lower water content and would drive the equilibrium towards the reactants [43]. The benefit of ethanol production from cellulose is the inexhaustibility and advantage of cellulosic biomass [44]. Acid hydrolysis of cellulose using sulfuric acid has two steps. In the first step, the acid breaks the β-1,4-glycosidic bond, and the second step is the conversion of cellulose to glucose [45,46,47]. Acid hydrolysis of hemicellulose consists of two steps. In the first step, 0.7% of H2SO4 was used at 190 °C to hydrolyze the hemicellulose. In the second step, 0.4% of H2SO4 was used at 215 °C to obtain the cellulose fraction. Finally, the hydrolyzates are removed before fermentation conditions [4849].

Fig. 2
figure 2

Influence of biomass characteristics and saccharification conditions on glucose yield (S-saccharification)

Fig. 3
figure 3

Influence of biomass characteristics and fermentation conditions on ethanol yield (F-fermentation)

During acid hydrolysis, lignin contains more acid, which leads to difficulty in using it as a feedstock. So, lignin is left as residue [50]. Many previous research articles used in this study support this analysis. Cotton stalk biomass had cellulose (40.1%), hemicellulose (13.6%), and lignin (29.4%) content and upon bioethanol production, resulted in (29.4 g/L) of glucose yield [51]. The biomass of Curcuma longa comprised cellulose (50%), hemicellulose (11.2%), and lignin (12%) content resulting in (31 g/L) of glucose yields [52]. Cellulose is the important element that forms glucose units with polymerization available in glucose and ethanol yields. Hence higher amount of cellulose content in biomass characteristics is favorable. During the dilute acid hydrolysis-saccharification and fermentation processes, the cellulose (%) is directly proportional to the glucose and ethanol yields, whereas the hemicellulose (%) and lignin (%) are inversely proportional to the glucose and ethanol yields [53]. In the saccharification process, the glucose level attains its maximum at low temperature (70–84 ℃). If there is an increase in temperature with increases in acid concentration, the glucose yield decreases are reported in the literature [42]. At a time period of 20–32 min, maximum glucose yield was attained. In this analysis, the temperature < 120 ℃ and time < 120 min were taken to reduce the Hydroxymethyl furfurals (HMF) production. Below 120 ℃ temperature and time of 120 min, the production of HMF is less than 5% reported in the literature [31]. During the fermentation analysis, temperature and time are directly correlated to ethanol yield.

3.2 Correlation coefficient parameters

Analyzing the correlation coefficient values between the biomass characteristics and process conditions and their influence on bioethanol production is important in determining the rank order by using the WROA function. Pearson matrix determines the correlation coefficient values by measuring the strength of the relationship between every variable present in the biomass dataset. It uses the value of one parameter to predict the value of the other parameter. Correlation coefficients are a quantitative assessment that measures both the direction and the strength of this tendency to vary together. It can be used to test whether the relationship between two variables is significant. It has + ve, −ve, and no correlation values. The positive correlation values show the importance of biomass characteristics and dilute acid saccharification and fermentation process parameters in optimizing bioethanol production to obtain yields. The negative correction acts as the detrimental effect that affects the production. This AIDMS algorithm uses the correlation coefficient values of every parameter to improve the performance of the AIDMS algorithm. Figure 4 shows the Pearson correlation coefficient values for glucose yield. The cellulose (%) content present in the biomass characteristics is positively correlated to glucose, having a correlation coefficient value of 0.968. This correlation value shows the importance of cellulose in the production of bioethanol. The biomass characteristics such as hemicellulose (%) and lignin (%) are inversely correlated to glucose yield, having coefficient values of −0.698 and −0.883 with (p > 0.05). The saccharification process conditions of temperature, acid conc., and time is inversely correlated to glucose yield with a coefficient value of −0.844, −0.798 and −0.768 with (p > 0.05). As time increases, the glucose yield decreases. Except for cellulose (%), all the other biomass characteristics and process conditions have negative correlation values. For saccharification, the cellulose (%) ranges from 0.968 to 0.956, hemicellulose (%) ranges from −0.698 to −0.690, lignin (%) ranges from −0.877 to −0.883, acid concentration (%) ranges from −0.794 to −0.798, temperature (°C) ranges from −0.844 to −0.840, and time (min) ranges from −0.768 to −0.772.

Fig. 4
figure 4

Correlation coefficient of glucose analysis using Pearson

Figure 5 shows the fermentation conditions having a positive correlation coefficient value for cellulose (%) and temperature and time such as 0.904, 0.848, and 0.823 with (p < 0.05). The negative correlated coefficient values for hemicellulose (%) and lignin (%) were as follows: −0.551 and −0.753 with (p < 0.05). For fermentation, the cellulose (%) ranges from 0.904 to 0.905, hemicellulose (%) ranges from −0.551 to −0.557, lignin (%) ranges from − 0.753 to −0.758, temperature (°C) ranges from 0.845 to 0.848, and time(min) ranges from 0.823 to 0.825.

Fig. 5
figure 5

Correlation coefficient of ethanol yield analysis using Pearson

3.3 Weighted rank order aggregate (WROA)

Considering three biomass characteristics, three saccharification process conditions, and two fermentation process conditions, a total of 11 parameters were given rank order using the weighted rank order aggregate function. Based on the WROA, the first rank was given to cellulose (%) in both the glucose and ethanol yield analyses. Figure 6 shows the rank order for biomass characteristics and saccharification conditions. Cellulose (%) is considered an important factor with the first rank 232 times in the 250 training dataset. Temperature was ranked second with a weightage of 227 times. The third rank was given to acid concentration (%) with the rank weightage of 224 times. High cellulose content will increase the glucose yield. It provides a good yield over a lower temperature and time. Figure 7 shows the WROA for biomass characteristics and fermentation conditions. Cellulose (%) has taken first place with a weightage of 229 times. Second and third place were given to temperature and time with a weightage of 220 and 219 times. The Euclidean distance method was used in this AIDMS algorithm. It calculates the distance between similar parameter data. A weighted rank was given to the data points having the same distance. If the distance between similar datasets varies, the weighted rank was given to the data according to the nearest neighbor distance. This process was repeated several times, for a total of 250 dataset runs in the WROA algorithm. The outputs were generated after the number of trials fed to the JB-2 and finally stored in the storage block of the cloud repository.

Fig. 6
figure 6

Frequency histogram plot of WROA for glucose yield

Fig. 7
figure 7

Frequency histogram plot of WROA for ethanol yield

3.4 Prediction of biomass characteristics and process conditions

The JB-2 block makes the final decision. For optimizing the bioethanol production conditions, 20 datasets were given as the testing data. At the beginning stage of the AIDMS algorithm workflow, the testing dataset containing the biomass characteristics as input, the obtained correlation values are stored in the SB of the cloud repository. Based on their similarity, rank order was given to the JB-2. In JB-2, the optimum process conditions were compared with the stored dataset to calculate the residuals. This process is repeated several times until the best result is attained for the given yield which should be similar to the test data. The predicted data was stored in the SB of the cloud repository as additional training data. So the residual fit is suitable. In this bioethanol production, the glucose and ethanol yield values can be calculated using the AIDMS algorithm.

Figure 8 shows the residue plot for biomass characteristics and process conditions for bioethanol yield. From the figure, it is clearly shown that all the parameters fit well with an accuracy of > 94%. The AIDMS algorithm shows a higher R2 value and fits well with the experimental data. This algorithm predicts the process conditions by using the biomass name and characteristics as input when the glucose yield is fixed. To evaluate the accuracy of the AIDMS algorithm, by using the coefficient of determination R2 values were obtained by comparing the 20-test dataset with the predicted AIDMS dataset. The cross-validation step is used to evaluate the performance of the AIDMS model. It ensures the robustness and good generalization of the model chosen for deployment to new data. The performance of the AIDMS algorithm for predicting ethanol production across different feedstock results is effective because it reviews the dataset several times until it attains the desired output. By comparing the AIDMS-predicted dataset with the experimental dataset, 94% accuracy was attained. The AIDMS algorithm can learn from several feedstock compositions. They are strong generalizers to novel feedstock types because they capture fundamental patterns. Because of the variations in process parameter conditions, this AIDMS algorithm can learn the relationship between the conditions and the bioethanol yield by making them available across various setups, coming to scaling of AIDMS, this algorithm can adapt to industrial scale and lab-scale experiments by improving the accuracy during the validation.

Fig. 8
figure 8

Comparison of predicted data of acid hydrolysis—saccharification and fermentation conditions from the actual data available in the test dataset

Real-world industrial setting applications include the following:

  1. 1.

    Optimization of dilute acid hydrolysis process parameter to obtain glucose and ethanol yields by using the AIDMS algorithm.

  2. 2.

    Maintenance of the predicted dataset in the cloud repository.

  3. 3.

    Efficiency can be improved by analyzing the process parameter dataset.

Potential impact on bioethanol efficiency includes the following:

  1. 1.

    Maximum glucose and ethanol yields due to optimization of saccharification and fermentation process parameters.

  2. 2.

    Cost can be reduced by preferring this AIDMS algorithm instead of using high-cost equipment.

  3. 3.

    This AIDMS algorithm can adapt to any feedstocks.

The limitations of this study are data collection, data quality, and interpretation of results. The limitations like data collection may affect the validity and scope of the findings by taking time and resources. Similarly, data quality can be improved by supplementing the training dataset by implementing different copies of the minority classes. It can be done many times to improve the accuracy of the results. Similarly, interpretation of results can be reduced by using SVM for classifications and regressions. It improves the prediction of the dataset using the AIDMS algorithm.

3.5 Validation of AIDMS

The biomass characteristics and process conditions predicted by the AIDMS algorithm were validated using experimental results shown in Table 1. Five biomass feedstocks were used for validating the experiments, and their biomass characteristics and process conditions were given as the input in AIDMS. The process conditions were set in experiments, and the glucose and ethanol yields were determined. Similarly, the AIDMS also obtains the glucose and ethanol yields for five biomass feedstocks. Accuracy % can be calculated from Eq. 1,

$$Accuracy\left(\%\right)=(100-\left(\frac{Expt\mathit.data\mathit-predicted\mathit\;data\mathit\;from\mathit\;AIDMS}{Experimental\mathit\;data}\right)\times100$$
(1)
Table 1 Performance of Artificial Intelligence Decision-Making System (AIDMS) based on experimental validation

From Table 1, it is shown that

  1. I.

    When the saccharification temperature increased, the glucose yield was reduced.

  2. II.

    The AIDMS algorithm has biomass characteristics as a training dataset and the process conditions were predicted for the selected biomass.

  3. III.

    If the cellulose is 44.95%, hemicellulose is 55.31%, lignin is 22.71%, and acid concentration is 5%, rice straw waste biomass yielded low glucose and ethanol because of its increase in hemicellulose, lignin, and acid concentration.

  4. IV.

    Glucose and ethanol yields were low for banana plant waste biomass because of its cellulose content (30.13%) and high hemicellulose (42.01%), lignin (27%), and acid concentration (5%). As the saccharification time increases, the glucose yield decreases.

3.6 Validation of targeted AIDMS

A targeted AIDMS model predicts the biomass characteristics and process conditions for the glucose and ethanol yields by the user. Five biomass feedstocks such as cotton stalk, wheat straw, olive tree, potato peel waste, rice straw, cotton stalk, and sugarcane bagasse were selected from the various studies for verification of this algorithm. The targeted glucose yield was set to determine the process conditions and ethanol yield. By comparing the targeted glucose yield, the results obtained are > 94%. Out of 250 acid hydrolysis biomass datasets, the training dataset contains the collected dataset as well as the AIDMS model-predicted dataset so the accuracy of the training dataset is higher than the test dataset accuracy [54]. The accuracy of the training dataset obtained using this AIDMS algorithm is 96% which is more compared to the accuracy of the test dataset 94%. This proves that the Artificial Intelligence Decision-Making System performs well in this optimization of the dilute acid hydrolysis process.

4 Conclusion

This study uses machine learning to develop an algorithm called the Artificial Intelligence Decision-Making System to optimize the production of bioethanol from different biomass feedstocks. A hundred datasets act as training data and 20 as test data. The Pearson correlation coefficient matrix shows the correlation between glucose and ethanol yields for biomass characteristics and process conditions. A weighted rank order aggregate was given to find the important factor in determining the yield. The accuracy % was determined using actual values and values predicted using the AIDMS algorithm and found to be more than 94%. For feedstock such as cotton stock, banana plant waste, 97% of accuracy, and for wheat straw 96% was attained through this algorithm. This artificial intelligence-based system plays an important role in optimizing bioethanol production. In the future, using this AIDMS algorithm, a master data management system was developed for bioethanol production for implementation in the industry. It can also be implemented to optimize Hydro Thermal Liquefaction and Hydro Thermal gasification.