Introduction

In modern analytical chemistry, some instrumental techniques, such as chromatographic, spectroscopic, and voltammetric techniques, can produce multivariate data (containing many variables) for each sample analyzed. Therefore, chemometric techniques, such as principal component analysis (PCA), principal component regression (PCR), and partial least squares regression (PLSR) [1, 2], are commonly applied for multivariate data analysis. However, not all the variables generated by an instrumental technique are important or correlated to the parameter of interest, and therefore, it can be useful to reduce the number of variables by selecting the more relevant variables or eliminating irrelevant, noisy, or unreliable variables. A variable selection can, for instance, improve the performance of multivariate calibration or classification models, giving better predictions, generate models that are more easily understandable, and help to simplify the analytical instrumentation, reducing the instrumentation costs [3, 4].

Several strategies for variable selection have been used in chemical analyses, such as forward selection, backward elimination, stepwise regression, Jack-knife PLSR (JK-PLSR), interval PLSR, and genetic algorithms. In forward selection, the variables are selected sequentially, one by one, based on the prediction performance of the resulting calibration model. First, all variables are evaluated singly and the variable that results in the lowest prediction error is selected. After, all combinations of two variables containing the first selected variable are evaluated and the combination that gives the lowest prediction error is selected. This process continues until the prediction error is no longer decreased by adding new variables. The advantages of this approach are that it is fast and it is not based on the global model [5, 6].

Laser-induced breakdown spectroscopy (LIBS) is a relatively new analytical technique for optical emission spectroscopy. It is based on the application of laser pulses on a limited region of the sample, with the resulting ablation and/or excitation of a small amount of material and the formation of a transient plasma. An appropriate detection system is then used to analyze the emitted radiation, which can be correlated to the chemical composition of the sample [710]. A LIBS instrument is usually composed of a laser source, an optical system to drive and focus the laser radiation and to collect the plasma radiation, and a detection system (wavelength selector coupled to a detector) [714]. Actively Q-switched Nd:YAG lasers [15, 16] and echelle polychromators coupled to intensified, gated, and cooled charge-coupled device (CCD) cameras [17, 18] have been frequently used in LIBS. However, instruments having microchip lasers [19, 20] and conventional grating polychromators with non-intensified linear sensor arrays [21, 22] have been alternatively used in LIBS in order to reduce instrumentation costs and dimensions.

The metallurgical industry is one of the most important fields of LIBS applications [9, 10], with several works with different types of alloys such as brass [23], gold [24], copper [25], aluminum [26], and, mainly, steel [21, 22, 27]. Some elements added to steel, such as manganese, play an important role in improving mechanical and chemical properties. Therefore, a rapid and precise analytical method for determining the content of these elements in steel is desirable, allowing a better control and monitoring of the steel-making process.

This work presents a new method for forward variable selection and calibration and its evaluation for manganese determination in steel by LIBS, using a compact and low-cost instrumentation and different integration times for spectra acquisition. The results from the new method were compared to those obtained with JK-PLSR.

Materials and methods

Instrumentation

A compact and low-cost LIBS analyzer was employed in this work, which has, among other optical components, a Standa STA-01-8 microchip laser (1053 nm wavelength, 600 μJ pulse energy, 470 ps pulse duration, and 100 Hz pulse repetition rate), a sample holder coupled to a Standa 8MT30-50 translation stage, and a B&W Tek Exemplar LS mini-spectrometer with a classical Czerny-Turner polychromator and a 2048 pixel non-gated, non-intensified, and non-cooled CCD sensor array (200 to 850 nm spectral range, 1.2 nm resolution). More details about the instrument were given in a previous work [28].

Samples

Sixty steel samples were analyzed using the LIBS instrument. The manganese contents in the samples were previously determined by inductively coupled plasma optical emission spectroscopy, with results ranging from 0.106 to 1.696 wt%.

Experimental procedure for LIBS

The samples were previously pretreated by polishing to remove any surface contamination. Each sample was analyzed using three different integration times (no delay time) and four replicates for each integration time. For each replicate, a single spectrum was obtained by firing laser pulses (at 100 Hz) on a different region of the sample surface, while the emitted radiation was integrated for 80, 400, or 1000 ms (integration of about 8, 40, or 100 plasmas, respectively). The translation stage was continuously displaced at 2.5 mm s−1 during the spectra acquisition to avoid sample perforation and plasma extinction [29]. The four spectra of replicates were averaged, with one average spectrum representing each sample for each integration time (60 spectra for each integration time).

Data analysis

A software program was developed to carry out the new method for forward variable selection and calibration, written in Microsoft Visual Basic 2008 Express Edition. The software execution accomplished the following steps:

  1. (i)

    Linear or quadratic regressions of the dependent variable (the manganese concentration, Y) against every independent variable (the LIBS spectra, X), registering all independent variables with coefficients of determination (R 2) higher than or equal to 0.9

  2. (ii)

    Linear or quadratic regressions of the dependent variable against all possible combinations of two independent variables, in the form of (X a /X b ), registering all combinations with R 2 higher than or equal to 0.9

  3. (iii)

    Linear or quadratic regressions of the dependent variable against all possible combinations of two independent variables with wavelength differences (Δλ) of up to 20 nm, in the form of (X a  − X b ), registering all combinations with R 2 higher than or equal to 0.9

  4. (iv)

    Sorting the results from the first three steps in descending order of R 2

  5. (v)

    Linear or quadratic regressions of the dependent variable against sequential combinations (sum) of the independent variables from the fourth step, registering the root mean square error of cross-validation (RMSECV) and trying to minimize it

The regressions were limited to positive slopes (linear and quadratic) and down concaves (quadratic). The use of quadratic regressions is justifiable, in this case, due to the relatively high manganese concentrations in the samples. In the second and third steps, all possible combinations are tested, independent on the results from the first step. In the first three steps, a down limit of 0.9 in R 2 was used to avoid overfitting in the fifth step. In the second and third steps, subtraction and division of variables were carried out to simulate some operations usually employed in spectroscopic data, such as baseline subtraction, normalization, and internal standardization. In the third step, the maximum difference between the wavelengths of the combined variables was established as 20 nm in order to limit the subtraction between variables close to each other (in terms of wavelength), since atomic spectra usually present narrow peaks and a possible baseline subtraction is made using a baseline close to a peak. In the fifth step, the software calculates the RMSECV for the first independent variable or combination of variables from the fourth step. Then, it combines (sum) the first and second independent variables or combination of variables from the fourth step and calculates the new RMSECV. If the RMSECV decreases, the new combination of variables is kept. If not, the second variable or combination of variables is discharged. Then, the third independent variable or combination of variables from the fourth step is included in the combination, with the calculation of the new RMSECV. Again, if the RMSECV decreases, the new combination of variables is kept. If not, the third variable or combination of variables is discharged and so on. In the end, the combination of variables with the lowest RMSECV is selected. No data pretreatment was used. A flow diagram of the new method for forward variable selection and calibration is shown in Fig. 1.

Fig. 1
figure 1

Flow diagram of the new method for forward variable selection and calibration

For comparison purposes, multivariate calibration models were also constructed using JK-PLSR [30, 31], with data pretreatment by mean centering of LIBS spectra. The JK-PLSR was applied repeatedly until the RMSECV values were no further decreased. In this case, the CAMO The Unscrambler 9.7 software was used. The calibration models were compared to each other in terms of the number of selected variables, RMSECV, and root mean square error of prediction (RMSEP).

For both cases (new method and JK-PLSR), the spectra were analyzed in four ways: first, all spectra from each of the three integration times separately and after all spectra from all integration times together. In the last case, each sample was represented by three spectra, placed together as one single row of data. For all cases, the spectra from 40 samples were used for the construction of the calibration models, with full cross-validation (leave-one-out), and the spectra from 20 samples were used for external validation. The samples were randomly divided into the two groups.

Results and discussion

The LIBS spectra

Figure 2a shows the raw LIBS spectra obtained for one of the steel samples at the three integration times studied. The spectra were displaced vertically and purposely to each other in the figure to make the visualization clearer. As expected, the higher the integration time, the higher the emission intensities. The integration time values were chosen to maximize the emission signals at specific wavelength ranges along the spectral range of the LIBS instrument. As can be seen from the figure, high emission intensities were obtained in the range between about 225 and 280 nm for 80 ms, high emission intensities were obtained in the range between about 280 and 385 nm for 400 ms (although some signals were lost at lower wavelengths by overload), and high emission intensities were obtained from about 385 nm for 1000 ms (although some signals were lost at lower wavelengths by overload). Figure 2b shows a linear and direct correlation between the peak height and the integration time, evaluated here for the peak at 495.67 nm (baseline subtracted against 481.29 nm). As can also be seen, due to the relatively poor resolution of the spectrometer for atomic spectroscopy, it is practically impossible to identify individual atomic emission peaks in the spectra, making the spectra evaluation more complex and revealing a need for a judicious variable selection to obtain good quantitative results.

Fig. 2
figure 2

a Raw LIBS spectra obtained for one steel sample and using 80, 400, and 1000 ms integration times (from bottom to top) and b the variation of the peak height at 495.67 nm (baseline subtracted against 481.29 nm) with the integration time

The new method for forward variable selection and calibration

As can be seen (see subsection 2.4), there are many differences between the new method for forward selection and a classical forward selection method. First, in the first three steps of the new method, only single variables (from all variables) and combinations of two variables (using division or subtraction) are evaluated, with no further variable combination. Second, in the fifth step of the new method, only specific combinations of variables are evaluated based on the preselection of variables from the first three steps and on their R 2 values. And finally, the new method not only selects the variables but also generates a mathematic combination of the selected variables based on basic mathematic operations, in which the results are calibrated against the dependent variable by linear or quadratic regression. That is, the new method avoids the use of complex multivariate calibration methods, such as PCR and PLSR.

The combinations of variables selected using the new method for forward selection are given in Table 1. Different selections were evaluated using the new method with linear or quadratic regression and the spectra obtained with different integration times. As can be seen, no combination using the subtraction operation was present at any final selection. Although some combinations with subtraction had been preliminarily selected (in the third step of the new method), no combinations were kept in the final selection (in the fifth step) because none of them led to a decrease in the RMSECV values. As can also be seen, and as expected, the integration time of the LIBS spectra affected the selected variables. Spectra with 80 ms integration time led to the selection of variables mainly between 250 and 300 nm, spectra with 400 ms led to the selection of variables between 280 and 490 nm, and spectra with 1000 ms led to the selection of variables only above 350 nm. These values are in accordance with the wavelength ranges intensified at each one of these integration times. When all the spectra from all integration times were used together, one can see that the new method selected variables from the spectra obtained with all integration times. However, in this case, one can also see that the most selected variables were from the spectra obtained with 400 ms, indicating that the spectra obtained with this integration time had information with better correlation to the manganese content. From all selected variables, the more important ones (more correlated to the manganese content) were those between 287.58 and 294.99 nm.

Table 1 Combinations of variables selected using the new method for forward selection

Although it is difficult to make peak identification using LIBS spectra with relatively low resolution, such as those obtained in this work, it was possible to assign some manganese emission lines to the wavelengths of all variables selected as numerators in Table 1. These assignments, given in Table 2, show that the variables are related to the chemical information under study (the manganese content), even though they had been selected by mathematic calculations. Figure 3 graphically shows some of the variables selected for the LIBS spectra obtained with the 400 ms integration time. As can be seen in Table 1, the six variables shown in Fig. 3 between 292 and 295 nm were divided by the five variables shown between 297 and 299 nm. The six variables between 292 and 295 nm are related to some manganese emission lines, as shown in Table 2, and the five variables between 297 and 299 nm are related to some iron emission lines (seven strong emission lines from iron between 296.6 and 299.5 nm [32]). Therefore, the combination of these variables simulated the division of the emission intensities of some manganese peaks by the emission intensities of some iron peaks, with the iron acting as an internal standard. The analytical curves generated by the new method for forward selection and calibration using linear regression and the spectra from 400 ms integration time and using quadratic regression and the spectra from all integration times together are shown in Fig. 4, which contain combinations of the variables highlighted in Fig. 3.

Table 2 Assignments of some manganese emission lines to the variables selected as numerators by the new method for forward selection
Fig. 3
figure 3

Part of the LIBS spectra obtained for all calibration samples at 400 ms integration time showing some variables (vertical lines) selected by the new method for forward selection

Fig. 4
figure 4

Analytical curves generated by the new method for forward selection and calibration using a linear regression and the spectra from 400 ms integration time (R 2 = 0.996) and b quadratic regression and the spectra from all integration times (R 2 = 0.998)

Manganese determination

The results of the manganese determination using the combinations of variables shown in Table 1 and selected with the new method for forward selection are given in Table 3, compared to those obtained with JK-PLSR. Different types of data pretreatment were evaluated, such as mean centering, Savitzky-Golay first derivative, multiplicative scatter correction (MSC), and standard normal variate (SNV) [1, 2]. For the new method, the use of data pretreatment did not improve the results of the manganese determination significantly. That is why no data pretreatment was employed. For the JK-PLSR, the best results, shown in Table 3, were obtained using only mean centering of LIBS spectra. As can be seen, the RMSECV and RMSEP values from the models obtained with the new method were all lower than those obtained with JK-PLSR for any given integration time, even using a significantly lower number of selected variables, between 6 and 31, and no data pretreatment. The root mean square errors (considering both RMSECV and RMSEP) and the numbers of selected variables from the new method were, on average, about 2.3 times and 22.7 times lower, respectively, than those from JK-PLSR. It is important to point out that, due to the complexity of optical emission spectra from multi-elemental samples, such as steel samples, it is very difficult to select such a low number of specific variables from low-resolution LIBS spectra and attain such good analytical results. In addition, if one looks only at the results from each single integration time, one can see that the best results were obtained from the spectra with the 400 ms integration time, which was similar for the new method and for the JK-PLSR. Plots of predicted versus reference manganese concentrations obtained from the spectra with 400 ms integration time and using the new method with linear regression and JK-PLSR, with external validation, are given in Fig. 5, showing the better correlation for the results obtained with the new method. However, the best results, with the lowest RMSECV and RMSEP values, were obtained using the new method with quadratic regression and the spectra from all integration times together. These results showed that the use of LIBS spectra from multiple integration times so that maximized emission signals are obtained over a wider wavelength range can improve the performance of calibration models under specific variable selection methods, especially with the use of low-cost LIBS instruments. Compared to other works reported in the literature for manganese determination in steel using LIBS [33, 34], within a similar concentration range, the results reported here presented lower root mean square errors and higher reliability, as they were obtained from a much larger sample set.

Table 3 Comparison between the different calibration models with selected variables using full cross-validation and external validation
Fig. 5
figure 5

Plots of predicted versus reference manganese concentrations, using external validation, obtained from a the new method for forward selection and calibration with linear regression (R 2 = 0.991) and b JK-PLSR (R 2 = 0.959), both from LIBS spectra with a 400-ms integration time

Conclusion

A new method for forward variable selection and calibration was developed and evaluated for manganese determination in steel, using LIBS data obtained with multiple integration times from a compact and low-cost instrument. The spectra acquisition using multiple integration times was useful to get maximized emission intensities over a wider spectral range. The variables selected by the new method as numerators in the combinations of variables could all be assigned to strong emission lines of manganese, showing that the calculations involved in the new method led to the selection of variables correlated to the chemical information under study. The new method, compared to JK-PLSR, presented a better prediction performance of the manganese content for all integration times, using significantly lower numbers of selected variables, no data pretreatment, and a simpler mathematic calculation. Additionally, the prediction performance of the new method could be further improved using the spectra from all integration times together, which can be especially useful for low-resolution LIBS spectra acquired by low-cost instruments. This analysis mode, with the use of spectra from multiple integration times to improve the results of multivariate calibration with variable selection, is unexplored in LIBS and can open a new window of analytical applications using LIBS.