Introduction

Water and gas sampling of the natural thermal discharges are the most commonly used techniques for exploring the prospective geothermal sources. On the other hand, soil characterization is probably the least adopted survey method employed over the geothermal areas. However, in the geothermal areas where surface discharges are relatively few and the extent of the geothermal field is not known, soil sampling has been proven to be a very effective tool [25].The elevated temperature of the geothermal reservoirs increases the mobility of volatile vapor-borne species, e.g., mercury, arsenic, antimony, boron, and ammonia; as a result, these elements migrate upward through the permeable zone and get adsorbed in the upper soil matrix. In fact, the presence of an anomalous concentration of Hg and its mineralization in various high-temperature geothermal areas around the world are well documented and is an effective tool for exploring potential geothermal fields [12, 22, 26, 27, 39].Although there are abundant literatures available about mapping of Hg in high-temperature geothermal areas, there is no study being carried out about the overall classification of soils near the thermal manifestations and its comparison with the soils away from the thermal discharges.

Analytical techniques like atomic absorption spectroscopy (AAS), X-ray fluorescence spectroscopy (XRFS), inductively coupled plasma atomic emission spectroscopy (ICP-AES), and gravimetric analysis are often used for qualitative and quantitative variety discrimination of soil [40]. However, these techniques are costly and time consuming and cannot be applied as an on-field technique. Laser-induced breakdown spectroscopy (LIBS) is an elemental emission-based spectroscopic technique that has inherent capability for both qualitative and quantitative study of materials in solids, liquids or gaseous samples and can be used in-situ / on-field and also if required remotely. In LIBS, a short laser pulse of high energy is focussed on the sample surface to produce a micro-plasma. The resulting emission lines from the atomic, ionic, and molecular fragments, created by the plasma, is then resolved both optically and spectrally to produce a spectrum of intensity against wavelength (Harmon, 2009).LIBS is being considered as a front runner in green chemical analysis due to its unique features, such as, real-time analysis, pseudo non-destructive technique, minimal to no sample preparation protocol, high sensitivity to low atomic weight elements, and capability to carry out close-in as well as stand-off detection. As a consequence, over the last two decades, LIBS has been widely applied in a variety of fields, like environmental monitoring [1, 2], biomedical applications [11, 19], archeological investigations [4], pharmaceutical industries [23, 24], extra-terrestrial explorations [6, 7, 20, 28, 29], hazardous materials identification [9, 10, 14], nuclear fuel characterization [17, 30, 32, 33], and geological material characterization [3, 5, 15, 16].

The application of multivariate chemometric methods in conjunction with LIBS data has recently shown tremendous potential in the field of soil analysis. Principal component analysis (PCA) is one of the most widely used chemometric procedure for multivariate data system. Chemometrics basically reduce the dimension of the input data to describe the complete information with considerably fewer variables than was originally present, thereby revealing the simple underlying structure that is present within a complex input dataset[8]. A chemometrics-LIBS couple can be utilized both for qualitative and quantitative assessment of samples. Qualitative applications include a study by Zhang et al. [42] for classification of slag samples using partial least squares discriminant analysis (PLS-DA). Fink et al. [13] and Unnikrishnan et al. [38] used PCA for identification of polymers and related materials. Yueh et al. [41] used hierarchical cluster analysis (HCA) for tissue classification, Sirven et al. [35] studied feasibility of rock identification in the Mars surface by applying PCA, PLS-DA, and soft independent modeling of class analogy (SIMCA) to the LIBS data. However, there are relatively few numbers of literature studies available on the classification of soil samples using PCA-LIBS coupled methods [34, 40].

In the present work, we have employed PCA methodology on the LIBS spectrum data to classify the soil samples collected from the Manuguru geothermal area of the Telangana state, India. The soil samples consist of several samples collected close to thermal discharges and these were compared with the samples taken away from the discharge areas to observe any distinguishable characteristics of any specific mineral (major, trace, and rare earth elements) concentrations. This work addressed the use of PCA in an indigenously developed modified form to identify emission lines significantly responsible for variety classification in LIBS spectral analysis. The developed method employed a relatively simple spectral truncation-based PCA technique. Several PCA score classification plots representing different sets of wavelength ranges were evaluated. These plots were based on truncated spectra obtained by applying a threshold radius on the corresponding PCA loading plot. Traditional PCA has been applied to the chemical data set to further corroborate the results obtained from the LIBS data.

General geology of the study area

The soil samples were collected from the Manuguru geothermal area, located in the Khammam district of the Telangana state, India. The district forms a part of the Godavari river basin. The study area consists of several thermal water manifestations having a temperature in the range of 36–76 °C. The Godavari basin, a NNW–SSE trending graben on a Precambrian platform, is filled with Gondwana sedimentary formations. This area is known for coal exploration and there are several opencast coal mines operated by Singareni Collieries Company Ltd. (SCCL). Almost all of the thermal manifestations are from the bore wells drilled for coal exploration. These thermal discharges are located near the Pagaderu, Gollakatur, ST colony, Shantinagar, and Kodichenkuntala villages of the Manuguru administrative division. Geological mapping in 1:25000 scales was carried out over an area of 80 km2 in and around the study area. The geological map along with the sample location points are shown in Fig. 1. Lower Gondwana sedimentary formations of the Permian period rest uncomformably over Proterozoic Pakhal meta-sediments, which form the basement rock of the study area. The Talchir formation, the Barakar formation, the Barren Measures formation, and the Kamthi formation collectively form the lower Gondwana subgroup. The simple litho-stratigraphy of the study area is given in Table 1. Among different structures, well-defined bedding planes and cross beddings are observed in the sandstone. There are two sets of joints observed in the sandstone. One strikes N30°W with a dip towards the east and the other E–W with a dip towards the south area. Apart from the joints, faults are the main diastrophic structure which controls the hot water movements within the basin. The fault system is characterized by a prominence of dip faults whereas some are oblique in the study area. Some faults extend beyond the limit of Gondwana basin into Pakhal Group, whereas others are restricted within the basin. The geothermal manifestations seem to be confined to the NE–SW trending fault. A total of 20 samples were collected from the B horizon (subsurface layer, 9–15-cm depth) of the soil.

Fig. 1
figure 1

Soil sample location points along with the geological map in the Manuguru geothermal field. The red circle indicates the location of thermal manifestations

Table 1 Litho-stratigraphy of the study area

Methods

To ensure the homogenous nature of the soil composition, the samples were dried, grounded, and sieved to a particle size of 80 mesh. The samples were divided into three sets, one for LIBS spectra recording and the other two for chemical composition analysis by inductively coupled plasma optical emission spectroscopy (ICP-OES) and instrumental neutron activation analysis (INAA).

LIBS

A common configuration of the LIBS system was used in this study [33]. Figure 2 presents a schematic diagram of the experimental setup. Laser pulses from an Nd: YAG laser (Brilliant B, Quantel, France) a with 532-nm wavelength (6-ns pulse duration) and having a maximum energy of 440 mJ were focussed on to the sample surface using a plano-convex lens (f = 10 cm) to produce a laser-induced micro-plasma. The samples were translated using a motorized linear XYZ translator (Velmex, USA) to ensure that each laser pulse hits a fresh portion of the sample. The emission light was collected at 45° with respect to the laser pulse propagation, through a collimator (CC52, Andor, UK). The collected light was fed to an echelle spectrometer (Mechelle, ME5000, Andor, UK) through an optical fiber. The echelle spectrometer covers 200–975-nm wavelength regions and has a spectral resolution of ~ 4750 CSR (λ/∆λ) at a 50-μm entrance slit width. The spectrograph is equipped with an ICCD (iStar, Andor, UK, 1024 × 1024 pixels), which is synchronized with the Q-switch of laser pulse to control acquisition time delay (td) and detector gate width (tg). The digital spectra recording and controlling the delay generation were carried out with data acquisition software (Solis 4.28). For the present study, a laser energy of 50 mJ, td = 1.2 μs, and tg = 50 μs was used. The soil samples were initially pelletized to 3-cm diameter pellets by applying a pressure of 2 × 109 Pa for 5 min using a pellet presser. Each LIBS analysis consists of an accumulation of 60 laser shots taken in scanning mode over a 15-mm straight line and three replicate spectra for every individual sample were recorded. These three replicate spectrums were then averaged to get a spectrum representing the overall sample surface.

Fig. 2
figure 2

The experimental setup of LIBS used for the present study

Chemical analysis

Thermal waters contain variable amounts of trace metals (i.e., Li, Rb, Cs, B, Sb, Cu, Pb, Mn, Co, etc.), rare earth elements (i.e., Sc, Ce, Eu, Tb, etc.) and transition metals (i.e., Hf, Ta, etc.) which are usually not present in detectable amounts in natural groundwater systems. High-temperature thermal waters show elevated concentrations of these trace and rare earth elements depending upon the extent of rock-water interaction at the high reservoir temperature ([18, 31] and the references cited therein). The elevated concentration of trace and rare earth elements in the thermal waters prompted us to analyze the concentrations of some of these elements in the soil samples collected from the geothermal area to check their preferential distribution.

Quantitative analysis of Fe, Na, K, Mn, Cu, B, As, Hg, Sb, Li, Pb, and Co concentrations in the soil samples were measured by the ICP-OES technique (Model: ACTIVA, M/S HORIBA Scientific). Prior to the analysis by ICP-OES, digestion of soil samples was carried out as per the methodology described by Kumar et al. [21]. Silica content of the soil samples was analyzed by the conventional gravimetric method [36]. The concentrations of Sc, Rb, Cs, Ce, Eu, Tb, Hf, Ta, and Th were determined by instrumental neutron activation analysis (INAA). For INAA analysis, 50 g of each soil sample was dried in an oven and carefully ground using mortar and agar from which 20 mg of the powdered sample was sealed doubly in aluminum foil and irradiated in the self-serve facility of the DHRUVA reactor, Mumbai, with a neutron flux of 1013 cm−1 s−1 for 6 h. IAEA RMs (reference materials) SL-1 and Soil-7 were used as reference and control standards, respectively. Gamma-ray measurements were carried out after the appropriate cooling time by using an HPGe detector coupled with a computer-assisted multichannel analyzer [37].

Results and discussion

Characterization and PCA of LIBS spectra

PCA on the LIBS data was employed to identify the variations in the spectral data and to interpret the data relative to the subset of the spectral variations. On applying a linear mathematical transformation, these variations were reduced to a smaller set of principal components. The first PC (PC1) contained the largest variance of data set followed by second PC (PC2), third PC (PC3), and so on. In this current study, PCA was applied to transform the LIBS spectra of all the 20 samples into several principal components (PCs). It was seen that that PC1, PC2, and PC3 explain 73%, 18%, and 8.9% of the variance respectively; implying three PCs collectively could explain 99.9% of the total variance of the original dataset. Figure 3 b to d show the comparison of principle components (PCs) after applying the PCA on the whole spectra with a typical LIBS spectra obtained from the soil sample (Fig. 3 a). One of the drawbacks of the abovementioned PCA procedure was the frequent crashing of the software due to excessive load of data in the algorithm. The spectrograph-detector system generates > 16,000 wavelength or pixel data per spectrum covering a 200–975-nm region. Hence, applying the PCA model on this LIBS whole spectrum meant introducing 20 (sample) × 16000 pixel data points, which was equal to 320,000 variables. This caused a heavy load on the program. To reduce this load, regions with no or minimal contribution to the overall analytical results were needed to be removed. Figure 3 a clearly depicts the lack of a significant amount of emission lines after 550 nm. The same thing was reflected from the PC1, PC2, and PC3 with a lack of a significant amount of contribution in the 550–975-nm regions, and hence, this region was irrelevant in the variety classification exercise and was omitted. Figure 4 a and b show the comparison of the PCA loading plot using whole a 232–1000-nm region and 232–550-nm region respectively and were found to be almost identical. Both Fig. 4 a and b show the presence of two different clusters. Soil samples very near to the thermal manifestations (MU-1, MU-4, MU-8, MU-10, MU-11, MU-13, MU-18) fall in one cluster (square symbols) whereas samples (MU-3, MU-12, MU-14, MU-15, MU-17, MU-20) collected away from the thermal manifestations fall in a different cluster (circle symbols). The MU-6 sample although collected near the thermal manifestations did not fall in the thermal cluster. Anomalies or wrong clustering were shown by only three samples (MU-7, MU-2, and MU-16). These three samples although collected away from the thermal manifestations but fell in the proximity of the thermal cluster in the PCA plot.

Fig. 3
figure 3

a A complete LIBS spectra of the soil sample showing the 232–1000-nm region. bd A comparison of different PCs on the whole LIBS spectra

Fig. 4
figure 4

a PCA bi-plot covering all spectral regions 232–1000 nm. b PCA bi-plot covering the 232–550-nm region

Ascertaining the reason behind the differential clustering of soil samples depending upon the proximity to the thermal waters, the PCA on the loadings of PC1 and PC2 were carried out (Fig. 5). Since the echellogram consisted of a huge number of pixels, even after eliminating 550–975 nm, the total number of pixels in the PCA model was ~ 9000. Due to this huge number of the pixel data, it became virtually impossible to identify which emission lines were majorly influencing the clustering in Fig. 4 b. To make the model more efficient in distinguishing major influencing variables or pixels, the irrelevant pixel input data to the program were needed to be reduced. When the LIBS spectrum is used as the input data set in a PCA model, a weakly intense emission line can be an effective variable for classification but a strong one can be irrelevant. Singh et al. [33] developed and successfully used a modified algorithm of PLSR known as analytical spectral-dependent PLSR (ASD-PLSR) for qualitative analysis of glass samples. The core idea of the algorithm was to delete the irrelevant spectral region. Based on a similar principle, in the present work, we have developed a spectral truncation method known as analytical spectral-dependent truncation, which was guided by the loading’s Euclidean distance in the loading plot (Fig. 5).

Fig. 5
figure 5

PCA bi-plot of loading factors along with different Tc values

The loading plot is simply the Cartesian plot of the individual scalar values of the loading vector in the XY-plane made by the respective principle components. The direction and distance of these scalar points (i.e., the representation of the corresponding variable) of the loading vector indicate the direction and magnitude of that particular variable contributing to the variety classification. In other words, the points close to the origin of the loading plot are irrelevant in verity classification. Hence, variables or pixels having a near-zero Cartesian distance from the origin can be removed. To select only those wavelength regions or pixels which are at a significant distance from the origin and significantly contributing in variety classification, an arbitrary parameter named the PCA coefficient threshold (Tc) was introduced, whose value varied from 0 to the maximum Cartesian distance in the loading plot. A program was written in such a way that it had converted the LIBS spectrum to a modified spectrum by replacing all emission intensities to 0 whose Cartesian distance in the loading plot was less than a selected Tc whereas the remaining intensities were left unaltered. Basically, this procedure removes all the intensities of the pixels which are irrelevant for the classification purpose and helps to retain only relevant intensities in the spectra. Figure 5 shows the six arbitrarily chosen Tc circles. These circles indicate that if say Tc = 0.08 is chosen, then all the pixels having a loading Euclidean distance less than the value of 0.08 will be replaced by the 0 value. A Tc value equal to 0 indicates no spectral region or pixel deletion, i.e., the PCA plot of the whole spectra as it is. The Tc value was changed from 0 to 0.4 with an increment of 0.005, and for each Tc value, separate PCA plots were constructed. Figure 6 shows PCA plots at four arbitrary Tc values. With the increase in the Tc value from 0 to 0.02, 0.08, and 0.2, the change in the PCA classification was observed. At Tc = 0, only MU-6 sample was the outlier from the thermal soil grouping. But at Tc = 0.02, when some amount of irrelevant signal or intensities were removed, a very similar grouping is observed, but now, two non-thermal samples gave false positive results. At Tc = 0.08, this clustering became less obvious and the degree of separation between two groups of soil samples (collected from near the thermal manifestations and away from thermal manifestations) became less, indicating a significant loss of relevant information or intensities At Tc > 0.08,the separate clustering of the soil samples was not at all there. These observations indicated the requirement of the variables or pixels having a Tc value ≤ 0.02 for proper variety classification, without which PCA classification would fail. Although, with the increase in the Tc value, the chance of false positive results increased, but this methodology helped us to separate the relevant pixel from irrelevant pixels thereby enabling us to identify the emission lines majorly responsible for the variety classification.

Fig. 6
figure 6

PCA bi-plots of loading factors having different Tc values

Figure 7 shows the PC1 and PC2 loadings obtained from PCA of truncated spectral data using a Tc of 0.02. For identification purposes, the truncated spectrum of an arbitrary chosen sample (MU-11) was also shown in Fig. 7. By identifying the different scalar values in the PC1 and PC2 and comparing with truncated spectra, the variables or emission lines majorly responsible for the PCA variety classification were identified. Table 2 tabulated the list of emission lines in the LIBS spectra responsible for the variety classification of the PCA plot shown in Fig. 4 b. It was interesting to note that, Table 2 mainly contained ionic lines rather than atomic lines. The occurrence of these ionic lines was not an indication of the plasma’s ionic nature; rather, the spectrally pure lines (in heavy spectrally impure emission spectra) which play a major role in variety classification were accidentally ionic in nature. The use of a spectrograph with high resolution might solve this dilemma by resolving many of these emission lines. It was observed that PC1 and PC2 had greater loadings for the emission lines of the few elements, i.e., Fe, Ca, Co, Sr, Al, Si, Ti, Ru, Mn, Mg, Cu, and Eu which caused the preferential clustering shown in the PCA plot (Fig. 4). However, a higher loading of these emission lines did not necessarily imply the higher intensities (higher intensity is proportion to the higher concentration of the elements) of the abovementioned elements in the soil samples near the thermal manifestations. Rather, the concentration of all these elements as a whole affected the emission pattern of soil samples near the thermal manifestation to be different from the other samples resulting in differential clustering in PCA. Simply, the overall distribution of the Fe, Ca, Co, Sr, Al, Si, Ti, Ru, Mn, Mg, Cu, and Eu compared to the other elements present in the soil samples near the thermal water has resulted in this alternate grouping.

Fig. 7
figure 7

PC1 and PC2 loadings obtained from PCA of truncated spectral data using the Tc = 0.02 parameter along with the truncated spectra of the MU_11 sample in the 230–450-nm region

Table 2 The emission lines and the corresponding spectral regions of the majorly influencing elements in the soil samples near the thermal manifestations

PCA of the chemical data

To correlate the findings obtained from the LIBS analysis, PCA on the chemical analysis data was performed. The concentrations of the corresponding elements are given in Table 3. The PCA bi-plot (Fig. 8) using PC1 and PC2 showed the existence of two different clusters. In Fig. 8, the soil samples close to the thermal water manifestations are shown in square blocks whereas the circle symbol denotes the soil samples collected away from the thermal manifestations. The tip of the arrow in Fig. 8 indicates the loading value of the respective elemental concentration. Figure 8 basically shows the PCA plot of the sample as well as the PCA plot of the elemental concentration in a single graph. The higher loadings of boron, cesium, rubidium, and cobalt as a whole made the soil samples close to the thermal water very distinct compared to other samples. Although the separate grouping of soil samples near the thermal manifestations was strikingly similar in both LIBS and chemical analysis methods, there were a few mismatches in the clustered samples. In Fig. 8, the MU-10 sample did not fall into the thermal cluster whereas MU-14 and MU-15 fall within the thermal cluster although they represented the soil samples away from the thermal manifestations. The MU-10 sample had nearly twice the concentration of Na and K, along with abnormally low Cu concentration than the fellow thermal soil sample. These unusual elemental patterns had forced the MU-10 sample to move away from thermal soil clustering. On the other hand, the MU-14 and MU-15 samples possessed a concentration range of few elements similar to thermal ones and few elements having a concentration range similar to non-thermal samples. This characteristic pattern had caused the PCA modeling to misinterpret the two samples as a part of the thermal cluster. A complete analysis of other traces and bulk elements will probably be able to correct this PCA misinterpretation.

Table 3 Chemical analysis results of the soil samples collected from the study area
Fig. 8
figure 8

A PCA bi-plot using chemical analysis results shows the clustering of soil samples near the thermal manifestations

It is interesting to note that the number of elements, assumed to play a significant role in the clustering process, is found to be a little bit different as per LIBS and chemical analysis data. This happens due to some intrinsic shortcomings of LIBS compared to traditional chemical analysis technique and vice versa. In chemical analysis, the matrix is generally separated and the elements are measured sequentially, whereas in LIBS, the spectrum of the whole matrix is taken so overall composition plays a very vital role in the LIBS technique. The spectral intensities of major elements (i.e., Fe, Mg, Si, etc.) were so high compared to the spectral intensities of the trace elements that in the PCA, these small intensities were unable to play any significant role in the clustering process. Not only that, some of the elements like boron (B) which was present in a significant amount (~ 100 ppm) and was a known influencer in the chemical analysis was found to be insignificant in the LIBS analysis due to the heavy spectral interference of iron (which is present in percentage amounts). The B (I) 249.677-nm line got heavily interfered by the Fe (II) 249.913-nm emission line. Cs and Ru being alkali metals are an emission-poor system, and hence, they were not seen in the LIBS spectrum with significant intensity. In short, Cs and Ru also could not play any significant role in clustering processes. It is true that the significant elements to distinguish between soil samples from close to the thermal manifestations and samples far away from thermal manifestations cannot depend on the analytical techniques employed in the study. But the LIBS system (resolution = 0.05 nm) used in the present study is unable to detect some of the signal from trace elements due to the instrumental limitation and intrinsic nature of the emission lines. Moreover, when the idea of the current study was conceived, the authors were completely in the dark regarding the elements which were to be analyzed chemically as there was no prior study regarding the soil analysis in the geothermal area. The only reference available was the analysis of the thermal waters which were found to be enriched in certain trace metals (i.e., Li, Rb, Cs, B, Sb, Cu, Pb, Mn, Co, etc.), rare earth elements (i.e., Sc, Ce, Eu, Tb, etc.) and transition metals (i.e., Hf, Ta, etc.) compared to the non-thermal groundwater ([18, 31], and the references cited therein). The elevated concentration of trace and rare earth elements in the thermal waters prompted us to analyze the concentrations of some of these elements in the soil samples collected from the geothermal area to check their preferential distribution, this being the main reason that some of the elements found to be important in LIBS analysis (Ca, Al, Ru, Mg, and Sr) were excluded from the chemical analysis. However, the available chemical data was found to be sufficient for the classification purpose. We are very sure that analysis of some more elements by chemical methods would have been able to dispel this concern. Thus combining the results obtained from both the LIBS and chemical analysis, it can be concluded that the relative difference in concentrations of few elements (i.e., B, Sr, Cs, Rb, Fe, Co, Sr, Al, Si, Ti, Ru, Mn, Mg, Cu and Eu) in the soil samples near the thermal manifestations has made them characteristically different from the soil samples situated away from the thermal springs.

Conclusion

This study for the very first time demonstrates the potential use of LIBS coupled with PCA as a tool for identification and discrimination of soil samples in a geothermal area through a geochemical fingerprinting approach. This submission also shows that indigenously developed analytical spectral-dependent truncation based PCA method applied on the LIBS spectra utilizing only relevant pixels helped to understand the majorly influencing emission lines. Combining the results obtained from both the LIBS and chemical analysis, it can be concluded that soil samples near the thermal manifestations had a distinctly different concentration pattern for some trace and bulk elements (B, Sr, Cs, Rb, Fe, Co, Al, Si, Ti, Ru, Mn, Mg, Cu, and Eu) compared to the soil samples located away from thermal manifestations. But this PCA-based LIBS method was not completely robust as it misinterpreted four samples out of 20 samples implying 80% accuracy. On the other hand, the chemical analysis-based PCA method also was not completely robust as a total of 17 samples out of 20 samples were clearly differentiated implying 85% accuracy. Yet, considering the time consumption and cost involved in multiple analytical techniques to get the chemical data, the LIBS method with a comparatively lower success rate seems to be a simple and good alternative technique which can be field adopted and generate real-time data. Due to the limited number of samples with known geothermal linkage, the validation of this method using an unknown sample was not possible to carry out in the present work but the obtained results demonstrate the successful application of the LIBS-PCA combination for fast classification of the geothermal soil samples avoiding the rigorous dissolution required by other analytical techniques.