1 Introduction

The realm of comprehensive geochemical and isotopic dataset concerning rocks and sediments has unfurled expansive avenues for delving into the gradual evolution of the continental crust (Haughton et al. 1991; Schwab 2003; Joshi et al. 2022a), the intricate genesis of rocks (Joshi et al. 2017, 2022b), as well as the nuanced reconstruction of paleoclimatic and paleogeographic conditions (Ramirez-Herrera et al. 2007; Tripathy et al. 2014; Shaji et al. 2022), and the estimation of provenance (Lipp et al. 2020; Hifzurrahman et al. 2023; Banerji et al. 2022a; Joshi et al. 2021a). The intricate geochemical characteristics of fine-grained sediments and sedimentary rocks are fundamentally shaped by their provenance rocks, followed by intricate interactions with fluids that metamorphose source-rock particles into solutes and/or nascent minerals via the mechanisms of weathering (Nesbitt 1979; Nesbitt et al. 1980; Taylor and McLennan 1985). Moreover, the compositions of these sedimentary assemblages are malleable, subject to modifications stemming from an array of factors encompassing hydrodynamic sorting, interactions with porewater fluids during the interment process, and the exchange of cations with ambient waters (Fedo et al. 1995; Nesbitt et al. 1996; Garzanti et al. 2009; Garzanti 2016; Lipp et al. 2020). Notwithstanding the profusion of expansive geochemical dataset, untangling the multifaceted contributions of these disparate factors to the formation of sediments and sedimentary rocks persists as a formidable enigma. The complexities intertwined with managing such colossal dataset have posed formidable challenges for practitioners in sedimentary geochemistry (Lipp et al. 2020). Consequently, in recent times, the acumen of harnessing multivariate statistical approaches (MSA), inclusive of methodologies like principal component analysis (PCA), discriminant analysis, linear discriminant analysis, factor analysis, singular value decomposition, and hierarchical cluster analysis, has emerged as a salient strategy to grapple with the intricacies inherent in navigating extensive data arrays. The PCA is a multivariate statistical method used for the reduction of dimension and identification of pattern or trend in data (Reid and Spencer 2009) while discriminant analysis discriminates between two or more groups based on their characteristics (Chien and Lautz 2018). As per Braun et al. (2013), linear discriminant analysis is a method that identifies linear combination of features to discriminate between two or more groups. Factor analysis is a method that identifies and quantifies underlying factors which explain the observed correlation among a set of variables (Hoseinzade and Mokhtari 2017). Singular value decomposition is related to PCA and is applied to the covariance matrix of the data to find principal components (PCs; Chen et al. 2015). Hierarchical cluster analysis is a method of cluster analysis that builds a hierarchy of clusters (similar elements into clusters, Jiang et al. 2015).

The assimilation of these statistical techniques has not solely proffered glimpses into the art of deducing provenance (Ohta 2004; Pe-Piper et al. 2008; Tolosana-Delgado et al. 2018; Armstrong-Altrin 2020; Banerji et al. 2022b; McManus et al. 2020) but has also deftly illuminated the influence of hydrodynamic fractionation and sorting within the intricate tapestry of the sedimentary environment (Pe-Piper et al. 2008). Through systematic mathematical adjustments, the practice of multivariate analysis kindles the revelation of heightened variations within a more tractable dimension scape (Gazley et al. 2015). This paradigm duly embraces multiple determinants that contemporaneously impact the data variability (Borvka et al. 2005), thus presenting an upper hand over the univariate and bivariate methodologies susceptible to distortions stemming from repetitive statistical trials (Manly1997). Multivariate statistical analysis is a type of statistical analysis that deals with more than two variables. It is used to decipher the correlation among large and complex datasets by reducing the number of variables without any loss of crucial information (Nadiri et al. 2013). Multivariate statistical analysis has emerged as a highly advantageous approach across diverse geochemical investigations. Its widespread utilization encompasses the examination of geochemical data derived from stream sediment, soil, and estuarine sediments, facilitating the detection of mineralization or contamination (Chork and Salminen 1993; Dominech et al. 2022; Paternie et al. 2023). In the realm of petrology, this methodology has proven pivotal in the discernment of diagenetic processes and the comprehension of how provenance impacts the overarching chemistry of rocks and sediments (Hakstege et al. 1992). Furthermore, the application of multivariate statistical analysis has enabled the characterization of fluvial deposits through the scrutiny of intricate and heterogeneous geochemical datasets (Helvoort et al. 2005). Notably, Garcia et al. (2020) successfully showcased the method's efficacy in differentiating depositional paleo-environments on the basis of geochemical data.

Within the scope of the present study, we have harnessed the PCA technique to reevaluate the previously published geochemical dataset originating from two distinct systems: The Shimla and Chail metasediments (SCM) situated in Himachal Himalayas, Himachal Pradesh (Joshi et al. 2021c), as well as the Diu Island mudflat sediments (DMS) located in outhern Saurashtra, Gujarat (Banerji et al. 2021a). The primary objective was to unravel the provenance intricacies by underpinning these two systems, one was sedimentary and the other metasedimentary in nature.

2 Principal component analysis (PCA)

In recent times, the remarkable advancement in computational capabilities has ushered in the widespread adoption of the PCA across diverse geoscientific investigations. The origins of PCA trace back to Pearson's groundwork in 1901, which was subsequently refined by Hotelling in 1933. The PCA is a multivariate statistical method used for the reduction of dimensions that is generally used in statistical analysis and machine learning. The prime objective of PCA is to transform high-dimensional data into a lower-dimensional representation, seizing all the valuable data while discarding the less relevant details (Hongyu et al. 2016). PCA assumes that the relationships between variables are linear and the data is normally distributed. However, the PCA also has some limitations as it converts the large dimension of data into a square matrix. There may be some data loss as well, as it sometimes reduces the interpretability and thus, it is not suitable for columns having many missing values (Lee 2010).

As a robust multivariate statistical exploratory tool, PCA empowers researchers to navigate data variability adeptly. Its prowess becomes particularly pronounced when grappling with extensive datasets, where intricate interdependencies among variables render interpretation and comprehension a formidable challenge. The core objective of PCA involves the transformation of a comprehensive set of potentially correlated variables into a more succinct set of uncorrelated variables known as PCs. These PCs efficiently encapsulate and preserve the pivotal information within the original dataset (Wishart et al. 2013; Sunkari and Abu 2019). Each PC epitomizes a linear amalgamation of the original variables, weighted in accordance with their contributions to elucidate the variance along a specific orthogonal dimension, sequenced in descending order (Geladi and Grahn 1996). Through this mechanism, PCA streamlines data representation while retaining its intrinsic characteristics, thereby simplifying the comprehension and visualization of intricate variable relationships.

Within PCA, the linear and non-linear relationship among samples manifests through scatter plots of scores where each data point corresponds to a distinct sample. Similar chemical attributes lead to the clustering of corresponding samples in proximity. On the converse, the association between different variables is expounded by loadings. These loadings intricately illustrate the fusion of concentration values for diverse chemical elements, forming the basis for the scores. The magnitude and sign of loadings unveil the significance of specific elements in shaping overall variance, thereby unveiling correlations between elements. Variables that coalesce signify positive correlations, while those positioned in diagonally opposing quadrants suggest negative correlations. The inaugural PC (PC1) accounts for the preeminent share of variance within the dataset, representing the primary axis of variability. Successively, the second PC (PC2) captures orthogonal variance, a pattern that endures through subsequent components (Geladi and Grahn 1996; Makvandi et al. 2016). Each PC furnishes a distinctive perspective on the data, collectively elucidating the majority of the dataset's variability.

The PCA emerges as most crucial tool in the realm of high-dimensional data (Ueki and Iwamori 2017; Corcoran et al. 2019; Henrichs et al. 2019), where the trend and patterns of the dataset often elude direct observation, causing the  graphical representation poorly understandable. By condensing the data's dimensionality and highlighting  thet pivotal sources of variation, PCA empowers researchers to decode and interpret intricate datasets with greater manageability.

3 Geological background of metasediments and mudflat sediments

In the present study, the geochemical datasets from the metasediments of  Lesser Himalaya and mudflat sediment of Diu Island have been studied through the statistical approach of PCA in order to decipher the sediment source and the dominant factor controlling the sedimentary system of the region. The published geochemical datasets of nearly 30 samples each for Shimla and Chail metasediment (hereafter SCM) from Lesser Himalayas (Joshi et al. 2021a) and Diu Island mudflat sediments (hereafter DMS) core (Banerji et al. 2021a) were studied and analysed through PCA approach. By embarking upon the PCA analysis on the metasediments (SCM) and  mudflat sediments (DMS), we aim to investigate the plausible source and mechanism responsible for the intricate patterns and correlation  in the geochemical variables.

3.1 Shimla and Chail metasediments (SCM)

The expansive Himalayan mountain range has been methodically subdivided into distinctive litho-tectonic units for the purpose of geological classification. Among these units are the Sub-Himalaya, Lesser Himalaya Sediments (LHS), Lesser Himalayan Crystalline sequence (LHCS), Higher Himalaya Crystalline Sequence (HHCS), and Tethyan Himalaya (Chambers et al. 2008; Bhargava et al. 2011; Law et al. 2013). Notably, Ahmad et al. (2000), employing isotopic markers, further refined the LHS into the relatively youthful outer zone and the elder inner zone. The outer Lesser Himalaya sediments were attributed to a dominant provenance from Meso- to Neo-proterozoic sources. Interestingly, both outer LHS and HHCS sediments exhibited congruent depositional ages and isotopic traits, hinting at a shared origin (Parrish and Hodges 1996; Ahmad et al. 2000; Richards et al. 2005). In contrast, the inner Lesser Himalaya sediments emerged as products largely sourced from Late-Archean to Paleoproterozoic origins.

Within this intricate geological landscape, the Chail, Shimla (Simla), and Blaini series exist as moderate to weakly metamorphosed strata, lying beneath the medium-grade Jutogh group of rocks. A stratigraphic marker in the form of stromatolite-bearing limestone horizons led Raha and Sastry (1982) to propose an upper Riphean age for the Shimla group, with 40Ar/39Ar mica dating yielding a maximum depositional age of 860 Ma (Frank 2001; Bhargava et al. 2011). The Shimla group exhibits a diverse composition comprising slates, greywackes, quartzites, and carbonates, with origins attributed to sedimentation via north-directed turbidity currents (Valdiya 1970; Srikantia and Sharma 1971, 1976; Sinha 1978; Joshi et al. 2021b). In contrast, the Chail Group (figure 1a) encompasses phyllites, phyllitic quartzite, psammitic and pelitic schists, orthoquartzites, arkose, chlorite schist, limestones, and meta-basic rocks, affiliating it with the Lesser Himalaya (Valdiya 1980). The outer LHS sediments, comprising formations like Chail, Tal, Krol, and others (figure 1), have been ascribed Neoproterozoic to Cambrian depositional ages (Richards et al. 2005).

Figure 1
figure 1

The geological map of the southern Saurashtra coast surrounding the active mudflat of Diu, Gujarat, modified after Banerji et al. (2019) and Pant and Juyal (1993), filled square indicates sampling site (DM) (Banerji et al. 2021a, b) and modified geological map of Sutlej section of Himachal Lesser Himalaya (Thakur 1992; Vannay and Grasemann 1998; Richards et al. 2005) sample locations shown as red square after Joshi et al. (2021a, b, c).

3.2 Diu Island mudflat sediments (DMS)

The mudflat of Diu Island is located along the southern Saurashtra coast of western Gujarat (figure 1). Majority of the Saurashtra peninsula comprises of a basalts and its derivatives belonging to the Deccan Trap Formation of upper Cretaceous period (Bhonde and Bhatt 2009). Unlike the Deccan plateau of west-central India, the Saurashtra Deccan basalts can be differentiated based on its tholeiitic flood basalts thickness (Najafi et al. 1981), dominance of granophyre and rhyolite, volcano plutonic complexes (Naushad et al. 2019) and pervasive compositional variety (Melluso et al. 1995; Sheth et al. 2011, 2012a). The Saurashtra Deccan basalts are unconformably overlain by Gaj, Dwarka formations of Tertiary and Miliolites, Chaya formations, Katpur and Mahuva formations of Quaternary Period (Pandey et al. 2007). Gaj Formation consists of marly limestone which is rich in fossils (foraminifera, echinodermata, lamellibranch and gastropods) and is well exposed near Una–Veraval Road. Dwarka Formation is only exposed near Jafrabad, SW of Diu Island (Verma and Mathur 1979a). Miliolite are considered to be of both marine as well as aeolian and are distinguished based on their sedimentary structures and quantitative faunal characteristics (Verma 1982). They includes only pelletoid (and oolitic) calc-arenites and associated micrites but devoid of megafossils (Verma and Moitra 1975). The miliolite exposures are found near the Machundri river section (Verma and Mathur 1979b). The coastal rocks such as the dead coral reefs, oyster beds and other highly fossiliferous limestone are included in the Chaya Formation. The age of the Chaya Formation ranges from the late Pleistocene to the Holocene (Gupta 1972; Gupta and Amin 1974). Katpur and Mahuva formations were deposited during the Holocene epoch wherein the former includes oxidized and pedoogenised tidal flat clays/silts while the latter comprises freshwater alluvium (sand and clays), coastal deposits, lime mud, calcareous sand with marine shells (Mathur et al. 1987; Bhatt 2003; Pandey et al. 2007).

4 Computational methodology

PCA is a robust statistical technique that leverages orthogonal transformations to transmute an assortment of potentially interrelated observations into an array of linearly uncorrelated variables. These newly formed  uncorrelated variables, commonly referred to as PCs, act as crucial  tools in rendering intricate high-dimensional datasets into easily recognisable 2D or 3D patterns. A series of systematic steps unfold when conducting PCA on a matrix characterized by n variables and m samples, encapsulating the following convoluted  progression.

4.1 Data preparation and mean centering

Standardization emerges as a pivotal phase within the PCA, serving to grant equanimity to dissimilar variables with divergent scales in contributing to the analysis. Through standardization, parity is established, encompassing a uniform range and data variability for all variables. This process of standardization unfolds in two essential steps. Initially, data is standardized by aligning each variable onto a shared scale; subsequently, data is centered by adjusting it in relation to the means of each variable. This centering maneuver situates the data at the origin of the PCs. The method of standardization employed varies depending on data characteristics. For instance, integration of the median or median absolute deviation can prove instrumental in mitigating the sway of outliers within the dataset. Notably, when grappling with geochemical datasets, meticulous attention to analytical uncertainties is warranted for their proper integration into the analysis.

In certain scenarios, the raw data necessitates preprocessing via log ratios, particularly when grappling with data constrained by constants like percentages or parts per million (Aitchison1986). This transformation guarantees conformity to the imposed constraints, rendering the data amenable to PCA. Within the scope of this study, the preferred standardization methodology revolves around the mean and standard deviation for its simplicity and effectiveness. The standardization process is realized through equation (1), thereby ensuring data achieves the requisite standardization, thereby priming it for subsequent PCA exploration.

$$Standardization=\left(\frac{Mean\, (data)}{Standard\, deviation \,(data)}\right).$$
(1)

The mean is a measure of central tendency that represents the average value of a set of numbers. It is obtained by summing up all the values in a dataset and then dividing that sum by the total number of data points. The mean is located between the median (the middle value in a sorted list of data) and the mode (the most frequently occurring value). The formula for calculating the mean of a dataset is as follows:

$$Mean=\sum\limits^n_{i=1}\frac{X_i}{n},$$
(2)

where \({X}_{i}\) is the \({i}{{\text{th}}}\) element of the individual data points of variable \(X\) and \({X}_{m}\) is the mean of \(X\) variables while \(n\) represents the number of elements.

The standard deviation is a statistical measure that quantifies the spread or dispersion of a dataset in relation to its mean value. It is computed as the square root of the variance, which represents the average squared deviation of each data point from the mean. By determining the distance of each data point from the mean, the standard deviation assesses how much the values in the dataset deviate from the average value. The formula to calculate the standard deviation is as follows:

$$Standard\, deviation=\sqrt{\sum\limits _{i=1}^{n}\frac{{\left({X}_{i}-{X}_{m}\right)}^{2}}{n-1}.}$$
(3)

Adjustment refers to a series of processes undertaken to enhance the classification, timing, valuation or coverage of data. It also involves adapting data to a specific recording or accounting basis and addressing any discrepancies in data quality during the assembly of dataset. To carry out data adjustment, we employ a specific formula or method (equation 4) that helps to modify the original data to better suit the intended analysis or reporting requirements. The adjustment process aims to ensure the accuracy and reliability of the data for further analysis and interpretation.

$$Adjusting \,data=\sum\limits_{i=1}^{n}\left({X}_{i}-{X}_{m}\right)$$
(4)

where the term \({X}_{i}\) represents the \({i}\text{th}\) element of the individual data points of the variable \(X\). \({X}_{m}\) denotes the mean of the \(X\) variable, which is the average value of all the data points in the dataset. The variable n represents the total number of elements in the dataset, reflecting the size of the data sample.

4.2 Variance and covariance

The scaled data obtained from section 4.1 is utilized to compute the covariance, which measures the relationship between two different datasets in terms of their positive and negative values. A positive covariance suggests that the variables tend to increase and decrease together, while a negative covariance indicates that the two variables vary in opposite directions. The covariance analysis also provides insights into the spatial relationship and variance of the dataset concerning different variables. The covariance between any two variables, X and Ycan be calculated using the following equations:

$${\rm Var}(X,X)=\sum\limits_{i=1}^{n}\frac{{\left({X}_{i}-{X}_{m}\right)}^{2}}{n-1}$$
(5)

and

$${\rm Cov}(X,Y)=\sum \limits_{i=1}^{n}\frac{\left\{\left({X}_{i}-{X}_{m}\right)\left({Y}_{i}-{Y}_{m}\right)\right\}}{n-1}$$
(6)

where the term \({X}_{i}\) represents the \({i}\text{th}\) element of the individual data points of the variable \(X\). \({X}_{m}\) denotes the mean of the \(X\) variable; \({Y}_{i}\) represents the \({i}\text{th}\) element of the individual data points of the variable \(Y\), and \({Y}_{m}\) denotes the mean of the \(Y\) variables; the variable n represents the total number of elements in the dataset, reflecting the size of the data sample.

4.3 Eigen decomposition

The covariance matrix provides the necessary information to calculate the eigenvalues and eigenvectors, which are essential in PCA. These eigenvalues and eigenvectors play a crucial role in representing the overall variability of the dataset. Eigenvalues and eigenvectors always come in pairs, where eigenvalues determine the magnitude or importance of each PC, and eigenvectors demonstrate the direction of the data with the largest variance in the dataset. The eigenvector associated with the highest eigenvalue corresponds to the first PC, which accounts for the greatest possible variance in the dataset. Subsequent PCs have progressively lower variances, capturing less and less of the total variability in the data. Eigenvalues and eigenvectors can be calculated using the following procedure:

$${\text{det}}\left(A-\lambda I\right)X=0$$
(7)

where \(A\) is a covariance matrix, \(I\) is the identity matrix, \(\lambda\) is the eigenvalue, and \(X\) is the eigenvector matrix.

4.4 Selection of principal components

After computing the eigenpairs (eigenvalues and eigenvectors), it is necessary to sort them based on the magnitude of their eigenvalues. This sorting process allows us to select the desired number of PCs with higher scores and loadings, which are more significant for dimensionality reduction. Typically, the eigenvectors with higher eigenvalues are chosen as the feature vectors, as they capture the most important information about the data. This selection can be accomplished by plotting the cumulative sum of the eigenvalues and identifying the point where the explained variance reaches a satisfactory level. Once the desired PCs (feature vectors) are identified, the transformed feature vector is multiplied with the transformed, adjusted data (the data centered around the means) to reconstruct the original data in the new lower-dimensional space. This transformation (equation 8) helps to retain the maximum relevant information about the original data while reducing its dimensionality, enabling easier visualization and analysis.

$$Final\, data=Row \,feature \,vector\times Row\, data \,adjusted$$
(8)

where row feature vector is the eigenvectors transposed, and row data adjusted is the mean adjusted data of the original data transposed.

In this study, a MATLAB-based computational algorithm is developed to compute RQ-mode PCA, following the steps mentioned in flowchart diagrams (figures 2 and 3). R-mode PCA is primarily based on variables and is suitable for identifying associations between variables and a set of observations (elements). It processes the covariance matrix and creates new orthogonal linear combinations that preserve the variance of the original variables. These new combinations account for successively decreasing portions of the variance, allowing for dimensionality reduction. On the other hand, Q-mode PCA is primarily based on observations (samples) and is suitable for characterizing samples. It analyzes the covariance matrix to identify patterns and relationships among samples.

Figure 2
figure 2

The process to carry out the PCA of any given dataset in general.

Figure 3
figure 3

The process to carry out the PCA of the given dataset in Matlab (where p is the number of samples and q is the number of variables).

RQ-mode PCA is a method that calculates both variables and object loadings simultaneously, combining aspects of both R-mode and Q-mode PCA. For this particular study, RQ-mode PCA computation is chosen due to its detailed analysis of sedimentary processes. It allows for the characterization of different elements and their associations with the process under investigation. Furthermore, it enables the identification of different rock types based on geochemical datasets. Using RQ-mode PCA, the researchers can gain valuable insights into the complex relationships and patterns within the geochemical dataset, aiding in the understanding and interpretation of sedimentary processes and rock types.

5 Results and discussion

In the present study, the PCA was applied to two distinct sets of geochemical datasets, namely, the SCM and DMS, with the aim of understanding the dominant processes influencing these distinct sites. For the SCM dataset, a total of 30 geochemical data points, including major and trace elemental compositions, were analysed  from the previous study (Joshi et al. 2021c). The major and trace elements retain important cues to sedimentary processes in the SCM locality. In case of the DMS dataset, major elements and selected trace elements were used to delineate the processes acting on the region on a temporal scale (Banerji et al. 2021a). The variations in elemental compositions in the DMS dataset are influenced by various geochemical proxies, such as in-situ productivity, paleo-weathering, and sediment source. The detailed implications of these geochemical proxies on a temporal scale are discussed in Banerji et al. (2021a).

The developed algorithm was subsequently implemented on the SCM and DMS datasets, and the eigenvalues of the PCs were calculated. The contributions of each PC to the total variance of the dataset were estimated, and these results are presented in the Supplementary file (tables S1 and S2). These contributions provide valuable insights into the importance of each PC in explaining the variability within the dataset and can help to identify the most significant factors or processes influencing the SCM and DMS localities.

This research helps us identify the major geochemical factors that influence sediment composition and indicate the sedimentary settings in which the sediments formed. We must normalize the data before doing the PCA in order to avoid any kind of error during the analysis and to make the contribution of each variable proportional to the analysis; otherwise, it might influence the geochemical trends.

5.1 Shimla and Chail metasediments (SCM)

In PCA of the SCM dataset, a scree plot (figure 4a) was generated, showing a total of 29 PCs. The scores of the observations were depicted as symbols, while the loadings of the different elements were plotted in figure 4(b and c). From the scree plot, it was evident that the first nine PCs showed an elbow point, which collectively accounted for 85.53% of the total variance. The contributions of these nine PCs were as follows: PC1 (20.86%), PC2 (19.75%), PC3 (11.90%), PC4 (7.42%), PC5 (7.06%), PC6 (5.56%), PC7 (5.35%), PC8 (4.67%), and PC9 (2.96%).

Figure 4
figure 4

(a) Scree plot for eigenvalues against PC, (b) PC1 vs. PC2 biplot, and (c) PC2 vs. PC3 biplot for metasediments from Chail–Shimla group.

Considering that most of the PCs had contributions <10%, we focused on explaining the total variability of the data using the first three PCs: PC1, PC2, and PC3. These three PCs together accounted for 52.51% of the total variation in the dataset. Additionally, for simplicity and to highlight the most significant relationships, only two combinations of PCs were taken into consideration: PC1 vs. PC2 and PC2 vs. PC3 (figure 4b and c). These plots reveal the major patterns and associations between variables in the dataset. By selecting the first three PCs and plotting these specific combinations, the researchers aimed to capture the most important information while reducing the complexity of the analysis.

Upon careful analysis of the contribution from PC1, it becomes clear that a significant number of major oxides and trace elements in the dataset demonstrate a positive correlation with PC1. Additionally, PC1 exhibits positive loadings for elements such as K, Rb, Th, Ba, and LREEs (Light Rare Earth Elements), which are typically considered incompatible elements and are indicative of rocks with felsic composition. Interestingly, SiO2 shows a slight negative loading on PC1. This, in combination with the positive loadings for incompatible elements, might suggest an intermediate source for the rocks in the dataset. Furthermore, major oxides and trace elements that have a strong affinity with mafic to intermediate rocks exhibit positive loadings on both PC1 and PC3, while showing negative loadings on PC2 (as seen in figure 4b and c). These loading patterns provide valuable insights into the relationships and characteristics of different rock types present in the dataset, helping to identify their composition and potential sources.

Due to their higher compatibility, K2O, Na2O, and CaO are typically enriched in feldspars. The relative enrichment of K2O and depletion of Na2O along PC1 and PC2 suggest that K-feldspar is the primary repository of potassium and the predominant feldspar in the SCM, as compared to sodic plagioclase. This observation aligns with the higher K2O/Na2O ratios found in bulk rock geochemistry (Joshi et al. 2021b). Furthermore, the close association of Al2O3 and TiO2 along the positive PC2 axis suggests that phyllosilicates are the main carriers of these elements in the SCM. The fact that both phyllosilicates and K-feldspar display positive loadings along PC2 and PC3 further supports their role as major reservoirs for K2O, Al2O3, and TiO2, as also noted by Joshi et al. (2021b) based on oxide correlations.

Studies have indicated that heavy minerals, such as zircon, apatite, and titanite, which possess higher partition coefficients for Rare Earth Elements (REEs), can influence the concentration of trace elements in the SCM (Armstrong-Altrin et al. 2012). The enrichment of Light Rare Earth Elements (LREEs) and Heavy Rare Earth Elements (HREEs) along both PC1 and PC2 suggests that these accessory minerals control the REE budget of the SCM. The distribution of least mobile incompatible elements, such as REEs, HFSEs (High Field Strength Elements), Th and Y, can reflect the provenance of the sediments and help differentiate between various lithologies (McLennan 1989; Cullers 1994; Taylor and McLennan 1995; Large et al. 2018). The enrichment of Th, U, Zr, and Sc along positive PC1 and PC2 suggests the influence of reworking and recycling of felsic to intermediate sources. The slight negative loading of SiO2 with PC1, along with the positive loadings of MgO, Fe2O3, Co, Ni, Th, and U, might indicate the possible contribution of intermediate rocks as a source for the studied sediments. These findings shed light on the origin and composition of the SCM sediments, providing valuable insights into the processes that have shaped their geochemical characteristics.

5.2 Diu mudflat sediments (DMS)

In the PCA of the DMS dataset, a scree plot (figure 5a) displayed a total of 13 PCs. The scree plot revealed that the first three PCs showed an elbow point and collectively accounted for 79.30% of the total variance. The contributions of these three PCs were as follows: PC1 (48.94%), PC2 (15.64%), and PC3 (14.72%). Due to the comparable variance of PC2 and PC3, only two combinations of PCs were considered for further analysis: PC1 vs. PC2 and PC2 vs. PC3 (figure 5b and c). The biplot for these combinations illustrates the scores of the observations displayed as symbols, and the loadings of the different elements are plotted. Upon careful analysis of the contribution from PC1, it becomes evident that oxides account for most of the variations in this component compared to the other elements. Combining PC1 and PC2 accounts for a significant portion (64.58%) of the total variability in all datasets within this group. These findings provide valuable insights into the major factors contributing to the variations in the DMS dataset and allow for a better understanding of the geochemical characteristics of this region.

Figure 5
figure 5

(a) Scree plot for eigenvalues against PC, (b) PC1 vs. PC2 biplot, and (c) PC2 vs. PC3 biplot for Diu mudflats.

The positive correlation of Total Organic Carbon (TOC) and Cu with both PC1 and PC2 is significant in the PCA of the DMS dataset. In marine, coastal, and lacustrine sediments, TOC is widely regarded as a significant indicator of in-situ productivity (Tribovillard et al. 2006; Chandana et al. 2017; Banerji et al. 2019, 2021b). However, TOC is susceptible to degradation over time. On the other hand, Cu is delivered to the sediments through organometallic complexes and serves as an additional proxy for in-situ productivity. The close association between TOC and Cu, as well as their enrichment towards the positive axis of both PC1 and PC2, indicates that in-situ productivity has played a crucial role in shaping the geochemical variations observed in the DMS dataset. The positive correlation of TOC and Cu with these PCs suggests that variations in in-situ productivity have had a significant impact on the geochemical composition of the sediments in the DMS region. These findings provide valuable insights into the environmental conditions and processes that have influenced the sedimentary characteristics of the studied area.

The fact that some elements like Cu, Ba, TiO2, Co, and Ni are more abundant along the positive axis of PC1 suggests that similar lithologies are involved. Hayashi et al. (1997) found important minerals like olivine, pyroxene, hornblende, biotite, and ilmenite with TiO2. The ferromagnesian trace elements Cr, Ni, and Co generally exhibit a similar behaviour during the magmatic processes, although weathering may result in their fractionation (Feng and Kerrich 1990). Nevertheless, they are more abundant in mafic igneous rocks and their associated weathering products (Armstrong-Altrin et al. 2004; Joshi et al. 2021c). The simultaneous enrichment of Ni and Cr in the floodplain sediments of the Cauvery River has been interpreted as a possible indication of a mafic origin (Singh and Rajamani 2001). Furthermore, the concurrent enrichment of Fe2O3, CaO, and SiO2 along the negative axis of PC1 and PC2 demonstrated the possible influence of an intermediate rock source. The hinterland of the Saurashtra peninsula is comprised of Deccan basalts, trachyte, rhyolite, granophyre, and pitchstone dykes, which are associated with mafic dolerite dykes at Sirohi-Palitana (Chatterjee and Bhattacharji 2001) and Picritic dykes at Dedan (Krishnamacharlu 1972). In addition, scientists have found granophyre, rhyolite, and obsidian at Barda (Cucciniello et al. 2019) and a sequence of rhyolite, pitchstone, and basaltic andesite lava flow at Osham (Sheth et al. 2012). A combination of different types of rock and other sources (Banerji et al. 2021a) must have caused the intermediate and mafic rock signatures in the DMS geochemical data.

The enrichment of K2O and MgO along the positive axis of PC2 and PC3 indicates enhanced  weathering intensities. Climate plays a pivotal role in sediment weathering, while other factors, such as the nature of source rocks, microbes, and relief, also significantly influence the geochemical composition of sediments (Nesbitt and Young 1982; Taylor and McLennan 1985; McLennan et al. 1993; Joshi 2014; Madhavaraju et al. 2016). Notably, K2O and MgO normalized with Al2O3 are extensively used as paleo-weathering proxies in sediment cores from coastal, marine, and lake environments (Banerji et al. 2017, 2019, 2021b; Bhushan et al. 2018). Furthermore, the positive axis of PC3 reveals enrichments in Al2O3, SiO2, and Na2O, suggesting a prevalence of clayey textures derived from terrestrial sources, particularly plagioclase minerals. These findings emphasize the significant contribution of sediment sources originating from the hinterland of the Saurashtra peninsula.

In summary, the control of the PC1 is mainly attributed to in-situ productivity and the mafic source, with a smaller contribution from the intermediate source. PC2 is influenced by weathering proxies, while PC3 is predominantly governed by the clayey fraction originating from plagioclase minerals found in the Saurashtra peninsula. These factors collectively shape the geochemical composition and variations observed in the sediments under study.

6 Conclusions

Geological processes govern the elemental assemblages derived from geochemical datasets, which pose a challenge due to the vast amount of data reflecting various geochemical processes. In our study investigating the provenance using high-dimensional bulk-sediment geochemical data from Lesser Himalayan rocks (Shimla and Chail groups) and Diu mudflats, we draw the following conclusions:

  • In the metasediments of the LHS region, the examination of the eigenvectors of PCs reveals that accessory minerals play a crucial role in controlling the trace element budget of the metasediments. Additionally, the presence of reworking and recycling of felsic to intermediate sources is suggested for the studied sedimentary cover in the Lesser Himalayan region.

  • In the Diu mudflats, the first three PCs indicate an intermediate to mafic source associated with processes involving olivine and pyroxene. The outcome of present  work invokes the significance and applicability of PCA on the high-dimensional geochemical datasets in Geosciences.