1 Introduction

Adulteration is a growing concern that impacts enterprises, producers, consumers, and the economy in general [1]. Estimated losses due to adulteration of products are higher than USD$250 billion, and in the food industry, the losses exceed USD$49 billion in adulterated products [2]. Adulteration includes the replacement of the original material with lower-cost products, defective material, or residues of the same or different plants, harmful substances, or synthetic products that do not meet official standards. Food adulteration may be either intentional (direct) aiming to obtain a financial profit or unintentional (indirect) due to a defective production process. In any case, food adulteration represents fraud and therefore constitutes an illegal practice [3].

Among the food products susceptible to adulteration, spices and condiments are high-valued products with an international market, and are widely used as flavorings and food and beverage coloring [4]. Regarding the spices market, between 2014 and 2015, the US FDA (Food and Drug Administration) reported more than 20 adulteration cases in condiments and food spices, causing their withdrawal from the market due to latent danger to consumers [5, 6]. The previous studies have identified cassava, corn starch, and wheat as frequent adulterants in powdered condiments that include garlic, ginger, onion, black pepper, cumin, and seasonings in general. As an example, recent studies have identified starch as the main adulterant used in black pepper (up to 50% of dry weight) [7]. Cheaper adulterants are added to obtain an economic benefit by diluting more valuable ingredients, at the expense of consumers’ health [5, 7].

Among the most popular spices, paprika (Capsicum annuum var. Longum) is a Native American spice that is currently cultivated throughout the world. Originally, the paprika powder was obtained from the pericarp of the fruit to be used as food coloring, seasoning, and flavoring. Currently, oleoresin is also extracted from this fruit to be used as a natural dye in the food, poultry, dairy, feed, canning, bakery, and cosmetic industries, among others [8]. Paprika oleoresin is characterized by its high content of vitamin C and carotenoids, \(\beta\)-carotene, \(\beta\)-cryptoxanthin, capsanthin, capsorubin, violaxanthin, and other carotenoids that are the pigments responsible for their characteristic colors, while capsaicinoids are the spicy compounds present in the fruit [9, 10].

According to Guillen et al. [11], the anatomical parts of the Capsicum annuum fruit are the peduncle, pericarp, seed, and placenta. The main part used to obtain the powdered dye should be the pericarp, and it is commonly adulterated with the addition of other parts of the same fruit (e.g., peduncle, seed, and placenta) to reduce production costs [4]. Furthermore, the fraudulent addition of other materials that are not considered food additives has been identified, for example, lead oxide and synthetic dyes [12]. Such illicit actions generate exorbitant profits for those who practice them and represent a significant threat to the health of consumers [3, 4]. The health hazard that represents an impure food product, encourages the adoption of robust strategies to detect fraudulent adulteration.

Numerous techniques have been developed to detect and estimate adulteration that vary depending on the product, the method employed to measure discriminative properties, and the techniques employed to analyze samples. Table 1 presents some examples of popular adulteration detection methods.

Table 1 Common techniques to detect and estimate adulteration in food products

Among the techniques mentioned in Table 1, spectroscopy in general, but especially near-infrared spectroscopy (NIRS), is attracting interest due to the advantages that include low cost, simplicity, non-destructive measurements, and quickness [4, 17]. NIRS measures the molecular vibrations of target products by light quanta absorption, which generates a signature of the spectral profile (“fingerprint”) that is reproducible, is distinct for different raw materials, and, in many cases, can be employed to determine the purity or the level of adulteration in food products [2]. Putting it differently, molecular vibrations are related to the conformation, structure, molecular interactions, and chemical bonds of materials, measuring chemical bonds based on overtones and combination bands of specific functional groups [29, 30]. The previous studies report the usefulness of NIRS to ensure effective food supply surveillance and its ability to reduce or detect food forgery or adulteration, mainly in the spectral range between 780 and 2500 nm (4000 and 14 000 \(\textrm{cm}^{-1}\)) [2, 19, 20, 31].

On the opposite, NIRS presents some drawbacks, e.g., the wide absorption band, weak absorption peaks, serious multicollinearity, and, in some cases, the geometry of the fruits, presenting distinct reflection artifacts [32]. To address these problems, chemometric techniques are commonly employed to extract information using tools inherited from signal and image processing (e.g., pre-processing before creating a model), the extraction or selection of variables, and, finally, modeling the relationship between the input variables and the properties to be measured [33]. Likewise, it is well known that for a better practical effect, it is necessary to select wavelengths and use pre-processing methods to remove non-informative variables, producing that way simpler and more accurate models [34,35,36,37]. Some of the most popular and efficient models in NIRS are principal component regression (PCR) and partial least squares regression (PLSR) [38]; additionally, the nonlinearity of neural networks is commonly advantageous for certain problems [39]. In fact, due to the high dimensionality and spatial complexity of the matrices extracted from food products using NIRS, in certain applications, it is important to use tools to extract nonlinear information and self-organizing methods commonly provide efficient solutions.

Relevant examples of chemometric models to deal with nonlinearities are the neural network approaches, such as well-known multilayer perceptron, and the recurrent neural networks (RNN) [40]. In particular, among the RNN models, the long short-term memory (LSTM) networks preserve information from the previous states in hidden layers and use it in the prediction and classification processes [41]. Initially developed for speech recognition, LSTM networks have aroused great interest due to their potential for applications in spectrum discrimination [42]. The combination of RNN and spectral profiles has allowed addressing various tasks related to food, such as the prediction of storage time in black tea [43], the quantification of Clostridium sporogenes spores in food products [44], and the detection of moisture levels in individual corn seeds [45], among others.

In this paper, a methodology is proposed for the evaluation of chemometric techniques that employ NIRS to estimate the percentage of two common adulterants in Paprika powder, e.g., peduncle or seeds. Additionally, three of the most popular chemometric techniques were compared using the proposed methodology, demonstrating that NIR spectroscopy can be employed to detect adulterants that are part of the ground red peppers used to produce Paprika powder. Finally, the experiments provide evidence of the feasibility of using partial least squares regression, multilayer perceptron, and LSTM networks to estimate the percentage of adulteration in Paprika powder.

2 Materials and methods

2.1 Raw material

A local spice producer provided a sample composed of whole mature (Paprika) ground red peppers. The sample was cleaned, and all fruits with visual defects on the surface were removed. The resulting materials were stored in plastic bags, hermetically sealed and in dark conditions, to reduce the possibility of damage from moisture or light.

Fig. 1
figure 1

Paprika chili parts used in the study

2.2 Experimental methodology

The experimental methodology performed in the present study is shown in Fig. 2 and detailed in subsequent subsections. In general, samples were prepared to measure separately the target product (pericarp) and adulterants (pedicel or peduncle, and seeds cake). Then, NIR spectra profiles were extracted, and a standard pre-treatment was applied to measurements. Models were built from such samples, and performance metrics were computed for comparison.

Fig. 2
figure 2

Experimental procedure to evaluate the paprika adulterant prediction models

2.3 Sample preparation

Eleven kilograms of Paprika ground peppers were dried in dark conditions, and at \(60\,^{\circ }\textrm{C}\), for 30 days, until 10% moisture was obtained. The fruits were then divided into pericarp, seeds, and peduncle, removing all the placenta from the previously separated parts.

The pericarp and peduncle were milled, using a hammer mill company SRL at 1450 RPM, and sieved with a 4-mm ASTM sieve. The seeds were passed through the oil extraction process, the oil was separated, and the seed cake was milled similarly to the pericarp and peduncle. The powder produced for each subproduct was stored separately in hermetically sealed plastic bags and maintained under dark conditions.

2.4 Adulteration of samples

Different treatments are prepared as shown in Table 2, where the target pericarp powder was adulterated with peduncle and seeds cake in different percentages. In both cases, adulterated samples were made using an ultra-turrax model IKA T25, at 4500 rpm per 1 min.

Table 2 Experimental treatment for each sample case

2.5 NIRS profiles extraction

The different treatments, consisting of a pericarp, a peduncle, and a seed, were measured in 30 repetitions each, obtaining a total of 630 samples. The spectral profile was determined for each of the 630 measurements.

The measurement of each sample followed the methodology reported by Yoplac et al. [33]. In this study, a Unity Scientific NIR spectrometer (SpectraStar 2500XL, USA) was used, equipped with a tungsten halogen lamp as a light source and an InGaAs detector (Indium–Gallium–Arsenic) in the range of 1100 and 2500 nm, with a resolution of 1 nm.

Measurements were made in reflectance mode applied directly to mixtures without pre-treatment or manipulation, using a quartz cuvette of \(3.5\,cm\) internal diameter and \(1.0\,cm\) thick, to which \(3.2\, g\pm \, 0.3g\) of the sample was added.

2.6 Pre-treatment

As reported in the publications by [39, 46], in most cases, extracted spectral profiles contain noise and variability due to capture conditions, and spectral enhancements are required to clean up the profiles, such as spectral smoothing, centering, and normalization.

In this process, the following combinations of standard pre-treatments were applied:

  • Smoothing. The spectra were smoothed using a second-order Savitzky–Golay filter with eleven frames according to Eq. (1):

    $$\begin{aligned} x'=\frac{1}{N} \sum ^{n}_{\lambda =1}{C_{\lambda }(x_{\lambda })}, \end{aligned}$$
    (1)

    where x′ is the smoothed profile; x is the original spectra; C is the coefficient; \(\lambda\) is the wavelength in analysis; and N is an integer number of convolutions.

  • Centering and normalization. The distribution of samples is centered and normalized according to Eq. (2), to reduce the variation in the baseline due to the dispersion of light.

    $$\begin{aligned} x'_{\lambda }=\frac{x_{\lambda }-\bar{x}}{S_{x,\lambda }}, \end{aligned}$$
    (2)

    where x, \(x'\) are the original and corrected profiles, respectively; \(\lambda\) is the wavelength to be analyzed; and \(S_{x,\lambda }\) is the standard deviation of the profiles at a specific wavelength.

2.7 Models training

The models were trained by implementing functions and routines in the mathematical software MATLAB-2022\(^a\); dividing this stage into the two steps detailed in Sects. 2.7.1 and 2.7.2.

2.7.1 Full models training

Using the profiles corrected in the previous stage, we proceeded to train adulteration–prediction models; for one or two adulterants at a time. The models implemented included the following:

  • Partial least squares regression (PLSR). This is one of the most widely used methods to predict food properties by coupling them to hyperspectral images, an example of which includes vibrational spectrometry [39, 47]. PLSR transforms an input matrix X, in our case with dimensions \([m \times n]\), where m is the number of observations, and n is the number of wavelengths. The output vector Y, which contains the percentage of adulterant quality, is obtained by decomposition. Decompose X and Y, by projection, into new directions with the constraint that the decomposition must describe the change of both variables as much as possible. After the decomposition of the variables, a regression step is performed in which the decomposed X and Y are used to calculate a regression model called the full model [48], see Eq. (3).

    $$\begin{aligned} Y =\beta .X+e , \end{aligned}$$
    (3)

    where X and Y are the input and output variables, \(\beta\) is the vector of regression coefficients, and e is the error. For the implementation of this model, the PLS toolbox of MATLAB was used.

  • Multilayer perceptron (MLP). This type of supervised learning network is widely used for prediction and classification, which is generally composed of three layers [39]. The input layer receives the intensity values, which employing a transfer function are distributed to the processing elements (neurons) of the second layer or hidden layer. Commonly, in the second layer, the values that were entered are transformed by a nonlinear sigmoid transfer function, propagating them to the third layer or output layer. The prediction results are obtained at the output layer [49]. The architecture of the multilayer perceptron is depicted in Fig. 3

  • Recurrent neural network (RNN)-based regression. The regression model based on recurrent networks used long short-term memory (LSTM) networks. Following the pyramidal principle proposed by Vázquez et al. in [39], a model of an input layer with i entering neurons, an LSTM layer with j neurons units, a fully connected layer, and a regression layer, this network structure is illustrated in Fig. 4. The training was carried out using the stochastic gradient descent with momentum (SGDM) algorithm, calculating the gradients in the weights and adjusting them to minimize the loss function, in 600 epochs at a learning rate (LearnRate = 0.005). The training was carried out 30 times in a k-fold (\(K=5\)) cross-validation strategy, and its results were stored on each occasion to later proceed to the calculation of metrics.

Fig. 3
figure 3

Architecture of the multilayer perceptron employed in experiments

Fig. 4
figure 4

Structure for the RNN-LSTM regression model

2.7.2 Model optimization

According to Blanco and Villarroya [29], the analytical information contained in the wide and often overlapping bands in the NIR spectrum is hardly selective. For this reason, it is important to choose relevant bands in actual chemometric applications. In this sense, relevant variables were determined.

This step starts with selecting the relevant variables; although there are a wide number of methods for this experiment, the \(\beta\) coefficient method was used. \(\beta\) coefficient is based on the ability of the variables to contribute to the PLSR regression model, defined by their coefficients \(\beta\) and following the strategy applied by [48]. Consequently, the result of this stage is a new spectral profile, named trimmed profiles, which contains only the intensity values for the relevant variables.

Finally, the models were optimized according to the work of [33, 39], and new models were [48] trained with the optimized trimmed profiles.

2.8 Models comparison

The different models were tested using a k-fold cross-validation strategy with \(k=5\). Then, in the same way as [3, 33, 39], the root-mean-square error (\(RMSE_{cv}\)), the coefficient of determination (\(R^2_{cv}\)), and the ratio of performance to deviation (\(RPD_{cv}\)) were computed. These statistical metrics are defined in Eqs. (4)–(6).

$$\begin{aligned} \textrm{RMSE}_\textrm{cv}= & {} \sqrt{\frac{1}{n}\sum _{i=1}^{n}(y_i-\hat{y}_i)^2}, \end{aligned}$$
(4)
$$\begin{aligned} R^2_\textrm{cv}= & {} \frac{\sum _{i=1}^{n}(\hat{y}_i-y_i)^2}{\sum _{i=1}^{n}(\hat{y}_i-\bar{y})^2}, \end{aligned}$$
(5)
$$\begin{aligned} {\text {RPD}}_{\text{cv}}= & {} \frac {\text{SD}} {{\text {RMSE}}_{\text{cv}}}, \end{aligned}$$
(6)

where \(\hat{y}_i\) and \(y_i\) are the percentages of adulteration values of the i th sample for prediction and reference, respectively; n is the number of samples; and SD is the standard deviation of Y. The sub-index \(_\textrm{cv}\) makes reference that the statistical measure was computed following the cross-validation strategy.

3 Results and discussion

3.1 NIR spectra profile

Figure 5a shows all spectra profiles collected from the different samples, including distinct treatments and repetitions; a direct relation between wavelength and absorbance was observed. In Fig. 5b, the differences between the mean spectral profiles of the materials used are observed, showing that the peaks of the three spectral profiles appear at similar locations. The difference between the average profiles is the average level of absorbance between the pericarp, peduncle, and seed cake. The pericarp spectral profiles present higher absorbance levels, followed by the peduncle, and the seed cake profile with the lowest absorbance. Similarly, nine picks are common in all parts around wavebands 1205, 1460, 1725, 1761, 1930, 2100, 2303, 2351, and 2485 nm. Among these peaks, the first five are related to the absorbance bands in the overtone region, and the remaining four are in the combining region.

Fig. 5
figure 5

Spectral profiles for a the whole sample including repetition and b average sample profiles for each paprika part (e.g., peduncle, pericarp, and seed cake)

The absorption peaks at 1460 and 1930 nm are related to –OH stretch and –OH stretch/deformation wavebands combination, respectively, those mainly due to the presence of water in the samples. The peak at 2100 nm is observed to be within the waveband of –NH deformation, associated with the presence of proteins and/or peptides; this same peak at 2100 nm is within the range of C–O and O–H stretching combination, related to the presence of carbohydrates. The peaks 1205, 1725, and 1761 nm are observed to be within the ranges of –C\(\textrm{H}_2\) and –C\(\textrm{H}_3\) stretch, related to the presence of lipids. Finally, the peaks 2303, 2352, and 2485 nm correspond to the wavebands of methylene and –CH stretch, related to the presence of lipids [50].

The absorption bands of capsicum composition can be analyzed and identified by their spectral behavior; when mixed with other components, such as water, carbohydrates, proteins, and lipid content, the characteristics may change. This observation is supported by research papers, such as [3], which evaluate the adulteration of spices using NIR to detect spectra that can increase or decrease depending on the variation in composition. Research articles such as [11] evaluate the pericarp and the non-edible portion (seeds, placenta, and interlocular septum) of two capsicum varieties, finding differences in the amounts of capsaicin and dihydrocapsaicin. These differences may explain the variations in spectra in the NIR range observed in our study. This, coupled with the highly sensitive nature of NIR spectroscopy, gives it the potential to differentiate mixtures, but also the potential to detect differences between varieties. Studies like [9], that use HPLC, show the variations in the carotenoid composition in capsicum varieties that can be determined using NIRS.

Whereas the spectral differences can be attributed to the composition of the mix, there are no specific functional groups that qualitatively differentiate the capsicum samples without an appropriate statistical multivariate analysis.

3.2 Models building

3.3 Partial least square regressions

Figure 6 shows the results of the analysis after the application of the PLSR model to predict the different percentages of combinations of adulterated pericarp seed with peduncle and seed cake. In Fig. 6a and b, the regression of the adulterated pericarp powder with different percentages of the peduncle using all the wavelengths of the NIR spectrometer. On the other hand, the more relevant wavelengths used to generate the optimized PLSR model are shown in Fig. 6c and d. Finally, the real against the estimated levels of adulteration are shown in Fig. 6e and f, when different adulterants are added to the mixture, but optimizing PLSR.

Fig. 6
figure 6

Evaluation of the PLSR-based models for adulterant prediction

Regarding the most relevant wavelengths, those in the range between 1600 and 2000 nm stand out, being the most prominent absorption bands at 1725 nm and 1761 nm, which are related to the C=O functional group. Similar wavelengths have already been documented and linked to capsaicin and other capsaicinoids [10]. Partitioning the full spectrum to reduce random noise and computational complexity while extracting the available information to enhance the model’s capacity is a recommended practice for achieving better prediction models [34].

3.4 Multilayer perceptron models

On the other hand, Fig. 7 shows the results of multilayer perceptron models, both the complete and optimized models, to predict the adulteration with the adulterants in pericarp powder; Fig. 7a–d for peduncle adulteration, and Fig. 7b–d when seed cake is used.

Fig. 7
figure 7

Multilayer perceptron models for adulterants prediction

3.5 Long short-term memory-based models

Finally, Fig. 8a and b shows the results of the regression model based on recurring neural networks, using the LSTM network. Results in Fig. 8a and b show the predictions of the powder of adulterated pericarp with stalk and seed cake.

Fig. 8
figure 8

LSTM models for adulterants prediction

The significance of comparing learning methods lies in the fact that each product exhibits different behaviors. In our case, when evaluating capsicum, where capsaicin and dihydrocapsaicin are the predominant compounds, we must not overlook the presence of other compounds that should be considered during the evaluation. Compounds such as colorants have been documented in these capsicum varieties [10], generating the need for prediction models to be tailored specifically to each model.

3.6 Models comparison

Predictive model results that full and optimized spectra profiles differences explain the feasibility of the NIRS to identify how adulterated pericarp powder with peduncle and seed cake are; then, Fig. 9 shows the box plot of each of the five analyzed models using the statistical metrics. According to Fig. 9a and b, the prediction models obtained \(R^2\) higher than 0.96 or 0.90 for the content of the peduncle or seed cake, respectively. The prediction of the peduncle content showed a lower adjustment; mainly when relevant variables were used. Furthermore, fully optimized PLSR models present lower variability in \(R^2\) compared to neural network-based models (Fig. 9a and b).

Fig. 9
figure 9

Models’ metrics for adulterants prediction

Evaluating RMSE of the PLSR optimized models, it obtained the lowest values for both prediction of the adulterant percentage, for peduncle powder (6.23) and seed cake (5.76); therefore, this model showed greater precision in the calibration set and classification. In the same way, Fig. 9c and d shows that the variability of the RMSE in neural network-based models PML and LSTM, respectively, was comparatively higher.

Finally, Fig. 9e and f RPD values showed all models could be considered reliable (\(\textrm{RPD} > 2\)); this was mainly true for PLSR models. However, this metric, which assesses the extent of error estimation compared to the standard deviation, exhibits significant variation for neural network-based models, which leads to the conclusion that models based on neural networks may not fit properly in these experimental settings. Other studies that evaluated models like those used here found that RMSE values were better when using neural networks; this occurs when working with samples with high humidity, such as corn [45], or with dry samples containing different active compounds, such as tea [43].

The optimized PLSR model showed the best predictive indicators in the study; this coincides with those reported by [19] in their study of dry goods, [6] in onion powder. Both studies affirm that combining NIR with appropriate multivariate analysis can produce reliable results. These findings emphasize the need to create specific models for each product, considering the unique composition of the product under analysis. In the case of the Capsicum genus, the generation of capsaicin and dihydrocapsaicin oleoresins is specific to the pericarp, with the shape and composition differing in the peduncle and seeds, which could be the reason for differentiation in the NIR spectra.

4 Conclusions

In this paper, the feasibility of the estimation of the level of adulterant was studied in paprika powder. In the comparison, the proposed methodology was evaluated based on adulterated pericarp powder mixed with peduncle and seed cake powders using NIRS in conjunction with PLSR and neural network-based models. The most significant wavelengths for adulterant estimation were found within the range of 1600–2000 nm, with absorption bands at 1725 nm and 1761 nm. The aforementioned ranges are related to the functional group C=O, which is the most notable. In general, the predictive models based on PLSR outperformed the predictive methods based on neural networks, all of which have values \(R^2\) higher than 0.95. However, based on RMSE and RPD values, optimized PLSR was shown to be the most effective among all predictive models. In conclusion, NIR spectrometry can be used to estimate the level of adulteration in paprika powder, when adulterants include peduncle or seed cake powder, and the highest performance was achieved by coupling it to PLSR when compared to models based on neuronal networks.

Further research may include the exploration of automatic methods for wavelength selection, to automate the whole process. It also may be interesting to explore other more sophisticated algorithms that involve higher dimensional representations of knowledge, at the expense of higher computational resources. Finally, it is still to be considered the application of the proposed methodology to different food products, which may include garlic, ginger, onion, black pepper, and other seasonings.