Introduction

The soy (Glycine max (L) Merril) is an essential component on animal feeding and on human nutrition. Soy is a complete food, because it has contents such as proteins (38%), carbohydrates (27%), lipids (19%), moisture (11%), and ashes (5%), which can present vitamins, mineral salts, and fiber (Embrapa 2017). With soy, therefore, it is possible to produce a diversity of products for human alimentation, e.g, cooked soybeans, diet foods, non-greasy flour, soy butter, soy flakes, and soy beverages (Rigo et al. 2015). Moreover, soy is an alternative for biodiesel production (Castanheira et al. 2015).

Soy-based beverages (SBBs) are made from the hydrosoluble soy extract (EHS) or from the isolated soy protein (PI), both different producing methods start with soybeans and end up producing toxic followed by the addition of fruit juice and other additives. Soy extract comes from an aqueous emulsion after soybean hydration, with an adequate technological procedure aiming the preservation of proteins, a part of soluble carbohydrates, unsaturated fats, vitamins, and minerals naturally present on soy. It is possible to add some ingredients to give color and different flavors, then turning the product into a beverage similar to a juice, whereas isolated soy protein results from degreased soy flour extraction, after a previous removal of the grains with no protein content. Isolated soy protein has approximately 90% of proteins in dry base. Despite having more proteins than soy extract, being more concentrated, diluting isolated soy protein not only makes a consumable product but also basically gives the same properties of EHS, and the final product offers protein amounts similar to extraction by portion (Cabral et al. 1981).

Through the determination of major components, also known as centesimal composition (moisture, protein, ash, lipid, carbohydrates, and fiber), it is possible to make the evaluation of nutritional quality on foods. In the case of soy-based beverage, regular analysis include determination of volatile substances, ashes, lipids, proteins, sugars, and food fiber (Zenebon et al. 2008); hence, the importance of the determination of total sugars. However, traditional methods of analysis are usually slow, produce toxic residues, and require materials, equipment, and many chemical reagents that are dangerous for both analyst and environment.

With that said, attenuated total reflectance Fourier transform infrared (ATR-FTIR), combined with a multivariate analysis by partial least square regression methods, is an alternative for physical-chemical analysis of soy-based beverages, which are fast, with low cost, without chemical reagents, and without sample preparation. There are examples on literature where quantitative determination of antioxidants in agricultural products happens after the combination of those techniques cited above (Cozzolino 2015): nitrogen rate determination on soy sauce (Xu 2016), soy content measurement that is added on meat (Jiang et al. 2017), and quantitative analysis of lectine adulteration with soy (Monakhova et al. 2015).

The difficulty of using medial infrared to the quantification of analytes in complex matrices is that there are overpositioning of spectra bands. To solve this withdraw, we applied multivariate calibration methods, as the technique of partial least square (PLS). Once using deterministic algorithms, i.e., interval partial least square regression (iPLS) and synergy of interval partial least square (siPLS), it is possible to select relevant spectral regions, improving then the efficiency of the multivariate calibration technique applied on infrared data. In this way, it is feasible to make regression models with better efficiency and fewer errors on calibration and prevision (Muller et al. 2011; Muller et al. 2012; Almeida et al. 2015; Ruschel et al. 2015; Hu et al. 2017).

According to Instituto Adolfo Lutz, the determination of total sugars in soy-based beverage is currently done through a titration method called Lane-Eynon (Zenebon et al. 2008). Once this method is slow, needs many chemical reagents, turning the analysis expensive, and not environmentally friendly, an alternative is the focus of the present work, which aims to develop a method to determine total sugars by ATR-FTIR associated to multivariate calibration, on raw sample, in other words, not using any pretreatments on the beverage. In order to produce models with fewer prevision errors of total sugars, we applied selection methods of variables as iPLS and siPLS.

Material and Methods

Twenty (n = 20) soy-based beverages of different brands (n = 3) and flavors (pineapple, orange, orange with mango, apple, passion fruit, peach, grape, strawberry, yellow fruits, and original soy milk) were obtained in the local market. The 20 samples acquired represent more than 80% of the samples that can be found in Brazil. From each one of the bought samples, three more samples were prepared with a crescent addition of water (5, 10, and 20%), summing 80 samples. Water was added to expand the detection range of total sugars in the products, since the drinks can be diluted as fraud, and thus, we can quantify within the range that we are acting.

Reference data to total sugar rate, expressed in percentage (w/w) of total sugars in glucose, were obtained through the reference method Lane-Eynon (Zenebon et al. 2008); the method is based on the digestion of the protein-free beverage sample, followed by titration against the boiling Fehling A and B solutions, until they pass from blue to colorless.

The analysis of all samples happened on a Fourier transformed infrared spectrometer (FTIR) Perkin Elmer Spectrum 400, with resolution of 1 cm−1 and 16 scanning with accessory of attenuated total reflectance on medial infrared region, spectral range of 1899 to 900 cm−1. Analysis happened in triplicate. The pretreatment consisted in spectrum smoothing using Savitsky-Golay (13 points window, polynomial of first degree), then normalizing it, applying the first derived (using Savitsky-Golay with a window of 13 points, polynomial of first degree) and calculating the average for all triplicates through ChemoStat® software (Helfer et al. 2015). With algorithms of PLS, iPLS, and siPLS, we constructed the regression models by employing iToolbox package in Matlab 8.1 environment. By applying Kennard-Stone algorithm, calibration and prediction sets were selected (Kennard and Stone 1969).

The iPLS is an extension developed for the PLS, where a partial least square regression is made at each equidistant interval over the full extent of the spectrum. In this way, the importance of the information in the different spectral subdivisions is evaluated, from where it is possible to identify and select the interval whose variables show the most important information. A PLS model is built for each interval, and the results are presented in graph form, thus facilitating comparison with the entire spectral range (Xiaobo et al. 2010).

The siPLS is an extension of the iPLS algorithm. This algorithm consists of the division of spectrum into equidistant regions (intervals) followed by combinations of these intervals, allowing models to obtain better determination coefficients (R2) and smaller calibration errors and prediction errors than those found by iPLS.

To obtain models of calibration and prevision, regression techniques of PLS, iPLS, and siPLS were applied. In order to use all spectrum information, PLS was applied over the whole spectrum (1899 to 900 cm−1). On the iPLS algorithm application, spectra were divided into two, four, and eight equidistant intervals, whereas on siPLS, the spectra were divided into eight intervals and combined in 2 by 2, 3 by 3, and 4 by 4.

The models were determined by the lower values of RMSECV and root mean square error of prediction (RMSEP) and higher determination coefficients (R2). In order to choose the number of latent variables (LVs), we evaluated the error of the internal cross-calibration RMSECV models. To evaluate the predictive capacity between the models, we use the RMSEP external forecast set error (Ferreira 2015).

Results and Discussion

The descriptive statistics of the total sugars can be found in Tables 1 and 2. According to reference method Lane-Eynon, sample results range from 1.00 to 7.92% (w/w) of total sugars in glucose.

Table 1 Descriptive statistics of the total sugars by method Lane-Eynon
Table 2 Descriptive statistics of the total sugars by Lane-Eynon method

Infrared spectra from all 80 samples of SBB used to develop regression models are shown in Fig. 1. In Fig. 2, we can see the effect of the dilutions with increasing water addition (5, 10, and 20%) in a sample, which extends the detection range of total sugars. Table 3 shows the values of RMSECV, RMSEP, R2cal, R2prev, and number of latent variables (LVs) from the best-developed methods.

Fig. 1
figure 1

Infrared spectra of the 80 soy-based beverage (SBB) samples

Fig. 2
figure 2

Infrared spectra of a sample to show the effect of the dilution. a Sample in nature, no water added. b Sample with addition of 5% water. c Sample with addition of 10% water. d Sample with addition of 20% water

Table 3 Merit figures for the different partial least square (PLS) models

To the PLS regression, which makes use of the spectral region from intended studied sets, the best result presented a RMSECV of 0.1770% for 9 LVs and R2cal of 0.9821. For the prevision samples, PLS model got RMSEP of 0.1199% and R2prev of 0.9492.

On other hand, iPLS divides the spectrum in equidistant regions and develops a model for each subinterval. Therefore, iPLS model indicates which spectrum region is more relevant and eliminates spectral regions that does not have relevant information about the researched analyte. The best iPLS presented smaller value of RMSECV when compared against PLS model, however a bigger value of RMSEP. The model constructed with the interval corresponding to a range of 1399–900 cm−1 (interval 2 of the spectrum divided into two parts) produced the best values, with RMSECV of 0.1678% for seven LVs and R2cal of 0.9839. To the samples of prevision, PLS model got RMSEP of 0.1326% and R2prev of 0.9320. The selected spectral region includes axial C-O stretching absorption that are characteristically from sugars (since they are polyalcohols), ranging from 1260 to 1050 cm−1 (Pavia et al. 2009), showing that it is the most representative band on total sugar determination on samples of soy-based beverages. Thus, it is not possible to determine that the model used the specific region, because the band is very wide.

By the fact that the iPLS regression model present a limitation of not ensuring selectivity, since it does not evaluate information distributed in more than one interval and the best-selected interval was wide, the regression by siPLS turns to be a good alternative, because it allows interval combinations that are either continuous or not. In comparison with previous models, this one present smaller error, as a result of the interval synergism that enables intervals’ selection where places better specific absorption of the studied analyte.

The best siPLS model was the one that divided the spectrum in eight parts and combined the subintervals 6 and 7 (corresponding to spectral intervals of 1025–1150 and 1151–1476 cm−1, with 6 LVs), which selected intervals and presented RMSECV = 0.1963%, RMSEP = 0.1325%, and determination coefficients (R2cal and R2prev) of 0.9779 and 0.9337, respectively. Selected intervals included axial C–O stretching absorption bands (1260–1050 cm−1), characteristic of sugars and polyalcohol itself (Colthup et al. 1990; Pavia et al. 2009). Then, it indicates that this is a representative band on the determination of total sugars on samples of soy-based beverages, because it selected a specific absorption region of the analytes.

On the presented results basis, iPLS with seven LVs and siPLS with six LVs are the most adequate models to use on total sugar determination, expressing results in percentage (w/w). As previously presented on graphics of LVs versus RMSECV and LVs versus RMSEP (Figs. 3 and 4), it is clear that the error reduces largely, and then stabilizes, indicating adequate number of LVs for the studied models, and also because the errors decrease is inexpressive when new LVs are included on the model, which could super adjust it. It is important to say that the matrix of analysis (SBB) is definitely complex component wise: proteins, vitamins, fats, metals, flavoring, dyes, and sugars such as fructose, glucose, sucrose, thickeners as gum and other carbohydrates, which increases the model number of LVs.

Fig. 3
figure 3

a Root mean square error of prediction (RMSEP) values versus number of LVs and b root mean square error of cross-validation (RMSECV) values versus number of LVs for interval partial least square (iPLS) model, using the interval 2, which corresponds to a range of 1399–900 cm−1, on the determination of total sugars in soy-based beverages (SBBs) through attenuated total reflectance Fourier transform infrared (ATR-FTIR)

Fig. 4
figure 4

a Root mean square error of prediction (RMSEP) values versus number of LVs and b root mean square error of cross-validation (RMSECV) values versus numbers of LVs for synergy of interval partial least square (siPLS) model using intervals 6 and 7, which correspond to spectra intervals 1025–1150 and 1151–1476 cm−1, on the determination of total sugars in soy-based beverages (SBBs) by attenuated total reflectance Fourier transform infrared (ATR-FTIR)

Other fact that justifies choosing such models is that the chosen absorption region includes the C–O stretch absorption band, majoritarian chemical function on molecules studied samples.

Figures 5 and 6 present a correlation between real value and expected value of iPLS and siPLS models here developed: in Fig. 5, on interval 2, and on Fig. 6, on the subintervals 6 and 7, which show themselves models with the best percentage (w/w) determination capability of total sugars in soy-based beverages.

Fig. 5
figure 5

Reference values by Lane-Eynon method versus expected values from interval partial least square (iPLS) model for total sugars using interval 2, interval that corresponds from 1399 to 900 cm−1

Fig. 6
figure 6

Reference values by Lane-Eynon method versus expected values from synergy of interval partial least square (siPLS) model for total sugars using intervals 6 and 7, spectral bands 1025–1150 and 1151–1476 cm−1

Conclusion

After combining ATR-FTIR spectroscopy with regression algorithms as PLS, iPLS, and siPLS, developing multivariate models became possible on the determination of total sugars on soy-based beverage, without sample pretreatment. Algorithms iPLS and siPLS showed to be the most suitable on the variable selection, then combining them to regions with more relevant information for the intended quantification. Both iPLS and siPLS were efficient, although RMSECV value remained on the same magnitude, there was a reduction on the number of LVs, improving robustness of the model. The proposed model may be considered as simple, fast, and environmentally friendly, because it does not requires chemical reagents. Therefore, turning it a low-cost model is useful on the percentage (w/w) determination of total sugars in glucose on SBB.