Introduction

Sesame (Sesamum indicum L.) is one of the oldest domesticated oilseed crops [1]. It is grown in tropical and sub-tropical areas over 7.54 million hectares worldwide, yielding 3.3 million tons of seed [2].China especially is one of the most important producers of sesame, accounting for approximately a 25% share in the world’s total production [3]. Pure and genuine Chinese sesame oil is dark brown with high contents of oleic, linoleic, palmitic and stearic acids, as well as low concentrations of palmitoleic, linolenic and eicosenoic acids [4, 5]. Sesame oil is highly valued for its unique and pleasant flavor and high nutrition and health functions. In China, pure and authentic sesame oil is traditionally regarded as a high-quality vegetable oil and serves as an important source of edible and cooking oils. Unfortunately, in the domestic market, pure sesame oil is sometimes adulterated with some cheaper vegetable oils, such as soybean oil, corn oil, rapeseed oil and palm oil. Therefore, a continuing demand exists for rapid, inexpensive and effective techniques for routine authenticity analysis of sesame oil products.

The quality and chemical composition of sesame oil depend on various factors, including varieties, fields, processing and storage, producing areas, etc. [613]. Numerous researches have been devoted to the investigation of the chemical compositions and properties of sesame oils influenced by such factors. Such investigations are crucial for understanding the quality of sesame oils. However, it is still difficult to find a few marker components that can unambiguously indicate the origins and species of raw seeds. In traditional sensory analysis [14, 15], the quality of oils can be evaluated by professional tasters on a sensory board. The tasters are experienced specialists who evaluate product quality based on the flavor, color, and taste of oil. Because it may take years and is very expensive to train a skilled oil taster, it would be attractive to evaluate oil quality by some nonhuman techniques.

In recent years, it has seen increasing interest in applying spectroscopic and chemometric tools to food authentication [1618]. The principle of such techniques is that chemical compositions of complicated samples are characterized by multi-channel analytical signals; then useful information concerning food quality can be extracted by multivariate analysis methods. Compared with the most widely used near infrared (NIR) spectroscopy techniques, mid-infrared possesses some advantages, including similar speed of measurement, moderate instrument cost and relative ease of sample presentation, especially for liquid and paste samples [1821].

This work was aimed at developing a reliable and quick analytical method for authentication of Chinese sesame oils by Fourier-transform infrared (FTIR) spectroscopy and chemometric methods, focusing mainly on the following aspects. First, the sampling procedure should be representative and comprehensive to include most if not all of the significant variations likely to be encountered in the authentic materials [18, 19]. In this paper, real and pure sesame oils from the six main producing areas in China were collected to make up a sufficient training set. Second, it was pointed out that food authentication can be reduced to the question of whether an unknown object should be accepted or rejected by a class of interest. The objective of a class model is to describe the representative samples belonging to the class and answer the above question based on predictions of the unknown objects. In these cases, the commonly used discriminant analysis (DA) methods devoted to differentiating two or more predefined classes are of limited use because food authentication will be confronted with various known and unknown adulterations [22]. Therefore, instead of DA, class modeling techniques (CMT) [22, 23] were used in this paper. Finally, based on the above two considerations, the objective of data analysis was to enhance class models to detect even very subtle signal difference caused by low doping concentrations. Soft independent modeling of class analogy (SIMCA) [24] and a recently proposed partial least squares class model (PLSCM) [25] were used to develop class models. To remove undesirable factors in the raw data, smoothing [26], derivatives [26] and standard normal variate (SNV) [27] were investigated to optimize data preprocessing. For data analysis, given the performances of different models are similar or have no significant differences, the models with least complexity and least preprocessing procedures were sought to ensure the generalization of models.

Materials and Methods

Sample Preparation

A set of 95 pure and authentic sesame oil samples from the six most important sesame producing areas of China were collected from domestic markets, including Henan (18), Hebei (16), Anhui (18), Hubei (16), Shandong (15) and Jiangsu (12). All the pure sesame oil samples were obtained from the cold pressing of sesame seeds. Samples were stored in a cool, dark area before spectrometry analysis.

Eighteen cheaper and common vegetable oils, including rapeseed oil (5), soybean oil (6), palm oil (5) and peanut oil (2) were purchased from domestic markets for adulteration. Then adulterated sesame oil samples were prepared by mixing the pure sesame oils with different ratios of the cheaper oils ranging from 2 to 20%. For each doping level, a supersaturated design [28] was used to select 18 pure sesame oil samples, each of which was adulterated with one of the 18 cheaper oils. Therefore, with doping levels at 2, 3, 4, 5, 8, 10 and 20 percent (w/w), 126 adulterated sesame oil samples were prepared and used as negative test samples. The doped samples are stored under the same conditions as the pure sesame oil samples before spectrometric analysis.

FTIR Spectrometric Analysis

The FTIR transmission spectra were measured within the spectral range from 4,000 to 400 cm−1 on a Nicolet Avatar 360 infrared spectrophotometer (Thermo Scientific, Waltham, MA) with a DTGS KBr detector. No further preprocessing of the oil samples was performed, and the spectra were measured in a KBr demountable absorption cell (150 μm) without any solvents. The resolution was 4 cm−1, and the scanning interval was 1.929 cm−1. Therefore, each spectrum had 1,868 individual data points for chemometric analysis. The scanning time was set to be 64, because an increase in scanning time did not significantly improve the quality of signal. Some of the raw spectra are demonstrated in Fig. 1.

Fig. 1
figure 1

FTIR spectra of pure and adulterated sesame oil; the doping levels ranged from 2 to 20%

Chemometric Data Analysis

The reliability of class models depends largely on the representativity of training samples. Therefore, robust principal component analysis (rPCA) [29, 30] was performed to detect and exclude the outliers. A robust PCA model can avoid the masking effects caused by the coexistence of multiple outliers. For pure sesame oils, the DUPLEX method [31] was used to split the measured spectra data into a representative training set and a test set. DUPLEX selects two samples with largest distance and puts them in the training set, then selects two other samples with largest distance among the remaining samples and puts them in the test set, and so on. By alternatively selecting the spectral data for the calibration set and the test set, DUPLEX gives data in the test set with a distribution almost equal to that of the training set.

Different data preprocessing techniques were investigated to optimize the training and predicting performance of class models. Smoothing was often used to reduce random noise in spectral data and enhance the signal-to-noise ratio (SNR). The polynomial fitting algorithm by Savitzky and Golay [26] was adopted for this purpose considering its popularity and simplicity. Moreover, taking derivatives can enhance spectral resolution and remove baseline and background, so first and second order derivatives were also investigated. Because derivatives tend to degrade the SNR by enhancing noise, the derivative spectra were also computed by polynomial fitting algorithms [26]. Standard normal variate (SNV) [27] was originally proposed to reduce scattering effects and was also proved to be effective in correcting the interference caused by variations of optical path. Therefore, SNV tests were performed to standardize the spectral variations caused by different liquid membrane thickness.

The classical SIMCA [24] and a recently proposed PLSCM method [25] were used to developing class models for authentication of pure sesame oils. For one-class problems, SIMCA describes the structure of the training samples by the principal components (PC) space spanned by a few significant PCs. The residual error is then subject to an F test procedure for estimating the decision region. The residual error tends to be underestimated when it is computed directly from PCA of the training samples. It was recognized that SIMCA could lead to a large number of objects that are wrongly rejected (a large α-error), therefore, the strategy suggested in Ref. [24] were adopted. By using scores predicted by leave-one-out cross-validation (LOOCV) instead of the original scores obtained after PCA on the class objects, this procedure was shown to lead to a reduction of the number of false outliers. PLSCM was demonstrated to have a comparable performance to SIMCA. It builds a PLS regression model relating the spectral variables to a response vector of 1 with all the elements being ones. The vector 1 used as a response vector means all the objects in the same class should be distributed as close to each other as possible. While SIMCA projects the data onto a few PCs explaining most of the data variances, PLSCM considers both the explained variances and compactness of a class. The predicting error of the response variable is assumed to have a normal distribution and used to deduce the decision region. Because PLSCM can be performed in the framework of multivariate calibration, the determination of the model complexity and the decision region for PLSCM seems more straightforward than for SIMCA. More details concerning the optimization of parameters in SIMCA and PLSCM will be presented later.

Results and Discussion

As shown in Fig. 1, the spectra of adulterated and pure sesame oils [32] have very similar absorbance bands in the range of 400–4,000 cm−1. Peak assignments for spectra of pure sesame oil as shown in Fig. 1 are listed in Table 1. Smoothing, first and second S-G derivatives and SNV were used to preprocess the raw spectra. Some of the raw and preprocessed spectra of pure and adulterated sesame oil are demonstrated in Figs. 2, 3. A significant difference between pure and adulterated sesame oil is the absorbance at 3,423 cm−1, however, the difference becomes very subtle with low doping concentrations. This can be further seen from the spectra of adulteration by peanut oils ranging from 2 to 20 percent in Figs. 2, 3. Therefore, class models are necessary to extract the detailed information from spectral data for characterizing pure sesame oils.

Table 1 Peak assignments for pure sesame oil
Fig. 2
figure 2

FTIR spectra of sesame oil adulterated with different levels of peanut oil; the difference in absorbance at 3,423 cm−1 becomes very difficult to distinguish with the naked eye with low doping concentrations

Fig. 3
figure 3

Preprocessed spectra of sesame oil adulterated with peanut oil of different levels by smoothing, taking derivatives and SNV

Because the FTIR spectral data were high-dimensional (for the raw spectra, p = 1,868), the improved rPCA [30] was used, which was shown to be numerically more stable for high-dimensional data and had a much lower computational cost. According to the computed score distance (SD) and orthogonal distance (OD), an rPCA diagnosis plot classifies the samples into four groups: regular data (with small SD and small OD), good PCA-leverage points (with large SD and small OD), orthogonal outliers (with small SD and large OD), and bad PCA-leverage points (with large SD and large OD). Outlier detection was performed on the raw spectra by rPCA with a significance level of 0.05. The number of PCs was determined by robust pooled predicted residual sum of squares (PRESS) values. Because the first seven PCs accounted for 95.35% of the total data variance, rPCA used seven PCs for outlier detection. The rPCA diagnosis plot of the 95 pure sesame oil samples is shown in Fig. 4. OD is a measure of the distance from the sample to the model space spanned by selected PCs and SD describes the sample dispersion in the class projected onto the model space. Therefore, both orthogonal outliers and bad PCA-leverage points should be excluded from the training set. Because the pure sesame oil samples came from different producing areas, there might be considerable difference in the contents of different chemical components. Therefore, good PCA-leverage points should be reserved to represent the spectral variations among the pure sesame oil samples from different producing areas. In Fig. 4, four orthogonal outliers were detected and excluded. The DUPLEX algorithm was then used to split the remaining 91 pure sesame oil samples into a training set with 70 samples and a test set of 21 samples. Therefore, the test set had 21 positive samples and 126 negative samples.

Fig. 4
figure 4

Robust PCA outlier diagnosis plots for 95 pure sesame oil samples from Henan (18), Hebei (16), Anhui (18), Hubei (16), Shandong (15) and Jiangsu (12) of China

PLSCM and SIMCA were applied to model the pure sesame oil samples. For SIMCA, the decision region proposed in [24] was adopted to reduce the risk of having a large number of objects wrongly rejected. Cross validation was performed to evaluate the number of significant PCs, the criterion of 95% total explained variances was also considered. For PLSCM, Monte Carlo cross validation (MCCV) [33] with 10% samples left out was used to determine the number of PLS components and the sampling time was 100. The PRESS values by MCCV were subject to the F test proposed in [34, 35]. As suggested, a significance of 0.25 was adopted to select the least number of latent variables with a PRESS value not significantly larger than the minimum value according to the F test. Sensitivity and specificity were used to evaluate the performance of different models and preprocessing options. The prediction results of test samples by SIMCA and PLSCM were summarized in Table 2. Seen from Table 2, preprocessing generally improved the classification performance in terms of sensitivity and specificity. However, the models based on smoothed spectra had inferior performance, which might be attributed to the possible loss of detailed frequency information [36]. First derivative, second derivative and SNV significantly enhanced the class models by reducing the baseline and backgrounds. The model complexity of SIMCA and PLSCM models based on such preprocessing was less than those based on the smoothed and the raw spectra. For both PLSCM and SIMCA, the best class models were obtained by taking second derivative spectra and the prediction results were demonstrated in Figs. 5, 6. In Figs. 5b, 6b, the adulterated sesame oil samples were arranged according to an ascending doping concentration, namely 2, 3, 4, 5, 8, 10, and 20%. The wrongly accepted samples (false positives) had a doping concentration of 2% and all the adulterated samples with a doping concentration equal to or higher than 3% were correctly detected. Moreover, as expected, the distance of an adulterated sample to the critical value increases with doping concentration, indicating the identification of a doping level over 3% was straightforward. The comparison of different preprocessing methods demonstrated that the spectral variations caused by scattering effects and baseline shifts played a more important role than an inferior SNR.

Table 2 Results by SIMCA and PLSCM with different data preprocessing methods
Fig. 5
figure 5

Predictions of test samples by second derivative SIMCA with 6 PCs for a 21 pure sesame oils and b 126 adulterated sesame oils in an ascending order of doping levels

Fig. 6
figure 6

Predictions of test samples by second derivative PLSCM with 7 components for a 21 pure sesame oils and b 126 adulterated sesame oils in an ascending order of doping levels

Conclusions

This paper has demonstrated that FTIR combined with chemometric class models provide a reliable and practical method for detecting adulterated sesame oils. Although it is difficult to perform an exhaustive sampling of all types of pure sesame soils and potential adulterations, PLS and SIMCA proved themselves to be successful at authenticating pure sesame oil samples from the six main producing areas in China. Moreover, the data pretreatments of derivatives and SNV were effective in reducing the influence of spectral variations caused by baseline shifts and scattering effects. The combination of FTIR and class modeling methods were able to detect adulterations with doping concentrations over 3%, which is enough to identify most of the current adulterations in Chinese markets. Wavelength selection might be useful to further enhance the class models to detect even lower levels of adulteration, however, care should be taken when performing a wavelength selection. Because too few spectral variables would fail in characterizing oils sufficiently, the use of a class model would be restricted. How to make a tradeoff between sufficient characterization of samples and proper variable selection/reduction will be the focus of our future investigations. Moreover, because the diverse origins of samples can increase model complexity and degrade model performance, we will attempt to combine clustering methods with class models to tackle this problem.