Objective chemical fingerprinting of oil spills by partial least-squares discriminant analysis

Gómez-Carracedo, M. P.; Ferré, J.; Andrade, J. M.; Fernández-Varela, R.; Boqué, R.

doi:10.1007/s00216-012-6008-5

Objective chemical fingerprinting of oil spills by partial least-squares discriminant analysis

Original Paper
Published: 25 April 2012

Volume 403, pages 2027–2037, (2012)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Analytical and Bioanalytical Chemistry Aims and scope Submit manuscript

Objective chemical fingerprinting of oil spills by partial least-squares discriminant analysis

Download PDF

M. P. Gómez-Carracedo¹,
J. Ferré²,
J. M. Andrade¹,
R. Fernández-Varela¹ &
…
R. Boqué²

459 Accesses
7 Citations
Explore all metrics

Abstract

An objective method based on partial least-squares discriminant analysis (PLS-DA) was used to assign an oil lump collected on the coastline to a suspected source. The approach is an add-on to current US and European oil fingerprinting standard procedures that are based on lengthy and rather subjective visual comparison of chromatograms. The procedure required an initial variable selection step using the selectivity ratio index (SRI) followed by a PLS-DA model. From the model, a “matching decision diagram” was established that yielded the four possible decisions that may arise from standard procedures (i.e., match, non-match, probable match, and inconclusive). The decision diagram included two limits, one derived from the Q-residuals of the samples of the target class and the other derived from the predicted y of the PLS model. The method was used classify 45 oil lumps collected on the Galician coast after the Prestige wreckage. The results compared satisfactorily with those from the standard methods.

Evaluation of ATR-FTIR spectrometry in the fingerprint region combined with chemometrics for simultaneous determination of benzene, toluene, and xylenes in complex hydrocarbon mixtures

Article 05 June 2018

Establishment and application of an intelligent treating method for oil spill identification

Article 21 November 2018

An improved LSSVM discrimination model based on factor analysis and moth flame optimization algorithm for identifying water inrush sources across multiple aquifers in mines

Article 03 July 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Environmental Chemistry

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Oil spills affect the environmental conditions of ecosystems, seriously threatening the overall life cycle of every species. To enforce liability actions for oil lumps, both identification of the source of the spilled hydrocarbon and monitoring of its fate are required. However, the large variability in the composition of crude oils and the complex ageing processes oil suffers in nature, especially at sea, make identification of the most probable source of a spillage a difficult problem. The US Coast Guard Research and Development Center pioneered the first systematic analytical procedure after increased social environmental awareness and resulting regulations [1, 2]. In the last four decades, abundant literature on the fate, effects, and sources of spilled oils and petroleum products has been published [3–10]. The Nordic countries have collectively developed the so-called “Nordtest protocol” that used chemical fingerprinting techniques introduced in 1983 [11]. Since 1991 the Nordtest method on oil spill identification constituted a de facto international standard for oil spill identification [12, 13].

Further efforts were made on a European scale (the “Eurocrude” project, European crude oil identification system) to establish an oil-spill identification procedure based on proprietary software [14]. Its analytical procedure, similar to the Nortest method, uses a sample to identify the pollution culprit. In 2002, Nordtest revised its methodology [12, 13] to provide a European standard for oil spill identification [15]. These analytical methods are very similar to the US-ASTM counterparts [16, 17] and the US-EPA GC-based techniques, which are intended to measure industrial chemicals [18, 19]. Modifications of these methods have been incorporated in multiple oil spill characterization procedures used over the past 40 years [7].

In essence, all analytical procedures rely on direct comparison of the spillage to several suspected sources [12, 13, 15] or to large databases of oil samples, which in turn are created by measuring hundreds of chromatographic and/or spectrometric variables. They focus on the molecular selectivity yielded by gas chromatography [7], either with flame-ionization detection (GC–FID) [20–22] or mass spectrometric detection [23–27]. Then, identification is based on tiered visual comparisons of chromatograms, n-alkane distributions, bar charts of PAH concentration, and double plots of diagnostic ratios among the chromatograms of the suspected sources and the study sample [12, 13, 15]. Main disadvantages are that the procedures are time-consuming (ca 90 min per injection), require highly skilful personnel and both interpretation of the results and conclusions based on visual comparison suffer from some subjectivity. The difficulties encountered in the matching process are reflected in the four results that can be reported when a sample is compared with one or several suspected source oils: “match”, “non-match”, “probable match”, and “inconclusive”.

Multivariate chemometric methods have rarely been applied to oil fingerprinting. Pasadakis et al. [28] discriminated among samples of petroleum spills and identified the refinery fractions that existed in the spills by using principal-component analysis (PCA) and k-means clustering. Lavine et al. [29] implemented a genetic algorithm for pattern recognition of fuel oil GC data. Doble et al. [30] used artificial neural networks to recognize premium and regular gasolines from GC–MS data. Urdal et al. [31] used four statistical methods to separate weathered crude oils on the basis of their geographic origin. Gaines et al. [32] used PCA to reduce the number of GC–MS peak ratios to differentiate among diesel fuel samples. In addition, several pattern-recognition methods were used to evaluate the origin of oils. Fonseca et al. [33] used Kohonen self-organizing maps (SOMs) to classify samples of crude oils on the basis of GC–MS descriptors in terms of geographical origin. Borges et al. [34] used unsupervised SOMs with a consensus criterion to classify weathered crude oil samples. Fernández-Varela et al. used MOLMAP [6] and Procrustes rotation (a variable selection technique) combined with PCA, cluster analysis, and self-organizing maps [35] to study PAHs (polycyclic aromatic hydrocarbons) associated with oil spills. Grueiro-Noche et al. [36] used three-way analysis (catenated-PCA, matrix-augmented principal-components analysis, parallel factor analysis, and Procrustes rotation) to assess the major sources of PAHs in seawater. Multiway methods were also used by Christensen et al. [37] and Arancibia et al. [38] to screen oil samples by excitation–emission fluorescence and phosphorescence, respectively. Zorzetti and Harynuk [39] estimated exposure time on the basis of the composition of a weathered sample of gasoline as observed by two-dimensional gas chromatography (GC × GC), and partial least-squares (PLS), nonlinear PLS (PolyPLS), and locally weighted regression (LWR). Finally, Lobão et al. [40] identified the source of a marine oil-spill by using geochemical and chemometric techniques, for example PCA and hierarchical component analysis (HCA).

In this work, multivariate supervised pattern-recognition is proposed as an alternative to visual comparison of chromatograms in order to assign an unknown sample to its most probable source. The multivariate approach consists of a variable-selection step using the selectivity ratio index (SRI) followed by partial least-squares discriminant analysis (PLS-DA). A “matching decision diagram” is derived that is used to designate the samples as “match”, “non-match”, “probable match”, and “inconclusive”. The procedure was used to ascertain whether a set of unknown oil lumps appearing on the NW Galician coastline (NW Spain) originated from a suspected source oil, in particular from the tanker Prestige–Nassau. The 153 analytical variables were the relative areas of 121 compounds measured by GC–FID and GC–MS, with 32 diagnostic ratios calculated among selected molecules. Classification by use of PLS-DA-based methodology was validated by comparison with standard procedures [12, 13, 15].

Materials and methods

Apparatus

Gas chromatography–flame-ionization detection

An HP 6890 instrument (Agilent Technologies, Palo Alto, CA, USA) with a split/splitless injector, a flame-ionization detector, and an HP-5 fused-silica high-resolution capillary column (J&W Scientific, Folsom, CA, USA) 30 m long × 0.25 mm i.d., 0.25 μm film thickness were used. The analysis followed standard procedures [12, 13] and in-house statistical studies were conducted to validate the method [35]. Operating conditions were: starting oven temperature, 40 °C, held isothermally for 5 min, and raised to 310 °C at 6 ° min⁻¹ and held isothermally for 30 min. Carrier gas: helium, 2 mL min⁻¹ constant flow. Injector and detector temperatures were 275 °C and 325 °C, respectively. The detector was fueled with 360 mL min⁻¹ air flow and 30 mL min⁻¹ hydrogen flow. Injection was performed in the splitless mode, injected sample volume was 1 μL. In total, 32 compounds (from n-C₁₀ to n-C₄₀, plus pristane and phytane) and four different ratios (n-C₁₇/pristane, n-C₁₈/phytane, n-C₁₇/n-C₁₈, pristane/phytane) were considered.

Gas chromatography–mass spectrometry

An HP 6890 instrument (Agilent Technologies) with a pulsed splitless injector, an HP 5973 mass spectrometry detector, and an HP-5MS fused silica capillary column (J&W Scientific) 60 m long × 0.25 mm i.d., 0.25 μm film thickness, were used. The m/z range for MS analysis was 40–440 amu. The analytical procedure was implemented by following SINTEF’s recommendations [13] after a detailed study of the robustness of the analytical conditions [41]. Operating conditions were: starting oven temperature, 40 °C, held isothermally for 1 min, and raised to 300 °C at 6 ° min⁻¹ and held isothermally for 30 min. Carrier gas: helium, 1 mL min⁻¹ constant flow. Injector and transfer line temperatures were 300 °C and 280 °C, respectively. Ionization energy: 70 eV, ion-source temperature, 230 °C. Injection was performed in the pulsed splitless mode, injected sample volume was 1 μL. SIM (Selected Ion Monitoring) mode was used throughout. The US-EPA target polycyclic aromatic hydrocarbons (PAHs) and, in particular, the petroleum-specific alkylated (C₁–C₄) homologues of selected PAHs (the decahydronaphthalene, naphthalene, benzothiophene, fluorene, dibenzothiophene, phenanthrene, fluoranthene, and chrysene series) were analyzed [12, 13, 15]. In total, 50 PAHs were analyzed. These are detailed in Table S1 of the Electronic Supplementary Material.

In total, 39 biomarkers were considered: 19 hopanes, 13 steranes and diasteranes, and seven triaromatic steroids (Table S2, Electronic Supplementary Material). In total, 28 diagnostic ratios were calculated, as is conventional [13, 15]; these are reported in Table S3 of the Electronic Supplementary Material. Note that because we are dealing with a regulated issue, the analytical procedures must adhere to official guidelines [11, 15] and, thus, only minor adjustments to achieve optimization are allowed. We could not, therefore, perform every analysis using unique GC–MS equipment, which is technically possible.

Samples

Two types of sample were considered:

1.
Controlled spillages of six types of oil in special containers whose weathering processes were monitored over time. These were four crude oils (Ashtart, Brent, Maya, and Sahara Blend), a “marine fuel oil” (IFO), and the original fuel oil from the tanker Prestige. The original products without weathering and the weathered samples were analyzed. More details on controlled spillages can be found elsewhere [6, 35]. In total, 102 samples were obtained; 75 % of these (13 samples from each of the six oils) were used as a training set and the other samples (four from each oil) were used as the validation set. The most weathered samples were all included in the training set so that all the validation samples were bracketed by those in the training set.
2.
Samples taken from 45 oil lumps beached on the Galician shoreline. These will be referred to as “beaches” and were studied by the chromatographic tiered standard procedures. They were taken after a major accident in the “Galician international corridor for hazardous goods”. This is among the most important routes for oil tankers worldwide. Most carriers transporting oil cargo from Central America, the Persian Gulf, and Africa to refineries in Northern Europe navigate throughout the Fisterra Cap (Galicia, NW Spain), only 25 miles off the shoreline. Accidental oil spills from tankers, accidental ballast release, and residues from ship bilge cleanup cause chronic pollution by hydrocarbons which damages the marine environment. The assigned source of the samples is shown in Table 1 as “Prestige” (positive match to the Prestige source), “No Prestige” (non-match to that source), “probable match”, and “inconclusive” (no decision can be drawn), in accordance with normal terminology.
Table 1 Description of the samples taken from the Galician shoreline and their classifications by use of PLS-DA with and without variable selection
Full size table

The chromatograms for all samples were reviewed manually to ensure correct integration for the target compounds. The integrated areas were then exported, by use of the chromatographs’ proprietary software, to a binary format, which can be read by any common spreadsheet application. The statistical studies were made using the PLS_Toolbox 6.0, under Matlab R2010a. The SRI subroutine was written in-house.

Partial least-squares discriminant analysis

Partial least-squares (PLS) regression is a multivariate latent-variable method that relates one dependent variable or predictand (y) to a set of independent variables, or predictors, X. Partial least-squares discriminant analysis (PLS-DA) is the application of PLS to classification of problems in which y is a vector that codifies the class of each sample. The class label of an unknown sample is decided on the basis of the y value predicted by the PLS model. Ideally, the predicted y should be close to the coded class values (either 0 or 1 in this work). In practice, it is a real number and different approaches can be used to convert the predicted y into a class label [42–45].

In this work, PLS-DA was used as a class-modeling technique. The objective was to discover a set of latent variables that modeled the class of interest, here the Prestige samples, while maximizing discrimination from the other classes [46]. The class limits were defined both from the y-predictions of the PLS-DA model and from the sum of the x-residuals (Q-value or Q-residual, which describes the non-modeled part of the data). The limits for the y-predictions, were set at mean + 3SD, where SD is the standard deviation of the cross-validated predictions for the samples of the class of interest in the training set. The Q-residual for each sample was calculated as the sum of the squared differences between the preprocessed X-data and the X-data predicted by the PLS-DA model for the selected number of factors. An upper limit for the Q-value was calculated from the standard deviation of the Q-residuals of the target class. The joint use of the y-cut-off value and the upper Q-limit enabled use of an ad-hoc diagram to take decisions that emulates the four conclusions that may arise from the standard procedures. A sample with unusual characteristics will produce either a extreme y-prediction, a large Q-value, or both, and so its classification as belonging to the target class Prestige might be questioned (i.e., “probable match” or “inconclusive”). Use of the diagram is discussed in the section “PLS-DA classification of oil lumps”, below.

Variable selection with the selectivity ratio index

PLS-DA models were improved by selecting the variables with the highest discriminant power. The optimum variable subset will, hopefully, associate specific variables with specific groups of samples, improve understanding and/or interpretation of classification patterns, and reduce the risk of misclassification as a result of noise and redundant data [47]. In the work discussed in this paper the selectivity ratio index (SRI) was used [48, 49]. The SRI is defined as the ratio of the explained variance (v _ex,p) to the residual variance (v _res,p) of a variable, p:

$$ S{RI_p} = {v_{{\rm{ex}},p}}/{v_{{\rm{res}},p}} $$

(1)

For a given PLS-DA model, a target projection model is calculated as:

$$ {{\bf X}} = {{{\bf t}}_{\rm{TP}}}{{\bf p}}_{\rm{TP}}^{\rm{T}} + {{{\bf E}}_{\rm{TP}}} = {{{\bf X}}_{\rm{TP}}} + {{{\bf E}}_{\rm{TP}}} $$

(2)

where t _TP (N × 1) are the target-projected scores (N = number of samples) and p _TP (P × 1) are the target-projected loadings (P = number of variables). These are obtained as:

$$ {{{\bf t}}_{\rm{TP}}} = {{\bf X}}{{{\bf b}}_{\rm{PLS}}}/\left| {\left| {{{{\bf b}}_{\rm{PLS}}}} \right|} \right| $$

(3)

$$ {{\bf p}}_{\rm{TP}}^{\rm{T}} = {{\bf t}}_{\rm{TP}}^{\rm{T}}{{\bf X}}/\left( {{{\bf t}}_{\rm{TP}}^{\rm{T}}{{{\bf t}}_{\rm{TP}}}} \right) $$

(4)

where b _PLS(P × 1) are the regression coefficients of the PLS-DA model calculated for A factors. From Eq. (2), the explained variance for variable p, v _ex,p, is calculated from the pth column of X _TP and the residual variance for variable p, v _res,p, is calculated from the pth column of E _TP.

A vector of SRI values (for each PLS model) is made so that all variables can be compared graphically. A high SRI means that the variable has a high discriminative power to separate the groups considered in the PLS-DA model.

Results and discussion

Preliminary PLS-DA with all the variables

A preliminary “Prestige vs all (Ashtart, Brent, Maya, Sahara Blend and IFO)” PLS-DA model was developed with the samples from the controlled spillages, considering the 153 chromatographic (autoscaled) variables. The model failed to discriminate Prestige samples from the rest, and most samples from the beaches could not be classified reliably because they had very large Q-residuals (Table 1). Subsequent “one class vs one class” PLS-DA models revealed that discriminating between Prestige and IFO was the most difficult challenge. The reason is that the IFO blend was very similar to the Prestige oil. Hence, a PLS-DA model was developed to discriminate between Prestige (coded as “1”) and IFO (coded as “0”) samples. The traditional RMSE (Root Mean Square Error) obtained by leave-one-out cross-validation (LOO-CV) was used to decide on the optimum number of latent variables (LV). Figure 1 shows the PLS-DA sample scores for the first and second latent variables (which explained 68.7 % of the variance of X and 99.4 % of the variance of y). The IFO samples could be clearly distinguished from the Prestige samples. However, Ashtart, Brent, Maya, and Sahara oils samples projected on to the latent variables overlapped with the IFO and Prestige samples. The clear trend of the Sahara samples is noteworthy because it is related to the weathering process. Hence, simultaneous smaller scores in LV1 and larger scores in LV2 denote the most weathered samples. This trend could be visualized for this product because it had the highest content of the most volatile compounds among all the products, thus leading to the fastest evaporation among them. This is evidence that the weathering process largely affects the composition of the original sample and, thus, its identification is difficult unless these variations are minimized by variable selection. Again, similar to the “Prestige vs all” model, a large number of validation samples and beaches had extremely large Q-residuals, mainly because of an excess of variables unrelated to the class being studied.

PLS-DA model with variable selection

Variable selection using SRI was performed to avoid the overlap among the samples in the IFO-Prestige model. To discover the most discriminant chromatographic variables, a PLS-DA model was calculated for each of the fifteen combinations of two products (Ashtart–Brent, Asthart–Maya, etc.). In these models, a single latent variable coped with most of the information (55–70 % of X and 94–99 % of y) so the SRI value of each chromatographic variable was calculated for each of the 15 models developed with only one latent variable.

To make the SRI values comparable among the different models, the (1 × 153) vector of SRIs from each model (see the section “Variable selection with the selectivity ratio index”) was divided by its maximum. Figure 2 shows the SRI value of each variable summed over all the models. To the best of our knowledge, there is no unique criterion for selection of the optimum number of variables to be retained and many different options exist [49–51]. In this work, sets of variables containing different numbers of variables were used to develop PLS-DA models. Each set was associated with a percentile of the maximum SRI value; for instance, two variables corresponded to the 95th percentile of the SRIs whereas 18 variables corresponded to the 60th percentile. We selected 12 variables (ca 68th percentile, the upper third of the SRI values) because they yielded good separation between the classes and good predictions. Fewer and/or more variables made the classes overlap in the training step. The selected variables were: an aliphatic hydrocarbon: n-C₁₈, the D4 aromatic PAH, six biomarkers, and four diagnostic ratios.

The n-C₁₈ aliphatic hydrocarbon was selected from the 32 initial ones (plus four additional ratios which are usually considered for weathering studies). This compound is an intermediate linear chain, relatively stable through weathering processes, which is very useful for differentiating between heavy products (IFO and Prestige, relative areas approximately 0.55–0.85 when normalized against n-C₂₅) and crude oils (relative areas ranging from 1.6–2.4). This reflected the different composition of the lightest fractions amongst crude oils and refined products.

The PAH that helped most to differentiate among the classes was C₄-dibenzothiophene (abbreviation “D4”), a polycyclic aromatic sulfur heterocycle with four methyl substituents and which corresponds to the differences between the total sulfur content of the oils, for example: Prestige, 2.28 %; Ashtart, 1 %; and Sahara, 0.05 %.

Six biomarkers (out of the 40 initial ones) were included in the reduced subset, namely, 27Ts, 29Ts, 29bbS(217), 28bbR(218), 29bbR(218), and 29bbS(218). They correspond to two hopanes and four steranes and diasteranes.

Four diagnostic ratios were selected from the 28 initial ones: %27Ts, %29Ts, %D2/P2, and %D3/P3. Interestingly, the first two were calculated between the hopanes biomarkers that had been selected above whereas %D2/P2 and %D3/P3 correspond to ratios between PAHs (phenanthrenes and dibenzothiophenes). It is also worth noting that the two PAH ratios had been proposed previously as useful indicators to match spillages [5] and were used to describe oil depletion and to differentiate two oils and to correlate spilled-oil sediments with a source oil [10]. They had also been used to differentiate a high-sulfur heavy Iranian cargo crude from a low-sulfur pre-spill background [5]. Further, the suitability of %D2/P2 as source indicator was proposed in cases with only moderate degradation whereas %D3/P3 had been suggested in cases with severe weathering [5]. They, therefore, seem to be selected here to somehow differentiate between samples with moderate and severe weathering.

A PLS-DA model between Prestige and IFO was calculated with the 12 selected autoscaled variables. Figure 3 shows the scores for the first two LVs (94.6 % and 98.7 % of explained variance of X and y, respectively). The model did not overfit the data, because the RMSEC was 0.058 and the LOO-CV error was 0.069. The sensitivity (defined as the ability of the model to correctly recognize objects belonging to the gth class), specificity (defined as the capability of the gth class to reject objects of all other classes), and precision (defined as the ability of a classification model to not include objects of other classes in the considered class) were excellent (100 %) for both training and validation samples. A y cut-off value of 0.57 was calculated by using the average plus three times the standard deviation of the predictions for the Prestige class using LOO-CV. This limit was set as a separation between the Prestige samples and the samples of the other types of oil.

Note that the six types of product do not overlap (Fig. 3) and that the percentage of X variance explained by the model in the X-block increased remarkably, ca. 26 %, compared with the model calculated with the 153 variables, emphasizing the benefit of the variable selection in this classification model. After variable selection by SRI the groups could be differentiated clearly along the first LV but with fewer analytical variables. The loadings on the first LV for the reduced subset of variables revealed an opposition between all the biomarkers (plus the two diagnostic ratios calculated with them, %27Ts and %29Ts) and the other variables: aliphatic and aromatic hydrocarbons (plus the two diagnostic ratios calculated with the PAHs: %D2/P2 and %D3/P3) Thus, the highest values for n-C₁₈, D4, %D2/P2, and %D3/P3 (0.85, 5.80, 0.34, and 0.39, respectively) became associated with Prestige whereas the highest values for the biomarkers (0.21, 0.18, 0.18, 0.30, 0.29, 0.25, 0.34, and 0.15, for 27Ts, 29Ts, 29bbbS(217), 28bbR(218), 29bbR(218), 29bbS(218), %27Ts, and %29Ts, respectively) defined the IFO class (Table 2). The most important variable on LV2 was %D3/P3. This ratio is, thus, confirmed as a good marker for describing the most weathered samples of IFO and Prestige. In addition, %D2/P2 seemed more linked to intermediate weathered samples.

Table 2 Average values for the selected variables and each class of sample

Full size table

The Ashtart, Brent, Maya, and Sahara classes were projected on the PLS-DA model (Fig. 3). The clear separation among the classes is evidence that the variables selected by SRI acted as good markers for these products also. Table 2 shows the average values found for the selected variables along the classes. It can be seen that, although it is possible to characterize some classes by a variable (e.g. Brent by D4, Maya by %D3/P3, Ashtart by 28bbR (218), and Sahara by %27Ts and D4), it is preferable to use the overall reduced set.

PLS-DA classification of oil lumps

The PLS-DA model was used to classify a set of samples taken on Galician beaches whose “target” assignations were obtained from chromatography-based standard procedures [12, 13, 15].

In contrast with the training and validation samples, the origin and time of weathering of which were controlled, the oil lumps from the coastline may have originated from different sources and suffered a range of weathering processes, because of their different drifting times at the sea and eventual mixture with other products. This may mean that even when a lump originated from the Prestige wreck, its chemical composition after a period of weathering might not be close to that of the original product (or to the samples studied after controlled weathering). This phenomenon would eventually be reflected by the Q-residuals and the y predictions of the PLS-DA model.

In order to adhere as much as possible to standardized procedures based on visual comparisons of the chromatograms, a “matching decision diagram” (Fig. 4), was set to mimic the terminology of the standard approach. A “match” (i.e. a sample identified positively as Prestige) occurred when PLS-DA classified a sample in class “1” (Prestige), with a low Q-residual. A “non-match” (i.e. not from the Prestige) occurred when PLS-DA classified a sample in class “0”, with a large Q-residual. This means that the sample contained variation not previously modeled by PLS-DA. A “probable match” was assigned when PLS-DA predicted a sample as class “1” but with a high Q-residual, and an “inconclusive” result was assigned when a sample did not fulfill any of the previous criteria (i.e., a classification of “0” but with no new characteristics and, so, a low Q-residual). The main separation between the classes is given by the cut-off value derived from the PLS-DA model (vertical dotted line in the figure). The upper Q-residual limit was set as the average Q-residuals of the Prestige samples on training (LOO-CV) plus 30 times their standard deviation.

The model thus obtained from the reduced set of variables, and two LVs, was highly satisfactory because it classified the validation samples and beach samples correctly. From inspection of Tables 1 and 3 it is apparent that the predictions for the 45 beach samples were more accurate than those obtained using the overall set of variables. Note that the PLS-DA method leads to more assignations than the visual comparison approach, because the former assigned only eight samples as “inconclusive” and “probable match”, whereas the latter had 18 samples in both categories. As a consequence, the number of probable matches derived from the standard method is higher (15 samples) than for the PLS-DA method (five samples). In addition, no samples considered by PLS-DA as Prestige or No Prestige were regarded as No Prestige or Prestige, respectively, by the standard approach, which was a very positive fact.

Table 3 Contingency table derived from the PLS-DA predictions compared with the assignments derived from standard procedures

Full size table

The five beach samples regarded as probably originating from the Prestige wreck (probable match in the standard approach) had PLS predictions higher than the cut-off value of 0.576 in Fig. 4 but large Q-residuals. They were regarded as non-match by the standard approach. This was not surprising, because of their closeness to the “No Prestige” class in Fig. 4. On the other hand, sample B13 is, clearly, very close to the decision value and, indeed, all its experimental variables were quite similar to the average values for the Prestige samples. Three samples were regarded as “inconclusive” in Fig. 4 (Portiño4, Lira1, and Farolira3). They agreed with the standard approach, except for Farolira3, which was regarded as a probable match by the visual approach.

Close inspection of those two categories revealed that the “probable match” samples had, in general, measured values very close to, or slightly higher than, the average Prestige values. In contrast, “inconclusive” samples had clearly higher values (except for n-C₁₈ and D4, which were clearly lower) than the Prestige values. For instance, 0.22 vs 0.17 (27Ts), 0.17 vs 0.13 (%29Ts), and 0.42 vs 0.39 (%D3/P3) for “inconclusive” and Prestige samples, respectively.

Regarding the samples assigned as “not from the Prestige” (non-match), all the PLS-DA classifications were correct except for a sample that the visual approach regarded as a probable match. Clear ordering was also observed in this class in Fig. 4, because the higher the Q-residual was, the higher the n-C₁₈, %D2/P2, and %D3/P3 values were. For instance, values for n-C₁₈ and %D2/P2 were 0.96 and 0.39, and 0.70 and 0.41 for Langosteira3 and Louro3 respectively.

Finally, it is worth noting that PLS-DA classified more samples as from the Prestige than the standard procedure, probably because of the large number of inconclusive and probable match assignments obtained in the standard approach (which, in turn, reflects the difficulty of decision-making solely by visual comparison). This is not really a critical drawback, because in liability studies false positives would be preferred to false negatives (because more detailed studies should clarify the false assignment).

Conclusions

In the work discussed in this paper an objective procedure was implemented to match an oil spillage with a suspected source. It is based on PLS-DA and was exemplified by a case study in which 45 oil lumps of unknown origin discovered on Galician beaches were matched to a suspected source of the oil (the Prestige wreck). A preliminary variable-selection step using SRI yielded 12 variables, from 153 initial chromatographic variables. A PLS-DA model was then developed that led to a “matching decision diagram”. Two cut-off values were established by considering the PLS-DA predictions and their Q-residuals. The diagram leads to four possible decisions that may arise: match, non-match, probable match and inconclusive. In this particular study the PLS-DA model considered more samples as matched to the Prestige oil than the standard approach. This may be a consequence of the larger number of inconclusive statements of the latter approach. Nevertheless, this was not considered a serious drawback because in environmental liability studies “false positives” would be preferred to “false negatives”, because of their very different consequences. It is worth noting that the presented approach is not based on a one-to-one comparison of the unknown sample with every suspicious reference source and every weathered reference sample, as is performed in standard procedures (and currently involves a huge workload). Instead, it is based on comparison of the sample with a whole reference set that includes original and weathered reference samples, which defines the class boundaries of the suspicious source in multivariate space. Because the sample is now assigned to the class of the suspicious source if it is classified between the class limits, no exact match of the sample to a single reference is required. Therefore, the PLS-DA approach presented here seems an effective method of screening to reduce the laboratory workload.

References

Bentz AP (1976) Oil Spill Identification. Anal Chem 48(6):454A–472A
Article Google Scholar
US Coast Guard (1977) Oil Spill Identification System. USCG Report No (X-D-52-77, Task No 4243.2
Boehm PD, Douglas GS, Burns WA, Mankiewicz PJ, Page DS, Bence AE (1997) Application of petroleum hydrocarbon chemical fingerprinting and allocation techniques after the Exxon Valdez oil spill. Mar Poll Bull 34(8):599–613
Article CAS Google Scholar
Christensen JH, Hansen AB, Tomasi G, Mortensen J, Andersen O (2004) Integrated methodology for forensic oil spill identification. Environ Sci Technol 38(10):2912–2918
Article CAS Google Scholar
Douglas GS, Bence AE, Prince RC, McMillen SJ, Butler EL (1996) Environmental stability of selected petroleum hydrocarbon source and weathering ratios. Environ Sci Technol 30(7):2332–2339
Article CAS Google Scholar
Fernández-Varela R, Gómez-Carracedo MP, Ballabio D, Andrade JM, Consonni V, Todeschini R (2010) Self Organizing Maps for Analysis of Polycyclic Aromatic Hydrocarbons 3-Way Data from Spilled Oils. Anal Chem 82(10):4264–4271
Article Google Scholar
Wang Z, Stout SA (2007) Oil Spill environmental forensics: fingerprinting and source identification. Academic Press, United States of America
Google Scholar
Wang Z, Fingas M, Sergy G (1994) Study of 22-year-old Arrow oil samples using biomarker compounds by GC–MS. Environ Sci Technol 28(9):1733–1746
Article CAS Google Scholar
Wang Z, Fingas M, Page DS (1999) Oil Spill Identification. J Chromatogr A 843(1–2):369–411
Article CAS Google Scholar
Wang Z, Fingas MF (2003) Development of oil hydrocarbon fingerprinting and identification techniques. Mar Pollut Bull 47(9–12):423–452
Article CAS Google Scholar
Nordtest Method NT CHEM 001 (1991) Oil Spill Identification, 2nd Edition
Daling PS, Faksness LG (2001) Laboratory and reporting instructions for the CEN/BT/TF 120 Oil Spill Identification Round Robin Test. Technical report, SINTEF report STF66 A02027
Fasksness LG, Weiss H, Dailing PS (2002) Revision of the Nordtest methodology for oil spill identification. Technical Report, SINTEF Report STF66 A02028
EUROCRUDE® (1995) European Crude Oil Identification System Final Report
CEN/TR 15522–2 (2006) Oil Spill identification, Waterborne petroleum and petroleum products, Part 2: Analytical methodology and interpretation of results. Technical report
ASTM D3328–06 (2006) Standard Test Methods for Comparison of Waterborne Petroleum Oils by Gas Chromatography. ASTM International, West Conshohocken, PA
Google Scholar
ASTM D5739–06 (2006) Standard Practice for Oil Spill Source Identification by Gas Chromatography and Positive Ion Electron Impact Low Resolution Mass Spectrometry. ASTM International, West Conshohocken, PA
Google Scholar
Method 8015B (1996) Total petroleum hydrocarbons (TPH) as Gasoline and Diesel. US-EPA SW-846, Revision 2
Method 8270D (2007) Semivolatile organic compounds by gas chromatography/mass spectrometry (GC/MS). US-EPA SW-846, Revision 4
Snape I, Harvey PMcA, Ferguson SH, Rayner JL, Revill AT (2005) Investigation of evaporation and biodegradation of fuel spills in Antarctica I A chemical approach using GC–FID. Chemosphere 61(10):1485–1494
Article CAS Google Scholar
Snape I, Ferguson SH, Harvey P, McA RMJ (2006) Investigation of evaporation and biodegradation of fuel spills in Antarctica: II – Extent of natural attenuation at Casey Station. Chemosphere 63(1):89–98
Article CAS Google Scholar
Lucas Z, MacGregor C (2006) Characterization and source of oil contamination on the beaches and seabird copses, Sable Island, Nova Scotia, 1996–2005. Mar Pollut Bull 52(7):778–789
Article CAS Google Scholar
Christensen JH, Hansen AB, Karlson U, Mortensen J, Andersen O (2005) Multivariate statistical methods for evaluating biodegradation of mineral oil. J Chromatogr A1090(1–2):133–145
Article Google Scholar
González JJ, Viñas L, Franco MA, Fumega J, Soriano JA, Grueiro G, Muniategui S, López-Mahía P, Prada D, Bayona JM, Alzaga R, Albaigés J (2006) Spatial and temporal distribution of dissolved/dispersed aromatic hydrocarbons in seawater in the area affected by the Prestige oil spill. Mar Pollut Bull 53(5–7):250–259
Article Google Scholar
Franco MA, Viñas L, Soriano JA, de Armas D, González JJ, Beiras R, Salas N, Bayona JM, Albaigés J (2006) Spatial distribution and ecotoxicity of petroleum hydrocarbons in sediments from the Galicia continental shelf (NW Spain) after the Prestige oil spill. Mar Pollut Bull 53(5–7):260–271
Article CAS Google Scholar
Ebrahimi D, Brynn Hibbert D (2008) Identification of sources of diesel oil spills using parallel factor analysis: A bridge between American society for testing and materials and Nordtest methods. J Chromatogr A 1198–1199:181–187
Article Google Scholar
Sun P, Bao M, Li G, Wang X, Zhao Y, Zhou Q, Cao L (2009) Fingerprinting and source identification of an oil spill in China Bohai Sea by gas chromatography–flame ionization detection and gas chromatography – mass spectrometry coupled with multi-statistical analyses. J Chromatogr A 1216(5):830–836
Article CAS Google Scholar
Pasadakis N, Gidarakos E, Kanellopoulou G, Spanoudakis N (2008) Identifying Sources of Oil Spills in a Refinery by Gas Chromatography and Chemometrics: A Case Study. Environ Forensics 9(1):33–39
Article CAS Google Scholar
Lavine BK, Brzozowski D, Moores AJ, Davidson CE, Mayfield HT (2001) Genetic algorithm for fuel spill identification. Anal Chim Acta 437(2):233–246
Article CAS Google Scholar
Doble Ph, Sandercock M, Du Pasquier E, Petocz P, Roux C, Dawson M (2003) Classification of Premium and regular gasoline by gas chromatography/mass spectrometry, principal component analysis and artificial neural networks. Forensic Sci Int 132(1):26–39
Article CAS Google Scholar
Urdal K, Vogt NB, Sporstøl SP, Lichtenthaler RG, Mostad H, Kolset K, Nordenson S, Esbensen K (1986) Classification of Weathered Crude Oils Using Multimethod Chemical Analysis, Statistical Methods and SIMCA Pattern Recognition. Mar Pollut Bull 17(8):366–373
Article CAS Google Scholar
Gaines RB, Hall GJ, Frysinger GS, Grondlund WR, Juaire KL (2006) Chemometric Determination of Target Compounds Used to Fingerprint Unweathered Diesel Fuels. Environ Forensics 7(1):77–87
Article CAS Google Scholar
Fonseca AM, Biscaya JL, Aires-de-Sousa J, Lobo AM (2006) Geographical classification of crude oils by Kohonen self-organizing maps. Anal Chim Acta 556(2):374–382
Article CAS Google Scholar
Borges C, Gómez-Carracedo MP, Andrade JM, Duarte MF, Biscaya JL, Aires-de-Sousa J (2010) Geographical classification of weathered crude oil samples with unsupervised self-organizing maps and a consensus criterion. Chemom Intell Lab Syst 101(1):43–55
Article CAS Google Scholar
Fernández-Varela R, Andrade JM, Muniategui S, Prada D, Ramírez-Villalobos F (2010) Identification of petroleum hydrocarbons using a reduced number of PAHs selected by Procrustes. Mar Pollut Bull 60(4):526–535
Article Google Scholar
Grueiro-Noche G, Andrade JM, Muniategui-Lorenzo S, López-Mahía P, Prada-Rodríguez D (2010) 3-Way pattern-recognition of PAHs from Galicia (NW Spain) seawater simples after the Prestige’s wreck. Environ Pollut 158(1):207–214
Article CAS Google Scholar
Christensen JH, Hansen AB, Mortensen J, Andersen O (2005) Characterization and Matching of Oil Samples Using Fluorescence Spectroscopy and Parallel Factor Analysis. Anal Chem 7(77):2210–2217
Article Google Scholar
Arancibia JA, Boschetti CE, Olivieri AC, Escandar GM (2008) Screening of oil samples on the basis of excitation–emission room-temperature phosphorescence data and multi-way chemometric techniques. Introducing the second-order advantage in a classification study. Anal Chem 8(80):2789–2798
Article Google Scholar
Zorzetti BM, Harynuk JJ (2011) Using GC × GC–FID profiles to estimate the age of weathered gasoline samples. Anal Bioanal Chem 401(8):2423–2431
Article CAS Google Scholar
Lobão MM, Cardoso JN, Mello MR, Brooks PW, Lopes CC, Lopes RSC (2010) Identification of source of a marine oil-spill using geochemical and chemometric techniques. Mar Pollut Bull 60(12):2263–2274
Article Google Scholar
Fernández-Varela R, Andrade JM, Muniategui S, Prada D, Ramírez-Villalobos F (2009) The comparison of two heavy fuel Oils in composition and weathering pattern, based on IR, GC–FID and GC–MS analyses: Application to the Prestige wreckage. Water Res 43(4):1015–1026
Article Google Scholar
Cozzolino D, Chree A, Scaife JR, Murray I (2005) Usefulness of Near-Infrared Reflectance (NIR) Spectroscopy and Chemometrics to Discriminate Fishmeal Batches Made with Different Fish Species. J Agric Food Chem 53(11):4459–4463
Article CAS Google Scholar
Bishop CM (2006) Pattern Recognition and Machine Learning. Springer, New York
Google Scholar
Wise BM, Gallagher N B, Bro R, Shaver JM, Windig W, Koch RS (2005) PLS-Toolbox Version 3.5 for use with MATLAB^TM, Eigenvector Research, Inc., Manson, WA, USA
Duda RO, Hart PE, Stork DG (2001) Pattern Classification, 2nd edn. John Wiley & Sons, Inc., New York, NY
Google Scholar
Sánchez MS, de la Cruz Ortiz M, Sarabia LA, Busto V (2010) Class-modelling techniques that optimize the probabilities of false noncompliance and false compliance. Chemom Intell Lab Syst 103(1):25–42
Article Google Scholar
Brereton RG (2006) Consequences of sample size, variable selection, and model validation and optimization, for predicting classification ability from analytical data. Trends Anal Chem 25(11):1103–1111
Article CAS Google Scholar
Rajalahti T, Arneberg R, Berven FS, Myhr K-M, Ulvik RJ, Kvalheim OM (2009) Biomarker discovery in mass spectral profiles by means of selectivity ratio plot. Chemom Intell Lab Syst 95(1):35–48
Article CAS Google Scholar
Rajalahti T, Arneberg R, Kroksveen AC, Berle M, Myhr D-M, Kvalheim OM (2009) Discriminating variable test and selectivity ratio plot: Quantitative tools for interpretation and variable (biomarker) selection in complex spectral or chromatographic profiles. Anal Chem 81(7):2581–2590
Article CAS Google Scholar
Sinkov NA, Johnston BM, Sandercock PML, Harynuk JJ (2011) Automated optimization and construction of chemometric models based on highly variable raw chromatographic data. Anal Chim Acta 697:8–15
Article CAS Google Scholar
Johnson KJ, Wright BW, Jarman KH, Synovec RE (2003) High-speed peak matching algorithm for retention time alignment of gas chromatographic data for chemometric analysis. J Chromatogr A 996:141–155
Article CAS Google Scholar

Download references

Acknowledgments

M.P.G-C acknowledges the Galician Government (Xunta de Galicia) for a postdoctoral “Ángeles Alvariño” Research Contract and a post-doc grant partially supported by the EU FEDER funds to stay at the URV. The Research Grant 07MDS031103PR (Xunta de Galicia) is recognized.

Author information

Authors and Affiliations

Department of Analytical Chemistry, University of A Coruña, Campus da Zapateira s/n, 15008, A Coruña, Spain
M. P. Gómez-Carracedo, J. M. Andrade & R. Fernández-Varela
Department of Analytical Chemistry and Organic Chemistry, Universitat Rovira i Virgili, Campus Sescelades, 43007, Tarragona, Spain
J. Ferré & R. Boqué

Authors

M. P. Gómez-Carracedo
View author publications
You can also search for this author in PubMed Google Scholar
J. Ferré
View author publications
You can also search for this author in PubMed Google Scholar
J. M. Andrade
View author publications
You can also search for this author in PubMed Google Scholar
R. Fernández-Varela
View author publications
You can also search for this author in PubMed Google Scholar
R. Boqué
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to J. M. Andrade.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 120 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gómez-Carracedo, M.P., Ferré, J., Andrade, J.M. et al. Objective chemical fingerprinting of oil spills by partial least-squares discriminant analysis. Anal Bioanal Chem 403, 2027–2037 (2012). https://doi.org/10.1007/s00216-012-6008-5

Download citation

Received: 03 February 2012
Revised: 30 March 2012
Accepted: 30 March 2012
Published: 25 April 2012
Issue Date: June 2012
DOI: https://doi.org/10.1007/s00216-012-6008-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Objective chemical fingerprinting of oil spills by partial least-squares discriminant analysis

Abstract

Similar content being viewed by others

Evaluation of ATR-FTIR spectrometry in the fingerprint region combined with chemometrics for simultaneous determination of benzene, toluene, and xylenes in complex hydrocarbon mixtures

Establishment and application of an intelligent treating method for oil spill identification

An improved LSSVM discrimination model based on factor analysis and moth flame optimization algorithm for identifying water inrush sources across multiple aquifers in mines

Introduction