Introduction

Accurate peak detection of chromatographic signals is critical to further data analysis, and it is a fundamental step for both qualitative and quantitative analysis in practical applications. LC–MS is one of the most powerful tools for analyzing complex samples, but the deconvolution of peaks in extracted ion chromatograms (EIC) is challenging to the existing software tools [1]. Determination of peak parameters was often impacted seriously by overlapped peaks. Several techniques have developed for peak detection, which often follow two strategies, derivative and pattern matching [2]. Derivative-based peak detection uses the first derivative of a peak has a zero-crossing at its local maximum or the second derivative has a negative region to determine a peak. To avoid false positives, the threshold on slope or amplitude in zero crossings and negative regions is often imposed, so that those thresholds exceeds a predetermined minimum can be retained [3, 4]. Peakfit package relies on the first derivative to find peaks and resolve signals [5]. The famous pattern matching of peak detection is MassSpecWavelet [6]. It is based on CWT and maintains a low false positive rate. Zhang has applied CWT-based peak detection in baselineWavelet [7], alignDE [8], MSPA [9], and CAMS [10], and then, an improved peak detection method entitled MSPD [11] has been proposed recently. Comparing with other methods, Cromwell [12], Limpic [13], Lms [14], and PROcess [15], CWT provides the best average performance [16]. As the complexity increasing, MassSpecWavelet is still sophisticated, whereas derivative-based methods require more preprocessing, such as baseline correction and smoothing. When analyzing complex samples with analytical instruments, overlapped peaks always appear which is difficult to extract quantitative information accurately from overlapped peaks. MassSpecWavelet may fail to detect these peaks in them, which means that vital information may be lost and error is unavoidable.

The peak model is significant to the deconvolution of overlapped peaks. The mathematical models should be sufficient flexible enough to fit different shapes of peaks. In the literature, peak models mainly include exponentially modified Gaussian (EMG), polynomial modified Gaussian function (PMG), hybrid of Gaussian and truncated exponential functions (EGHs), and bi-Gaussian mixture model [1719]. These models are designed to fit limited peak shapes, such as asymmetric, fronting, and tailing peaks. In addition, the number of parameters of them is more than four and cannot easily be determined, and thus, the versatility of them may be suffered. For these reasons, they are not benefit to achieve automatic detection of overlapped peaks [20, 21].

This study aims to develop an automatic peak detection method for both normal and overlapped peaks in analytical signals. CWT-based pattern matching is utilized for peak detection. It can not only directly apply to the raw chromatograms without baseline correction and signal smoothing but also identify each peak accurately [6]. The segments of overlapped peaks in analytical signals are extracted to perform deconvolution. Genetic algorithm (GA) is then used to optimize positions and widths of overlapped peaks to obtain optimal solutions in acceptable time, and the balance between population sizes and iterations is adjusted by grid searching. Combining with the results of GA, Gaussian fitting and trapezoidal integration are employed to calculate peak heights and peak areas of each fitting curve. To obtain exactly peak parameters, the baseline can be corrected by linear model [22] or airPLS [23] if necessary. After baseline correction, RWPD is applied to this baseline-corrected signal. If residual signal is large after deducting detected peaks from raw signal, there may exist undetected peaks. Then, CWT peak detection is performed recursively with residual signal until the residue is small enough. When it cannot detect new positions or can detect the approximation positions with last time, this process will be terminated. If new peak positions are detected, then repeat the above steps to obtain better fitting results. The flow chart describing the architecture of RWPD is shown in Fig. 1.

Fig. 1
figure 1

Flow chart describing the framework of RWPD

Theory

Recursive Peak Detection via Continuous Wavelet Transforms

Identification of peak positions is a critical step in analysis of analytical signal. One of the most popular techniques is based on CWT. CWT can analyze signal at some special frequency or sets of frequencies (scales), and it has been widely applied in peak detection [2430].

Wavelet theory is based on a series of basic functions which are continuously differentiable and zero mean. Mother wavelet is represented as follows:

$$ \psi_{a,b} (t) = \frac{1}{\sqrt a }\psi \left(\frac{t - b}{a}\right),\,a \in R^{ + } ,b \in R, $$
(1)

where a and b represent the dilation (or scaling) and translation parameter, respectively. The CWT can be represented as:

$$ C(a,b) = \int\limits_{ - \infty }^{ + \infty } {s(t)\psi_{a,b} (t){\text{dt}}} , $$
(2)

where s(t) denotes given signal, and C(a,b) represents 2D matrix of wavelet coefficients [31]. The Mexican hat wavelet is similar to Gaussian and Lorentzian functions, and it is symmetrical and has one major positive peak. Therefore, it is selected as mother wavelet and described mathematically as:

$$ \psi (x) = \left(\frac{2}{\sqrt 3 }\pi^{{ - \frac{1}{4}}}\right )(1 - x^{{^{2} }} )e^{{ - \frac{{x^{2} }}{2}}} , $$
(3)

where \( \psi (x) \) represents the Mexican hat wavelet.

The peak identification process can be divided into four steps: (1) identify the ridges by linking the local maxima in 2D matrix of CWT coefficients; (2) define of the signal to noise ratio; (3) identify the peaks based on the ridges lines; and (4) refine the peak parameters estimation.

The peak width can be estimated roughly according to its optimal scale in wavelet space. It is based on CWT using the Haar wavelet function to improve the SNR during the derivate calculation [7]. In Fig. 2, the initial estimation of peak positions and widths by Mexican hat wavelet and Haar wavelet is marked as solid square points and circles.

Fig. 2
figure 2

Detect peak positions by Mexican hat wavelet and mark them with solid squares. Estimate peak widths by calculating the derivative using Haar wavelet. For each peak, its derivative by the optimal Haar wavelet has been shown at the bottom of this figure

The peak detection method based on CWT such as MassSpecWavelet can detect most peaks in signal (please see Fig. 3), and the results are robust and accurate. However, the main defect of MassSpecWavelet is that it cannot handle overlapped peaks well. If there exist relatively weak peaks overlapped with strong peaks on their both sides, MassSpecWavelet may not detect the weak peaks. To address it, the solution has been proposed in this study as follows. Peak positions of raw analytical signal are detected by CWT, and GA and Gaussian fitting can fit the detected peaks (please see the next section for more details). Then, deducting fitted signal from raw signal, the residual signal includes undetected peaks but without or smaller influence from overlapped large peaks. It can be applied to search whether there are undetected peaks in residual signal. These results combine with first fitting results can determine all peak positions in raw signal and repeat above procedure until no new peaks have been detected. GA, Gaussian fitting, and integral are used to get the peak widths, heights, and areas. This procedure is called recursive wavelet peak detection (RWPD), which can remedy the major defect of MassSpecWavelet. By RWPD, each peak in analytical signal, weak, strong or overlapped peaks, can be determined effectively.

Fig. 3
figure 3

Results of peak detection of simulated data. The solid squares are the estimation of peak positions by CWT with Mexican hat wavelet. The segments in the dotted boxes are overlapped peaks and they will be deconvolved by RWPD

Deconvolve Overlapped Peaks by Genetic Algorithm

GA was based on the famous evolutionary rule of Darwin, which is survival of the best. In 1975, J. Holland introduced GA, and it was used in mathematics, physics, and chemistry. In chemistry, it was used to predict the chromatographic retention time in LC [3236], and peak alignment of 1H-NMR and IR spectra [32, 37].

As an optimization method, GA has several advantages over other local searching techniques: (a) simple, efficient, and accurate in computation. (b) Global optimization method to avoid local optimization. (c) Select the most suitable solution from series of optimal solutions. (d) Solve different search space, as continuity, discrete or existence of derivation [38, 39]. Owing these advantages, GA is suitable for optimizing parameters of each peak in overlapped peaks.

The advantage of GA over peakfit package is that GA can set the boundary for each parameter to narrow the search range. The initial input parameters are the estimation of peak positions and widths in previous step. The difference between the fitting signal \( \hat{y} \) and raw signal y is regarded as the error. To minimize the error, the negative of least squares norm (L-2) of it is regarded as fitness function and can be calculated as:

$$ {\text{Fitness function}} = - (||y - \hat{y}||^{2} ). $$
(4)

Baseline Correction for Better Quantification

It is difficult to directly find accurate peak areas of the overlapped peaks with the existence of a non-zero baseline. To calculate exact peak areas by integration, two algorithms of baseline correction techniques are introduced.

The baseline, which is simulated linearly using starting and ending data points, is called polynomial fitting method. These points are 10 % of the total points that are selected, respectively. The linear model is applied to these points to fit a line, which is defined as baseline. Thus, the signal without baseline can be calculated by deducting the baseline from the raw signal [22].

When signal is simple, polynomial fitting method can be selected to remove baseline. However, if baseline drift is complex, the polynomial fitting method performs poorly, more flexible methods, such as adaptive iteratively reweighted penalized least squares (airPLS) [23], MPLS [40], SirQR [41], and ATEB [42], can replace it to remove baselines. airPLS is simple but flexible, valid, and fast algorithm for estimating baseline. The parameters in airPLS can be set as follows. The lambda is adjustable parameters, and the larger lambda, the baseline is smoother. When baseline is similar to the linear function, it can be set 102; if the baseline is quadratic function, it is 104 generally. It has been applied to chromatograms, Raman spectra, and NMR signals, and its performance is better than other baseline correction methods.

Extract Important Features of Peaks

Least squares fitting and Gaussian fitting can be used together to infer peak heights as Eq. (5) shows. In these two equations, X denotes the computation results by Gaussian model; y denotes the measurement data; h denotes the corresponding peak height:

$$ \begin{gathered} y = Xh \hfill \\ h = (X^{\prime } X)^{{ - 1}} X^{\prime } y. \hfill \\ \end{gathered} $$
(5)

The area of peak can be calculated by trapezoidal integration on the multiplication of estimated height and Gaussian peak model.

Experimental

Both simulated data and real data are used to benchmark the performance of RWPD, and the segments of overlapped peaks in them are extracted to resolve orderly.

Simulated Data

The simulated data consists of Gaussian peaks by adding to 1 % random noise to the data in Fig. 3. The resolutions of the triplet overlapped peaks near 850 and the doublet overlapped peaks near 600 are shown in Sects. 4.1 and 4.5, respectively.

LC–MS Data Set of Isomers of Diaminotoluenes

2,4-Diaminotoluene(2,4-DAT) is widely used as intermediates in the synthesis of dyes, and it has carcinogenicity and genotoxic activity [4345]. To avoid the interference of the non-banned structural isomers (2,3-DAT, 2,6-DAT and 3,4-DAT) on the determination of 2,4-DAT [46], an effective LC–MS method combining with RWPD was established. It also solves effectively the false positive problem in the analysis of 2,4-DAT. The chemical structures of DATs isomers are shown in Fig. 4.

Fig. 4
figure 4

Chemical structures of 2,3-DAT, 2,4-DAT, 2,6-DAT, and 3,4-DAT

Chemicals and Reagents

2,3-DAT, 2,4-DAT, 2,6-DAT, and 3,4-DAT were purchased from Dr. Ehrenstorfer GmbH (Germany). HPLC-grade acetonitrile was from Merck (Germany). LC–MS grade formic acid was purchased from Sigma (America). HPLC-grade methyl alcohol was purchased from Merck (Germany). The water used in all test was treated in a Milli-Q water purification system (Millipore, Bedford, MA, USA).

Preparation of Standard Solutions

Standard stock solutions of drugs (including 2,3-DAT, 2,4-DAT, 2,6-DAT, and 3,4-DAT) were dissolved in methyl alcohol. Mixture standard solutions were prepared by mixing stock solutions and diluting appropriately with methyl alcohol. The concentrations of them were 2.04, 1.94, 2.18, and 2.45 µg mL−1, respectively.

Apparatus

All sample analyses carried out on a UPLC-IT-TOF-MS system (Shimadzu, Tokyo, Japan). LC experiments were conducted on a Shimadzu (Kyoto, Japan) ultrahigh performance liquid chromatography (UPLC) system consisting of a solvent delivery pump (LC-30AD), an auto-sampler (SIL-30AC), a DGU-20A5R degasser, a photodiode array detector (SPD-M20A), a communication base module (CBM-20A), and a column oven (CTO-30A). Chromatographic separation was carried out on a column of Shim-pack XR-ODS (1.6 µm, 2. 0 mm I.D. × 75 mm) using a gradient elution consisting of mobile phase A (0.1 % formic acid) and mobile phase B (acetonitrile). The gradient was as follows: 0–3 min, a linear gradient from 5 % B to 10 % B; 3–8 min, a linear gradient to 90 % B; 8–8.01 min a linear gradient back to 5 % B. The injection volume was 5 µL, the flow rate was 0.4 L min−1, and PDA detection was performed from 190 to 800 nm. The sample chamber in the autosampler was maintained at 4 °C, while the column was set at 40 °C. The whole analysis lasted 10 min.

Mass spectral data for the compounds were obtained using a Shimadzu ITTOF mass spectrometer. It was equipped with an electrospray ionization (ESI) source operated in the positive ionization mode. Liquid nitrogen was used as nebulizing gas at a flow rate of 1.5 L min−1, drying gas (N2) pressure 0.1 MPa. The interface and detector voltages were set at 4.5 and 1.56 kV, respectively. The CDL voltage sets at constant mode (optimized by autotuning), and its temperature was 200 °C. Mass spectrometry was conducted in the full scan and automatic multiple stage fragmentation scan modes over an m/z range of 100–500 for MS1. The ion accumulation time was set at 10 ms. Argon was used as the collision gas. Trifluoroacetic acid (TFA) sodium solution was used as the standard sample for calibrating the instrument against the entire mass range (m/z 100–2000). Data processing was performed using the LC–MS Solution software (version 3.70).

LC–MS Data Set of FaahKO

The faahKO package consist quantitated LC/MS peaks from the spinal cords of six wild-type and six FAAH knockout mice. The data are a subset of the original data from 200 to 600 m/z and 2500–4500 s, and it is collected in positive ionization mode. The extraction ion chromatographic (EIC) in Fig. 7a is a sample in FAAH knockout mice, and its m/z range is between 429.0 and 429.5. The EIC in Fig. 7c is a sample in wild type, and its m/z range is between 575 and 575.5 [47, 48].

Results and Discussion

RWPD can fit peak parameters of each peak in the signal. Here, undetected peaks and overlapped peaks are selected to test the performance of it.

Results and Comparisons with Previous Methods on Simulated Data Set

The peakfit package is capable of measuring peak positions and heights accurately; however, peak widths and areas are accurate only when peak shapes are approximate Gaussian or Lorentzian. The comparison of fitting results of simulated data by RWPD and peakfit is shown in Fig. 5 and Table 1.

Fig. 5
figure 5

Deconvolution results of simulated data by RWPD. The solid squares denote the detect peak positions by Mexican hat wavelet. Black lines, red dashed lines, and blue dotted lines represent raw signals, fitting signals, and fitting peaks, respectively

Table 1 Comparison of the estimation of peak parameters among expected, RWPD and peakfit package

From Table 1, it is found that the fitting error of RWPD is lower than peakfit. Comparing the estimation of strong peaks, both methods have good performance, but when estimating weak peaks, RWPD is more accurate and closer to the expected than peakfit.

Results of LC–MS Data Sets

The molecular ions of m/z 123.0912 for DATs isomer were analyzed using RWPD, and four peaks were observed, and the fitting signal is consistent with EIC in Fig. 6. Each fitting peaks in EIC is corresponding to different structural isomers, and they are 3,4-DAT, 2,3-DAT, 2,6-DAT, 2,4-DAT, respectively. Each sample in DATs was analyzed by LC–MS, and the retention time was consistent with fitting peaks in EIC. The quantitative analysis is based on integrating the area under the curve to estimate the relative abundance.

Fig. 6
figure 6

Deconvolution of DATs EIC into Gaussian peak shapes by RWPD. Each peak with different elution times correspond to different chemical isomeric structures. The lines and solid squares have the same meaning as given in Fig. 5

The EICs of faahKO data sets were extracted by XCMS package. The overlapped peaks in EICs of LC–MS data set are also applied to benchmark the performance of RWPD, and the results are shown in Fig. 7. One can observe that the fitting signal match pretty well with the raw (or baseline corrected) signal from the deconvolution results, which means that almost all the peaks information in the overlapped peaks has been correctly extracted.

Fig. 7
figure 7

Deconvolution results of EICs by RWPD. The segments in the dotted boxes are overlapped peaks, and they will be deconvolved by RWPD. The lines and solid squares have the same meaning as given in Fig. 5

The Choice of Peak Model

The Gaussian and Lorentzian are used as peak models to fit overlapped peaks in chromatograms and spectra, respectively. They have less parameters, and can give a reasonable fit to most experimental peaks in this study. Although signals are more complex and always impacted by random noise or baseline drift, these effects can be dealt with our method. Based on the characteristics of the signals, Generally, Gaussian is used as peak model of chromatograms; Lorentzian is used to fit spectra. In the cases of chromatographic peak serious distortion, tailing, and heavy overlapping in real application, these models may fail to solve them and other functions may be implemented.

Balance between Accuracy and Computation Speed

The choice of population sizes and iterations can affect the final results of GA. The lower number of population sizes or iterations may not search the optimal solution before the process stopped. In theory, the greater population sizes or iterations can keep the population diversity, and the fitness results are closer to the true values and error is lower. However, this may take up lots of computational resources, and leads to reduce the search efficiency. The suitable population sizes and iterations should be determined to achieve accurate solution within acceptable time. The iterations are the convergence criteria of GA, when the maximum number of iterations has been reached, and GA is terminated.

The relationship between population sizes, maximum iterations, and fitting error can be seen in Fig. 8. The fitting error is a dependent variable of population sizes and maximum iterations. It is calculated with different iterations and population sizes. Both of them range from 1 to 300, and the interval is 10. The data have been smoothed to obtain clearer trends between them. The relationships are obvious: with the increase in population sizes and iterations, the fitting error is reducing obviously; especially, population sizes and iterations are less than 100 and 150, respectively. It has been tested the relationship between them with many overlapped peaks. When population sizes and iterations equal 100 and 150, respectively, this method can achieve acceptable fitting error.

Fig. 8
figure 8

Fitting error and its relationship with population sizes and iterations

Recursive Wavelet Peak Detection of Overlapped Signal

RWPD can extract each peak from the overlapped signal. In simulated data, there is overlapped peak near 600, and the weak peak cannot be detected in Fig. 3. The residual signal after the first iteration is shown in Fig. 9b. It is detected by CWT again until no new position is appearing. The peak positions and widths can be determined by combining the new positions with the results in first iteration. Then Gaussian fitting and other methods have been used to obtain the accurate fitting results. As shown in Fig. 9c, the undetected peak can also be detected by RWPD. Table 2 shows the results by comparison the first and second iterations, and the fitting results are more accurate. The fitting signal matches better with the raw signal, while the fitting error is rapidly decreasing.

Fig. 9
figure 9

Results of RWPD of simulated data in dotted boxes in Fig. 3. a In the first iteration, only strong peaks are detected and fitted; b detecting peak positions in residual signal; c combining with the results of (a, b), fitting each peak again. Black lines denote the residual signal in (b). The lines and solid squares have the same meaning as given in Fig. 5

Table 2 Comparison of the fitting results of simulated data in the first and second iterations by RWPD

Conclusion

In this study, we present a practical peak detection method by seamlessly combining RWPD and heuristic optimization. RWPD has been proposed for peak positions estimation of overlapped peaks. It is significantly better than the traditional peak detection method based on CWT. Heuristic optimization has been used to optimize the important features of peaks including positions, widths, heights, and areas. The initial value and boundary of each parameter to be optimized can be obtained from RWPD. By investigating the results of simulated and LC–MS data set, one can observe that RWPD has more accurate positions and smaller relative errors than MassSpecWavelet and peak fit, especially in overlapped peaks. It means that our method is suitable for extracting features of scientific interest from complex analytical signals.