Outlier Detection for Mass Spectrometric Data

Cho, HyungJun; Eo, Soo-Heang

doi:10.1007/978-1-4939-3106-4_5

HyungJun Cho Ph.D.³ &
Soo-Heang Eo Ph.D.³

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1362))

4037 Accesses
1 Citations

Abstract

Mass spectrometry data are often generated from various biological or chemical experiments. However, due to technical reasons, outlying observations are often obtained, some of which may be extreme. Identifying the causes of outlying observations is important in the analysis of replicated MS data because elaborate pre-processing is essential in order to obtain successful analyses with reliable results, and because manual outlier detection is a time-consuming pre-processing step. It is natural to measure the variability of observations using standard deviation or interquartile range calculations, and in this work, these criteria for identifying outliers are presented. However, the low replicability and the heterogeneity of variability are often obstacles to outlier detection. Therefore, quantile regression methods for identifying outliers with low replication are also presented. The procedures are illustrated with artificial and real examples, while a software program is introduced to demonstrate how to apply these procedures in the R environment system.

Access provided by CONRICYT – Journals CONACYT. Download protocol PDF

Genomic Outlier Detection in High-Throughput Data Analysis

Multivariate Outlier Identification Based on Robust Estimators of Location and Scatter

Detection Methods for Outlier Samples

Key words

1 Introduction

Mass spectrometry (MS) data are often generated from various biological or chemical experiments. Such large amounts of data are usually analyzed automatically in a computing process that consists of pre-processing, significance testing, classification, and clustering. Elaborate pre-processing is essential to obtain successful analyses with reliable results. A key pre-processing step is the detection of outliers, which may have extreme values due to technical reasons [1]. Possible outlying observations need to be examined carefully, and then corrected for or eliminated if necessary. However, as the manual examination of all observations for outliers is time-consuming, possible outliers must be detected automatically.

An outlier is an observation that falls well above or well below the overall bulk of the data [2–4]. A natural approach to detect outliers is to investigate the distribution of the observations and evaluate the outlying degrees of potential outliers. The investigation can be conducted for each peptide because the distributions of observations of peptides may differ substantially. It is natural to measure the variability of observations for each peptide by calculating the standard deviation (SD) or interquartile range (IQR) of each sample [5].

The SD and IQR criteria may produce unreliable outcomes in the case of a few replicates. Furthermore, they are not applicable for duplicated samples. Another, perhaps naive, approach for detecting outliers statistically involves constructing lower and upper fences of differences between two samples for all peptides. A suspected outlier is then an observation whose value is either smaller than the lower fence or greater than the upper fence. However, this may generate a spurious result because variability is heterogeneous in high-throughput data generated even from MS experiments. Naive outlier detection methods such as these ignore the heterogeneity of variability, and may often miss true outliers at high levels and select false outliers at low levels. If a number of technical replicates for each peptide under the same biological condition can be obtained in MS experiments, a search for outliers can be conducted for each peptide. However, only a small number of replicates are usually subjected to MS experiments due to the high cost of experiments and the limited supply of biological samples. Instead, a more elaborate approach for detecting outliers with low false-positive and false-negative rates in MS data is to utilize quantile regression, which is especially useful when the number of technical replicates is small. The outlier detection procedures are illustrated in the next section, with artificial and real datasets in the R environment system.

2 Outlier Detection Methods

Suppose that there are n replicated samples and p peptides in an MS dataset. Then let x _ij be the i-th replicated observation for the j-th peptide from experiments under the same biological or experimental condition, where i = 1, …, n and j = 1, …, p and let \( {y}_{ij}={ \log}_2\left({x}_{ij}\right) \). Typically, n is small and p is very large in high-throughput data, i.e., p ≫ n. We introduce the standard deviation, interquartile range, and quantile regression approaches for identifying outliers in this section.

2.1 Standard Deviation Criteria

The standard deviation describes the distance between the data and the mean, thus providing a measure of the variability of the data. The standard deviation s is defined as the square root of the sum of squared deviations divided by the sample size minus 1, i.e., \( s={\displaystyle {\sum}_i{\left({y}_i-\overline{y}\right)}^2/\left(n-1\right)} \). The z-score for an observation is the number of standard deviations that it falls away from the mean. A positive z-score indicates the observation is above the mean, while a negative z-score indicates that the observation is below the mean. For sample data, an observation from a bell-shaped distribution is a potential outlier if its z-score < −3 or > +3. The z-score criterion for identifying outliers is summarized below:

1.
Compute the standard deviation, s _j for each peptide j, and then z-score \( {z}_{ij}=\left({y}_{ij}-{\overline{y}}_j\right)/{s}_j \), where \( {\overline{y}}_j \) and s _j are the sample mean and standard deviation, respectively.
2.
For each peptide j, observation y _ij is flagged as an outlier if \( {z}_{ij}<-k \) or \( {z}_{ij}>k \), where \( k=2 \) or 3.
3.
This z-score criterion works well when the data follows a bell-shaped, normal distribution. Thus, the thresholds k = 2 and 3 indicate that 95 and 99.7 % of the observations fall within 2 and 3 SDs of the mean, respectively.

Grubbs et al. [6] developed a more elaborate procedure, where the threshold is more precise, and outliers are removed recursively. This is the Grubbs’ test, and its method for identifying outliers is summarized below:

1.
Compute the test statistic \( {G}_{ij}={ \max}_{i=1,\dots, n}\left|{y}_{ij}-{\overline{y}}_j\right|/{s}_j \), where the sample mean is \( {\overline{y}}_j \) and standard deviation is s _j for peptide j.
2.
For each peptide j, observation y _ij is flagged as an outlier if \( {G}_{ij}>c \), where c is the critical value (see Note 1 ).
3.
Remove the detected outlier, and then repeat steps 1–3 until no further outliers are detected.

If \( n=2 \), the statistic is always \( 1/\sqrt{n} \); thus, this test is applicable for \( n>2 \). Grubbs’ test is based on the assumption of normality; therefore, one should first verify that the data could be reasonably approximated by a normal distribution before applying the test. Grubbs’ test detects one outlier at a time. This outlier is expunged from the dataset and the test is reiterated until no further outliers are detected. However, multiple iterations change the probabilities of detection, and the test should not be used for sample sizes of six or less since it frequently tags most of the points as outliers [7].

2.2 Interquartile Range Criteria

The p-th percentile is a value such that p percentages of the observations fall at, or below, a certain value. Three useful percentiles are the quartiles. The first quartile Q ₁ is the 25th percentile, where the lowest 25 % of the data fall below it. The second quartile Q ₂ is the 50th percentile, which is the median. The third quartile Q ₃ is the 75th percentile, and the highest 25 % of the data exists above it. The quartiles split the data into four parts, each containing quarter (25 %) of the observations. The interquartile range (IQR) is the distance between the third and first quartiles, i.e., \( \mathrm{I}\mathrm{Q}\mathrm{R}={Q}_3-{Q}_1 \). An observation is declared an outlier if it is greater than 1.5 IQR below the first quartile or more than 1.5 IQR above the third quartile. Thus, the lower and upper fences for outliers are \( {Q}_1-1.5\mathrm{I}\mathrm{Q}\mathrm{R} \) and \( {Q}_3+1.5\kern0.24em \mathrm{I}\mathrm{Q}\mathrm{R} \) [8]. This IQR criterion for identifying outliers is summarized as follows:

1.
Compute the first and third quartiles, Q _1j and Q _3j, for each peptide j, and then its IQR: \( {\mathrm{IQR}}_j={Q}_{3j}-{Q}_{1j} \).
2.
For each peptide j, observation y _ij is flagged as an outlier if \( {y}_{ij}<{Q}_{1j}-k\kern0.24em {\mathrm{IQR}}_j \) or \( {y}_{ij}>{Q}_{3j}+k\kern0.24em {\mathrm{IQR}}_j \), where k = 1.5 or 3.

In this IQR criterion, a coefficient k determines the strictness of capturing outlying observations. Values of k = 1.5 or 3 are often used. A larger value of k selects outlying observations more conservatively.

The distribution of observations may not be symmetric about the median, but instead may be skewed to the left or the right, implying that the middle of the first and third quartiles is in fact not the median. Thus, the distance from the first quartile to the median is significantly different of that from the third quartile to the median. In this situation, IQR can be too large for one side and too small for the other.

As an alternative, the semi-interquartile range (SIQR) can be more effective. That is, the left and right SIQRs are used rather than IQR. This SIQR criterion for identifying outliers is summarized as follows:

1.
Compute the first, second, and third quartiles, Q _1j, Q _2j, and Q _3j, for each peptide j, and then its SIQR: SIQR_j ^L = Q _2j − Q _1j and SIQR_j ^U = Q _3j − Q _2j .
2.
For each peptide j, observation y _ij is flagged as an outlier if \( {y}_{ij}<{Q}_{1j}-2k\kern0.24em {\mathrm{SIQR}}_j^{\mathrm{L}} \) or \( {y}_{ij}>{Q}_{3j}+2k\kern0.24em {\mathrm{SIQR}}_j^{\mathrm{U}} \), where k = 1.5 or 3.

2.3 Quantile Regression Approaches

The above IQR and SD criteria require for the data to follow a normal distribution, and for the sample sizes to be large enough (see Note 2 ). However, the assumptions may not be satisfied for some MS analyses, and in particular, the sample size is often small (see Note 3 ).

In duplicated experiments (n = 2), two observed values for each peptide should be theoretically identical, but are not identical in practice due to their variability; however, they should not differ substantially. The tolerance of the difference between the two observed values from the same condition is not constant because their variability is heterogeneous. The variability of high-throughput data depends on the intensity levels.

Lower and upper fences can be constructed for detecting outliers using quantile regression in an M–A plot with M and A values in vertical and horizontal axes, respectively, where M _j is the difference between replicated samples for j and A _j is the average, i.e., \( {M}_j={y}_{1j}-{y}_{2j}={ \log}_2\left({x}_{1j}/{x}_{2j}\right) \) and \( {A}_j=\left({y}_{1j}+{y}_{2j}\right)/2=\left(1/2\right){ \log}_2\left({x}_{1j}{x}_{2j}\right) \) to detect the outliers accounting for the heterogeneity of variability [9]. By applying the regression, we compute the 0.25 and 0.75 quantile estimates, Q ₁(A) and Q ₃(A), of the differences, M, depending on the levels, A. Then we construct the lower and upper fences: \( {Q}_1(A)-1.5\kern0.24em \mathrm{I}\mathrm{Q}\mathrm{R}(A) \) and \( {Q}_3(A)+1.5\kern0.24em \mathrm{I}\mathrm{Q}\mathrm{R}(A) \), where \( \mathrm{I}\mathrm{Q}\mathrm{R}(A)={Q}_3(A)-{Q}_1(A) \). To obtain quantile estimates that depend on the levels more flexibly, nonlinear or nonparametric quantile regression can be utilized [10]. This quantile regression approach [1], called the OutlierD algorithm, is summarized as follows:

1.
Generate an M–A plot with M and A values in vertical and horizontal axes, respectively, where M _j is the difference between replicated samples for j and A _j is the average.
2.
Apply linear, nonlinear, or nonparametric regression and then compute the 0.25 and 0.75 quantile estimates, Q ₁(A) and Q ₃(A), of the differences, M, depending on the levels, A.
3.
Construct the lower and upper fences: \( {Q}_1(A)-k\kern0.24em \mathrm{I}\mathrm{Q}\mathrm{R}(A) \) and \( {Q}_3(A)+k\kern0.24em \mathrm{I}\mathrm{Q}\mathrm{R}(A) \), where \( \mathrm{I}\mathrm{Q}\mathrm{R}(A)={Q}_3(A)-{Q}_1(A) \) and k = 1.5 or 3.
4.
Peptide j is claimed as containing an outlying observation if \( {M}_j<{Q}_1\left({A}_j\right)-k\kern0.24em \mathrm{I}\mathrm{Q}\mathrm{R}\left({A}_j\right) \) or \( {M}_j>{Q}_3\left({A}_j\right)+k\kern0.24em \mathrm{I}\mathrm{Q}\mathrm{R}\left({A}_j\right) \), where k = 1.5 or 3.

A larger value of k selects outliers more conservatively. In this approach, one of the two samples is outlying, but which one is not known.

In multiple experiments \( \left(n\ge 2\right) \), it is natural to search for outliers based on all observed values in a high-dimensional space. An outlier will be at a very large distance from the center of the distribution of a peptide. The cutoffs of distances for classification of outliers depend on the degree of variability from the center. The degree of variability is dependent on intensity levels, and the center can be defined as a 45° line from the origin. More flexibly, the center can be obtained by principal component analysis (PCA ) [11]. The first principal component (PC) becomes the center of each intensity level, i.e., a new axis for intensity levels. The experiments are replicated under the same biological and technical condition; hence, the first PC can explain most variations. It implies that it is enough to use the first PC practically. An outlier will be at a large distance from its projection. Following the notations for applying quantile regression, we can define the distance of peptide j to the projection as M _j and the length of the projection on the new axis as A _j. Then the first and third quantiles can be obtained by applying quantile regression on an M–A plot with M and A on the vertical and horizontal axes, respectively. The quantile regression algorithm that uses this projection [7] is called the OutlierDM algorithm, and is summarized as follows:

1.
Shift the sample means to the origin (0, …, 0), i.e., \( {y}_{ij}^{*}={y}_{ij}-{\overline{y}}_i \).
2.
Find the first PC vector v using principal component analysis (PCA ) on the space of y ^*₁ , …, y ^*_n .
3.
Obtain the projection of a vector \( {y}_j^{*}=\left({y}_{1j}^{*},,\dots,, {y}_{nj}^{*}\right) \) of each peptide j on v, where \( j=1,\dots, p \).
4.
Compute the signed length, A _j, of the projection and the length, M _j of the difference between a vector of peptide j and the projection, where \( j=1,\dots, p \).
5.
Obtain the first and third quantile values Q ₁(A) and Q ₃(A), on an M–A plot using a quantile regression approach. Then calculate \( \mathrm{I}\mathrm{Q}\mathrm{R}(A)={Q}_3(A)-{Q}_1(A) \).
6.
Construct the lower and upper fences, \( \mathrm{L}\mathrm{B}(A)={Q}_1(A)-k\kern0.24em \mathrm{I}\mathrm{Q}\mathrm{R}(A)\; and \) \( \mathrm{U}\mathrm{B}(A)={Q}_3(A)+k\kern0.24em \mathrm{I}\mathrm{Q}\mathrm{R}(A) \), where k = 1.5 or 3.
7.
Peptide j is claimed as containing one or more outlying observations if it is located above the upper fence or under the lower fence.

This projection quantile regression approach utilizes all of the multiple replicates simultaneously, and a high-dimensional problem reduces to two-dimensional one that can easily be solved. Note that the quantile regression approaches only determines whether each peptide contains one or more outliers, but not which observation is an outlier. A visual approach (see Note 4 ) is useful to identify which observation(s) of the selected peptide is (are) outlying, is illustrated in the next section.

3 Illustrations

In this section, we illustrate how to detect outliers in two cases with artificial and real examples by using an analysis written in R package OutlierDM [12] (see Note 5 ). The first case is illustrated with an artificial dataset to detect outlying samples for each peptide, while the second case uses a real dataset to detect the peptides containing at least one outlying observation when the number of replicates is small.

3.1 When the Number of Replicates Is Sufficiently Large

The primary purpose of outlier detection in MS data is to determine which observations for each peptide are outlying. If the number of replicates is large enough (see Note 2 ), one of the SD and IQR criteria can seek out the outliers within each peptide. For illustration, an artificial data set with 200 peptides and 15 samples is generated [7]. This dataset (called “toy”) contains ten peptides, and each of them has one outlying observation. This toy dataset can be called within the R package OutlierDM by using the following commands:

> library(OutlierDM)
> data(toy)

To detect outlying observations using the Grubbs’ test with significance level 0.01, the function odm() of OutlierDM can be called as follows:

> fit = odm(x = toy, method = “grubbs”, alpha = 0.01)
> fit

These R commands create an object fit using the three input arguments, dataset used (x = toy), outlier detection method (method = “grubbs”), and significance level (alpha = 0.01), and then display a table consisting of dots (test statistics for the detected outliers) for the first six peptides as an output (Fig. 1).

In the output, the first column is the row number and the second column indicates whether each peptide contains one or more outlying observations, shown as TRUE. Columns G1–G15 give the test statistics for the detected outliers, while the dots for non-outliers. To see all the peptides, the function output(fit) can be conducted in the R environment. In this example, 12 peptides were flagged as containing one or more outlying observations, two of which were flagged falsely. In the first six peptides shown, peptide 3 found two outlying observations, but one of them was flagged falsely. The other five peptides detected all the outlying samples correctly. The detected outlier for each peptide can be shown graphically by the function oneplot():

> oneplot(fit, i = 1)

The object fit was generated from the function odm() and index “i” indicates the row number corresponding to a peptide. Figure 2 shows the dot plot of log2-transformed data points with one outlier (marked by an asterisk) detected by the Grubbs’ test with significance level 0.01 (see Note 4 ).

3.2 When the Number of Replicates Is Small

We would like to know which observations for each peptide are outlying, but for cases where the number of replicates is small (see Note 3 ). In these events, a quantile regression approach can be utilized to detect the peptides having at least one outlying observation. For illustration, we consider a real-life dataset obtained from three replicated LC/MS/MS experiments with 922 peptides (n = 3 and p = 922). The details regarding the experiment can be found in refs. 1 and 7. This dataset can be called up by the following command:

> data(lcms3)

We first illustrate how to detect outliers under the duplicated experiment (n = 2). For instance, consider the first two replicates of the “lcms3” dataset and apply the OutlierD algorithm to the duplicated data set:

> fit2 = odm(x = lcms3[,1:2], method = “pair”, k = 3)
> outliers(fit2)
> plot(fit2)

The argument method = “pair” is for the OutlierD algorithm and k = 3 is a threshold (i.e., a coefficient) used within IQR. Using the function outliers(fit2) generates the output shown in Fig. 3. In this output, the first column indicates the row numbers of the peptides containing an outlier observation. The next columns consist of log2-transformed values (N ₁ and N ₂), A and M values, the first and third quartiles (Q ₁ and Q ₃), and lower and upper bounds (LB and UB), respectively. Figure 4 shows the M–A plot from the object fit2 and the superimposed lines separate outlying peptides from normally observed peptides.

Next, we use all the three replicates simultaneously to detect outliers (n = 3). The number of replicates is still small, so the SD and IQR criteria are not applicable. In this case, the OutlierDM algorithm is applied to the lcms3 dataset:

> fit3 = odm(lcms3, method = “proj”, k = 3)
> outliers(fit3)
> plot(fit3)

The argument method = “proj” is for OutlierDM and k = 3 is again a threshold used by IQR. Using the function outliers(fit3) generates the output Fig. 5. In this output, the first column indicates the row numbers of the peptides containing an outlier observation. The next columns consist of log2-transformed values (N ₁, N ₂, and N ₃), A and M values, the first and third quartiles (Q ₁ and Q ₃), and lower and upper bounds (LB and UB), respectively. Figure 6 shows the M–A plot from the object fit3 and the superimposed lines separate outlying peptides from normally observed peptides.

After detecting the outlying peptides, their raw data points can be plotted to see which observations are furthest from the others using (see Note 4 ):

> oneplot(fit3, i = 18)

This generates the dot plot of the log2-transformed values for the 18th peptide, as shown in Fig. 7. It is seen that one observation is far from the other two for the 18th peptide.

4 Notes

1.
In Grubbs’ test, the critical value c is \( \frac{n-1}{\sqrt{n}}\sqrt{t_{\frac{\alpha }{2n},n-2}^2/\left(n-2+{t}_{\frac{\alpha }{2n},n-2}^2\right)}, \) where \( {t}_{\frac{\alpha }{2n},n-2}^2 \) is a t-distribution with a degree of freedom n − 2 and significance level α/2n.
2.
The standard deviation (SD) and IQR criteria are used to detect outliers for each peptide. These require a sample size greater than six: n > 6.
3.
The quantile regression approaches are used to detect peptides containing one or more outliers when a sample size is small, usually, n ≤ 6. They also work for a sample size of two (n = 2).
4.
After detecting peptides containing one or more outliers using a quantile regression approach, a visual analysis such as a dot plot can be used to reveal which observations are outlying for a selected peptide.
5.
A software program OutlierDM [7] based in the R environment system is available at http://www.r-project.org/package=OutlierDM for conducting outlier detection.

References

Cho H, Lee JW, Kim Y-J et al (2008) OutlierD: an R package for outlier detection using quantile regression on mass spectrometry data. Bioinformatics 24:882–884
Article CAS PubMed Google Scholar
Su X, Tsai C-L (2011) Outlier detection. WIREs Data Mining Knowl Discov 1:261–268
Article Google Scholar
Barnett V, Lewis T (1994) Outliers in statistical data, 3rd edn. Wiley, New York
Google Scholar
Aggarwal CC (2013) Outlier analysis. Springer, New York
Book Google Scholar
Zimek A, Schubert E, Kriegel H-P (2012) A survey on unsupervised outlier detection in high-dimensional numerical data. Stat Anal Data Min 5:363–387
Article Google Scholar
Grubbs FE (1950) Sample criteria for testing outlying observations. Ann Math Statist 21:27–58
Article Google Scholar
Eo S-H, Pak D, Choi J, Cho H (2012) Outlier detection using projection quantile regression for mass spectrometry data with low replication. BMC Res Notes 5:246
Article Google Scholar
Tukey JW (1976) Exploratory data analysis. Addison-Wesley, Boston, MA
Google Scholar
Dudoit S, Yang YH, Callow MJ, Speed TP (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sin 12:111–139
Google Scholar
Koenker R (2005) Quantile regression. Cambridge University Press, Cambridge
Book Google Scholar
Jolliffe IT (2005) Principal component analysis, 2nd edn. Springer, New York
Google Scholar
Min H-K, Hyung S-W, Shin J-W et al (2007) Ultrahigh-pressure dual online solid phase extraction/capillary reverse-phase liquid chromatography/tandem mass spectrometry (DO-SPE/cRPLC/MS/MS): a versatile separation platform for high-throughput and highly sensitive proteomic analyses. Electrophoresis 28:1012–1021
Article CAS PubMed Google Scholar

Download references

Acknowledgement

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2010-0007936).

Author information

Authors and Affiliations

Department of Statistics, Korea University, Anam-dong 5ga, Seongbuk-gu, Seoul, 136-701, South Korea
HyungJun Cho Ph.D. & Soo-Heang Eo Ph.D.

Authors

HyungJun Cho Ph.D.
View author publications
You can also search for this author in PubMed Google Scholar
Soo-Heang Eo Ph.D.
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to HyungJun Cho Ph.D. .

Editor information

Editors and Affiliations

Department of Medical Statistics, University Medical Center Göttingen, Göttingen, Germany
Klaus Jung

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Cho, H., Eo, SH. (2016). Outlier Detection for Mass Spectrometric Data. In: Jung, K. (eds) Statistical Analysis in Proteomics. Methods in Molecular Biology, vol 1362. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3106-4_5

Download citation

DOI: https://doi.org/10.1007/978-1-4939-3106-4_5
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-3105-7
Online ISBN: 978-1-4939-3106-4
eBook Packages: Springer Protocols

Publish with us

Policies and ethics

Outlier Detection for Mass Spectrometric Data

Abstract

Similar content being viewed by others

Genomic Outlier Detection in High-Throughput Data Analysis