Introduction

Endometriosis is a benign gynecological disorder, defined by the presence of endometrial-like tissue outside the uterus [1]. It is classified into minimal, mild, moderate, and severe stages (stage I, II, III, and IV, respectively) based on the phenotype, size, and site of lesions and the presence of adhesions [2]. Lesions can appear as three different entities: superficial plaques (peritoneal endometriosis), infiltrative nodules (deep endometriosis), or endometrioma (ovarian endometriotic cysts) [3]. These three different manifestations of endometriosis may have different origins [4] and therefore different biomarkers.

Currently, the only way to definitively diagnose the disease is by visual inspection of the peritoneal cavity through laparoscopy, preferably combined with histological confirmation of endometrial glands and stroma [5]. Vaginal ultrasound and magnetic resonance imaging (MRI) have sufficient diagnostic power to diagnose more advanced cases, presenting with deeply infiltrative lesions and/or endometriomas, but not to diagnose peritoneal endometriosis and endometriosis-associated adhesions [6]. This lack of a noninvasive diagnostic test contributes to a diagnostic delay of 8–11 years [7]. In order to reduce this delay, a semi- or noninvasive diagnostic test for endometriosis is needed. Currently, no single biomarker nor a panel of biomarkers has been validated as diagnostic test for endometriosis in peripheral blood [8]. The most frequently investigated marker is CA-125; however on its own, it lacks the sensitivity or specificity to act as a replacement test for laparoscopy [9]. The endometriosis research field is in great need of biomarker discovery to allow development of a successful noninvasive diagnostic test for endometriosis [10].

Discovery of new endometriosis biomarkers has traditionally been conducted using a hypothesis-driven approach in which one or a few biomarkers are investigated because of their putative role in the disease process [11]. The combination of multiple biomarkers in a biomarker panel may be necessary to capture the proteins that are systemically up- or downregulated as a consequence of the complex dynamics of the endometriosis disease process. Multiplex analysis allows the parallel measurement of a number of proteins in a low volume [12] and is therefore more rapid and cost-effective than conventional singleplex enzyme-linked immunosorbent assays (ELISAs) [13]. The advantage of multiplex immunoassay techniques over mass spectrometry based methods is their higher detection sensitivities (pg/ml), user-friendliness, lower cost, and the direct identification of the biomarkers without further need of dedicated sample pretreatment, therefore aiding the transition to the validation phase [14, 15]. Multiplex immunoassay technologies, either bead-based or array-based, have been used in endometriosis research in serum/plasma but focused mainly on cytokine arrays while only measuring between 4 and 27 analytes simultaneously [16].

The goal of this study was to discover new biomarkers for endometriosis in peripheral blood using a pooled approach combined with antibody array technology allowing the investigation of a large (up to 1000) and highly diverse number of proteins (cytokines, chemokines, adipokines, growth factors, angiogenic factors, proteases, soluble receptors, soluble adhesion molecules, etc.).

Materials and Methods

Peripheral Blood Plasma Sample Collection

Peripheral blood plasma samples were selected from our endometriosis biobank at the Leuven University Fertility Centre. Patients with symptoms of pelvic pain and/or infertility received a laparoscopy to determine the presence or absence of endometriosis. Blood samples were collected in ethylenediaminetetraacetic acid (EDTA) tubes before anesthesia at the time of laparoscopy and processed within maximum 1 h according to our standard operation procedures (SOPs) [17]. The tubes were centrifuged at 1400 g for 10 min at 4 °C, and the supernatant was aliquotted and stored at − 80 °C until use. Patients had signed written informed consent before recruitment, and the study protocol was approved by the Medical Ethics Committee UZ KU Leuven/Research (S56979).

Patient Selection

Plasma samples from patients, not using hormonal medication, with laparoscopically confirmed presence (n = 68) or absence (n = 35) of endometriosis were selected (Table 1). Samples were selected based on availability of 1 ml of plasma and aiming to include ten samples in each group for controls, ASRM stage I–II and ASRM stage III–IV endometriosis patients per cycle phase (menstrual, follicular, luteal phase). Only patients with necessary information about basic patient characteristics, menstrual cycle phase, endometriosis stage, and phenotype were selected. Control patients had symptoms suggestive for endometriosis (pelvic pain and/or infertility), but never had a past diagnosis of previous endometriosis and/or surgery related to endometriosis, and did not have any signs of macroscopic endometriosis at laparoscopy when the samples were obtained. Patients with ASRM stage I–II endometriosis (n = 31) had only superficial peritoneal endometriosis, undetectable on preoperative ultrasound. Patients with ASRM stage III–IV endometriosis (n = 37) had either deep nodules, endometrioma, or a combination of both and had endometriosis that was detectable on preoperative ultrasound in 17 out of 37 cases.

Table 1 Patient characteristics

Approach for Biomarker Discovery and Verification

A common approach for proteomic biomarker discovery, aimed at reducing the cost and complexity of experiments, is the initial pooling of samples to reduce biological variation. Sample pooling was performed in our study as is often preferred in microarray experiments to reduce subject-to-subject variation and measure a large group of individuals using relatively few arrays [18]. Plasma samples were pooled according to cycle phase, disease stage, and disease phenotype, resulting in a total of 24 pools (Fig. 1). To verify the initial findings obtained after analysis of the pooled samples, we investigated the most interesting proteins in the individual samples that had been included in the pools.

Fig. 1
figure 1

Detailed overview of pooling schedule. A pool comprising plasma of all controls (pool 1) and plasma of all endometriosis cases (pool 2) was made. Plasma samples from patients with endometriosis were also divided in a pool comprising all stage I–II samples (pool 3) and all stage III–IV samples (pool 4). Further, sub-pools were made according to cycle phase (pools 5–7 for controls, pools 8–10 for all endometriosis cases, pools 11–14 for stage I–II endometriosis cases, and pools 14–16 for stage III–IV endometriosis cases). Finally, for stage III–IV endometriosis pools were made according to disease phenotype (pools 17–24). Triangles indicate pooling according to disease stage. Squares indicate pooling according to disease phenotype. OMA signifies endometrioma

Overview of Experiments

Figure 2 provides a flow chart of the performed experiments. Briefly, as explained above, plasma samples were pooled according to endometriosis status, cycle phase, and disease phenotype. Pools were investigated for possible endometriosis biomarkers using the L-series 1000 Human Antibody Array (RayBiotech, Norcross, GA, USA) and later with the Quantibody 660 (RayBiotech, Norcross, GA, USA). To verify the promising biomarkers detected in the pooled experiment, individual samples were analyzed using a custom-made 10-plex array (RayBiotech, Norcross, GA, USA). To verify further discrepancies, the gold standard for protein detection, i.e., single ELISA, was used.

Fig. 2
figure 2

Overview of the sequence of experiments and pooling schedule. Sample pools and individual samples were analyzed using four different immunoassay techniques

Immunoassay Methods

RayBio L-Series Human Antibody Array 1000

The RayBio L-series Human Antibody Array 1000 (L-series 1000) allows detection of 1000 proteins from pathways associated with endometriosis (cytokines, chemokines, adipokines, growth factors, angiogenic factors, proteases, soluble receptors, soluble adhesion molecules, etc.). Antibody spots are printed onto the array in duplicate.

The 24 sample pools were diluted fivefold and dialyzed to remove endogenous biotin before being biotinylated. Biotinylated samples were incubated onto the RayBio L-series Human Antibody Array L-507 and L-493 slides (RayBiotech) at a 1:5 dilution, before being visualized with a streptavidin-Cy3 conjugate using a Genepix 4000B scanner (Molecular Devices, San Jose, CA, USA) at 532 nm and 450 PMT. PMT indicates photo multiplier tube, which is a device that converts the fluorescent light into a numeric value. The PMT setting was chosen based on the following criteria: no saturation of signals, no excess background, visualization of most features, and with positive control spots showing three distinct signal intensities POS1 > POS2 > POS3. Image analysis was done with GenePix Pro software (Molecular devices). Median fluorescence signal intensities minus local background were imported into the RayBio Analysis Tool which carried out the normalization across arrays, allowing direct comparison of signal intensities (semiquantitative). A 1.5-fold increase or decrease in signal intensity for a given analyte between groups was considered a measurable and significant difference, provided that the duplicate signals were two standard deviations above the mean background (according to manufacturer’s manual).

This experiment was conducted in-house in our laboratories at KU Leuven.

RayBio Quantibody 660 Array

Due to the limited reproducibility between runs of the L-series 1000 experiment, the discovery phase was repeated using the Quantibody 660 array (RayBiotech). This assay consists of 18 array slides allowing the measurement of 660 non-overlapping cytokines, chemokines, growth factors, angiogenic factors, etc. It is characterized by a sandwich design and a quadruplicate spot template which confers higher assay reproducibility. Contrary to the L-series 1000, the Quantibody 660 allows quantitation of the proteins by inclusion of a standard curve on each slide.

Four sample pools (pool 1–4) were twofold diluted and applied onto the 18 Quantibody array slides. Results were obtained according to manufacturer’s protocol. Briefly, after binding of the analytes to the capture antibodies, a biotin-labeled detection antibody cocktail was added. Streptavidin-conjugated Cy3 equivalent dye was added and visualized with a laser scanner. Normalization of signal intensities across arrays was done using the ratio of the positive control spots POS1/POS2. To account for outliers in fluorescence measurements, any value differing 35% from the average fluorescence minus background was removed, and the spots were re-averaged using the remaining three (manufacturer’s instructions). Each slide contained a standard curve, allowing the calculation of protein concentrations for each sample. Only proteins with values between limit of detection (LOD; 3.33 SD above blank) and max (highest standard) were considered for differential expression analysis. A 1.5-fold increase or decrease in concentration for a given analyte between groups was considered a measurable and significant difference.

This experiment was conducted at the RayBiotech laboratories in Norcross, GA, USA.

RayBio Custom-Made 10-Plex

To verify the ten relevant proteins identified in the Quantibody 660 experiment (see Results section for more details), four sample pools (pool 1–4) and the individual samples constituting the pools were analyzed using a multiplex immunoassay for ten proteins (10-plex array) custom-made by RayBiotech. As with the Quantibody 660, the custom-made 10-plex has a sandwich design and a quadruplicate spot template. As quality control of the custom array, it was made sure that for each protein, the standard curve was present, linear, and offered > 4 points of distinguish between blank and max. Additionally, the ten targets were pretested for cross reactivity with the antibodies and the standard. Quality control cross checks were done one at a time for all antibodies in a hand-spotted membrane array first and then confirmed with a run of the standard curves for the targets on the glass slide post printing.

Four sample pools (pools 1–4) and the individual samples (n = 103) constituting the pools were analyzed on custom-made slides (RayBiotech) for ten proteins selected based on the Quantibody 660 experiment (XIAP, MMP-13, Prolactin, CD48, CEA, DNAM-1, IL-31, WISP-1, GASP-2, PF4). The protocol was the same as for the Quantibody 660 (see above).

This experiment was conducted at the RayBiotech laboratories in Norcross, GA, USA.

RayBio Single ELISAs

Four sample pools (pools 1–4) and the individual samples (n = 103) constituting the pools were analyzed using single ELISAs (RayBiotech) for four proteins: XIAP, CD48, CEA, and GASP-2 (reason for selection of these four proteins is explained in the Results section), according to the manufacturer’s protocol. CD48 and GASP-2 were custom-made ELISAs. Each sample was measured in duplicate and according to manufacturer’s instructions.

This experiment was conducted in our laboratory at KU Leuven (XIAP, CEA) and at the RayBiotech laboratories in Norcross, GA, USA (CD48, GASP-2).

Data Interpretation and Statistics

For pools, a 1.5-fold change in median fluorescence signal intensity (L-series 1000) or concentration (Quantibody 660) between sample pools was considered a measurable and significant difference, provided that the signals were above background (see above).

For individual samples, Mann-Whitney tests were conducted to compare control and endometriosis samples in GraphPad Prism Version 6 (GraphPad Software, San Diego, CA, USA). Correlation between individual custom 10-plex array and ELISA measurements was performed using Spearman correlation (GraphPad, Prism).

Results

Experiment 1: L-Series 1000

Nine hundred twenty nine out of 1000 proteins had both duplicate median fluorescence signals 2 standard deviations above background. Of those, 104 proteins had a 1.5-fold change between the endometriosis pool (pool 2) and the control pool (pool 1). Twenty-four proteins were upregulated (1.5–3.7-fold change), and 80 proteins were downregulated (1.5–3.4-fold change) in the pool containing plasma from the endometriosis patients (pool 2) compared to the pooled control samples (pool 1). When comparing the stage I–II pool (pool 3) with the control pool (pool 1), overall, 908 proteins out of the platform of 1000 were detectable (signals 2 standard deviations above background); and 210 proteins were upregulated (1.5–4.4-fold change), while 18 proteins were downregulated (1.5–2.8-fold change) in the stage I–II pool. When comparing the stage III–IV pool (pool 4) with the control pool (pool 1), 914 proteins had both signals 2 standard deviations above background. Three hundred twenty proteins were upregulated (1.5–11-fold change), while 14 proteins were downregulated (1.5–2.5-fold change) in the stage III–IV pool.

It was striking that when comparing the endometriosis pool (pool 2) with the control pool (pool 1), a larger number of downregulated than upregulated proteins were found. On the other hand, when comparing the all stage endometriosis ASRM stages I–II pool (pool 3) with the control pool (pool 1) or comparing the all stage endometriosis ASRM stages III–IV (pool 4) with the control pool (pool 1), the majority of differentially expressed (DE) proteins were upregulated instead of downregulated. This lack of consistency was especially clear as, out of the 80 proteins that were downregulated in the total endometriosis pool (pool 2) versus the control pool (pool 1), only four proteins were also downregulated in pool 3 (stage I–II) and pool 4 (stage III–IV) versus control pool 1 (TECK/CCL25, Livin, CD23, PTN).

More detailed analysis of the results across all 24 sample pools uncovered that the proteins that were DE in the “all disease” (pool 2) versus “all control” pool (pool 1) were not consistently differentially expressed when comparing pools according to cycle phase (pools 5–10), according to a combination of cycle phase and disease stage (pools 11–16) or according to phenotype (pools 17–24). We had expected that the trends seen in the “all disease” (pool 2) versus “all control” pool (pool 1) would reappear when comparing the disease and control pools of the follicular (pools 5 and 8), luteal (pools 6 and 9), and menstrual phase (pools 7 and 10). Instead we found that for 37 of the 104 DE proteins of pools 2 versus pool 1, no DE was found in either the follicular, luteal, or menstrual phase. For 45 of the 104 DE proteins, only in one of the cycle phase comparisons a differential expression was found. This lack of consistency was even more obvious when comparing the stage I–II (pool 3) versus control (pool 1) comparison according to cycle phase where for 187 of the 228 DE proteins, none of the comparisons per cycle phase showed a DE expression. For the stage III–IV pool (pool 4) versus control (pool 1), of the 334 DE proteins, 231 were not DE in any of the three cycle phases.

We illustrate this problem with the example of TRANCE, a protein that was 2.9-fold downregulated in the disease (pool 2) versus control pool (pool 1). However, the normalized fluorescence intensity for TRANCE was much higher in samples from women with endometriosis stage I–II (pool 3) and stage III–IV (pool 4) than in samples from all patients with endometriosis together (pool 2) (Fig. 3a). Thus, the downregulation seen in pool 2 versus 1 was not recapitulated when comparing either pool 3 or pool 4 versus pool 1. The normalized fluorescence for TRANCE tended to be lower in control pools from the three cycle phases (pools 5–7) than in the pool with all controls (pool 1), while for the disease pools (pool 2 and 8–10), the opposite trend was seen (Fig. 3b). The downregulation that was seen in the endometriosis pool (pool 2) versus the control pool (pool 1) was only partly recapitulated in the menstrual endometriosis pool (pool 10) versus menstrual control pool (pool 7) (Fig. 3b, red bars), whereas no change was seen in the follicular phase, and an opposite trend (upregulation) was observed in the luteal phase pools. When evaluating the results according to disease stage, for endometriosis ASRM stage I–II, the sub-pools according to cycle phase (pools 11–13) corresponded well with the endometriosis ASRM stage I–II pool from all cycle phases (pool 3), whereas for endometriosis ASRM stage III–IV, the sub-pools according to cycle phase (pools 14–16) were lower than the pool with all endometriosis ASRM stage III–IV samples (pool 4) (Fig. 3c). It should be noted that TRANCE is only an illustration of the conflicting results that we found and that similar problems were encountered for other proteins (data not shown).

Fig. 3
figure 3

Example of a protein of which the fluorescence intensities showed different trends in different sample pools. (a) For the disease and control pools, regardless of cycle phase and disease stage. (b) According to cycle phase. (c) According to ASRM endometriosis stage

As we witnessed this lack of consistency in fluorescence intensities of sub-pools and fold change trends, we wondered about the reproducibility of the L-series 1000 assay. To assess the reproducibility of the L-series 1000 results, the experiment was repeated for pools 1–4 (all control, all disease, all stage I–II, and all stage III–IV pools). Several inconsistencies with regard to the original experiment were found.

Firstly, we noticed that the fluorescence signal was lower in the second run than in the first run when scanning at the same scanner settings (Supplementary Fig. 1). When scanning the second run at a higher PMT setting, we did not find a higher number of DE proteins, and therefore we used the original PMT settings for the repeated experiment.

For only 11 of the 104 DE proteins in the original run (CXCR3, Follistatin-like 1, CXCR6, PLUNC, IL-21, CXCR5 /BLR-1, MIG, FGF-8, TIM-1, TFPI, CD23), the obtained fold change of the endometriosis pool versus the control pool (pool 2 versus 1) was comparable (fold change of > 1.5 in same direction) in the second run. For stage I–II versus control, this was the case for 56 of the originally 228 DE proteins. For stage III–IV versus control, this was the case for 113 of the originally 334 DE proteins. The proteins that corresponded between both runs were not systematically the ones with the highest fold change difference in the first nor the second run; therefore we did not see the need to question our cutoff of 1.5-fold change.

We concluded that we could not confirm the results of the first run in the second run and therefore decided that this assay lacked the consistency and reproducibility in order to confidently select potential endometriosis biomarkers. This lack of reproducibility may be a consequence of variability after the biotinylation step, the use of only one antibody instead of a sandwich, and the lack of sufficient normalization. These aspects are addressed in depth in the Discussion section. As a result, we used in experiment 2 another multiplex platform for biomarker discovery (Quantibody 660) with higher expected consistency/reproducibility than the L-series 1000 as explained in the next section.

Experiment 2: Quantibody 660

Because of their sandwich design and quadruplicate spot template, the Quantibody arrays have been reported to have a lower coefficient of variation than the L-series 1000 [14], expected to lead to better consistency and reproducibility. This Quantibody 660 assay allows detection of 660 proteins and partially overlaps with the L-series 1000 assay, namely, 330 proteins are identical between both assays. Due to cost restraints, we only re-evaluated pools 1–4 with this multiplex immunoassay. We only took into consideration the 507 proteins that were between LOD and max (highest standard). Two hundred eighty proteins were upregulated (> 1.5-fold change, between LOD and maximum) and 29 proteins downregulated in the endometriosis pool (pool 2) versus the control pool (pool 1). For pool 3 (stage I–II) versus pool 1, 235 proteins were upregulated in endometriosis, while 38 proteins were downregulated. For pool 4 (stage III–IV) versus pool 1, 241 proteins were upregulated in endometriosis, while 44 proteins were downregulated. Out of the 309 proteins that were DE in pool 2 versus pool 1, 221 were also DE in pool 3 versus pool 1, and 223 were DE in pool 4 versus pool 1. One hundred eighty out of 309 proteins were DE in both pool 3 and pool 4 versus pool 1.

Ten proteins were selected for further evaluation of which seven were upregulated and three downregulated in different pathways that are known to be important in endometriosis (Table 2). These proteins were selected based on their fold change of endometriosis (pool 2) versus control (pool 1) and their relevance to the pathogenesis of endometriosis (Table 2).

Table 2 Proteins selected from the Quantibody 660 experiment for verification, based on fold change and biological process and technical verification result of the custom 10-plex

Experiment 3: Custom 10-Plex

We wanted to evaluate whether the results obtained from the Quantibody 660 experiment could be verified using a smaller-scale multiplex immunoassay on both the sample pools and the individual samples making up the pools. While ELISAs are the gold standard for protein, for three of the ten selected proteins, no single ELISA was readily available (CD48, GASP-2, and prolactin). These assays would have to be custom-made or ordered in a different company. Since ordering from a different company could confound the data by use of different antibody pairs, we preferred the option of custom-making the ELISAs. Using a custom-made 10-plex allowed us to save sample volumes and not change assay type since a 10-plex follows the same design principle as the Quantibody 660 only with different combinations of antibodies.

Experiment 3.1: Results for Sample Pools

Of the ten proteins selected for validation, only four were still differentially expressed using the custom 10-plex when comparing the endometriosis (pool 2) and control pool (pool 1), namely, XIAP (4.3-fold ↑), CD48 (2.0-fold ↑), DNAM-1 (11.7-fold ↑), and IL-31 (4.8-fold ↑) (Table 2). For MMP-13, prolactin, CEA, and PF4, the fold change between disease and control pool approximated 1. For CEA all values were above the detection limit. In the case of GASP-2, the value of the control pool was below the detection limit. Lastly, for WISP-1 an increased instead of decreased ratio was found. In general, the absolute concentrations of the proteins differed widely from the results of the Quantibody 660 experiment, reflecting the influence of the change in assay composition (data not shown).

Experiment 3.2: Results for Individual Samples Constituting the Sample Pools

Univariate statistical analysis for the individual samples making up the pools only showed a statistically significant difference between the individual control and disease samples for IL-31 (p = 0.03; upregulation) (Fig. 4). The area under the ROC curve (AUC) for IL-31 was 0.6337 (95% CI: 0.5174–0.7499). For PF4, a borderline significant (p = 0.059; downregulation) difference was observed between individual control and disease samples. The other proteins did not show significant changes between the individual endometriosis cases and controls.

Fig. 4
figure 4

The ten proteins measured by a custom-made 10-plex immunoassay in individual samples making up the pools. Only IL-31 showed a significant difference between the control and endometriosis samples. PF4 showed borderline significance. P values were obtained by Mann-Whitney test

Taking into account experiments 2–3, there was only a limited overlap between the Quantibody 660 array and the custom-made experiment. For only four proteins, the fold changes of the original sample pools were recapitulated using the custom 10-plex. For XIAP, CD48, and DNAM-1, the results of the pooled samples analyzed by the custom 10-plex corresponded well with the Quantibody 660 results, but for individual samples, no significant changes were found. When looking at the individual samples constituting the pools, only IL-31 had a significant difference between the endometriosis and control samples. Therefore, we could only label IL-31 as truly verified. To verify whether the discrepancies were due to the custom format, we chose four proteins for further evaluation using the gold standard for protein measurements, namely, single ELISA. Due to sample volume restriction and financial cost, we could only analyze four proteins.

Experiment 4: Single ELISA

To verify the outcome of the custom 10-plex and to further investigate the discrepancy between the custom-made 10-plex and the Quantibody 660 results, we performed the gold standard of single ELISA on all samples for four proteins: GASP-2, CEA, XIAP, and CD48. We hypothesized that the use of the gold standard would shed a light on whether the Quantibody 660 results or the custom 10-plex results were more reliable. We did not take along IL-31, because we considered this protein as verified. Herein, we investigated the discrepancy on two levels:

  1. 1)

    Between the pooled samples of the two techniques (Quantibody 660 and custom 10-plex). This was the case for GASP-2 and CEA. GASP-2 was below the detection limit for the control pool in the custom-made assay and in general had lower concentrations in the custom-made assay than in the original Quantibody 660 experiment. CEA was above the detection limit for all samples and did not show any difference between control and endometriosis samples in the custom-made assay, while it had been upregulated in the Quantibody 660 experiment.

  2. 2)

    Between the pools and the individual samples (XIAP and CD48). XIAP and CD48 were both upregulated in the endometriosis pools in both the Quantibody 660 and custom 10-plex experiments, but not in the individual endometriosis samples in the custom 10-plex.

Several problems were encountered using single ELISAs. For XIAP, more than 70% of the values were below the detection limit using the single ELISA. Also for CD48, 48% of samples was not measurable, which is in contrast with the custom array where only 8.5% of values were below the limit of detection. On the other hand, for CEA, samples had to be diluted 1:64 before falling within the linear range of the standard curve. This resulted in sample values around 500 ng/ml, which is much higher than expected from clinical practice. GASP-2 was the only protein that could be reliably measured with the provided single ELISAs, but our analysis did not show a difference between the endometriosis and control samples (data not shown). Furthermore, the correlation between ELISA and custom-array concentrations was low for GASP-2 (spearman r = 0.16; Fig. 5).

Fig. 5
figure 5

Correlation of GASP-2 measurements between custom array and single ELISA. Regression line (black) illustrates for GASP-2 the correlation between the measurements from the single ELISA (x-axis) and the measurements from the custom array (y-axis). Blue diagonal line represents 100% agreement between assays

In summary, the results of the single ELISA experiments did not shed more light on which assay was more reliable.

Discussion

There is a great need for a diagnostic test for endometriosis to reduce the diagnostic delay, but so far no peripheral blood biomarkers have been validated for clinical use. We aimed to address this problem by using multiplex immunoassays. However, our efforts were hampered by a lack of repeatability when using the multiplex immunoassays.

Our paper represents, to the best of our knowledge, the first report on the evaluation of large-scale multiplex immunoassays plasma biomarkers for endometriosis. A version of the L-series 1000 detecting a lower amount of targets (L-507) has previously been used on peritoneal fluid, but the group did not verify their results in individual patients [19]. The strength of our study lies in its systematic approach and in the use of several immunoassay platforms, where the reproducibility of each of the experiments was evaluated using a different immunoassay platform. The L-series 1000 protein platform showed a lack of consistency in fold change trends between different sample pools and between runs of the same pools. This inconsistency cannot be explained through biological variation, but only by technical variability of the L-series 1000 measurements. The Quantibody 660 showed a better correspondence between fold change trends in the different sample pools. However, differential expression of only four out of ten proteins (selected from the Quantibody 660 analysis) could be reproduced in the sample pools using a custom-made multiplex assay. In individual samples, only one out of ten proteins (IL-31) showed a significant difference between control and disease. To determine whether the custom 10-plex possibly lacked some of the sensitivity/specificity to pick up subtle differences, we verified the multiplex data with single ELISAs, but we did not find the same trends. A second strength of the study is the patient cohort with only ultrasound-negative patients with peritoneal endometriosis being selected in the stage I–II pools, as these patients would be in most need for a diagnostic test, since endometriosis could not be diagnosed by ultrasound [17, 20].

Sample pooling can be perceived as both a strength and a weakness. The approach to pool samples is common in RNA microarray studies where it has been used to identify biomarkers or expression patterns shared between individuals [18]. Pooling is implemented to reduce natural biological variation within a sample [21] and to reduce financial cost. Furthermore, it is able to reduce the handling of samples in processes that are time- or sample-consuming. In general, it is useful to observe trends between two pools of relatively well-characterized patient groups. However, as natural biological variation is leveled out, this also happens for disease-specific alterations, resulting in false-negative results [22]. In contrast, an outlier protein in a single sample may affect the entire pool and cause a false-positive result [22]. We chose to pool samples instead of choosing a low number of cases and controls, which has been the case for some other studies using antibody arrays [19, 23, 24], because we wanted to focus on general disease-specific alterations rather than individual differences.

We acknowledge that our study has several weaknesses. Firstly, to interpret the results, the commonly used threshold of a 1.5-fold change was chosen. This cutoff poses limitations, as it essentially ignores the standard error, while it is known that low-abundance proteins that approach the detection limit experience higher variability than high-abundance proteins [25, 26]. We chose this threshold of 1.5-fold change because the pooling of samples impeded the use of statistics in the discovery phase of the antibody array experiments. A second limitation is that we did not include all 24 sample pools in the Quantibody 660 experiment due to cost restraints, but only the four most important pools (all controls, all endometriosis, all stage I–II, and all stage III–IV). Thirdly, due to limited consistency and conflicting results from previously published studies on peripheral blood protein biomarkers for endometriosis [9], no reliable estimation of an expected effect size could be made to allow a sample size calculation.

In the context of endometriosis, multiplex immunoassays have mainly been used in studies analyzing peritoneal fluid and to a lesser extent in reports assessing peripheral blood plasma [16]. Most multiplex experiments have implemented bead-based immunoassays detecting a limited number of cytokines, but there is no consensus on whether they are useful for the noninvasive diagnosis of endometriosis [9]. Cytokines often present with variable measurements due to their short half-life and the influence of pre-analytical variables [16]. Our approach for discovery of new endometriosis biomarkers encompassed the use of a large antibody array (L-series 1000 and later Quantibody 660) identifying proteins in different pathways known to be involved in the pathogenesis of endometriosis, including apoptosis, ECM breakdown/remodeling, female hormones, immune response, cell adhesion, glycoproteins, cytokines, wnt signaling, serine protease- and metalloprotease-inhibitor activity, angiogenesis, and chemotaxis. For initial technical verification, we chose one marker from each pathway as we estimated that a combination of proteins from different pathways would be more useful in a biomarker panel (see also Table 2). We only succeeded in verifying the potential utility of IL-31. IL-31 is part of the gp130/IL-6 cytokine family and exerts pleiotropic effects on the immune system [27]. It has been cited as a potential biomarker for endometrial cancer [28], but has not yet been investigated as a biomarker for endometriosis [9]. IL-31 on its own will not suffice as a diagnostic test for endometriosis, but it could be investigated in combination with other markers for endometriosis. More research in independent patient populations herein is necessary.

In the context of biomarker discovery in therapeutic areas outside reproductive medicine, antibody arrays have been used including those manufactured by RayBiotech. It is a relevant question to consider to which extent semiquantitative multiplex assays have led to the discovery and validation of new and clinically robust biomarkers that can be used in clinical practice. Part of the L-series 1000, namely, the L-507 slide, has been used for biomarker screening in plasma or serum in patients with heart failure with preserved ejection fraction [23], ovarian cancer [29], colorectal cancer [30], pancreatic ductal adenocarcinoma, [31] and hepatocellular carcinoma [32]. Several slides of the Quantibody array were used to investigate inflammatory proteins in serum samples from American tegumentary leishmaniasis patients [33] and to identify potential serum biomarkers for the discrimination of neurodegenerative parkinsonian disorders in the initial screening stage [34]. However, none of these studies have led to a clinically approved biomarker test so far. In some of these reports, the array results were not reproducible, which has been quoted as a potential reason for failure of biomarker validation [34, 35]. In studies where a different patient cohort was used for validation, the failure to reproduce some of the original array trends with another technical platform could not be solely attributed to analytical variability, but could also be due to a shift in patient cohort composition [32, 35]. In a similar approach as we used in this study, Malhknecht et al. used a Quantibody array on pooled samples to identify serum biomarkers for the discrimination of neurodegenerative parkinsonian disorders in the initial screening stage and found a significant difference for only two out of seven proteins selected for further investigation in the individual samples making up the sample pools [34].

There are several possible reasons for a lack of reproducibility in array experiments [36]. A known challenge using multiplex immunoassays is the optimization of the detection of all analytes using one dilution and one diluent [37]. Furthermore, variation in the immunoassay manufacturing process, such as antibody spot printing, confers imprecision and variability [38]. The different antibody array designs also bring about different levels of variation.

The L-series 1000 employs a strategy of direct sample labeling with biotin, eliminating the need for a detection antibody. This practice allows upscaling of the number of targets, but the use of only a capture antibody reduces the specificity. An additional disadvantage of direct labeling is the possible masking of the epitope and the difficulty to obtain homogenous labeling among high-abundance and low-abundance proteins [39, 40]. Furthermore, batch differences between biotinylation can occur which can lead to more or less efficient biotinylation of samples from one batch to another [41]. This may have been the problem between our first and second L-series 1000 experiment, where the repeated experiment showed overall lower fluorescence values at the same PMT settings. A second disadvantage of the L-series 1000 is the lack of an internal spike-in control of the sample for array normalization. This would have allowed us to account for sample changes during the dialysis and biotinylation step before sample incubated onto the array, but this internal spike-in control was not available. Instead, the positive control spots (different amounts of biotinylated protein) are compared across slides to normalize for differences in Cy3 labeling and in fluorescence measurements of the scanner.

While the Quantibody 660 performs better because of its sandwich approach, it also does not have an internal control although this is less necessary than for the L-series 1000 as there is less sample handling before placing it onto the array (only dilution). Despite the fact that the Quantibody 660 does not have the limitations of the L-series 1000, we did see differences between the Quantibody 660, custom array, and single ELISA. This can be explained by the use of different assay formats. We expected that these changes would mostly impact the absolute quantitation of the proteins while keeping the general trends (up or down in endometriosis) constant, but this was not the case in our experiments. Changing the matrix of the run can alter the LODs of various targets, and while they will not cross-react, the interactions can be altered. For example, in the Quantibody there is an extended matrix in the sample (other antibodies), while in ELISA the only detection agent is the detection antibody, and additionally there is a much larger printed surface for binding. As the use of different antibody pairs by different manufacturers may induce variability, we opted to use ELISA kits from the same company as the initial arrays.

Robustness of immunological assays depends highly on the choice of antibody, and consistent performance of antibodies in immune assays remains an unmet need in too many cases [42]. It is well known that assays from different manufacturers for the same analyte can result in differences of reported absolute concentrations [43]. Moreover, between assays from the same manufacturer, variation in assay results can be attributed to lot-to-lot ELISA variability, sample, or kit storage time [44]. There is a lack of standardization and harmonization of diagnostic clinically approved procedures (in vitro diagnostics or IVD), and obviously this lack is even more common in the less strictly regulated immunoassays for research purposes only (RUO) [43].

In conclusion, we attempted to discover new endometriosis biomarkers using a pooled approach using large antibody arrays to detect a variety of proteins. Due to heterogeneity as a result of sample pooling and as a result of lack of reproducibility, we could only verify IL-31 as possible marker that could be useful in a biomarker panel for endometriosis. We experienced that for discovery of clinically useful biomarkers, the antibody arrays that we used may not yet be in a far enough stage. Further research should determine whether other antibody arrays are more reliable for robust marker discovery and consequent validation.