Introduction

Non-Hodgkin lymphoma is one of the most malignant tumors occurring with the seventh highest incidence and ninth mortality rate in males and the sixth highest incidence and seventh mortality rate in females in developed countries [1]. The incidence and mortality rate of non-Hodgkin lymphoma is also increasing in developing countries. Diffuse large B-cell lymphoma (DLBCL) is the most common form of lymphoma, accounting for 30–40 % of newly diagnosed non-Hodgkin lymphomas [2]; there are no satisfactory biomarkers available to screen for DLBCL. Detecting cancers at their earliest stages of development is critical for achieving improved rates for curing the disease. Among DLBCL patients, less than one half of all diagnosed cases are cured with conventional chemotherapy; however, DLBCL is a heterogeneous disease with significant variations in clinical outcome [3]. Prognostic stratification is based on clinical risk factors used to determine the International Prognostic Index (IPI), which incorporates patient age at diagnosis, tumor stage, serum lactate dehydrogenase level, performance status, and the number of extranodal sites [4]. However, the use of the IPI for identification of high-risk patients is not completely reliable due the intrinsic molecular heterogeneity of this disease [5]. Therefore, the identification of prognostic serum biomarkers that can be used to reliably distinguish DLBCL patients is highly important for improving patient outcomes.

Recently, the focus of cancer research has expanded from the analysis of the genetic information encoded by the human genome to the analysis of protein expression. ProteinChip is a new proteomics technique that is based on surface-enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI-TOF-MS) [6]. The ProteinChip array surface binds proteins non-selectively, enabling high-throughput protein profiling and providing a sensitive system for detecting and analyzing multiple proteins. This technology has been applied successfully to the discovery of biomarkers of a number of cancers [79].

The purpose of this study was to utilize SELDI-TOF-MS and bioinformatics techniques to generate and analyze protein profiles in serum samples from patients with DLBCL and to distinguish prognostic markers for patients with DLBCL.

Materials and methods

Patients and samples

This study was approved by the Ethics Committee of Zhejiang University (China), and written informed consent was obtained from all patients and donors prior to inclusion. In total, 178 individuals (38 patients diagnosed as untreated early stage DLBCL, 13 patients with inflammatory lymphadenopathy, 35 healthy donors, 53 post-treatment non-relapsed DLBCL patients, and 39 relapsed DLBCL patients) who presented to the Second Affiliated Hospital, Zhejiang University, College of Medicine (China) were included in our study. The diagnoses of malignant tumors and inflammatory lymphadenopathy were confirmed by biopsy pathology. The patients with early stage DLBCL were classified as stage I according to the Ann Arbor staging system for lymphomas. All the enrolled post-treatment patients received a standard R-CHOP chemotherapy. Each group was age- and sex-matched with a corresponding control group.

Sample preparation and SELDI-TOF-MS analysis

The Protein Biological System (PBS) II ProteinChip plus SELDI-TOF-MS and the CM10 ProteinChip were purchased from Ciphergen Biosystems (Fremont, CA, USA). Sinapinic acid (SPA) was purchased from Fluka (Buchs, Switzerland). All other reagents were purchased from Sigma (Sigma-Aldrich; St. Louis, MO, USA). Serum samples were collected from patients in the early morning before breakfast and were stored at room temperature for 1–2 h before centrifugation at 3000 rpm for 10 min. The samples were then stored at −80 °C until required.

Prior to analysis, the samples were thawed and centrifuged at 10,000 rpm for 2 min. The supernatants were removed, and 5 μl of each sample was added to 10 μl of 0.5 % U9 buffer (9 M urea, 0.2 % CHAPS, 0.1 % dithiothreitol) in a 96-well plate. Samples were then incubated for 30 min at 4 °C on a platform shaker set at 600 rpm.

The ProteinChip array cassette was placed in a 96-well bioprocessor, and 200 μl NaAC (50 mM, pH 4.0) was added to each well. The cassette was incubated for 5 min at 4 °C and shaken at 600 rpm. The liquid was removed and retained, and the procedure was repeated. A further 185 μl NaAC was added to each well, and the cassette was incubated for 2 min at 4 °C with shaking at 600 rpm. Samples retained from different patients were added to separate wells (100 μl/well), and the cassette was incubated for 1 h at 4 °C, with shaking at 600 rpm. After removing excess liquid, the array was washed twice with 200 μl NaAC for 5 min and shaken at 600 rpm and rinsed twice with high-performance liquid chromatography water.

Prior to SELDI-TOF-MS analysis, 1 μl of a saturated solution of SPA energy-absorbing molecules was applied onto each chip array. The array surface was allowed to air-dry, and the application was repeated. The CM10 ProteinChip is a weak cation exchange array with a carboxylate surface that binds cationic proteins; therefore, the negatively charged carboxylate groups interact with the positive charges exposed on target proteins. The ProteinChip was placed on the PBS II mass spectrometer reader, the automatic baseline correction was applied, and peaks between 2000 and 20,000 Da were detected automatically. Peaks with mass-to-charge (m/z) ratios <2000 Da were mainly due to ion noise from the matrix and were excluded. Ciphergen ProteinChip Software v. 3.2 was used to analyze the data.

Bioinformatics and statistical analysis

Data analysis was performed using the Zhejiang University Cancer Institute-ProteinChip Data Analysis System (ZUCI-PDAS) on a MATLAB Web Server v. 1.2.4 (MathWorks Inc., Natick, MA, USA). The undecimated discrete wavelet transform (UDWT) method was applied to denoise the signals using the Rice Wavelet Toolbox v. 2.4. Baseline correction was achieved by aligning the spectra with a monotone local minimum curve and mass calibration was carried out by adjusting the intensity scale to three peaks that were present in all the spectra. Detection and quantification of the peaks was performed using an algorithm that evaluated the maximum height of every denoised, baseline-corrected, and calibrated mass spectrum. The minimum height of each peak was set at 10. The peaks were filtered to maintain a signal-to-noise ratio (SNR) >3, where the SNR was defined as the height of the peak above the baseline to the wavelet-defined noise. To match peaks across the spectra, peaks with a relative mass difference ≤0.3 % were pooled and defined as a cluster. The value of any peak that was absent from a cluster was set to the maximum height of that cluster. Normalization was performed only within peak clusters.

The non-linear SVM classifier was based on the shareware program OSU_SVM v. 3.00 Toolbox of Junshui Ma and Yi Zhao. A radial-based function kernel, a parameter gamma of 0.6, and a cost of the constraint violation of 19 were used to distinguish different groups of data. The leave-one-out cross-validation (LOOCV) approach was applied to estimate the accuracy of the classifier. This approach assigns one sample as the test set, and the remaining samples are used as the training set. The process is repeated until each sample has been selected as a test set.

The Wilcoxon t test was used to estimate the potential of each peak for distinguishing different groups of data. The 10 peaks with the smallest P values were selected for further analysis. Each of the 1023 combinations of the 10 peaks was analyzed by the SVM using the LOOCV method. The combinations that showed the highest accuracy in distinguishing different groups of data were selected as potential biomarkers. The SVM model with the highest Youden’s Index was selected for the detection of DLBCL.

The reproducibility of the SELDI-TOF-MS spectra was verified using a quality control sample. The mass accuracy was <0.1 % according to the all-in-one peptide molecular mass standard (Ciphergen Biosystems).

Results

Differential proteomics pattern of DLBCL versus normal controls (model 1)

SVM analysis using the LOOCV method revealed nine protein peaks (m/z: 6443, 5913, 6198, 4098, 7775, 9293, 5946, 5977, and 4628) that could be used to distinguish DLBCL from healthy individuals with an accuracy of 95.89 % (70/73) (Table 1, Figs. 1 and 2). The diagnostic pattern constructed using SVM analysis, including the nine proteins of model 1, showed the maximum Youden’s Index (YI). Among these peaks, those at 6443, 6198, 4098, 9293, 5977, and 4628 m/z were found at high intensity in normal individuals, but at low intensity in DLBCL patients. In contrast, the peaks at 5913, 7775, and 5946 m/z were detected at significantly higher levels in DLBCL patients compared with those in normal individuals (Table 2). Representative peaks are shown in Fig. 1. The sensitivity and specificity of detection were 94.29 % (33/35) and 97.37 % (37/38), respectively (Table 1).

Table 1 The cross-validation blind test results of the test set in model 1 (cases)
Fig. 1
figure 1

Protein mass peak 6443 expressed significantly high in normal individuals (group 0), but remarkably low in DLBCL (group 1) (p = 8.1241e-011)

Fig. 2
figure 2

Protein mass peak 5913 expressed significantly high in DLBCL (group 1), but low in normal individuals (group 0) (p = 1.1708e-010)

Table 2 The nine selected discrepant protein mass peaks comparison of normal individuals with DLBCL patients of model 2; mean and standard deviation (SD)

Differential proteomics pattern of DLBCL versus inflammatory lymphadenopathy patients (model 2)

Through LOOCV, support vector machine (SVM) analysis was used to screen out three protein peaks (m/z: 3942, 6639, and 4121) to build model 2, which could be used to define the peptide patterns for distinguishing DLBCL from patients with inflammatory lymphadenopathy. Combinations with the highest accuracy were chosen as potential biomarkers, and the SVM model with the highest Youden’s Index (YI) was used as the diagnostic model (Table 3). The accuracy of the blind prediction was 94.12 % (48/51). The sensitivity and specificity of the detection were 84.62 % (11/13) and 97.37 % (37/38), respectively (Table 4).

Table 3 The three selected protein mass peaks comparison of inflammatory lymphadenopathy patients with DLBCL patients of model 2; mean and standard deviation (SD)
Table 4 The cross-validation blind test results of the test set in model 2 (cases)

Differential proteomics pattern of inflammatory lymphadenopathy patients versus normal individuals (model 3)

Model 3 formed by six protein peaks could be used to distinguish patients with inflammatory lymphadenopathy from healthy individuals with an accuracy of 97.92 % (47/48) (Table 5), as evaluated by LOOCV. The descriptive statistics of these six peaks are shown in Table 6.

Table 5 The cross-validation blind test results of the test set in model 3 (cases)
Table 6 The six selected discrepant protein mass peaks comparison of normal individuals with inflammatory lymphadenopathy patients of model 3; mean and standard deviation (SD)

Differential proteomics pattern of non-relapse DLBCL patients versus relapse DLBCL patients (model 4)

Model 4 formed by three protein peaks could be used to distinguish non-relapsed DLBCL patients from relapsed DLBCL patients with an accuracy of 84.78 % (78/92) (Table 7), as evaluated by LOOCV. The descriptive statistics of these three peaks are shown in Table 8.

Table 7 The cross-validation blind test results of the test set in model 4 (cases)
Table 8 The three selected protein mass peaks comparison of patients of non-relapse DLBCL with relapse DLBCL of model 4; mean and standard deviation (SD)

Discussion

ProteinChip technology has enabled the application of high-throughput proteomics to the discovery of potential biomarkers of a variety of diseases. The advantages of this technique include the simultaneous analysis of small samples on a single chip, a relatively short experimental duration, and the ability to separate and identify peptides expressed on the chip surface based on differences between their intensities, without a requirement for purification [10].

Support vector machines (SVMs) are a machine learning approach originally proposed and developed by Vladimir Vapnik in 1995. This approach is based on the structural risk minimization principle (SRM), which can be used to solve multiple problems in pattern recognition involving small sample sizes, nonlinearity, or high dimensional data. The central principle is based on generalization using a classifier that is required to be applicable not only on training samples but also to previously unseen samples or test samples. Although this approach was previously recognized, the development of SVM has enabled its effective application. The intrinsic complexity of biological systems and their high dimensionality compared to small sample sets render SVM an ideal tool for the analysis of high-throughput biological data.

Reduction the number of false protein peaks created through random variation remains one of the challenges in analyzing SELDI-TOF-MS-generated data [11]. To overcome this problem, we developed a bioinformatics tool based on an algorithm that was able to identify most of the true, reproducible peaks. This tool was designed using the ZUCI-PDAS with UDWT denoising and included baseline correction, peak detection, biomarker selection, and the establishment and evaluation of differential SVM patterns based on statistical theory. The SVM can be used to solve problems such as the generalization in pattern recognition, pattern selection, and over-fitting of medium and small sample sizes; therefore, it is applicable to finite samples, rather than to sample sizes approaching infinity, to give an optimal solution with the available information. In theory, obtaining the optimal point can solve problems caused by local extrema, which are otherwise unsolvable by artificial neural network (ANN) methods. The algorithm transforms the actual condition into hyperplane feature space with non-linear transformation, and the non-linear discriminant function is created by constructing a linear discriminant function in hyperplane space. The specificity of this approach enables excellent generalization and effectively solves the dimensionality problem. The LOOCV method is used to determine the accuracy of the classifier. All these steps in combination ensure that the selection of biomarkers is not influenced by systemic biases. This bioinformatics tool has been used previously by our group to produce a protein fingerprint in the diagnosis of papillary thyroid carcinoma [12] and to identify candidate molecular markers in oral squamous cell carcinoma [13], gastric cancer [14], non-small cell lung cancer [15, 16], and colorectal cancer [17].

In this study, we successfully used SELDI-TOF-MS and bioinformatics techniques to analyze the sera of patients with DLBCL to build four models that can be used to discriminate between different groups of patients. Model 1 could be used to distinguish DLBCL from healthy individuals with an accuracy of 95.89 % (70/73). Model 2 could be used to distinguish DLBCL from patients with inflammatory lymphadenopathy with an accuracy of 94.12 % (48/51). Model 3 could be used to distinguish patients with inflammatory lymphadenopathy from healthy individuals with an accuracy of 97.87 % (46/47). Model 4 could be used to distinguish non-relapsed DLBCL patients from relapsed DLBCL patients with an accuracy of 84.78 % (78/92). The diagnostic pattern constructed using the SVM showed the maximum YI, and the four patterns were validated by LOOCV.

Different biomarkers are often detected by using different bioinformatics methods, such as conventional SVMs, discriminant analysis, and ANNs. The LOOCV SVM method has relatively high sensitivity and specificity; therefore, we applied this method of analysis to data obtained using the SELDI-TOF-MS and bioinformatics approaches to successfully build models of serum protein fingerprints in DLBCL. Further research is required to explore the peptides identified in our study in greater detail and to confirm our findings in a larger cohort of study samples.