Keywords

1 Introduction

With the development of high-throughput sequencing technology, a vast amount of biological data of different categories have been generated by The Cancer Genome Atlas (TCGA). They provide us more opportunities to learn the biological mechanism of complex diseases [1].

Detecting features from biological data is an effective way to illuminate the underlying mechanism of diseases. A variety of feature extraction methods have been widely used to analyze the gene expression data. For instance, least absolute shrinkage and selection operator (LASSO), penalized matrix decomposition (PMD) and sparse principal component analysis (SPCA) are commonly used methods of feature extraction. Roth V. used the generalized LASSO method to feature selection problems for microarray data [2]. Liu carried differential expression analysis on RNA-seq count data based on PMD [3]. Lass et al. applied SPCA to clustering and feature selection problems [4]. Although LASSO, PMD and SPCA have achieved satisfactory performance on explaining the gene expression, they still have some defects in multi-omics feature extraction. These conventional feature extraction methods which can only identify genomic feature from single type of genomic feature cannot handle the integrated TCGA datasets.

Recently, many particle swarm optimization (PSO) based methods have been proposed for determining SNP-SNP interactions [5], gene features selection [6], and cancer classifications [7]. PSO is a population-based search algorithm of adaptive evolution, which proposed by Kennedy and Eberhart in 1995 [8]. Owing to its simple structure and fast convergence, PSO has become an important evolutionary algorithm. In recent years, numerous studies have been carried out to improve the performance of PSO. Kennedy and Mendes have conducted a deep research on population structure and particle behavior, founding that topology has a profound impact on particle behavior [9]. Liu et al. proposed SFPSO (Scale-Free PSO) [10]. Gao proposed SIPSO (Selectively-informed Particle Swarm Optimization), which employed scale-free network to simulate the population structure and greatly improved the optimization process [11]. The DMSPSO proposed by Zhao, used random dynamic changed population structure which greatly improved the ability of local search [12].

However, conventional improvement on PSO algorithm suffers from the limited particle population structure. For example, SFPSO and SIPSO generate the population structure before experiments which cannot embody the dynamic changes in the process of iteration. DMSPSO achieves the dynamic changes in population structure to a certain extent, but the population structure building becomes a completely random process which is unable to fit in with the actual optimization problems.

In this paper, we propose an improved PSO-based algorithm with dynamic scale-free network, named DSFPSO, to detect multi-omics features. The innovations of DSFPSO are the introduction of scale-free network and velocity updating strategies. We employ scale-free network as its population structure which can be dynamically adjusted in the process of iteration. Three types of velocity updating strategies are used in DSFPSO for fully considering the heterogeneity of particles and the connecting between neighbors. Specifically, to utilize the difference of gene expression based on different levels of multi-omics data, we employ the ranking function to extract the most effective gene features. To evaluate the validity of DSFPSO, experiments applied on CRC are handled by DSFPSO and other compared methods. The identified genes are appraised by gene function analysis and pathway analysis. Results show that the novel method can identify CRC-associated features effectively.

2 Methods

2.1 Standard PSO Algorithm

PSO is similar to other evolutionary algorithms which use the concepts of “groups” and “evolution” [13]. The speed of each particle can be dynamically adjusted according to the particle itself and its peers’ experience based on the fitness value. Based on the fitness of the position, each particle will move to a better place and obtain the optimal solution of optimization problems.

Standard PSO algorithm can be illustrated as follows.

  1. Step1:

    Initialize the particle velocity and position;

  2. Step2:

    Evaluate the fitness of each particle;

  3. Step3:

    Decide whether to update personal and group best positions by comparing the fitness;

  4. Step4:

    Update the position and speed of the particles;

  5. Step5:

    If not meet the ending condition, then return to Step2.

2.2 DSFPSO on Multi-omics Data

The flowchart of the proposed method is shown in Fig. 1. We will describe DSFPSO in details on six aspects.

Fig. 1.
figure 1

The flowchart of DSFPSO.

2.2.1 Initializing Particles with Multi-omics Data

According to the characteristics of the omics data, we integrate the data as genomics and clinical information matrices. The whole genome matrix is the search space of particles while the clinical information matrix is used for the test of particle fitness.

Based on the above mapping of multi-omics data, the position of particle \( i \) at iteration \( t \) can be illustrated as

$$ \begin{array}{*{20}c} {Position_{t} (i) = (x_{i1}^{t} , \cdots ,x_{ik}^{t} , \cdots ,x_{iK}^{t} )} \\ {i \in \{ 1,2, \cdots ,I\} } \\ {k \in \{ 1,2, \cdots ,K\} } \\ {t \in \{ 1,2, \cdots ,T\} } \\ {x_{ik}^{t} \in \{ 1,2, \cdots ,M\} } \\ \end{array} $$
(1)

where \( I,K,T,M \) represents the number of particles, combination dimension of genomic features, iteration, and gene features in the genome datasets, respectively. \( x_{ik}^{t} \) is the selected genomic feature of particle \( i \) at iteration \( t \) in \( k \) dimensional space.

The speed of particle \( i \) at iteration \( t \) can be defined as

$$ \begin{array}{*{20}c} {Velocity_{t} (i) = (v_{i1}^{t} , \cdots ,v_{ik}^{t} , \cdots ,v_{iK}^{t} )} \\ {v_{ik}^{t} \in [1 - M,M - 1]} \\ \end{array} $$
(2)

where \( v_{ik}^{t} \) is the speed of \( x_{ik}^{t} \).

Similarly, before the first iteration, \( Position_{t} (i) \), \( Velocity_{t} (i) \), \( Pbest_{t} (i) \), \( Neibest_{t} (i) \), \( Gbest_{t} (i) \) are assigned a random value in their domain respectively.

2.2.2 Analysis of the Fitness Function

Since mutual information does not need to assume the distribution of genomics data and can effectively measure the nonlinear relationship between genetic characteristics [14], we employ it as fitness function, which can be formulated as

$$ MI(X;Y) = H(X) + H(Y) - H(X,Y) $$
(3)

Therefore, higher mutual information value denotes strong association between the genetic characteristic combination and the clinical information.

2.2.3 Updating the Dynamic Scale-Free Network

In order to fully utilize the properties of particles and experimental data, we have adopted a new strategy of link growth and selecting.

In one iteration, we make the out of network particles in fitness descending order and select new particles with higher fitness from these particles to join the network. Then these new particles will choose excellent neighbors from the network particles with the same sort processing.

In the dynamic process of scale-free network building, the particles position and population structure will be dynamically updated with the join of new particles in the solution space. Furthermore, we select the excellent new particles according to fitness value instead of the basic scale-free network adding new points without selection, which greatly improve the reliability of particles information exchange.

2.2.4 Updating the Particle Speed

In DSFPSO, the scale-free network building is synchronized with the solving iteration. Accordingly, particles have the difference of “in” and “out” of the network in the process of scale-free network building, so the two kinds of particles should be treated differently using different velocity updating strategies. The velocity updating equations can be formulated as

$$ \begin{aligned} & v_{ik}^{t + 1} = \left\{ {\begin{array}{*{20}c} {\eta \cdot (v_{ik}^{t} + \frac{1}{{k_{i} }}\sum\limits_{j \in N(i)} {rand(0,\phi ) \cdot (pbx_{jk}^{t} - x_{pk}^{t} )),} } & {^{{{\prime \prime }}} in^{{{\prime \prime }}} } \\ {w_{ik}^{t} \cdot v_{ik}^{t} + rand(0,c1) \cdot (pbx_{ik}^{t} - x_{ik}^{t} ) + rand(0,c2) \cdot (gbx_{ik}^{t} - x_{ik}^{t} ),} & {^{{{\prime \prime }}} {\text{out}}^{{{\prime \prime }}} } \\ \end{array} } \right. \\ & v_{ik}^{t + 1} = \left\{ {\begin{array}{*{20}c} {v_{ik}^{t + 1} } & {v_{ik}^{t + 1} \in [1 - M,M - 1]} \\ {rand(1 - M,M - 1)} & {v_{ik}^{t + 1} \notin [1 - M,M - 1]} \\ \end{array} } \right. \\ & \eta = \frac{2}{{\left| {2 - \phi - \sqrt {\phi^{2} - 4\phi } } \right|}} \\ & \phi = c_{1} + c_{2} > 4 \\ & w_{ik}^{t} = b - iter \cdot (b - a)/n \\ \end{aligned} $$
(4)

where \( \eta \) is learning rate, \( c_{1} \) and \( c_{2} \) are acceleration coefficients. \( w_{ik}^{t} \) is dynamic inertia weight balancing the capability between global and local search, \( rand(a,b) \) is random number between \( a \) and \( b \), \( N(i) \) denotes the neighbors of the particle \( i \), \( K_{i} \) is the number of neighbors for particle \( i \).

Based on the speed updating of particles, the position updating equation can be formulated as

$$ \begin{aligned} x_{ik}^{t + 1} & = x_{ik}^{t} + v_{ik}^{t + 1} \\ x_{ik}^{t + 1} & = \left\{ {\begin{array}{*{20}c} {x_{ik}^{t + 1} } & {x_{ik}^{t + 1} \in [1,M]} \\ {\text{int} (rand(1,M))} & {x_{ik}^{t + 1} \notin [1,M]} \\ \end{array} } \right. \\ \end{aligned} $$
(5)

2.2.5 Updating Personal Best Position, Neighbor Best Position and Group Best Position

In DSFPSO, particle’s personal best position will be updated by the position with the maximum mutual information. The specific equations can be formulated as

$$ \begin{aligned} & Pbest_{t + 1} = \left\{ {\begin{array}{*{20}c} {Position_{t} (i)} & {MI(Position_{t} (i);Y) = Val} \\ {Pbest_{t} (i)} & {MI(Pbest_{t} (i);Y) = Val} \\ \end{array} } \right. \\ & Val = \hbox{max} (MI(Position_{t} (i);Y),\,MI(Pbest_{t} (i);Y)) \\ \end{aligned} $$
(6)

Similarly, the group best position updating equations can be written as

$$ \begin{aligned} & Gbest_{t + 1} = \left\{ {\begin{array}{*{20}c} {Pbest_{t + 1} (i)} & {MI(Pbest_{t + 1} (i);Y) = Val} \\ {Gbest_{t} (i)} & {MI(Gbest_{t} (i);Y) = Val} \\ \end{array} } \right. \\ & Val = \hbox{max} (Pbest_{t + 1} (i);Y),\,MI(Gbest_{t} (i);Y)) \\ \end{aligned} $$
(7)

And the neighbor best position updating equations can be written as

$$ \begin{aligned} & Neibest_{t + 1} = \left\{ {\begin{array}{*{20}c} {Position_{t} (j)} & {MI(Position_{t} (j);Y) = Val} \\ {Neibest_{t} (i)} & {MI(Neibest_{t} ;Y) = Val} \\ \end{array} } \right. \\ & Val = \hbox{max} (MI(Position_{t} (j);Y),\,MI(Neibest_{t} (i);Y)) \\ & j \in N(i) \\ \end{aligned} $$
(8)

2.2.6 Finding Final Results

In genomics data, each gene may have several genetic characteristics due to the differences of gene expression. In the results of DSFPSO, a gene may have a variety of genomic characteristics or may not. In this paper, we resort scoring strategies to extract gene features based on the score of gene expression [15]. The scoring function can be described as

$$ \begin{aligned} & Score1(i) = rank(i) \cdot (n - i + 1) \\ & Score2(j) = \sum\limits_{i \in G} {Score1(i)} \\ \end{aligned} $$
(9)

where \( rank(i) \) represents the rank value of genomic features \( i \), \( n \) is the total order value of all the gene characteristics,\( G \) is the expression set of each gene.

3 Results

3.1 TCGA CRC Data

TCGA CRC data can be obtained from its web portal (https://tcga-data.nci.nih.gov/docs/publications/tcga/). Data used in this paper is the integrated data which has been preprocessed by Lee [16] (http://genomeportal.stanford.edu/tcga-crc/pages/datainformation). Considering the experiment needs, we carry discretization on somatic mutations and methylation data,which greatly improved the stability of the experiment.

The CRC data of TCGA used in this paper from 197 samples contains 5,188 genomic features of 1325 genes, including copy number variation, somatic mutations, methylation data and gene expression data (Fig. 2).

Fig. 2.
figure 2

The CRC data of TCGA.

3.2 Gene Enrichment Analysis

ToppGene is a one-stop portal for gene list enrichment analysis and candidate gene prioritization based on functional annotations and protein interactions network.

To show the effectiveness of DSFPSO, we carry out GO enrichment analysis using ToppGene (https://toppgene.cchmc.org/enrichment.jsp) and compare the results on the same data set, including PSO,SIPSO, LASSO, PMD and SPCA. We input the top 500 genes identified by these methods into the ToppGene Suite, respectively, whose threshold value of the p-value is set to 0.001 and other parameters are set as default. Table 1 lists the top 10 closely related GO terms found by ToppGene. From this table, we can see that the term of “positive regulation of gene expression” has the lowest P-Value (9.38E-19), so it is considered as the most probable enrichment item. Furthermore, we notice that in the term of “regulation of multicellular organismal development” PSO outperforms DSFPSO and in the term of “regulation of transcription by RNA polymerase II” PMD outperforms DSFPSO. In general, DSFPSO shows better performance than SIPSO, PSO, LASSO, PMD and SPCA in majority results.

Table 1. The closely related GO terms found by toppgene.

3.3 KEGG Pathway Analysis

KEGG (Kyoto Encyclopedia of Genes and Genomes) is a database which systematacially analyzes the function of gene to reveal the genetic and chemical blueprint of life [17].

In this study, we use DAVID (https://David-d.ncifcrf.gov/) on KEGG pathway to analyze the results. The top 10 CRC-associated pathways are shown in Table 2. Among them, Pathways in cancer and Colorectal cancer are obviously correlated with cancers. [18] indicates that PI3 K-Akt signaling pathway play an important role in inflammation-induced colorectal carcinogenesis. PI3 K-Akt signaling pathway links intimately with cellular metabolism and has great influence on cancer biological behavior [19]. The FoxO signaling pathway plays a central role in diverse physiological processes including cellular energy storage, growth and survival, among others [20]. [21] suggests that FOXO3a is a relevant mediator of the cytotoxic effects of cisplatin in colon cancer cells. Adherens junction pathway plays a critical role in cellular adhesion, glandular differentiation, and cellular proliferation. The function of this pathway correlated proteins is compromised in a number of intestinal diseases, including ulcerative colitis that has an increased incidence for colorectal cancer [22].

Table 2. The top 10 CRC-associated pathways.

3.4 Analysis of Gene Function

In order to evaluate the algorithm’s performance and explore the correlation between genes and the pathogenesis of colorectal cancer, we carry out detailed analysis on 10 CRC-related genes among top identified 50 genes. The gene function descriptions are shown in Table 3.

Table 3. The function of genes identified by DSFPSO.

CSMD1 alterations can correlate with earlier clinical presentation in colorectal tumors, thus further implicating CSMD1 as a tumor suppressor gene [23]. Loss of CSMD1 may contribute to the poor prognosis of colorectal cancer patients [24]. [25] indicates that KBTBD11 influences colorectal cancer risk, especially in interaction with an MYC-regulated SNP rs6983267. WRN promoter methylation connects mucinous differentiation, microsatellite instability and CpG island methylator phenotype in colorectal cancer [26]. SUZ12 mRNA expression in the CRC tissues is significantly increased than in the non-cancerous tissue. Increased SUZ12 mRNA expression is directly correlated with primary tumor size, regional lymph node metastases, distant metastasis and AJCC stage. Furthermore, CRC patients with higher level of SUZ12 showed a worse disease-free survival (DFS) [27]. CDX2 is mutated in a colorectal cancer with normal APC/β-catenin signaling [28, 29] shows that CDX2 specifies intestinal development and homeostasis and is considered a tumor suppressor in colorectal carcinogenesis.

4 Conclusions

Considering traditional PSO algorithms usually take equal treatment of all particles and ignore the disadvantages related to the heterogeneity of population structure, we propose an improved PSO algorithm named as DSFPSO to identify gene features of complex diseases. This algorithm dynamically adjusts population structure according to the particles status in the process of iteration.

With fitness of particles as a standard for preferred link selection, DSFPSO realizes the true meaning of PSO for dynamic scale-free network. Moreover,this is the first time for PSO algorithm introduced into multi-omics data analysis with CRC data provided by TCGA as the experiment data and filtering results through scoring strategies. Experimental results show that DSFPSO can be convergent to global optimization quickly and find CRC-associated genes, which will provide valid references for early diagnosis, effective treatment and prognostic guidance of colorectal cancer. To explore correlations among differentially expressed genes is left as our future work.