Keywords

1 Introduction

SNP-SNP interactions have been receiving increasing attention in understanding the mechanism underlying susceptibility to complex diseases, such as heart disease, Alzheimer’s disease, cancer, type 2 diabetes and many others [1]. Detection of SNP-SNP interactions is therefore becomes an urgent, front-burner problem in the field of Bioinformatics. Though lots of algorithms have been proposed for inferring SNP-SNP interactions, the algorithmic development is still ongoing due to their computational and mathematical complexities, including the fitness function that decides how well an SNP combination contributes to the phenotype, the limitation of prior knowledge in interpreting complex genetic architecture of a disease, the intensive computational burden imposed by the enormous search space resulted from the problems of “high dimensional and small sample size” and “combinatorial explosion”, which obviously hinder further applications of current algorithms in genome-wide association studies.

Recently, several particle swarm optimization (PSO) based methods have been proposed for determining SNP-SNP interactions [213]. Yang et al. [2] used a binary PSO to evaluate the risk of breast cancer, with odds ratio as its fitness function, which is known as OR-BPSO for short. Based on the OR-BPSO, Chang et al. [3] presented a odds ratio discrete binary PSO (OR-DBPSO) for inferring SNP-SNP interactions with the quantitative phenotype. Chuang et al. [4] also combined PSO and odds ratio to explore combinations of SNP-SNP interactions in breast cancer. They then developed a chaotic PSO (CPSO) for breast cancer association studies and indeed identified a SNP combination [5]. In order to enhance the reliability of the PSO in determining SNP-SNP interactions associated with breast cancer, they improved the PSO (IPSO) and proved that the IPSO is highly reliable than the OR-BPSO [6]. Recently, they presented a gauss chaotic map PSO (Gauss-PSO) and used it to detect the best association with breast cancer [7]. They confirmed that the Gauss-PSO was capable of identifying higher difference values between cases and controls than both the PSO and the CPSO. By analyzing the performance of the PSO and the CPSO, Yang et al. [8] presented a double-bottom chaotic map PSO (DBM-PSO) to overcome their respective limitations. Later, the DBM-PSO is used to infer SNP-SNP interactions based on chi-square test [9]. Hwang et al. [10] developed a complementary-logic PSO (CLPSO) to increase the efficiency of determining significant SNP-SNP interactions in case-control study. Wu et al. [11] proposed a PSO based algorithm and found SNP interactions of rennin-angiotensin system genes against hypertension. Nevertheless, above mentioned methods mainly focus on identifying a best genotype-genotype of a SNP-SNP interaction among possible genotype combinations of SNP combinations, but not several SNP-SNP interactions among possible SNP combinations, which may lose important clues for the exploration of causative factors of complex diseases. In addition, limited sample size of SNP data obviously affects computational accuracies of these methods. More recently, Ma et al. [12] proposed PSOMiner for the detection of SNP-SNP interactions, which is a generalized PSO algorithm based on the fitness function of mutual information. Shang et al. [13] presented an improved method IOBLPSO for detecting SNP-SNP interactions by using opposition-based learning PSO. IOBLPSO has ensured the ability of global searching and prevented premature convergence. However, above methods treated all particles equally and adopt a single learning strategy, overlooking the heterogeneity of particles. Gao et al. [14] employed scale-free networks to represent the population structure of swarms and proposed a selectively informed PSO (SIPSO), where particles select different learning strategies based on their degrees: a densely connected hub particle gets full information from all of its neighbors (fully-informed) while a non-hub particle with low degree can only follow a single yet best performed neighbor (single-informed). Experiment results show that SIPSO is able to significantly improve the optimization performance.

In this paper, we combine SIPSO and mutual information to determine SNP-SNP interactions. The highlights of SIPSO are the introductions of scale-free networks as its population structure, two different learning strategies as its interaction modes, and mutual information as its fitness function. Experiments are performed on several real and simulation data sets. Results demonstrate that SIPSO is promising in inferring SNP-SNP interactions, and might be an alternative to existing methods.

2 Methods

2.1 Scale-Free Network

A scale-free network is a network whose degree distribution follows a power law, at least asymptotically. So far, many networks have been reported to be scale-free, for instance, world wide web, internet, neural networks, and so on. Liu et al. studied the influence of scale-free population structure on the performance of PSO, showing that the scale-free PSO outperforms the traditional PSO, due to only a few particles being densely connected hubs and most particles being low degree non-hubs, resulting in high heterogeneity of degrees of particles [14, 15].

In SIPSO, particles are considered as nodes of a scale-free network. This network is generated by Barabasi-Albert model [16], which incorporates two ingredients including growth and preferential attachment. Specifically, at first \( m_{0} \) fully connected particles are constructed as a complete network; at each time step a new particle is added to the network, which is connected to \( m \) (\( m < m_{0} \)) nodes with a specified probability. This probability \( p_{i} \) of the new particle being connected to an existing particle \( i \) depends upon degree of \( i \), i.e., \( k_{i} \), which can be written as

$$ p_{i} = \frac{{k_{i} }}{{\sum\limits_{j} {k_{j} } }}, $$
(1)

where \( j \) runs over all the existing particles. Here \( m_{0} \) is set to 4 and \( m \) is set to 2.

2.2 Mutual Information

The SIPSO introduces mutual information to decide which SNP combinations are SNP-SNP interactions, and to measure how much the effects of captured SNP-SNP interactions to the phenotype are, since mutual information has widely been used as a promising measure for measuring dependence of two variables without a complex modeling. The formula of mutual information can be written as

$$ MI\left( {S;Y} \right) = H\left( S \right) + H\left( Y \right) - H\left( {S,Y} \right), $$
(2)

where \( H\left( S \right) \) is the entropy of \( S \), \( H\left( Y \right) \) is the entropy of \( Y \), and \( H\left( {S,Y} \right) \) is the joint entropy of both \( S \) and \( Y \), \( S \) is a position of a particle indicating a SNP combination, \( Y \) is the phenotype.

The entropy and the joint entropy are defined as,

$$ H\left( S \right) = - \sum\limits_{{j_{1} = 1}}^{3} { \cdots \sum\limits_{{j_{K} = 1}}^{3} {\left( {p\left( {s_{{j_{1} }} , \cdots ,s_{{j_{K} }} } \right)\; \cdot \;\log p\left( {s_{{j_{1} }} , \cdots ,s_{{j_{K} }} } \right)} \right)} } , $$
(3)
$$ H\left( Y \right) = - \sum\limits_{j = 0}^{1} {\left( {p\left( {y_{j} } \right)\; \cdot \;\log p\left( {y_{j} } \right)} \right)} , $$
(4)
$$ H\left( {S,Y} \right) = - \sum\limits_{{j_{1} = 1}}^{3} { \cdots \sum\limits_{{j_{K} = 1}}^{3} {\sum\limits_{j = 0}^{1} {\left( {p\left( {s_{{j_{1} }} , \cdots ,s_{{j_{K} }} ,y_{j} } \right)\; \cdot \;\log p\left( {s_{{j_{1} }} , \cdots ,s_{{j_{K} }} ,y_{j} } \right)} \right)} } } , $$
(5)

where \( K \) is the considered order of SNP-SNP interactions, i.e., the number of SNPs in a SNP combination. \( s \) is the genotype of a SNP coded as \( \left\{ {1,2,3} \right\} \), corresponding to homozygous common genotype, heterozygous genotype, and homozygous minor genotype. \( y \) is the label of a sample coded as \( \left\{ {0,1} \right\} \), corresponding to control and case. \( p\left( \cdot \right) \) is the probability distribution function.

Obviously, higher mutual information value indicates stronger association between the phenotype and the SNP combination.

2.3 Selectively Informed Particle Swarm Optimization (SIPSO)

The SIPSO, first proposed by Gao et al. [14] that takes into account both the scale-free population structure and selectively informed learning strategies, is a typical swarm intelligence algorithm that derives the inspiration from the self-organization and adaptation in flocking phenomena. Experimental comparisons showed that SIPSO outperforms several peer algorithms in terms of solution quality, convergence speed, and success rate. We combine SIPSO and mutual information to determine SNP-SNP interactions in this paper.

For SIPSO, let \( X_{i}^{t} = \left[ {x_{i1}^{t} , \cdots ,x_{iK}^{t} } \right] \) and \( V_{i}^{t} = \left[ {v_{i1}^{t} , \cdots ,v_{iK}^{t} } \right] \) be the position and the velocity of particle \( i \) at iteration \( t \), where \( t \in \left[ {1,T} \right] \), \( i \in \left[ {1,I} \right] \), \( T \) is the number of iterations, \( I \) is the number of particles, \( x_{iK}^{t} \) is the selected \( K_{th} \) SNP index of particle \( i \) at iteration \( t \), and \( x_{iK}^{t} \in \left[ {1,M} \right] \), \( M \) is the number of SNPs in the data set. \( v_{iK}^{t} \) is the velocity of \( x_{iK}^{t} \). At each iteration, each particle update its position and velocity according to the following equations:

$$ \tilde{V}_{i}^{t + 1} = \left\{ {\begin{array}{*{20}c} {\eta \cdot \left( {V_{i}^{t} + \frac{1}{{\left| {{\rm N}\left( i \right)} \right|}}\sum\limits_{{j \in {\rm N}\left( i \right)}} {\left( {U\left( {0,c_{1} + c_{2} } \right) \cdot \left( {P_{j}^{t} - X_{i}^{t} } \right)} \right)} } \right)} & {\left| {{\rm N}\left( i \right)} \right| > \lambda } \\ {\eta \cdot \left( {V_{i}^{t} + U\left( {0,c_{1} } \right) \cdot \left( {P_{i}^{t} - X_{i}^{t} } \right) + U\left( {0,c_{2} } \right) \cdot \left( {G_{i}^{t} - X_{i}^{t} } \right)} \right)} & {\left| {{\rm N}\left( i \right)} \right| \le \lambda } \\ \end{array} } \right., $$
(6)
$$ V_{i}^{t + 1} = \left\{ {\begin{array}{*{20}c} {\tilde{V}_{i}^{t + 1} } & {\tilde{V}_{i}^{t + 1} \in \left[ {1 - M,M - 1} \right]} \\ {U\left( {1 - M,M - 1} \right)} & {\tilde{V}_{i}^{t + 1} \notin \left[ {1 - M,M - 1} \right]} \\ \end{array} } \right., $$
(7)
$$ \tilde{X}_{i}^{t + 1} = X_{i}^{t} + V_{i}^{t} , $$
(8)
$$ X_{i}^{t + 1} = \left\{ {\begin{array}{*{20}c} {\text{int} \left( {\tilde{X}_{i}^{t + 1} } \right)} & {\tilde{X}_{i}^{t + 1} \in \left[ {1,M} \right]} \\ {\text{int} \left( {U\left( {1,M} \right)} \right)} & {\tilde{X}_{i}^{t + 1} \notin \left[ {1,M} \right]} \\ \end{array} } \right., $$
(9)
$$ \eta = \frac{2}{{\left| {2 - \left( {c_{1} + c_{2} } \right) - \sqrt {\left( {c_{1} + c_{2} } \right)^{2} - 4 \cdot \left( {c_{1} + c_{2} } \right)} } \right|}}. $$
(10)

Here, \( P_{i}^{t} \) is the best historical position of particle \( i \) until iteration \( t \), \( G_{i}^{t} \) is the best historical position of neighbors of particle \( i \) at iteration \( t \). \( c_{1} \) and \( c_{2} \) are the acceleration coefficients controlling how far a particle moves at an iteration. \( \eta \) is the learning rate. \( U\left( {a,b} \right) \) is a random number drawn at each iteration from the uniform distribution \( \left[ {a,b} \right] \). \( {\rm N}\left( i \right) \) is the neighbors of particle \( i \) and \( \left| {{\rm N}\left( i \right)} \right| \) is the degree of particle \( i \) (e.g. the number of neighbors). \( \text{int} \left( \cdot \right) \) is an integral function. \( \lambda \) is the threshold to determine a particle fully-informed or single-informed.

For constructing a positive feedback mechanism, \( P_{i}^{t} \) and \( G_{i}^{t} \) should be updated at every iteration. Specifically, whether \( P_{i}^{t + 1} \) is updated to \( X_{i}^{t} \) or \( P_{i}^{t} \) depends on their mutual information values, which can be defined as,

$$ P_{i}^{t + 1} = \left\{ {\begin{array}{*{20}c} {P_{i}^{t} } & {MI\left( {P_{i}^{t} ;Y} \right) = \hbox{max} \left( {MI\left( {P_{i}^{t} ;Y} \right),MI\left( {X_{i}^{t} ;Y} \right)} \right)} \\ {X_{i}^{t} } & {MI\left( {X_{i}^{t} ;Y} \right) = \hbox{max} \left( {MI\left( {P_{i}^{t} ;Y} \right),MI\left( {X_{i}^{t} ;Y} \right)} \right)} \\ \end{array} } \right.. $$
(11)

Similarly, by evaluating mutual information values of \( {\rm N}\left( i \right) \) and \( G_{i}^{t} \), \( G_{i}^{t + 1} \) is updated to one of them with the higher value, which can be written as,

$$ G_{i}^{t + 1} = \left\{ {\begin{array}{*{20}c} {G_{i}^{t} } & {MI\left( {G_{i}^{t} ;Y} \right) = \hbox{max} \left( {MI\left( {G_{i}^{t} ;Y} \right),MI\left( {N\left( i \right);Y} \right)} \right)} \\ {N\left( i \right)} & {MI\left( {N\left( i \right);Y} \right) = \hbox{max} \left( {MI\left( {G_{i}^{t} ;Y} \right),MI\left( {N\left( i \right);Y} \right)} \right)} \\ \end{array} } \right.. $$
(12)

While completing the iteration process, \( P_{i}^{T} \) with descending mutual information values are recorded as the detected SNP-SNP interactions of SIPSO.

3 Results and Discussion

3.1 Simulation and Real Data Sets

We exemplify 4 benchmark models of SNP-SNP interactions for the experiments [1721], detailed parameters of which are recorded in Table 1. Specifically, Model 1 is a model that display both marginal effects and interactive effect, the penetrance of which increases only when both SNPs have at least one minor allele [17, 18]; Model 2 is a model showing both marginal effects and interactive effect, the additional minor allele at each locus of which does not further increase the penetrance [17]. Model 3 is also a model displaying both marginal effects and interactive effect, which assumes that the minor allele in one SNP has the marginal effect, however, the marginal effect is inversed while minor alleles in both SNPs are present [17]; Model 4 is a model that shows only interactive effects, which is directly cited from the reference [20]. Model 4 is exemplified here since it provides a high degree of complexity to challenge ability of a method in detecting SNP-SNP interactions. For each model, 50 data sets are simulated by epiSIM [22], each containing 4000 samples with the ratio of cases and controls being 1. For each data set, random SNPs are set with their MAFs chosen from \( \left[ {0.05,0.5} \right] \) uniformly.

Table 1. Details of 4 models of SNP-SNP interactions. Prevalence is the proportion of samples that occur a disease. Penetrance is the probability of the occurrence of a disease given a particular genotype. MAF(\( \alpha \)) is the minor allele frequency of \( \alpha \). \( AA \), \( Aa \), and \( aa \) are homozygous common genotype, heterozygous genotype, and homozygous minor genotype.

A real data set of age-related macular degeneration (AMD) is used for testing the practical ability of SIPSO. AMD, refers to pathological changes in the central area of the retina, is the most important cause of irreversible visual loss in elderly populations, and is considered as a complex disease having multiple SNP-SNP interactions [18]. The AMD data set contains 103611 SNPs genotyped with 96 cases and 50 controls, which has been widely used as a benchmark data set in the field of testing methods of detecting SNP-SNP interactions [12, 13, 17, 18, 21, 2325].

3.2 Experiments on Simulation Data Sets

Three PSO based methods for inferring SNP-SNP interactions are used for this comparison study with the SIPSO. They are DBM-PSO, PSOMiner and IOBLPSO. For a fair comparison, parameter settings of these methods are the same. Specifically, the number of particles \( I \) and the number of iterations \( T \) are respective set to 200 and 20; acceleration factors \( c_{1} \) and \( c_{2} \) are chosen their appropriate settings 2.05 based on previous extensive analysis [26]; the inertia weights for compared methods are set to 0.65; the threshold to determine a particle fully-informed or single-informed \( \lambda \) is set to 4.

Detection power is used to evaluate the performance of SIPSO by applying SIPSO on four groups of simulation data sets, each of which consists of a SNP-SNP interaction model. Power of these compared methods on simulation data sets is reported in Fig. 1. It is seen that performance of the SIPSO is comparable, and sometimes outperforms to those of compared methods. Specifically, power of the SIPSO is much higher than that of DBM-PSO and PSOMiner; nevertheless, its power is behind that of IOBLPSO in three models except the Model 3, since the IOBLPSO introduces the strategy of opposition-leaning, which implies that introducing opposition-learning to SIPSO might also improve the performance of SIPSO, which is a direction; the DBM-PSO has the worst performance on all models because it only focus on identifying genotype combinations among SNP combinations; the PSOMiner is indeed a general PSO that used in determining SNP-SNP interactions, and hence has poor power.

Fig. 1.
figure 1

Power of compared methods on simulation data sets.

3.3 Application to Real Data Set

The SIPSO is applied on AMD data set four times with parameter settings \( \left( {I,T} \right) \) being \( \left( {10000,500} \right) \), \( \left( {10000,1000} \right) \), \( \left( {20000,500} \right) \) and \( \left( {20000,1000} \right) \). The top 10 captured SNP-SNP interactions that might be associated with the AMD are reported in Table 2 with descending mutual information values.

Table 2. Top 10 captured SNP-SNP interactions associated with AMD. CFH: complement factor H. MPP7: palmitoylated membrane protein 7. PEBP4: phosphatidylethanolamine binding protein 4. R3HDM1: R3H domain (binds single-stranded nucleic acids). N/A: no gene is available. Chr: Chromosome.

It is interesting that all reported SNP-SNP interactions contain either rs380390 or rs1329428. This is because that these two SNPs have strongest main effects among all tested SNPs, and have already proofed to be significantly associated with AMD [18]. Both of them are in the CFH gene, and there are biologically plausible mechanisms for the involvement of CFH in AMD. CFH is a regulator that actives the alternative pathway of the complement cascade, the mutations in which can lead to an imbalance in normal homeostasis of the complement system [24]. This phenomenon is thought to account for substantial tissue damage in AMD [23, 25, 27].

Other SNPs listed in the second column of the table might be identified due to their partners, i.e., rs380390 and rs1329428, since strong main effects of them leads to their combinations with other SNPs almost displaying strong interactive effect. This also indicates that SIPSO is sensitive to those SNPs displaying strong main effects. Further studies with the use of large scale case-control samples are needed to confirm whether these SNPs have true associations with AMD. We hope that several clues could be provided from this paper for the exploration of causative factors of AMD.

4 Conclusions

SNP-SNP interactions, also known as epistatic interactions or epistasis, is nonlinear interactive effects of SNPs, which is now believed to be one of the causative patterns of complex diseases, and is recognized fundamentally important for understanding the mechanism of disease causing genetic variation. Though many works have been done for inferring SNP-SNP interactions, the algorithmic development is still ongoing due to their mathematical and computational challenges. For instance, the limited fitness functions, the complex genetic architecture, and the intensive computational burden. By considering the advantages of the PSO including simplicity, effectiveness and low computational cost, we introduce the SIPSO with the mutual information to determine SNP-SNP interactions in this manuscript. The SIPSO improves the traditional PSO by modifying the population structure as a scale-free network and altering the interaction modes to different learning strategies. That is to say, learning strategy of each particle depends on its degree: a densely connected hub particle gets full information from all of its neighbors while a non-hub particle with low degree can only follow a single yet best performed neighbor. Experiments are performed on both simulation and real data sets, which show that SIPSO is promising in inferring SNP-SNP interactions, and might be an alternative to existing methods. The Matlab version of SIPSO software package is available online at http://www.bdmb-web.cn/index.php?m=content&c=index&a=show&catid=37&id=99.

There are several merits and demerits for SIPLO introducing to identify SNP-SNP interactions. The first advantage is that the SIPSO taking into account the individual’s heterogeneity, could balance the exploration and the exploitation in the optimization process. The second advantage is that mutual information is an effective and efficient fitness function in measuring SNP-SNP interactions. Nevertheless, it still has lots of limitations. Firstly, it is only a beneficial exploration of SIPSO applying to determine SNP-SNP interactions, its performance is simply comparable to compared methods, and sometime inferior to compared methods, which indicates that the SIPSO needs more improvement while applying to the field of Bioinformatics. Secondly, SIPSO is sensitive to those SNPs displaying strong main effects. These limitations will inspire us to continue working in the future.