Keywords

1 Introduction

With the development of high-throughput sequencing technologies, it is universally acknowledged that SNP (single nucleotide polymorphism) is one of the most common forms of genetic variants in human genome, which usually affects complex diseases by their nonlinear interactions, namely, epistatic interactions or epistasis [1]. Currently, epistasis has caused the extensive concerns in exploring the pathogenic mechanism of non-Mendelian diseases, such as hypertension, diabetes, Alzheimer’s disease and many others [2]. Although many efforts have been made for the detection of SNP-SNP interactions, the algorithm of such studies is still ongoing due to their computational and statistical complexities, including the complexity of pathogenesis, the complexity of genetic models and the influence of environment factors.

Recently, a number of generic ant colony optimization (ACO) based methods have been proposed [1, 38] to detect the epistatic interactions. For instance, Christmas et al. [6] used generic ACO algorithm to identify the epistatic interactions in type 2 diabetes data. Results indicate that ACO algorithm is able to find statistically significant epistatic interactions. Wang et al. [4] proposed AntEpiSeeker based on ACO algorithm and designed two-stage optimization procedure for detection SNP-SNP interactions. Though it enhances the power of ACO algorithms, it is sensitive to main effect SNPs and it requires large amounts of ants over numerous iterations to obtain acceptable results. Shang et al. [1] proposed AntMiner based on ACO algorithm and incorporated heuristic information into ant colony optimization for epistasis detection, although it contributes to improving the computational efficiency and solution accuracy, it is also very time-costing. Jing et al. [5] proposed MACOED on the basis of ACO algorithm and designed a multi-objective method to detect SNP epistasis, which may be criticized for the complexity of multi-objective computation and the randomness of its searching strategy.

In this study, we proposed a method IACO based on ant colony optimization and a novel introduced fitness function Svalue, which combined both Bayesian networks and mutual information for detecting SNP-SNP interactions, effectively and efficiently evaluates how well the SNP-SNP combinations associates with the phenotype. In addition, a memory based strategy is also employed to improve the performance of IACO, which effectively avoids ignoring the optimal solutions that have already been identified and speed up the convergence of IACO algorithm. Experiments of IACO are performed on simulation data sets and a real data set of age-related macular degeneration (AMD). And we also compared IACO with some other representative methods, including AntEpiSeeker [4], AntMiner [1], and MACOED [5]. Results demonstrate that IACO outperforms others in inferring SNP-SNP interactions. Besides, the application of IACO on a real data set of age-related macular degeneration (AMD) may provide some new clues for the detection of AMD. IACO might be an alternative to existing methods for inferring epistatic interactions.

2 Methods

2.1 Mathematical Description of the Problem

In genome-wide association study (GWAS) [9], SNPs are bi-allelic labels with major and minor alleles being denoted by capital letters (e.g., A) and lowercase letters (e.g., a) respectively. There are three genotypes for each SNP: homozygous common genotype (AA), heterozygous genotype (Aa), and homozygous minor genotype (aa) [1]. The most common way of mapping SNPs is to collect them as a matrix, where a column represents a SNP and a row represents the genotypes of a sample. Furthermore, the elements 1, 2, 3 in the matrix on behalf of AA, Aa and aa respectively. In addition, we generally called the case group and the control group of each individual as a sample and used a matrix to collect them. Rows of the matrix are equivalent to the rows of SNP matrix. The label of a sample is either 1 (case) or 2 (control) [10]. In mathematics, Genome-wide SNP interaction research is described as a study of multiple SNP combinations for high-dimensional and small sample size data, which is used to greatly predict the phenotype of the sample.

2.2 Traditional Ant Colony Optimization

Swarm intelligence algorithm [1, 38, 1012] is an optimization algorithm to solve the problems of large-scale data, which can handle some discrete data optimization problems through simulation of animal behavior. As a new swarm intelligence algorithm, ACO takes inspiration from the foraging behavior of some ant species [1], which has already been used to solve the problems of traveling salesman, graph coloring and others.

In the progress of foraging, ants can communicate with each other by secreting pheromones on the ground, but pheromones will gradually evaporate as time passes. The subsequent ants choose the path according to pheromones and tend to choose the paths with higher pheromones [1]. With the passage of time almost all ants choose a nearest path from a food source to their nest.

In this study, m ants are represented by \( \left\{ {m_{1} ,m_{2} \cdots m_{m} } \right\} \) respectively, each of which is used to detect a K-SNP interaction at each iteration, where K is a user-specified order of epistatic interactions.

The ants choose the next position by probability function (PF). PF is a probability that ant k choose the next position j from the position i at iteration t, and is denoted as \( p_{k}^{ij} \left( t \right) \). The PF of traditional ant colony optimization is defined as

$$ P_{k}^{ij} (t) = \left\{ \begin{aligned} & \frac{{\tau_{ij} (t)^{\alpha } \eta_{ij}^{\beta } }}{{\sum\nolimits_{{u \in U_{k} (t)}} {\tau_{iu} (t)^{\alpha } \eta_{iu}^{\beta } } }}\, & i \in U_{k} (t) \\ & 0\, & otherwise \\ \end{aligned} \right. $$
(1)

where \( \tau_{ij} \left( t \right) \) is the pheromones of position i to position j at iteration t. \( \eta_{ij} \) is the heuristic information of position i to position j. \( \alpha \) and \( \beta \) are controlling importance of pheromones and heuristic information respectively. \( U_{k} \left( t \right) \) is a set of positions that are not selected by ant k at iteration t.

Pheromones will evaporate in the process of movement. Pheromones of position i to position j at iteration t + 1 are updated according to

$$ \tau_{ij} \left( {t + 1} \right) = \left( {1 - \rho } \right)\tau_{ij} \left( t \right) + \Delta \tau_{ij} \left( t \right) $$
(2)

where \( \rho \) is a user-defined evaporation coefficient and \( \Delta \tau_{ij} \left( t \right) \) is an increment of pheromones from position i to position j at iteration t. The \( \Delta \tau_{ij} \left( t \right) \) is defined as

$$ \Delta \tau_{ij} \left( t \right) = \sum\limits_{k = 1}^{m} {\Delta \tau_{ij}^{k} } \left( t \right) $$
(3)

where \( m \) is the number of ants and \( \Delta \tau_{ij}^{k} \left( t \right) \) is the legacy of pheromones from position i to position j for ant k at iteration t, which is defined as

$$ \Delta \tau_{ij}^{k} \left( t \right)\, = \left\{ \begin{aligned} & \frac{Q}{{S_{k} \left( t \right)\,}} & f\,ant\,k\,via\,the\,path\,of\,ij\,at\,iteration\,\,t \\ & 0\, & otherwise \\ \end{aligned} \right. $$
(4)

where \( Q \) is a user-defined positive constant and \( S_{k} \left( t \right) \) is the path length for ant k at iteration t.

2.3 Evaluation Measure Svalue

2.3.1 Mutual Information

Information entropy is used to measure the characteristic of uncertainty, higher uncertainty implies the higher information entropy. Mutual information [12, 13] is a measure based on information entropy, which can measure the dependence of SNP-SNP combinations and phenotype. The IACO employs mutual information to measure how much the relevance of SNP-SNP interactions to the phenotype. The formula of mutual information can be written as

$$ MI\left( {S;Y} \right) = H\left( S \right) + H\left( Y \right) - H\left( {S,Y} \right) $$
(5)

where H(S) is the entropy of S, H(Y) is the entropy of Y, and H(S,Y) is the joint entropy of both S and Y, S is a position of a particle indicating a SNP combination, Y is the phenotype.

The entropy and the joint entropy are defined as

$$ H\left( S \right) = - \sum\limits_{{j_{1} = 1}}^{3} { \cdots \sum\limits_{{j_{K} = 1}}^{3} {\left( {p\left( {s_{{j_{1} }} , \cdots ,s_{{j_{K} }} } \right) \cdot \log \, p\left( {s_{{j_{1} }} , \cdots ,s_{{j_{K} }} } \right)} \right)} } $$
(6)
$$ H\left( Y \right) = - \sum\limits_{j = 0}^{1} {\left( {p\left( {y_{j} } \right) \cdot \log \, p\left( {y_{j} } \right)} \right)} $$
(7)
$$ H\left( {S,Y} \right) = - \sum\limits_{{j_{1} = 1}}^{3} \cdots \sum\limits_{{j_{K} = 1}}^{3} {\sum\limits_{j = 0}^{1} {\left( {p\left( {s_{{j_{1} }} , \cdots ,s_{{j_{K} }} ,y_{j} } \right) \cdot \log \, p\left( {s_{{j_{1} }} , \cdots ,s_{{j_{K} }} ,y_{j} } \right)} \right)} } $$
(8)

where S is the genotype of a SNP coded as {1, 2, 3}, corresponding to homozygous common genotype, heterozygous genotype, and homozygous minor genotype, y is the label of a sample coded as {0, 1}, corresponding to control and case. \( p\left( \cdot \right) \) is the probability distribution function.

Apparently, the higher the mutual information score is, the stronger the association between the SNP subset and the disease.

2.3.2 Bayesian Network

In recent years, with the rapidly development of machine learning, there are a lot of researches devoting to some statistical learning methods. For instance, Bayesian network [14, 15]. Many measures in Bayesian networks are used to evaluate the correlation between SNP-SNP combinations and disease. In this work, we adopt BN to measure the correlation between disease and different SNP combinations. Specifically, in the GWAS BN, the genotypes and phenotypes are denoted as a set of nodes, and their conditional dependences are denoted as a set of edges. On the basis of previous studies [15] we choose the BN score as follows:

$$ BN = \sum\limits_{i = 1}^{I} {\left( {\sum\limits_{b = 1}^{{r_{i} + 1}} {{ \log }\left( b \right){ - }\sum\limits_{j = 1}^{J} {\sum\limits_{d = 1}^{{r_{ij} }} {{ \log }\left( d \right)} } } } \right)} \, $$
(9)

where I is the combinatorial number of SNP nodes with different values (if l-SNP nodes are connected to disease node y, the number of SNP nodes’ combinations is \( 3^{l} \) as the possible value of a SNP node is 0, 1, or 2), J is the state number of disease node y two for all samples), \( r_{i} \) is the number of cases with SNP nodes taking the \( i_{th} \) combination, and \( r_{ij} \) is the number of cases where the disease node takes the \( j_{th} \) state and its parents take the \( i_{th} \) combination [5]. In this measure, lower BN value implies greater relevance of SNP-SNP combinations and phenotype.

2.3.3 Svalue

In previous studies, most of studies used single objective evaluation method to evaluate the correlation between SNP-SNP interactions and phenotype. However, many evaluation methods have different effects on the disease in different model due to the complexity of some disease. In this work, we adopt a new evaluation measure Svalue, which is based on Bayesian network and mutual information. As mentioned before, the higher the mutual information score and the lower the BN score, the stronger the association between the SNP subsets and the disease. But we can’t evaluate the pros and cons of different SNP-SNP combinations due to the different assessment of two evaluation measures and the different order of magnitude for two evaluation functions. So, we proposed a new evaluation measure as follows:

$$ Svalue\left( A \right)\,{ = }\,\theta \frac{MI}{BN} $$
(10)

where MI and BN are the mutual information score and BN score respectively. \( \theta \) is a user-specified constant. The ability to identify between SNP-SNP combinations and phenotype will be better when the Svalue score is larger.

2.4 Improved Ant Colony Optimization (IACO)

2.4.1 Path Selection Strategy

IACO algorithm is based on the generic ant colony algorithm (ACO). And in this work, we use a matrix to store the information of pheromones, which number of columns is all SNPs data sets. In order to treat each SNP equally, the pheromones of all SNPs is set to 1 in the initial time of the algorithm. Then ants can independently choose the SNPs on the basis of pheromones and heuristic information. The ant-decision rule is ant k select SNP i at iteration t, and it is denoted as \( \mathop P\nolimits_{k}^{i} \left( t \right) \). The PF of IACO is defined as

$$ P_{k}^{i} \left( t \right)\, = \,\left\{ \begin{aligned} & R\, & q \le q_{0} \\ & S\, & q > \,q_{0} \\ \end{aligned} \right. $$
(11)

where q is a number generated randomly with uniform distribution in (0,1), and \( q_{0} \) is a user-defined threshold to control the rate of convergence and to avoid falling into locally optimal solution. R and S are the rules for ants to select SNPs, described as

$$ S = \left\{ \begin{aligned} & 1\, & i\, = \,r\,and\,\left( {U_{k} \left( t \right)} \right) \\ & 0\, & otherwise \\ \end{aligned} \right. $$
(12)
$$ R = \left\{ \begin{aligned} & \frac{{\tau_{i} (t)^{\alpha } \eta_{i}^{\beta } }}{{\sum\limits_{{u \in U_{k} (t)}} {\tau_{u} (t)^{\alpha } \eta_{u}^{\beta } } }}\, & i \in U_{k} (t) \\ & 0\, & otherwise \\ \end{aligned} \right. $$
(13)

where \( \tau_{i} \left( t \right) \) is the pheromones of SNP i at iteration t, and \( \eta_{i}^{{}} \) is the heuristic information of SNP i.

2.4.2 Pheromone Updating Strategy

Each ant can find a K-order SNP-SNP combination at the completion of each iteration, but the connection degree of different SNP-SNP combinations and disease is distinct. Therefore we adopt a new pheromone update function to change the pheromones of different SNP-SNP combinations respectively, and described as

$$ \tau_{i} \left( {t + 1} \right) = \left( {1 - \rho } \right)\tau_{i} \left( t \right) + \Delta \tau_{i} \left( t \right) $$
(14)

where \( \Delta \tau_{i} \left( t \right) \) is the variation of pheromones, which is described as

$$ \Delta \tau_{i} \left( t \right) = \sum\limits_{k = 1}^{m} {\Delta \tau_{i}^{k} } \left( t \right) $$
(15)
$$ \Delta \tau_{i}^{k} \left( t \right) = \left\{ \begin{aligned} & Svalue(A)\, & k \in M_{i} (t) \\ & 0\, & otherwise \\ \end{aligned} \right. $$
(16)

In those equations, \( M_{i} \left( t \right) \) is a set of ants who select SNP i at iteration t, A is the SNP-SNP combinations which are identified by ant k at iteration i, and Svalue(A) is the Svalue of A.

2.4.3 Memory Based Strategy

In this work, we rank all SNP-SNP combinations that identified by ants at iteration t with descending Svalue, and the combinations are described as \( A_{1} ,A_{2} , \cdots A_{m} \), where their Svalues are \( (1,Svalue(A_{1} )),(2,Svalue(A_{2} )), \cdots (m,Svalue(A_{m} )) \). Then we should find an inflection point f and hold the points before the inflection point as the optimal solutions of this iteration, which is defined as

$$ f = arc\mathop {\hbox{max} }\limits_{g = 3}^{m} \left( {\left( {Svalue\left( {A_{g} } \right) - Svalue\left( {A_{g - 1} } \right)} \right) - \left( {Svalue\left( {A_{g - 1} } \right) - Svalue\left( {A_{g - 2} } \right)} \right)} \right) $$
(17)

In order to save the optimal solutions that have already been identified, speed up the convergence and reduce the calculation of Svalue for some optimal solutions that have already been calculated, we introduce a memory strategy, which put the optimal solutions to the sets of suspected solutions. Otherwise, we design the memory strategy to remain the optimal solutions to compare with the identified solutions in the current iteration and get final suspected solutions. All suspected solutions in the last iteration will be fed into the next stage of IACO to process [5].

3 Results and Discussion

3.1 Simulation and Real Data Sets

We exemplify 4 benchmark models of SNP-SNP interactions for the experiments [14, 1619], parameters are set as follows, penetrance is the probability of the occurrence of a disease given a particular genotype, and penetrance functions of four models is reported in Fig. 1; MAF(\( \alpha \))is the minor allele frequency of \( \alpha \). AA, Aa and aa are homozygous common genotype, heterozygous genotype, and homozygous minor genotype; and prevalence is the proportion of samples that occur a disease. Specifically, Model 1 is a model that display both marginal effects and interactive effect, the penetrance of which increases only when both SNPs have at least one minor allele [14, 16], the MAF(a), MAF(b) and the prevalence of Model 1 are 0.300, 0.200 and 0.100 respectively; Model 2 is a model showing both marginal effects and interactive effect, the additional minor allele at each locus of which does not further increase the penetrance [14], the MAF(a), MAF(b) and the prevalence of Model 2 are 0.400, 0.400 and 0.050 respectively; Model 3 is also a model displaying both marginal effects and interactive effect, which assumes that the minor allele in one SNP has the marginal effect, however, the marginal effect is inversed while minor alleles in both SNPs are present [14], the MAF(a), MAF(b) and the prevalence of Model 3 are 0.400, 0.200 and 0.010 respectively; Model 4 is a model that shows only interactive effects, which is directly cited from the reference [18]. Model 4 is exemplified here since it provides a high degree of complexity to challenge ability of a method in detecting SNP-SNP interactions, the MAF(a), MAF(b) and the prevalence of Model 4 are 0.400, 0.400 and 0.171 respectively. For each model, 50 data sets are simulated by epiSIM [11], each containing 4000 samples with the ratio of cases and controls being 1. For each data set, random SNPs are set with their MAFs chosen from \( \left[ {0.05,0.5} \right] \) uniformly.

Fig. 1.
figure 1

Penetrance functions of four models

A real data set of age-related macular degeneration (AMD) is used for testing the practical ability of IACO. AMD, refers to pathological changes in the central area of the retina, is the most important cause of irreversible visual loss in elderly populations, and is considered as a complex disease having multiple SNP-SNP interactions [16]. The AMD data set contains 103611 SNPs genotyped with 96 cases and 50 controls, which has been widely used as a benchmark data set in the field of testing methods of detecting SNP-SNP interactions [1, 10, 1214, 16, 19, 20].

3.2 Experiments on Simulation Data Sets

Three methods for detection SNP-SNP interactions are compared with IACO, which are AntEpiSeeker, AntMiner and MACOED. In order to equitable comparison, we set the same parameters for four methods. The number of ants m and the number of iterations T are set to 500 and 10 respectively; the dimension K for IACO is set to 2; the heuristic information \( \eta \) is set to 1; the controlling importance of pheromones \( \alpha \) and heuristic information \( \beta \) are both set to 1; the threshold for control the rate of convergence \( q_{0} \) is set to 0.6; the evaporation coefficient \( \rho \) is set to 0.2; the constant \( \theta \) is set to 10000.

In order to evaluate the performance of the algorithm, we introduced power as evaluate criteria, which is defined as the proportion of 50 data sets that all connected SNPs are recognized. Power of these four compared methods on simulation data sets is reported in Fig. 2. Apparently, power of IACO is much higher than that of AntEpiSeeker; power of IACO is much higher than that of MACOED except the Model 4;power of IACO is much higher than that of AntMiner except the Model1 and Model2, on account of AntMiner incorporate heuristic information into ACO for SNP-SNP interaction, but the running time of AntMiner is significantly higher than IACO; the performance of AntEpiSeeker is worse on all models because it focus on identify the models of pathogenic SNPs; the MACOED introduce logistic regression to evaluate SNP-SNP combinations, which is unstable for different parameter settings, and the most major problem of it is the searching strategy becomes stochastic search. Specifically, the computer time of AntEpiSeeker, MACOED and IACO is on an order of magnitude, immensely lower than AntMiner. With comprehensive consideration, IACO outperforms others in detection power for large datasets and it has superb stability. In the future research, we will consider introducing heuristic information into ACO to improve the power of our methods.

Fig. 2.
figure 2

Power of compared methods on simulation data sets.

3.3 Application to Real Data Set

The IACO is applied on AMD data set four times with parameter settings \( \left( {m,T} \right) \) being \( \left( {10000,500} \right) \), \( \left( {10000,1000} \right) \), \( \left( {20000,500} \right) \) and \( \left( {20000,1000} \right) \). In total, there are captured SNP-SNP interactions that might be associated with the AMD are reported in Table 1 with descending Svalue.

Table 1. Top 11 captured SNP-SNP interactions associated with AMD. CFH: complement factor H. CRSP8: cofactor required for Sp1 transcriptional. JMJD2C: jumonji domain containing 2C. ISCA1: HESB like domain containing 2. N/A: no gene is available. Chr: Chromosome.

Obviously, all reported SNP-SNP interactions contains either rs380390 or rs1329428 except the fourth line SNPs, of which reason is that these two SNPs have strongest main effects among all tested SNPs, and these two SNPs have already proved to be significantly associated with AMD [4]. All of them in the gene of CFH, which is located on chromosome 1 in a region repeatedly linked to AMD in family-based studies [21].

Besides, the SNPs in the second column of the table might be identified due to they combine with rs380390 or rs1329428, since strong main effects of these two SNPs leads to their combinations with other SNPs almost displaying strong interactive effect. This also indicates that IACO is sensitive to those SNPs displaying strong main effects. It is worth noting that our methods find that the combinations of rs1368863 and rs1740752 also displaying strong interactive effect. These SNPs in the N/A gene, which are located on chromosome 11 and 10 in noncoding region, may induced AMD by gene mutation or induced diseases by changing the other disease-causing genes. The above observation may provide some new clues for the detection of AMD.

4 Conclusions

Epistasis has caused the extensive concern in exploring the mechanism underlying susceptibility to complex diseases. In this work, we proposed an algorithm IACO based on ant colony optimization and a novel introduced fitness function Svalue, which combined both Bayesian networks and mutual information, for detecting SNP-SNP interactions. Furthermore, a memory based strategy is also employed to improve the performance of IACO, which effectively avoids ignoring the optimal solutions that have already been identified. Experiments of IACO are performed on simulation data sets and a real data set of age-related macular degeneration (AMD). And we also compared IACO with some other representative methods, including AntEpiSeeker [4], AntMiner [1], and MACOED [5]. Results demonstrate that IACO outperforms others in detecting SNP-SNP interactions. Besides, the application of IACO on a real data set of age-related macular degeneration (AMD) may provide some new clues for the detection of AMD. The Matlab version of IACO software package is available online at http://www.bdmb-web.cn/index.php?m=content&c=index&a=show&catid=37&id=98.

Obviously, there are many advantages for IACO to identify epistatic interactions. First of all, IACO proposed a new fitness function Svalue based on Bayesian networks and mutual information, which unifies the different assessment of two evaluation measures, greatly improve the processing ability of the optimal solutions. In the next place, it adopted a memory based strategy to dispose the optimal solution, which rank Svalue of the suspected solutions with the Svalue of the SNP-SNP combinations identified at current iteration. One merit of memory based strategy is the optimal solutions in last iteration is used for the current iteration, which avoid ignoring the optimal solution that have already been found, speed up the convergence and reduce the calculation of Svalue for some optimal solutions that have already been calculated. What’s more, we use a dynamic pheromone updating strategy to update the pheromones of SNPs. SNPs of bigger associated with disease will update more pheromones than that poor relevance of disease, which effectively improve the convergence speed of the algorithm.

Although IACO showed great performance for detection of epistatic interactions on both simulation and real data sets, there still remain some limitations. For instance, we set the heuristic information \( \eta \) to 1, which may degrade the performance of IACO. In future studies, we will consider introducing the heuristic information to IACO to deal with GWAS dataset. IACO is a continuous research project and we will continue improving in the future.