Introduction

Due to an increase in microbial disease incidences, in animals and plants, discovery for their antidote is simultaneously getting high priority in research. In the past years, microorganisms have developed resistance to antibiotics [1], thus also affecting both human health and agricultural economy [2]. The continuous increase of pathogens as well as resistance against antibiotics has led the researchers to search for new antimicrobial compounds from diverse sources. Among these compounds, antimicrobial peptides (AMPs) appear to be one of the most propitious candidates for clinical development in order to inhibit microbial activity because of their target specificity, speed of action and producing innate immunity in organism [3]. One of the most important factors for introducing AMPs as antibacterial agents is that it may not get resistance against bacteria [4] due to receptor-independent mechanism of action [5]. Antibiotics use in aquaculture is not preferred due to associated problem of resistance development and consumer health risk [6]. Due to associated toxicity, carcinogenicity, sensitivity or bio-accumulation leading to adverse human health, many of synthetic antibiotics, viz. nitrofurans and fluoroquinolones, and natural antimicrobial agents like malachite green are banned.

Aquaculture is also known as “underwater agriculture” which is important for food, employment and revenue [7]. It includes around 350 species having 34 finfish (piscean), 8 crustacean and 12 molluscan with productivity of more than 100,000 tonnes comprising economic value more than US$215 billion, which contributes around 0.3 % of the world economy [810]. But due to certain factors, economic loss in fisheries that has been reported in [8] is around US$50 billion per year. The most significant factor in fisheries deprivation is infectious diseases. The estimated loss in conservative figure is about tens of billions dollars over last 20 years [7].

AMPs are low molecular weight proteins of about 12–100 amino acid residues, usually positively charged. Since its 40–50 % residues are hydrophobic, it gets folded into an amphipathic structure making it soluble in aqueous and lipid [3]. AMPs are distinct molecules with the ability to kill targeted microbes using a mechanism in which some cationic peptides create electrostatic attraction for negatively charged phospholipids of microbial membranes and integrate into cell membrane of microbes resulting in membrane disintegration [11]. Genes of AMPs get expressed in different types of cells of host [12] supporting innate immunity also [13]. AMPs have been derived from many sources in which fish are the sources of many kinds of AMPs [14]. AMPs are of broad spectrum, acting against gram-positive and gram-negative bacteria, viruses and fungi. They have other roles like mediators of inflammation, cell proliferation, angiogenesis, wound healing, chemotaxis, immune induction and protease–antiprotease balance [15, 16]. They are also called “natural antibiotics” [15, 17].

As per World Organization for Animal Health, out of nine major diseases in fish, seven are the infectious disease, causing mortality. To combat this, antibiotics are used heavily, which often exceeds limits prescribed by statutory norms like tetracycline (0.1 mg/kg), oxytetracycline (0.1 mg/kg), trimethoprim (0.05 mg/kg) and oxolinic acid (0.3 mg/kg). Thus, there is a hard pressing need to look for alternates of antibiotics [18]. Various classes of AMPs found in fish include defensins, cathelicidins, hepcidins, histone-derived peptides and a fish-specific class of the cecropin family, called piscidins [13]. Since fishes are much more dependent on their innate immune defenses than mammals, they are potential rich source of therapeutic molecules against mammalian viral, pathogenic fungi, fibrosarcoma, bacterial bio-film in mastitis disease [19]. Discovery of such lead molecules having vast applications and recombinant approach makes them of much potential industrious applications; thus, computational approach can be pivotal and can accelerate such investigations.

Earlier existing AMP prediction servers [20, 21] were generic and trained on all available multispecies wet laboratory-validated AMP data with various classifiers. Recently, species-specific approach without compromising prediction accuracy has been reported in cattle [22]. There is no such species-specific approach in case of fish reported so far. With the advent of low-cost sequencing technology, the plethora of genomic data is available. If machine learning techniques are used in user-friendly server mode, then the cost of AMP discovery will be drastically reduced by narrowing down the number of putative AMPs for antimicrobial assay through laboratory experiments. Thus, in silico search prediction is imperative. The present work aims at development of AMP prediction server using machine learning techniques, especially artificial neural networks (ANNs) and support vector machine (SVM) technique.

Materials and Methods

Data Collection and Preprocessing

There exist a number of AMP databases. An extensive search was made at the databases, viz. LAMP [23], CAMP [21, 24], PenBase [25], EROP [26] and APD2 [27] for experimentally validated AMPs specific to fishes and crustacean. A total of 308 AMPs were extracted after extensive search. To remove the redundancy, clustering was done with CD-HIT program [28] using “longest sequence first” list removal algorithm to remove sequences above certain identity threshold. Finally, 151 AMPs were selected as positive dataset for the study. Due to lack of experimentally proven non-antimicrobial peptides for fishes, peptides synthesized from fish mitochondria and other intracellular locations except the secretary proteins were considered as non-antimicrobial peptides [29] and taken as negative set. It is believed that eukaryotic mitochondrial organelle genome mimics prokaryotic genome features because of endosymbiont hypothesis endorsing prokaryotic origin of mitochondria during course of evolution [30].

Code in Perl was written to split the data into window size of 30, which is the average length of AMPs. The split sequences were validated in silico for their antimicrobial peptide property at antimicrobial peptide prediction servers like ABP2 [20] and CAMP [21]. A total of 31 features, selected on the basis of p values at 5 % level of significance based on Chi-square values obtained using STATISTICA version 6.0 software package [31], were considered for the model building. These features included physicochemical properties like amino acid composition (AAC) of all the amino acids, molecular weight, theoretical pI, number of carbon atoms, number of hydrogen atoms, number of sulfur atoms, number of oxygen atom, number of nitrogen atoms, half-life, instability index, aliphatic index, grand average of hydropathy (GRAVY), which were calculated. These parameters were calculated for batch of AMPs and non-AMPs using ProtParam module of BioPerl [32]. The workflow (Fig. 1) illustrates the procedure followed.

Fig. 1
figure 1

Workflow of the fish AMP Web server

The study was divided into three parts, viz. prediction based on N-terminal, C-terminal and full sequence. C-terminus is more responsible for the interaction of negatively charged bacterial membrane and involved in penetration, while the N-terminus is involved in interaction with intracellular components, thus hindering the metabolic functions of bacteria [12].

Model Development

The various models were tried for N-terminal, C-terminal and full sequence using SVM light [33] and STATISTICA version 6.0 [31]. Neural network is a powerful machine learning technique widely used to solve a variety of problems in pattern recognition, prediction, optimization, associative memory, etc. The neurons basically consist of inputs (like synapses), that are multiplied by weights (respective signals strengths), and then computed by a mathematical function which determines the activation of neuron. The output of the artificial neuron is computed by another function, which sometimes depends on certain threshold. ANNs combine artificial neurons to process information [34]. This technique has also been applied in the present study for predicting antibacterial peptides from fishes. In order to develop ANN models, we used STATISTICA [31]. In order to build the prediction model, ANNs with back propagation algorithm have been used [35], but it was found that it overfits and provides underestimation of actual prediction error specially in case of small sample size. Standard square error (SSE) was used for the error function. The two most popular and widely used networks, namely multilayer perceptron (MLP) and radial basis function (RBF), are trained using all the three learning algorithms, viz. gradient descent algorithm (GDA), Broyden–Fletcher–Goldfarb–Shanno (BFGS) and conjugate gradient descent algorithm (CGDA), with a view to minimizing sum of the squared error function of the network output. Several learning rates [35] are considered for training the networks as well as for adjusting the weights. For hidden units and output units, several activation functions, viz. identity, tanh, logistic, exponential and sine, are tried. Performance of the trained network is assessed by computing different measures.

Support vector machines (SVMs) are a group of supervised learning methods applied to classification or regression. It is an extension to nonlinear models of the generalized algorithm and is based on the statistical learning theory and optimization theory [36]. For x, a vector in n-dimensional input space and \(K\left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right)\) being the kernel function, some choices of kernel function are as follows:

$${\text{(a)}}\quad K\left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right) = {\mathbf{x}}_{i}^{T} {\mathbf{x}}_{j} \quad \left( {{\text{Linear}}\,{\text{SVM}}} \right)$$
$${\text{(b)}}\quad K\left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right) = \left( {\gamma \,{\mathbf{x}}_{i}^{T} {\mathbf{x}}_{j} + r} \right)^{d} \quad \left( {{\text{Polynomial}}\,{\text{SVM}}\,{\text{of}}\,{\text{degree}}\,d} \right)$$
$${\text{(c)}}\quad K\left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right) = \exp \,\left\{ { - \gamma ||{\mathbf{x}}_{i} - {\mathbf{x}}_{j} ||^{2} } \right\}\quad \left( {{\text{Radial}}\,{\text{Basis}}\,{\text{function}}\,{\text{Kernel}}} \right)$$
$${\text{(d)}}\quad K\left( {{\mathbf{x}}_{i} ,{\mathbf{x}}_{j} } \right) = \tanh \,\left( {\gamma \,{\mathbf{x}}_{i}^{T} {\mathbf{x}}_{j} + r} \right)\quad \left( {\text{Sigmoid}} \right)$$

where rdγ > 0 are the kernel parameters.

In the present study, SVM light has been used to predict AMP of fishes for using different kernel functions, viz. polynomial degree 2, 3, RBF and sigmoid. The whole analysis is performed for three types, viz. full sequence, N-terminus residues, C-terminus residues.

Model Evaluation

For error estimation, fivefold cross-validation technique [37] was applied where the whole data were divided into five sets, each having almost equal number of peptides. Further, four sets were clubbed and considered as training set, while the fifth set was taken as test set. This was repeated five times such that each set falls under test set. The developed models were evaluated using the evaluation measures, viz. sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), false-positive rate (FPR), false discovery rate (FDR), F1 score, accuracy and Matthews correlation coefficient (MCC). The sensitivity indicates the “quantity” of predictions, i.e., the proportion of real positives correctly predicted. The specificity indicates the “quality” of predictions, i.e., the proportion of true negatives correctly predicted. The PPV indicates the proportion of true positives in predicted positives, “the success rate,” while NPV is the proportion of true negatives in predicted negatives. In terms of definition, FDR of a set of predictions is the expected percent of false predictions in the set of predictions. The F1 score is the measure of a test’s accuracy. The MCC can be seen as the correlation coefficient between the observed and predicted binary classifications.

$$\left\{ {\begin{array}{*{20}l} {{\text{Sensitivity }} = {\text{TP}}/\left( {TP + FN} \right)* 100} \hfill & {{\text{Specificity}} = {\text{TN}}/\left( {{\text{FP}} + {\text{TN}}} \right)* 100} \hfill \\ {{\text{PPV}} = {\text{TP}}/\left( {{\text{TP}} + {\text{FP}}} \right)* 100} \hfill & {{\text{NPV}} = {\text{TN}}/\left( {{\text{TN}} + {\text{FN}}} \right)* 100} \hfill \\ {{\text{FPR}} = {\text{FP}}/\left( {{\text{FP}} + {\text{TN}}} \right)} \hfill & {{\text{FDR}} = {\text{FP}}/\left( {{\text{TP}} + {\text{FP}}} \right) = 1 - {\text{PPV}}} \hfill \\ {F1 = 2{\text{TP}}/\left( {2{\text{TP}} + {\text{FP}} + {\text{FN}}} \right)} \hfill & {} \hfill \\ {{\text{Accuracy}} = \frac{{\left( {{\text{TP}} + {\text{TN}}} \right)}}{{\left( {{\text{TP}} + {\text{FP}} + {\text{TN}} + {\text{FN}}} \right)}}*100} \hfill & {} \hfill \\ {{\text{MCC}} = \frac{{\left( {{\text{TP}}*{\text{TN}} - {\text{FP}}*{\text{FN}}} \right)}}{{\sqrt {\left( {{\text{TP}} + {\text{FP}}} \right)\left( {{\text{TP}} + {\text{FN}}} \right)\left( {{\text{TN}} + {\text{FP}}} \right)\left( {{\text{TN}} + {\text{FN}}} \right)} }}*100} \hfill & {} \hfill \\ \end{array} } \right.$$

where TP = true positive (correctly identified as positive); TN = true negative (correctly identified as negative); FP = false positive (incorrectly identified as positive); FN = false negative (incorrectly identified as negative).

Results and Discussion

The positive (antimicrobial peptides) and negative (non-antimicrobial peptides) datasets were compared based on average percentage composition of amino acids (Fig. 2). It was seen that the amino acid G was very much dominant in AMPs. Also, the amino acids K and R showed significant presence in AMPs as compared to non-AMPS. On the other hand, residues like D, Q, E, S and V were remarkably present in non-AMPs as compared to AMPs, which is in accordance with the previously reported research [20].

Fig. 2
figure 2

Comparison of average percentage of amino acids in antibacterial and non-antibacterial peptides (where all amino acids represent the standard codes)

ANN and SVM machine learning techniques were tried for the study in order to find out the models with best accuracy. For ANN, STATISTICA version 6.0 [31] was used. Supplementary Table 1 shows the features selected for model development along with p values. In this study, we have developed classification models using ANN and SVM.

Table 1 Performance of the models for full sequence residues using SVM and ANN methodology

The models MLP 31-13-2, MLP 31-18-2 and MLP 31-15-2 with accuracy 94, 93 and 89 % were found to be best for full sequence, N-terminus residues and C-terminus residues, respectively. In case of full sequence, training algorithm BFGS, entropy error function, exponential activation function for hidden layer and for output layer softmax function were found to be the best. For N-terminus residues, activation function was exponential, while for C-terminus, it was tanh. Training algorithm and error functions for all three models were BFGS and entropy, respectively, whereas output layer function was softmax. The best models were selected on the basis of measures discussed in previous section, and the same are shown in Tables 1, 2 and 3, respectively, for full sequence, N-terminus residues and C-terminus residues.

Table 2 Performance of the models for N-terminus residues using SVM and ANN methodology
Table 3 Performance of the models for C-terminus using SVM and ANN methodology

The SVM models were developed using SVM light version 6.2. Various scripts in Perl were written to fine-tune the parameters for final model selection. In comparison with ANN, SVM showed 97, 99 and 97 % accuracy for full sequence, N-terminus residues and C-terminus residues, respectively. This comparison was also evaluated using other studies where SVM methodology is reported to be better than ANN [20, 38].

The models were generated through SVM methodology considering five kernels, viz. linear, polynomial of degree 2, polynomial of degree 3, RBF and sigmoid function using criteria of maximum number of iterations or stop if error is <0.0001 for all the three sets (full sequence, N-terminus residues and C-terminus residues).

For full sequence, SVM model generated with RBF kernel function was considered to be best with c parameter as 45.0 using 66 numbers of support vectors. Further, the values of the other parameters, viz. sensitivity, specificity, accuracy, MCC, PPV, FPR, NPV, FDR and F1, were obtained as 0.96, 0.98, 0.97, 0.94, 0.98, 0.96, 0.02, 0.02 and 0.97, respectively (Table 1).

Similarly, for N-terminus residues, linear kernel function was considered best with c parameter as 17.50. Values of sensitivity, specificity, accuracy, MCC, PPV, FPR, NPV, FDR and F1 were found to be 0.99, 0.98, 0.99, 0.97, 0.98, 0.99, 0.02, 0.02 and 0.99, respectively (Table 2). Moreover, polynomial kernel function with degree 2 was perceived to be the best for C-terminus residues with parameter c as 12. Values of sensitivity, specificity, PPV, NPV, FPR, FDR, accuracy, MCC and F1 were found to be 0.97, 0.97, 0.97, 0.94, 0.97, 0.97, 0.03, 0.03 and 0.97, respectively (Table 3).

The generic AMP prediction servers like ABP2 reported accuracy of 92.14 % by SVM methodology [20], while CAMP server has 93.2 % accuracy by random forest approach and 91.5 % by SVM approach [21]. The present fish-specific AMP prediction server is reported to give the predicted accuracy of 99 %. The higher accuracy may be due to the AMP diversification during course of evolution on specialized species-specific parameters.

AMPs predicted through this server can be modified for better antimicrobial activity to be used in industries. This can be done by modifying their polypeptide structure, thus changing its biological and chemical activity. Such modifications can be done at genetic level for novel antimicrobial peptides generation. Such approach is coming up, for example, nisin AMP which is even approved by FDA. AMPs have wide applications in food additives, cosmetics, ointments, injections, etc., and thus characterization of the antimicrobial peptides is highly warranted [39].

The present server can not only be used for discovery of novel AMPs using recently available genomic and proteomic data, but can also be used for future fish genome and proteome data. Fish AMPs are unique in comparison with other phyla in terms of much wider applications. In case of human diseases, Crohn’s disease and Kostmann’s syndrome treatment, AMPs from rainbow trout have shown promising result as immunomodulator [40]. Similarly, antitumor activity is reported from tilapia [41] and grouper fish [42, 43]. Fish AMPs from rainbow trout have been reported to increase efficacy of vaccination as an adjuvant [44]. The issues of formalin toxicity in developing inactivated vaccine can be resolved by substituting it with fish AMPs, e.g., grouper fish AMP has been reported for inactivation of NNV (nervous necrosis virus) [45]. Red sea bream fish AMPs have been used for non-contaminated coatings of food packages effective against both gram-negative and gram-positive bacteria [46]. Such wider application needs wet laboratory validation and its efficacy in different diverse applications. The comparisons of all the 3 cases (N-terminal, C-terminal and full sequences) have been made to show the trend of specificity and sensitivity using the training and test data (Fig. 3a–c).

Fig. 3
figure 3

a Comparison of various methods in terms of specificity and sensitivity using the training and test data for N-terminal (Sp specificity, Sen sensitivity, Tr training set, Ts test set, Nt N-terminal, ANN artificial neural network, Lin linear, Pol-2 polynomial of degree 2, Pol-3 polynomial of degree 3, RBF radial basis function, Sig sigmoid). b Comparison of various methods in terms of specificity and sensitivity using the training and test data for C-terminal (Sp specificity, Sen sensitivity, Tr training set, Ts test set, Ct N-terminal, ANN artificial neural network, Lin linear, Pol-2 polynomial of degree 2, Pol-3 polynomial of degree 3, RBF radial basis function, Sig sigmoid). c Comparison of various methods in terms of specificity and sensitivity using the training and test data for full sequence (Sp specificity, Sen sensitivity, Tr training set, Ts test set, Full full sequence, ANN artificial neural network, Lin linear, Pol-2 polynomial of degree 2, Pol-3 polynomial of degree 3, RBF radial basis function, Sig sigmoid)

Web Implementation

The best model based on above evaluation criteria for N-terminal, C-terminal and full sequence was implemented and is made available at http://webapp.cabgrid.res.in/fishamp/. This user-friendly server is designed with PHP using JavaScript having the interface composed of dynamic Web pages based on user input. This is supported by Apache, an open source Web server. The server has six menus, viz. Home, Algorithm, Submission, Links, Tutorial and Team. Submission page allows user to generate result using various input options like “Input Sequence” or “Upload File” and can select either of the radio button captioned as C-terminus, N-terminus and Full Sequence. Tutorial page provides “User guide” which aids user to use this server in an effective manner.

Conclusions

ANN and SVM methodologies have successfully been used for prediction of species-specific antimicrobial peptide prediction in fishes for the first time. This is the world’s first prediction server in fishes, which is freely accessible at http://webapp.cabgrid.res.in/fishamp/ for global research community. This methodology has an accuracy of 97 % for fish-specific AMP prediction of unknown peptides. This in silico approach can drastically reduce the time and cost of AMP discovery without compromising accuracy. This server can accelerate the discovery rate of lead AMP molecules having potential wider applications in diverse area like fish and human health as substitute of antibiotics, immunomodulator, antitumor, vaccine adjuvant and inactivator, and packaged food.

Availability and Requirements

The antimicrobial peptide prediction tool for aquaculture industries is freely accessible for research purposes for nonprofit and academic organizations at http://webapp.cabgrid.res.in/fishamp/.