Introduction

Predicting virulence of bacterial pathogens and their ability to cause diseases is necessary for microbial risk assessment. Information on virulence is important for the analysis of qualitative and quantitative description of health outcomes. Thus, identification and characterization of virulence genes from whole genome sequences have become an important thrust area in recent years with an ultimate aim of identifying novel drug targets especially in the light of pathogens fast acquiring drug resistance (Hasan et al. 2006).

The experimental approach of considering gene function relating to an organism's pathogenicity has its limits (Kuruvilla et al. 2002). Bioinformatics analysis can reveal virulence potential of a genome-sequenced strain (Garg and Gupta 2008). A gene's contribution to phenotype is determined by the context of other genes present in the genome (Groth et al. 2008). Integration of expression data into sequence-based comparative analyses could thus potentially provide new insights into the relation between genomic sequence and its function. Thus, if the gene expression patterns and associated gene networks are understood, we can better predict genes coding for virulence (Sridhar et al. 2007). It may even be possible to identify bacterial species that are not yet pathogenic but have the correct genetic repertoire to become so if particular genes or gene functions were acquired (Knell 2006). Gene expression pattern and network identification may thus become an important component for identification and characterization of microbial hazards, including emerging pathogens, in the context of microbial risk assessment. The availability of microarray data for several bacterial organisms (Hubble et al. 2009) and the completion of the human genome project have evolved the field of host–pathogen interactions and the field of drug discovery against threatening human pathogens. Those genes that share similar gene expression patterns are assumed to be similar at functional and metabolic level, play same type of metabolic function, or involve in similar metabolic processes (van-Noort et al. 2003; Bergmann et al. 2004). In fact, computational methods are already in place to predict the role of the products of different genes as drug targets (Gray and Keck 1999; Sakharkar et al. 2004; Li and Lai 2007; Perumal et al. 2007; Sakharkar et al. 2008). The present study aims to find new drug targets in Vibrio cholerae O1, the causative agent of cholera, considering that drug targets must be essential for the growth and viability of the pathogen and highly selective against the pathogen with respect to the human host (Galperin and Koonin 1999). The use of such assumptions in predicting a drug target in itself is novel, which has been tested in V. cholerae, a Gram-negative human pathogen, consisting of two circular chromosomes of sizes 2,961,146 bp and 1,072,314 bp. V. cholerae O1 subgroup in particular has displayed dynamism in its incidence pattern in Indian subcontinent (Sur et al. 2006).

Methods

Resources

Virulence genes available in public domain were cataloged from the published reports (Higgins et al. 1992; Lee et al. 1999; Camara et al. 2002; Zhu et al. 2002) and using available virulence factor database (VFDB; http://www.mgc.ac.cn/VFs/; Yang et al. 2008). The gene sequences of the virulence genes were downloaded from GenBank database of NCBI (ftp://ftp.ncbi.nlm.nih.gov). Microarray data (cDNA) for gene expression pattern analysis were downloaded from Stanford Microarray Database (SMD; http://genome-www5.stanford.edu/). A list of essential genes of the pathogen was obtained from database of essential genes (DEG; http://tubic.tju.edu.cn/deg/; Zhang and Lin 2009).

Raw data processing and clustering

To predict virulence genes, we used gene expression data from different experiments available at SMD. Data corresponding to the control sequences and the open reading frames (ORFs) whose expression values across the time points and various conditions had mean lower than 25% were filtered out from the analysis. Analogous filtering schemes included accepting only those ORFs that show at least a twofold change in expression level (Tamayo et al. 1999; Ulm et al. 2004). These ORFs were clustered based on their gene expression using K-mean clustering algorithm through the cluster tool (Eisen et al. 1998). As a rule of thumb, K = 500 for genes was used. Clusters sharing at least one well-documented virulence gene were selected for further analysis, with all participating genes considered as probable virulence genes.

Identification of drug targets

These selected genes were subjected to BLASTX (Altschul et al. 1997) against the Database of Essential Genes. A random expectation value (E value) cutoff of 0.001 and a minimum bit-score cutoff of 100 were used as the baseline to identify the essential genes in the pathogen. These essential genes belonging to the pathogen were subjected to BLASTX against the human genome in the NCBI server (http://blast.ncbi.nlm.nih.gov/Blast.cgi). The homologs were excluded, and the lists of nonhomologs were compiled. The finally selected genes were further analyzed using Blast2GO (Conesa et al. 2005) to predict the potential targets, i.e., surface proteins and participants of important metabolic pathways (Overington et al. 2006).

Results and discussion

The present pharmaceutical scenario is under constant stress of discovering new antimicrobials due to the threat of resistance rapidly being developed in target microbes. Identification of microbe-specific proteins for directing drug discovery and to designing new drugs to previously known targets are the two popular means to combat this resistance. Modern tools of computational biology greatly enhance the speed and reliability of antimicrobial discovery. With an objective of identifying proteins potentially useful as drug targets, we have relied on the use of genomic data and a subtractive genomic approach. The results obtained on using sequence and expression data of V. cholerae are presented here below.

Gene expression data and virulence genes

Initial microarray dataset was constituted of 5,760 ORFs, which was reduced to 5,222 after the control dataset and the ORFs not showing any variation across time points were eliminated. Further filtering based on the cutoffs described above reduced the number of ORFs to 4,169 at 16 time points at the conclusion of filtering stage. ORFs were arranged into clusters based on the gene expression repertoire. All available ORFs could be divided in 500 clusters. Only those clusters were further selected which shared the presence of at least one well-documented virulence gene. A list of 155 virulence genes prepared on the basis of previously published reports or on mining VFDB was used as a reference for analysis of gene expression patterns in the clusters that contained them. K-mean clustering helped identification of those genes which shared similar type of gene expression pattern in different experimental conditions and were thus considered as probable virulence genes (van-Noort et al. 2003). This approach led to identification of an additional 357 probable virulence genes. Thus, a total of 512 virulence-related genes were shortlisted. Each of the individual genes from these clusters was subjected to blastx against database of essential genes.

Virulence genes as drug targets

Similarity search for virulence related genes in DEG led us to identify 102 of the total 512 also as essential genes for V. cholerae which is quite large in number as compared to experimentally reported essential genes (Higgins et al. 1992; Lee et al. 1999; Judson and Mekalanos 2000; Camara et al. 2002; Zhu et al. 2002). These genes were considered as potential drug targets because of their essentiality for the growth and survival of the organism (Roemer et al. 2003).

To predict the role of these 102 virulence genes as drug target, we compared them with human proteome and found that 66 genes among the 102 essential genes showed considerable similarities. Thus, only 36 genes (Table 1), which did not show homology with human genes, could be considered as drug targets (Sakharkar et al. 2004). The 36 genes that were included in the gene ontology analysis using Blast2GO analysis were mapped and annotated to have important functions in the cell and critical locations inside or on the surface of the cell (Overington et al. 2006). More than 50% of the shortlisted genes were found coding for the proteins involved in metabolic processes of biopolymers including nucleic acids and proteins (Fig. 1). Another eight of the gene products were found involved in biosynthetic processes. The rest of the gene products were also found associated with vital metabolic processes of the cell. Blocking the primary metabolic processes or synthesis of macromolecules has direct implication in discovery of new treatment strategies (Sharma et al. 2008; Sridhar et al. 2007). Furthermore, highest numbers of gene products were found carrying hydrolase activity. In fact, hydrolase proteins and nucleic acid-binding proteins together constituted more than 50% of the present dataset (Fig. 2). Significantly, ten gene products were found associated with nucleotide, nucleobase, and nucleic acid metabolism processes which corresponded to nine nucleotide-binding proteins (Fig. 2). Similarly, most of the gene products were mapped to intracellular locations, to some organelles, or to the enzyme complexes (Fig. 3).

Table 1 Putatively selected drug targets after eliminating those vitally essential virulence genes that shared homology to human genes
Fig. 1
figure 1

Summary of the biological processes in which products of the predicted virulence genes are involved in

Fig. 2
figure 2

Molecular functions of the products of the predicted virulence genes

Fig. 3
figure 3

Summary of the cellular components where products of the predicted virulence genes are localized

Clearly, all functions played by these genes are very important for the growth and surveillance of V. cholerae. The gene products of VC2424, VC2425, and VCA0692 (Table 1) were considered most appropriate drug targets because of their location at cell surface and participation in cell wall synthesis (Hasan et al. 2006; Overington et al. 2006). The latter of these has been implicated in most vital processes of the cell (Table 1) like replication, transcription, cellular defense (cell wall biosynthesis), and even pathogenicity. The other two (VC2424 and VC2425) encode surface antigens and are likely to be involved in invasion of the host and establishment of virulence. All of these as present on cell surface become easy therapeutic targets.

The drug target identification is an important and sensitive first step of the drug discovery process that must need to satisfy various selection criteria to pass for next stage (Lipinski et al. 2001; Hefti 2008). Our results thus provide a starting material for future discovery of drug discovery against V. cholera, and we recommend these 36 targets may be experimentally validated. To the best of our knowledge, this is the first report on use of both sequence analysis and gene expression data to identify putative drug targets in a pathogenic species, and both results and methods are likely to find importance among the scientists actively participating in pharmaceutical research.

Conclusion

Various computational methods are available in scientific field for the prediction of virulence genes and drug targets, but all these methods mostly depend on the sequence-based homology or gene expression pattern analysis, and none of them is self-sufficient for such purposes.

In this work, we utilized the information regarding virulence genes and used it for prediction of other probable virulence genes by using gene expression pattern, and then predict their role as drug target using subtractive genomic approach. We found that the combination of these two approaches (gene expression pattern and sequence pattern) provide a very good method to find out the virulence genes and the role of their products as drug target. We tested this approach on the sequence and gene expression data of V. cholerae and efficiently shortlisted 36 genes. The number of genes with drug target potential is low, but this set of 36 genes is likely to prove highly potent and convenient targets for newer drugs to be discovered against this pathogen.