1 Introduction

Cytochrome P450s (CYPs), phase 1 biotransformation mediating enzymes, catalyze the biotransformation of endogenous substrates and xenobiotics, which account for 75 % of total clinically relevant drugs [13]. The majority of drug metabolism is mediated only by a few members of the CYP1, 2 and 3 families that mainly include CYP3A4, 2D6, 2C9, 2C19, 1A1, 1A2, 1B1 and 2E1 [4, 5]. The identification of CYP isoform which will preferentially interact with new chemical entity (NCE), is a part of drug development process in reducing the adverse effects and increasing the therapeutic efficacy of the molecule. Hence it is a matter of prime concern for pharmaceutical industries to anticipate beforehand, whether it is worthwhile spending both time and money taking that candidate through the drug development pipeline [6]. In face of these practical needs, great efforts have been extended to understand CYP structure and function. The general view that function of an enzyme can be determined by its structure suits well to the CYPs. Understanding the structure–function relationships may reveal the information about metabolism by a specific CYP isoform [2]. Scientific approaches employed on CYP substrate specificity predictions, mainly include experimental approaches and in silico approaches that employ various molecular modeling methodologies. The major benefit of in silico approaches is that it allows prediction without experimental determination and thus is particularly favored in earlier stages of drug development [2].

Amongst the xenobiotic metabolizing cytochrome P450s, CYP1 isoforms are known to metabolically activate some known pro-carcinogen environmental chemicals, toxins and toxic drugs. The isoforms of this family possess high sequence identity but display varying substrate specificity. CYP1A1 and CYP1A2 are approximately 70 % identical with each other in respect to sequence similarity and possess about 40 % sequence identity with CYP1B1 [7]. Among the CYP families identified thus far, the CYP1A subfamily appears to be the most highly conserved. CYP1A1, CYP1A2 and CYP1B1 have approximately 7, 10 and 3 % contribution to overall metabolism of drugs and contribute approximately 20, 17 and 11 % in carcinogen activation reaction [8]. CYP1A2 has major hepatic expression (~13–15 %) whereas CYP1A1 and CYP1B1 possess extra-hepatic expression. CYP1A1 and CYP1B1 are over-expressed in tumor tissues; the risk of getting cancer depends upon the type of carcinogen and length of exposure and tissues such as skin, kidney, mammary, prostate are at major risk for cancer. Nutrition, drugs, genetic polymorphisms, environmental influence and smoking are the factors which influence level of participation of these isoforms in the metabolism of xenobiotics [9]. Aryl hydrocarbon receptor (AhR), a PAS domain protein regulates the members of CYP1 family, which binds CYP1A1, CYP1A2, and CYP1B1 genes and leads to transcription and translation [10]. CYP1A1 preferentially metabolizes benzo[a]pyrene and polycyclic aromatic hydrocarbons (PAHs) to their toxic derivatives, whereas, CYP1A2 shows preference for the oxidation of heterocyclic aromatic amines [11]. CYP1A1 is specifically responsible for bioactivation of anticancer compounds such as 2-(4-amino-3-methylphenyl)-5-fluoro-benzothiazole (5F 203) and 5-aminoflavone (5-AF) [1114]. The structure reveals a compact, closed and narrow active site cavity that is highly adaptable for the positioning and oxidation of relatively large and planar substrates [15]. Structure of these three isoform reveals distinct features among the active sites that underlie the functional variability of these enzymes. Thus elucidation of substrate specificity between CYP1A1, CYP1A2 and CYP1B1 has been a matter of concern, which is evident from the numerous studies conducted in this regard.

The work presented here involves identifying and elucidating structural protein features responsible for differences in substrate specificity among the three isoforms of CYP1 family of enzymes.

2 Materials and Methods

2.1 Selection of Protein Structure

3D structures of CYP1A1 (PDB ID: 4I8V) [10], CYP1A2 (PDB ID: 2HI4) [15] and CYP1B1 (PDB ID: 3PMO) [16] co-crystallized with the inhibitor α-naphthoflavone were retrieved from protein data bank (PDB).

2.2 Sequence Alignment

Identification of similarities and differences in the amino acid sequences of the three isoforms is one of the key for characterization of substrate specificity. This difference in the sequences can be best known by performing the sequence alignment of the three sequences with respect to one another. Sequences of CYP1A1, CYP1A2 and CYP1B1 were retrieved from UniProt Knowledgebase (UniProtKB) having Uniprot ID as P04798, P05177 and Q53TK1, respectively. Initially, sequences and protein structures were aligned into the 3D space and visualized using Pymol software (The PyMOL Molecular Graphics System, Version 1.5.0.4 Schrödinger, LLC) to identify the similarity and differences among the three sequences considering CYP1A1 as a reference.

2.3 Topology of CYP1 Family of Enzymes

Since the crystal structure of CYP1A1 is recently published, its 3D topology and substrate recognition sequence/site (SRS) have not yet been characterized. Both these features are essential for explicating isoform specificity. Here we provide the 3D topology of CYP1A1 with the secondary structure elements and substrate recognition sequences (SRSs). The nomenclature common to the CYP family is considered for naming the secondary structure elements.

2.4 Comparative Crystal Structure Analysis

Average and residue-wise RMSD as well as B-factor analysis for three isoforms CYP1A1 (PDB ID: 4I8V), CYP1A2 (PDB ID: 2HI4) and CYP1B1 (PDB ID: 3PMO) structures were computed keeping CYP1A1 (PDB ID: 4I8V) structure as reference. In all, RMSD values were obtained for active site residues around 8 Å of (1) CYP1A1 with respect to CYP1A2, (2) CYP1A1 with respect to CYP1B1 and (3) CYP1A2 with respect to CYP1B1. All the structures were superposed over CYP1A1 (PDB ID: 4I8V) and the co-ordinates of the proteins were updated with the new positions they acquired after superposition. Residue-wise and average RMSD for the all the active site residues was calculated using formula (1).

$${\text{RMSD}}({\text{v}},{\text{w}}) = \sqrt {\frac{1}{n}\sum\nolimits_{i = 1}^{n} {((v_{ix} - w_{ix} )^{2} + (v_{iy} - w_{iy} )^{2} + (v_{iz} - w_{iz} )^{2} )} }$$
(1)

3 Results

3.1 Sequence Alignment

Identification of similarities and differences in the amino acid sequences is one of the key for characterizing substrate specificity. This difference in the sequences can be best known by performing sequence alignment of sequences as shown in Fig. 1. It is evident from the sequence alignment that there is variability in the length of the three sequences. CYP1A1, CYP1A2 and CYP1B1 have 512, 513 and 525 amino acids in their sequences, respectively. The alignment in Fig. 1 shows the secondary structure topology of CYP1A1 and the substrate recognition sequence (SRSs) regions that are identified. CYP1A1 has 370/506 identical amino acid residues that shows 73 % sequence identity with CYP1A2. The sequence identity of CYP1A1 and CYP1A2 with respect to CYP1B1 is 40 % (193/466 identical amino residues) and 41 % (186/468 identical amino acid residues).

Fig. 1
figure 1

Sequence alignment of the three isoforms taking CYP1A1 as a reference structure. Six putative substrate recognition sequences (SRSs) are shown by a blue colored box. Residues are colored according to ClustalX2 color coding. Identical residue are marked with ‘asterisk’ above while similar and less similar are marked with ‘colon’ and ‘dot’ respectively. A black colored box indicated at an interval of 50th amino acid residue in case of all the three isoforms (Color figure online)

A primary question of concern is which parts of the CYP1 proteins are involved in the substrate recognition and binding and hence play an important role in substrate specificity. The diverse ligand specificity of CYP1 family of enzyme can be attributed to highly variable sequence intervals among the isoforms, residues located at the active site cavity and the substrates recognition sequences (SRSs). Six putative substrate recognition sequences were identified which were located along the structure, which constitute approximately about 16 % of the total residues. Study of the SRSs region is of prime concern as it plays a decisive role in the substrate identification and binding. Amino acid residues covered in the 6 SRSs regions for three isoforms are shown in Table 1. By comparing the sequence of three isoforms, the high degree of variability between the SRSs regions of CYP1A1 and CYP1A2 with respect to CYP1B1 can be easily observed from Fig. 1 and Table 2.

Table 1 SRS regions in CYP1A1, CYP1A2, and CYP1B1
Table 2 Residues which are similar and differing in case of CYP1A1, CYP1A2, and CYP1B1

SRS1 starts with ARG (106) a basic amino acid and ends with PHE (134) with a non-polar amino acid residue in all the three protein sequences. Apart from these two residues, three amino acids are identical in SRS1, PRO (107) at the 2nd, GLY (118) and SER (120) at the 13th and 15th positions, respectively. CYP1A1 and CYP1A2 have more proportions of the identical sequences but the orientation of residues is different in 3D space which is key factor responsible for substrate binding. When we compare CYP1A1 and CYP1A2 with CYP1B1 the high amount of variability in sequence as well as orientations was observed. For example amino acid residues ASP (108/110), LEU (109/111), TYR (110/112), THR (111/113), THR (113), LEU (114), ILE (115) AND GLN (116) in case of CYP1A1 and CYP1A2 are identical but for CYP1B1 it has SER (119), PHE (120), ALA (121), SER (122), ARG (124), VAL (125), VAL (126), ARG (130) in place (Number in the bracket indicates the position of residues in CYP1A1 followed by that of CYP1A2 isoforms, the position differ in case of CYP1A2 and CYP1B1). The regions of differences in case of CYP1A1 and CYP1A2 are the residues PHE (112) and SER (114), SER (116) and THR (118), ASN (117) and ASP (119), MET (122) and LEU (124), as well as SER (123) and THR (125), respectively as shown in Fig. 1. The regions identical in CYP1A1 and CYP1B1 but differing in CYP1A2 are PHE (112/123) in CYP1A1 and CYP1B1 but in CYP1A2 it is SER (114), similar is the case with MET (122/133) and LEU (124) as shown in Fig. 1. Residues different in all the three isoforms include ASN (117), ASP (119) and GLY (128), as well as SER (123), THR (125) and ALA (133) respectively. SRS1 is the region which covers the highest number of residues out of the 6 SRSs.

SRS2 contains LEU (217), LEU (219) and LEU (224) and PHE (224), PHE (226) and PHE (231) as the two identical residues in case of CYP1A1, CYP1A2 and CYP1B1. The residues VAL (218/220) and GLU (226/228) in CYP1A1 and CYP1A2 are aligned but in CYP1B1 it has LEU (225) and ARG (233) in place as shown in Fig. 1. The amino acids differing in all the three protein sequences are ASN (219) of CYP1A1 aligned with LYS (221) and SER (226) of CYP1B1 and ASN (222) of CYP1A1, HIS (224) in CYP1A2 and GLU (229) in case of CYP1B1 as shown in Fig. 1.

SRS3 has three PHE (251), PHE (258) and PHE (261) in CYP1A1, PHE (253), PHE (260) and PHE (263) in CYP1A2 and PHE (261), PHE (268) and PHE (271) in CYP1B1 residues cluster at three different locations and a single residue which is identical and aligned in all the three protein sequences. Hence this part of SRS3 doesn’t contribute significantly in substrate specificity but has implications for ligand binding. LYS (252) in CYP1A1 and LYS (254) in CYP1A2 are aligned but CYP1B1 has a GLU (262) in place which may contribute to substrate specificity between CYP1A1, CYP1A2 and CYP1B1. LEU (254) in CYP1A1 and LEU (264) CYP1B1 are identical but differ in case of CYP1A2 as it consists of PHE (256). This may lead to increase in pi–pi stacking interactions with ligand in CYP1A2 as compared to CYP1A1 and CYP1B1. Other than that, more diverse residues between three proteins are ASP (253), GLU (256), LYS (257), TYR (259), SER (260) and MET (262) residues in CYP1A1; ALA (255), GLN (258), ARG (259), LEU (261), TRP (262), LEU (264) in CYP1A2; GLN (264), ARG (267), ASN (268), SER (270), ASN (271) and ILE (273) in case of CYP1B1 are in sequence and aligned. Hence, it may be possible reason behind substrate specificity amongst three proteins as shown in Fig. 1.

SRS4 is the region which is covered in the longest I helix, this part is the one which is exactly above the heme in the 3D structure and undergoes flexibility changes. This SRS region comprises of nearly 17 residues, out of which 7 (40 %) residues (ASP, PHE, GLY, ALA, ASP, THR, THR) are identical and aligned in all the three protein structure as shown in Fig. 1. The amino acid residues ASN (309/309), VAL (311/311), GLY (318/318), PHE (319/319), VAL (322/322), and THR (323/323) are identical in CYP1A1 and CYP1A2 but CYP1B1 has different residues that are ALA (322), ILE (324), SER (331), GLN (332), LEU (335), SER (336) at these positions. The amino acid residues which are different amongst three proteins includes ILE (308), VAL (308), PRO (321); ILE (310), LEU (310), THR (323); LEU (312), ASN (312), THR (325) which may be important in substrate recognition and specificity.

SRS5 has PHE (381) in CYP1A1, PHE (381) in CYP1A2 and PHE (395) in CYP1B1 as the starting amino acid and it ends with ILE (386) in CYP1A1, ILE (386) in CYP1A2 and ILE (399) in CYP1B1 in all three proteins and it comprises of only 6 residues. The residue VAL (382) is identical in CYP1A1 and VAL (395) CYP1B1 but it is aligned with LEU (382) in CYP1A2 as shown in Fig. 1. As both are similar residues hence it may not have significant effect on substrate specificity but it has contribution in hydrophobic substrate binding.

SRS6 has the maximum identical sequences accounting for 5 out 7 residues. The sequences which are identical in case of CYP1A1 and CYP1A2 but different in CYP1B1 are ILE (493) in CYP1A1 and ILE (494) in CYP1A2 with SER (506) in CYP1B1. MET (498) in CYP1A1, MET (499) in CYP1A2 and ILE (511) in CYP1B1 are the identical residues. These residues have different properties than aligned residues from CYP1A1 and CYP1A2, as SER in CYP1B1 is polar and aligned residue is ILE which is non-polar, similarly ILE in CYP1B1 is non-polar and aligned residue is MET which is polar. Such different residues are proposed to play crucial role in substrate specificity.

3.2 Topology of CYP1 Family of Enzymes

CYP1 family of enzymes displays similar CYP structural fold. The isoforms of CYP1 family possess a structural diversity that is found to be quite distinct from the architecture of CYP2 and CYP3 family of enzymes [10]. The crystal structure of CYP1A1 is recently published in protein data bank (PDB ID: 4I8V), its 3D topology and substrate recognition sequence is not yet been characterized. Both these features are essential for explicating isoform specificity. Here we provide the 3D topology of CYP1A1 with the secondary structure elements and six substrate recognition sequences (SRSs) regions (Table 3) as shown in Fig. 2. The nomenclature common to the CYP family is considered while naming the secondary structure elements. CYP1A1 has 512 amino acid residues in its sequence. It starts from the N-terminus, 36th residue (GLY36) at its starting and ends at the 512th (SER512) amino acid as the C-terminus residue. It possesses 19 helices starting from A to L, contains eight beta-sheets, and loops connecting the various elements together. The (Fig. 2) depicts the topology of CYP1A1 along with the SRS regions marked as (SRS1 to SRS6) and the various secondary structure elements.

Table 3 Substrate recognition sequences (SRSs) of CYP1A1
Fig. 2
figure 2

Topology of CYP1A1 along with the secondary structure elements. Substrate recognition sequences (SRSs) are marked in blue colored boxes from SRS1 to SRS6 (Color figure online)

3.3 Comparative Crystal Structure Analysis

Identifying similarities and differences in macromolecular features is one of the main issue in characterizing substrate specificity differences. Comparative crystal structure analysis of the three isoforms was carried out to identify the similarities and differences in the protein structure. RMSD and B-factor analysis was performed for the three isoforms under study.

3.3.1 RMSD Analysis

Average and residue-wise RMSD for active site (binding pocket) of three isoforms viz CYP1A1, CYP1A2 and CYP1B1 structures was computed keeping CYP1A1 as reference structure. It can be observed from Table 4 that differences exist for the residues in the active site between the three isoforms, the amino acid differed in their type, nature and location. These residues could be accounted for differences in the type of substrate being metabolized by the three isoforms and thus demonstrating differences in substrate specificity profiles.

Table 4 Residues showing higher RMSD fluctuations

3.3.2 RMSD Analysis of CYP1A1 and CYP1A2

RMSD values for most of the residues was below 2.00 Å except for three pairs of residues as shown in Fig. 3. Few stretches of residues such as ASN221:THR223 and PHE258:PHE260 pairs of CYP1A1 and CYP1A2 displayed large RMSD values, because of differences in orientations, these residues extend from F to the I-helix in the protein structure (Table 5). The residues which are different in each of the protein are the ones which display the maximum RMSD deviations as was evident from the plot of RMSD. The amino acid pairs undergoing the most RMSD deviations are listed in the Table 6. The B–C loop comprises of SRS1; it consists of identical residues ILE115, SER120 and PHE123 of CYP1A1 which are aligned with ILE117, SER122 and PHE125 of CYP1A2 showing less RMSD values. The remaining amino acids present in active site have values below 2.00 Å.

Fig. 3
figure 3

RMSD for active site residues of three isoforms

Table 5 Comparative residue-wise analysis of CYP1A1, CYP1A2, and CYP1B1
Table 6 Active site residues showing higher B-factor values than average B-factor

3.3.3 RMSD Analysis of CYP1A1 and CYP1B1

The sequence identity between CYP1A1 and CYP1B1 is approximately 40 %. The amino acid pair ASN221 from CYP1A1 and ASN228 of CYP1B1, which are present in SRS2, has shown the maximum RMSD deviations (Fig. 3), because these residues are present in distorted region of F-helix in both proteins (Table 6). Those residues might play a crucial role in substrate binding. Other than that residue pairs MET121:MET132, LEU312:THR325 and PHE319:GLU332 have also shown significant RMSD values and could affect the differential substrate binding potential of proteins.

3.3.4 RMSD Analysis of CYP1A2 and CYP1B1

The sequence similarity of CYP1A2 and CYP1B1 is quite low and is only about 38 % and so major differences in the active site residues are evident. Because of low sequence similarity they show differential preference for the substrates they metabolize. From the RMSD analysis of the two isoforms, it was found that residue pair THR118:SER127 of CYP1A2:CYP1B1, which is present in SRS1, shows the largest deviation (2.7 Å) in the RMSD value (Table 6). The RMSD values of residues found in the active site were shown in Fig. 3.

3.3.5 B-factor Analysis of CYP1A1, CYP1A2 and CYP1B1

B-factor of atoms in a crystal structure represents how much they are moving from their average position. High value of B-factors indicates higher mobility or flexibility of residues in crystal structures. Average as well as C-alpha B-factor for active site residues was looked closely to observe the deviation in B-factor. B-factor analysis of the active site residues revealed that overall B-factors for active site residues of CYP1A1 are quite higher than that of other two isoforms, the second one being CYP1B1 and then CYP1A2 as shown in Fig. 4. It can be inferred that the active site of CYP1A1 is most flexible among the three isoforms of CYP1 family of enzymes, which might be an important factor for substrate specificity. The residues in CYP1A1 active site that form a part of the helix showed higher B-factor values than the overall average B-factor as in Table 6. The most important and variable regions which could govern the access and egress of the substrate to and fro from the active site cavity are I helix, G helix, loop connecting H helix to the I-helix, certain portions of the E′ and F helix and the loop connecting these helices, the B′ and C helices and loop encompassing them [10]. The residues ASN221, ASP320, LYS499, and LEU254 in CYP1A1 are the amino acids which show higher B-factor values than the average B-factor (47.86). In CYP1A2 structure, residues THR118, ASP320 THR321, LEU382, and ILE386 displayed higher B-factor values than overall average B-factor (23.44); While, residues VAL126, SER127, ASN265 and THR510 of CYP1B1 display higher B-factor values than overall average B-factor (39.81). The above mentioned amino acid residues which have high B-factor value need attention because of their high flexibility. These residues might govern the interactions with substrates in CYP active site. In case of the three isoforms, B-factor values for all other active side residues were lower than the overall average.

Fig. 4
figure 4

B-factor for active site residues of three isoforms (that are: CYP1A1, CYP1A2, and CYP1B1)

4 Discussion

The present study concentrates upon elucidating protein structural features that are accountable for differences in substrate specificity profiles among the three isoforms CYP1A1, CYP1A2 and CYP1B1. As a part of protein structure analysis, we performed the sequence alignment of three isoforms to identify the similarities and differences in the amino acid sequences and also characterized the substrate recognition sequences (SRSs)—these sequences are the ones which are essential for substrate binding, recognition and differential substrate specificity profiles. Thus this can be utilized as a primary screening of analysis to concentrate upon the handful of amino acid residues. Analyzing the amino acid residues forming the part of active site thus enable us to identify the difference in the amino acid residues with the same family. These residues are located on the periphery of the active site cavity and thus form a gate for the access and egress of the substrate; hence responsible for substrate binding and recognition. From the topology analysis and sequence alignment, we analyzed that amino acid residues forming part of SRS1, SRS2, SRS3 are flexible and hence are more prone to undergo changes upon the entry and exit of the substrates. The topology of CYP1A1 helps to identify the regions in the protein structure and the actual architecture of the protein in the 3D space. This plays an essential role during the substrate recognition, binding, access and egress of the substrate. RMSD and B-factor analysis helped us to identify the amino acid residues which are different in the three isoforms showing maximum RMSD deviation. CYP1A1 and CYP1A2 has seven residue pairs, CYP1A1 and CYP1B1 has six residue pairs, and CYP1A2 and CYP1B1 has seven residue pairs, respectively, in the SRSs regions showing high RMSD values and thus displaying substantial RMSD deviations. These residues form the part of B′ helix, F, G or I helix and the loop connecting the B′ to the C helix. These regions form the SRS 1 to SRS 4 in the protein structure. From the B-factor analysis it was concluded that the residues which were different showed the maximum flexibility and mobility. In case of CYP1A1, CYP1A2 and CYP1B1 4, 5 and 4 residues showed the highest B-factor values. These residues formed the part of B′ helix, F, G or I helix and the loop connecting the B′ to the C helix. These regions cover the SRS 1 to SRS 4 regions in the protein structure.

5 Conclusions

In the present study, we have investigated the reasons for the differences in substrate specificity profiles for the three isoforms of Cytochrome P450 1 family. It can be concluded that the differences in the key active site residues of three isoforms of CYP1 family of enzymes could responsible for the substrate specificity profiles. A comparative protein structure analysis has been carried out to elucidate the reasons for differential substrate specificity among the three isoforms of CYP1 family.

Sequence alignment is one of the main tools for the identification of the similarities and differences that exist in the amino acid sequences of the three isoforms. The sequence alignment of the three isoforms helped us to identify the regions of similarity and differences between the amino acid residues. This acted as a primary analysis helping us to identify the differentiating residues among the three isoforms and hence this can render a more realistic view for the residues to concentrate upon for explicating the difference in substrate specificity. This similarity and differences in the amino acid sequence can act as a key part of characterizing substrate specificity differences. Substrate recognition sequences (SRSs) regions which are implicated to be essential for ligand binding and governing substrate specificity have been characterized for the three isoforms. This identification can help to concentrate upon only those set of residues which form part of the active site cavity and are essential for ligand binding and substrate specificity. Using the insights from the sequence alignment, comparative crystal structure analysis of the three isoforms was carried out keeping CYP1A1 as a reference structure. In the comparative crystal structure analysis; RMSD deviation and B-factor analysis of active site residues was carried out. RMSD analysis of the three isoforms with respect to each other taking CYP1A1 as a reference was carried out, this included calculation of the RMSD of the active site and those residues which form a part of the substrate recognition sequences (SRSs). The residues which were different in case of the isoforms were showing the most RMSD deviations. B-factor analysis of the active site residues revealed that overall B-factor for active site residues follows the order CYP1A1 > CYP1B1 > CYP1A2. RMSD deviations and B-factor values results matched well and were found that the residues which showed the most RMSD deviations also displayed higher B-factor values. Since the crystal structure of CYP1A1 was recently published, its 3D topology is characterized keeping into account the overall global structure of CYPs. The topology has been constructed keeping into account the interaction with heme and the substrate recognition sequences have been marked. Hence, this initial comparative protein structure analysis can help researchers to concentrate upon the regions in protein structure for explicating the differences in substrate specificity and hence render a more realistic view of substrate specificity profile.