Abstract
Sequence alignment is a common method for finding protein structurally conserved/similar regions. However, sequence alignment is often not accurate if sequence identities between to-be-aligned sequences are less than 30%. This is because that for these sequences, different residues may play similar structural roles and they are incorrectly aligned during the sequence alignment using substitution matrix consisting of 20 types of residues. Based on the similarity of physicochemical features, residues can be clustered into a few groups. Using such simplified alphabets, the complexity of protein sequences is reduced and at the same time the key information encoded in the sequences remains. As a result, the accuracy of sequence alignment might be improved if the residues are properly clustered. Here, by using a database of aligned protein structures (DAPS), a new clustering method based on the substitution scores is proposed for the grouping of residues, and substitution matrices of residues at different levels of simplification are constructed. The validity of the reduced alphabets is confirmed by relative entropy analysis. The reduced alphabets are applied to recognition of protein structurally conserved/similar regions by sequence alignment. The results indicate that the accuracy or efficiency of sequence alignment can be improved with the optimal reduced alphabet with N around 9.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Bowie J U, Luthy R, Eisenberg D. A method to identify protein sequences that fold into a known three-dimensional structure. Science, 1991, 253: 164–170
Jones D T, Taylor W R, Thornton J M. A new approach to protein fold recognition. Nature, 1992, 358: 86–89
Regan L, Degrado W F. Characterization of a helical protein designed from first principles. Science, 1988, 241: 976–978
Kamtekar S. Protein design by binary patterning of polar and nopolar amino acids. Science, 1993, 262: 1680–1685
Plaxco K W. Simplified proteins: Minimalist solutions to the “protein folding problem”. Curr Opin Struct Biol, 1998, 8: 80–85
Wang J, Wang W. A computational approach to simplifying the protein folding alphabet. Nature Struct Biol, 1999, 6: 1033–1038
Henikoff S, Henikoff J G. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA, 1992, 89: 10915–10919
Ogata K, Ohya M, Umeyama H. Amino acid similarity matrix for homology derived from structural alignment and optimized by the Monte Carlo method. J Mol Graph Model, 1998, 16: 178–189
Zhou H, Zhou Y. Fold recognition by combining sequence profiles derived from evolution and from depth-dependent structural alignment of fragments. Proteins, 2005, 58: 321–328
Friedberg I, Kaplan T, Margalit H. Evaluation of PSI-BLAST alignment accuracy in comparison to structural alignments. Protein Sci, 2000, 9: 2278–2284
Mallick P, Weiss R, Eisenberg D. The directional atomic solvation energy: An atombased potential for the assignment of protein sequences to known folds. Proc Natl Acad Sci USA, 2002, 99: 16041–16046
Kleiger G. PFIT and PFRIT: Bioinformatic algorithms for detecting glycosidase function from structure and sequence. Protein Sci, 2004, 13: 221–229
Karlin S, Altschul S F. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA, 1990, 87: 2264–2268
Altschul S F. Amino acid substitution matrices from an information theoretic perspective. J Mol Biol, 1991, 219: 555–565
Karlin S, Altschul S F. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc Natl Acad Sci USA, 1993, 90: 5873–5877
Higgins D G, Sharp P M. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 1988, 73: 237–244
Holm L, Sander C. Mapping the protein universe. Science, 1996, 273: 595–602
Holm L, Sander C. Dictionary of recurrent domains in protein structures. Proteins, 1998, 33: 88–96
Blake J D, Cohen F E. Pairwise sequence alignment below the twilight zone. J Mol Biol, 2001, 307: 721–735
Dosztanyi Z, Torda A E. Amino acid identity matrices based on force fields. Bioinformatics, 2001, 17: 686–699
Johnson M S, Overington J P. A structural basis for sequence comparisons an evaluation of scoring methodologies. J Mol Biol, 1993, 233: 716–738
Li T. Reduction of protein sequence complexity by residue grouping Protein Eng, 2003, 16: 323–330
Fan K, Wang W. What is the minimum number of letters required to fold a protein. J Mol Biol, 2003, 328: 921–926
Koradi R, Billeter M, Whrich K. MOLMOL: A program for display and analysis of macromolecular structures. J Mol Graphics, 1996, 14: 51–55
Henikoff S. Automated construction and graphical presentation of protein blocks from unaligned sequences. Gene, 1995, 163: GC17–GC26
Pietrokovski S, Henikoff J G, Henikoff S. The blocks database-A system for protein classification. Nucleic Acids Res, 1996, 24: 197–200
Clarke N D. Sequence “minimization”: Exploring the sequence landscape with simplified sequences. Curr Opin Biotech, 1995, 6: 467–472
Riddle D S. Functional rapidly folding proteins from simplified amino acid sequences. Nature Struct Biol, 1997, 4: 805–809
Akanuma S, Kigawa T, Yokoyama S. Combinatorial mutagenesis to restricted amino acid usage in an enzyme to a reduced set. Proc Natl Acad Sci USA, 2002, 99: 13549–13553
Felsenstein J. Confidence limits on phylogenies: An approach using the bootstrap. Evolution, 1985, 39: 783–791
Liu X. Simplified amino acid alphabets based on deviation of conditional probability from random background. Phys Rev E, 2002, 66: 021906-1–021906-4
Author information
Authors and Affiliations
Corresponding author
Additional information
Supported by the National Natural Science Foundation of China (Grant Nos. 90403120, 10474041 and 10021001) and the Nonlinear Project (973) of the NSM
Rights and permissions
About this article
Cite this article
Li, J., Wang, W. Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids. SCI CHINA SER C 50, 392–402 (2007). https://doi.org/10.1007/s11427-007-0023-3
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s11427-007-0023-3