Tautomerism in chemical information management systems

Warr, Wendy A.

doi:10.1007/s10822-010-9338-4

Tautomerism in chemical information management systems

Published: 06 April 2010

Volume 24, pages 497–520, (2010)
Cite this article

Download PDF

Access provided by CONRICYT-eBooks

Journal of Computer-Aided Molecular Design Aims and scope Submit manuscript

Tautomerism in chemical information management systems

Download PDF

Wendy A. Warr¹

638 Accesses
41 Citations
6 Altmetric
1 Mention
Explore all metrics

Abstract

Tautomerism has an impact on many of the processes in chemical information management systems including novelty checking during registration into chemical structure databases; storage of structures; exact and substructure searching in chemical structure databases; and depiction of structures retrieved by a search. The approaches taken by 27 different software vendors and database producers are compared. It is hoped that this comparison will act as a discussion document that could ultimately improve databases and software for researchers in the future.

Informatics: Tools and Databases in Drug Discovery

Automated evaluation of consistency within the PubChem Compound database

Article Open access 19 February 2019

Reconciling Inconsistent Molecular Structures from Biochemical Databases

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Tautomerism has implications for many of the computational procedures used in drug discovery. The dividing lines between chemical information, cheminformatics and computational chemistry are by no means clear [1, 2] but this article aims to discuss only some “chemical information” aspects of tautomerism, namely novelty checking during registration into chemical structure databases; storage of structures; exact and substructure searching in chemical structure databases; and depiction of structures retrieved by a search. Implications of tautomerism in computational chemistry (for example in ligand preparation and property prediction) are deliberately not addressed.

The systems and databases that have been studied are listed in Table 1. The list is chosen to include most of the well known software companies; it is by no means comprehensive where databases are concerned. It is also true that some software organizations are missing: a few failed to reply to repeated requests for information, or had inadequate Web sites, and one or two may possibly have been overlooked. There are inter-relationships between the organizations and databases in Table 1; some of the software vendors sell databases built by other organizations and database vendors have selected their preferred chemical information management packages from the software vendors.

Table 1 Software and database vendors

Full size table

The aim of this article is not to provide a blow by blow account of every tautomer feature offered by every vendor, nor to make unfair comparisons among companies that serve very different markets (e.g., molecular modelers, chemical catalog companies and patent searchers). It should also be noted that vendors of “out-of-the box software” are in a rather different position from vendors of software toolkits. The latter may have the advantage of being able to offer multiple options from which customers can pick and choose, although those customers will have to expend some resource in building their own systems. Vendors of “out-of-the box” software may have to make decisions on behalf of a majority of customers, although in these days of open systems, it may be possible to plug in optional components.

Most of the facts in this article were collected by questioning software vendors and database producers individually. Some vendors were quicker than others in supplying detailed lists of chemical structures, but it should not be assumed that other vendors have made a more cursory assessment of the subject, nor should it be assumed that anyone’s solution is the one and only correct answer. This article should be viewed as a discussion document that could ultimately improve databases and software for researchers in the future.

Chemical structure representation

Although there are very many databases and organizations, the number of “standards” for chemical structure representation is much smaller. Common ones are the Chemical Abstracts Service (CAS) connection table used in the REGISTRY system [3–15]; molfile, SDfile and other file standards developed by MDL, now Symyx [16, 17]; the International Union of Pure and Applied Chemistry (IUPAC) International Chemical Identifier, InChI [18]; and SMILES, developed by Daylight Chemical Information Systems [19, 20].

Through the use of a strict valence model, SMILES can represent molecular graphs, including tautomeric structures, with suppressed hydrogen structures, yielding very compact representations. These are suitable for database indexing and many related computational dictionary functions. Isomeric SMILES, which covers stereochemistry and isotopes, has further increased the utility of canonical SMILES. Note, however, that OpenEye canonical SMILES, Daylight canonical SMILES, SciTouch canonical SMILES and ChemAxon canonical SMILES are all independent unique descriptors. None of them can be used as interchangeable indices in cheminformatics.

The Morgan algorithm [3] underpins many of the systems in use today, and is the basis of the CAS REGISTRY database. It identifies atoms based on an extended connectivity value. The atom with the highest value becomes the first atom in the name, and its neighbors are then listed in descending order. Ties are resolved based on additional parameters, for example bond order, and atomic number. The original Morgan algorithm did not handle stereochemistry; SEMA (stereochemically unique naming algorithm) was developed to handle stereoisomers [21]. SEMA was adopted by MDL Information Systems (now Symyx Software). Symyx’s NEMA (newly enhanced Morgan algorithm) produces a unique name and key for a wider range of structures than SEMA [22]. The work of Wipke et al. [23] identified the value of a constitutional key and a stereo key. This approach has been incorporated into NEMA.

InChI is an openly available, electronic format for exchanging chemical structure information over the Internet: a unique, linear identifier or “digital signature” [18, 24]. The InChI algorithm converts a chemical structure (in the form of its connection table) into a unique, alphanumeric string of characters. The program can also convert an InChI label back into a molecular structure. Two requirements must be fulfilled in doing this: different compounds must have different identifiers, with all the information needed to distinguish the structures; and any one compound must have only one identifier, including only the necessary information to identify that compound.

Since a given compound may be represented at different levels of detail, in order to create a robust expression of chemical identity the InChI team decided to create a hierarchical “layered” form of the Identifier, where each layer holds a distinct and separable class of structural information, with the layers ordered to provide successive structural refinement. In addition to basic connectivity and overall charge, the principal varieties of layers are mobile/fixed H-atoms (expresses tautomerism), isotopic composition and stereochemistry. The layered structure of the InChI allows future refinements with little or no change to the layers [25].

An InChI Key, a condensed digital representation of the identifier, can be generated based on a truncated SHA-256 hash [26] of the corresponding InChI layers. An InChIKey has two parts. The first block of 14 letters encodes the molecular skeleton (connectivity); the first eight letters of the second block encode stereochemistry and isotopes. Use of InChIKey allows searches based solely on atom connectivity (the first 14 characters). Tautomers have different structures, and different systematic names; those in Fig. 1 have identical InChIKeys but different NEMA keys. Mesomers do not exist separately and would ideally have the same identifier. Figure 2 is an example of mesomers with the same InChIKey but different NEMA keys.

The National Cancer Institute Computer Aided Drug Design (NCI/CADD) identifiers are calculated for the Chemical Structure Lookup Service (CSLS) [27]. They are based on hashcodes calculated by the cheminformatics toolkit CACTVS. CACTVS hashcodes represent a chemical structure uniquely as a 16-digit hexadecimal number (64-bit unsigned), have a high sensitivity to structural features of a compound, and change if the connectivity changes. Structure normalization is performed for any incoming structure set to be registered, or searched by, in CSLS. Each parent structure is then subjected to a hashcode calculation to generate the NCI/CADD identifier [28].

The normalization has adjustable levels of sensitivity. The Fragment Isotope Charge Tautomer Stereo (FICTS) identifier is a representation of the exact structure drawing, sensitive to all the five features. The FICuS identifier is not sensitive to tautomers (“u” stands for “unsensitive”), and comes close to how chemists perceive a chemical. The uuuuu identifier links closely related forms. Currently there are eight identifier variants defined for a structure: FICTS, FICTu, FICuS, FICuu, uuuTS, uuuTu, uuuuS, and uuuuu. Three of them, FICTS, FICuS and uuuuu are searchable for all the structure records in CSLS[28].

Issues

When registering a compound into a chemical database system or registry, it is usual to check first whether its structure is novel. If the compound can exist in multiple forms, “novelty check” (or “duplicate search”) must involve searching for all forms. This can be achieved in more than one way, for example, by storing all possible forms in the database (and probably indicating that they relate to just one compound), and doing an exact match search for the query molecule as drawn; or by storing just one form but ensuring that the existing and new structures are “normalized” in some way before comparison.

Whether or not all tautomers are stored, there may be reasons for selecting one of them as the preferred form. If one form only is required, how should it be chosen? Should it be the canonical tautomer described by some graph algorithm, or set of rules, or should the supposed major tautomer be stored? What algorithms and rules are currently in use?

If only one tautomeric form is stored, the query used in a substructure search might be modified to allow for the possibility of tautomerism. Alternatively the software developer may decide that it is up to the scientist doing the search to formulate queries that represent all the tautomeric forms being sought. Thus, substructure search is another challenge that can be addressed in more than one way. (Exact structure search is equivalent to the novelty checking procedure described above.)

Finally there is the decision of which tautomer or tautomers to display once the search is complete. Should this be the registered tautomer, the preferred tautomer (if any), the supposed major tautomer, the tautomer that best matches the query, or some combination of these options? Or should all possible tautomers be displayed, whether or not they are stored? The purpose of this article seeks to address all these issues, but before that can be done it is necessary to establish precisely what is meant by tautomerism.

Definitions of tautomerism

The Symyx definition of tautomeric structures is given in Fig. 3. There can be multiple tautomeric groups in a single molecule. (Although Symyx software defaults to rules such as these, the rules are under user control and can be modified to suit local circumstances.) The Chemical Abstracts Service (CAS) definition [29] is similar (see Fig. 4) but has a broader range of elements. For tautomeric pyrazole derivatives and for tropolones, Chemical Abstracts selects a single preferred structure and index name (using a lowest locant principle, see Fig. 5), and assigns a single CAS Registry Number, even though these systems do not conform to the general equilibrium in Fig. 4 and are not currently normalized by the CAS Registry System. The IDBS basic definition (Fig. 6) is very similar to CAS’ except that M or Z can be carbon, and a negative charge can migrate instead of hydrogen (or a hydrogen isotope). CAS allows a positive charge to migrate.

The CambridgeSoft representation of the simplest form of tautomerism, proton-shift tautomerism, without terminal carbons, is given in Fig. 7. The parameter n allows for 1,3-, 1,5-, 1,7-shifts (see Fig. 8) and beyond. (This is not to imply that other organizations do not recognize more distant shifts than 1,3.) CambridgeSoft can supply a very wide definition including proton-shift tautomerism without terminal carbons (Fig. 7), proton-shift tautomerism with one terminal carbon (Fig. 9), proton-shift tautomerism, with higher unsaturation (Fig. 10), ring-chain “tautomerism” (Fig. 11), valence tautomerism (Fig. 12), “charge tautomers” (Fig. 13), “unreasonable tautomers” (Fig. 14), and “hidden tautomers” (Fig. 15).

IDBS’ documentation supplies the specific examples in Fig. 16. In addition it describes overlapping systems (Fig. 17) and adjacent and non-adjacent forms (Fig. 18). Non-adjacent forms arise because of overlapping tautomeric systems. In the example in Fig. 18, the overlapping systems belong to different tautomeric cases (imine-enamine and azo-hydrazone), but non-adjacency may also arise where the overlapping systems belong to the same tautomeric case (as in two overlapping keto-enol systems). Non-adjacent forms may only be interconverted via adjacent intermediate forms such as Structure 10 in Fig. 18. Unsaturated hydrazine 9 can only convert directly into hydrazone 10 (imine-enamine case). Azo 11 can only convert directly to hydrazone 10 (azo-hydrazone case). Thus 9 and 11 are non-adjacent forms that cannot convert directly into each other.

Support for multiple types of tautomerism

We turn now to a discussion of which organizations recognize which sorts of tautomerism (assuming that they address the issue at all). Before elaborating on this, it is necessary to recognize that the approaches to tautomerism of experts in “informatics” (some might say “IT”) are quite different from those of computational chemists. For example vendors of informatics software supply multiple chemical structure examples of the types of tautomerism they recognize; computational chemistry companies send lists of rules. Informatics experts think, at least partly, in graph theory terms, computational chemists talk of minimizing energies. This section thus tends to address informatics approaches. Organizations are discussed in a logical rather than alphabetic order in this section.

Limited support

The Protein Data Bank (PDB) does not explicitly include tautomers in its databases; it only archives what is in the crystal structure complexes, so there is no IT structure for dealing with tautomers. The Beilstein database under the CrossFire system is being phased out. Beilstein is offered online on STN and it is included in Elsevier’s new Reaxys system. In the past the database included a tautomer identifier (i.e., a compound was given a Beilstein Registry Number and individual tautomers were given further identifiers) but this feature has temporarily been disabled. Tautomer recognition is likely to be added to Reaxys before the end of 2010. Another system that still has limited tautomer support is InChI. InChI currently supports only 1,3-migration of hydrogen between heteroatoms; 1,5 migration and keto-enol tautomerism are currently being tested and may become available later as a special user-defined option. This has implications for Internet search engines which are dependent on InChI.

CambridgeSoft

CambridgeSoft is at the other extreme, having studied a great many possibilities for tautomerism (Figs. 7, 8, 9, 10, 11, 12, 13, 14, and 15). CambridgeSoft implements the broadest form of the rule in Fig. 7. Tautomers involving 1,3-shifts are recognized by many software systems; CambridgeSoft products recognize tautomeric systems with no size constraints. There is nothing “magical” about 1,3-, 1,5-, or 1,7-shifts (see Fig. 8); from a chemical perspective, this sort of proton-shift tautomerism is characterized by a 1,n system of alternating single and double bonds, and sometimes it includes bonds in an aromatic system.

In the category “proton-shift tautomerism, with one terminal carbon”, one of X or Z in Fig. 7 is a carbon atom. CambridgeSoft software recognizes keto-enol tautomers in this category but only for three-atom systems. They argue as follows. Consider the structures in Fig. 9. Structures 1 and 2 represent a keto-enol tautomeric pair. Structures 3 and 4 represent another keto-enol tautomeric pair. In contrast, structures 2 and 3 are related through a 1,5-shift where a hydrogen nominally shifts between the oxygen and the para carbon. If that sort of “extended keto-enol tautomerism” were to be allowed, then it would imply that structures 2 and 3 were also tautomers of each other, which is clearly unreasonable from a chemical perspective. Accordingly, tautomeric shifts involving a terminal carbon atom must be restricted to three-atom systems in CambridgeSoft logic.

CambridgeSoft software recognizes any amount of unsaturation across tautomeric systems of any size. Proton-shift tautomerism requires unsaturated bonds between the end points, but there are no restrictions from a chemical sense in the amount of unsaturation. From CambridgeSoft’s point of view, a ketene-ynol tautomerism (Fig. 10) involves two units of unsaturation across a three-atom system.

Some forms of tautomerism involve ring opening (Fig. 11). Carbohydrates are a classic example, and also phenolphthalein, but the general category is much broader than that. CambridgeSoft products currently do not recognize ring-chain tautomerism. It is not clear from a chemical information management perspective whether chemists would want to see structures of this type recognized as tautomers, even though they do match CambridgeSoft’s “dictionary definition”.

Some structures are able to interconvert without any change in the hydrogen bonding pattern (“valence tautomerism”, Fig. 12). CambridgeSoft claims that this is still a true tautomerism (rather than resonance) because the atoms of the structure do move relative to each other. Compounds of this type are said to have fluxional structures [30]. The prototypical example is bullvalene. CambridgeSoft products currently do not recognize valence tautomerism, even though it fits the dictionary definition.

Unlike tautomers, which are chemically distinct species that can be isolated under appropriate conditions (at least in theory), “charge tautomers” are simply representational variants of resonance forms (Fig. 13). They should be recognized as identical and not be treated as tautomers. CambridgeSoft software, however, does recognize these charge-shift resonance pairs, and as with tautomers there are no limitations to the size of the shift. In real compounds, and especially in dyes, the distances can be quite large.

There are certain types of tautomerism where the average chemist would agree that one compound in the “tautomer pair” could not possibly exist. Examples include “tautomers” that break a carboxylic acid or a nitro group (Fig. 14). While it is true that these tautomers are unreasonable, CambridgeSoft holds that this is not an issue that should be ignored: if a user did enter an “unreasonable” tautomer, the software should recognize that it is a tautomer of the more usual form. Some software is designed specifically to exclude recognition of this sort of tautomerism. CambridgeSoft software does recognize it, while admitting that it is indeed rarely encountered.

In the presence of multiple overlapping tautomeric centers the situation can get extremely complex. Consider the structures in Fig. 15. Few chemists would recognize that 5 and 8 are tautomers relative to each other. Structure 5 can interconvert with 6 through a series of keto-iminol tautomerisms. Structure 6 can interconvert with 7 through a 1,11-proton shift. Then 7 can interconvert to 8 through an inverse series of keto-iminol tautomerisms. So indeed 5 and 8 are tautomers, even though the net result is a shift of two hydrogen atoms from one end of the structure to the other. CambridgeSoft software recognizes this sort of tautomerism.

ACD/Labs

ACD/Labs supports keto-enol tautomerism but forms that are very unlikely in practice are not proposed. Thus if any of the structures in Fig. 19a is drawn by a user, the other two are proposed as options; that in Fig. 19b is not proposed, but if a user draws this structure, the three in Fig. 19a will be proposed, and can be used for registration and search. For acetone the minor enol form is not proposed by ACD/Labs procedures as this form is really minor and hardly detectable. Tropolone tautomerism is handled because it also falls in the keto-enol category. The “length” of the tautomeric system is not a problem for ACD/Labs: all six tautomers in Fig. 20 are generated, two of them corresponding to N/NH tautomerism and the other four to N/CH tautomerism (imine-enamine), an analog of keto-enol tautomerism.

The double bond may be in an aromatic system. ACD/Labs’ software recognizes the nitroso-oxime example in Fig. 21 but does not propose any tautomer for phenol since the keto form is hardly detectable. Both 2-hydoxypyridine and 2-pyridone are recognized and the pyridone is treated as the predominant form. Overlapping systems such as guanidine (Fig. 22) are handled.

IDBS

In IDBS’s software, tautomeric systems extending across a conjugated system of double bonds (along a chain or around a ring system) are detected, whether or not the conjugated system is aromatic. All forms in a tautomeric set, whether adjacent or non-adjacent (Fig. 18) will be found in an “exact by tautomer” search, whichever form is input as the query. Many, if not most, of the examples suggested by CambridgeSoft are handled, but one difference is that IDBS does support 1,5-shifts of the keto-enol type, something that CambridgeSoft claims is inappropriate.

Symyx

Symyx agrees with CambridgeSoft on proton-shift tautomerism, with or without terminal carbon, and on the unlimited size of a tautomeric system (Figs. 7, 8, and 9). The company agrees that proton-shift tautomerism with higher unsaturation is a valid example of tautomerism, but does not perceive it yet. Its customers have not complained, perhaps because energetics make the situation rare, but the company will consider implementing it in the future for completeness. Symyx considers Figs. 11, 12, 13, 14, and 15 to be stretching the definition of tautomerism and mostly degrading the usefulness of the term. Its software does, however, perceive “charge tautomers” as tautomers, even though the forms look more like mesomers, in line with CambridgeSoft’s treatment of these structures. The unreasonable tautomer where the carboxylic acid is “broken” (Fig. 14) is recognized by Symyx software but not the nitro group example. Symyx does not accept the “hidden tautomer” (Structures 5, 8 of Fig. 15); it is seen as two products that might rearrange under the right conditions. The software does not recognize the two isomers as tautomers but it does recognize them individually as tautomers of the intermediate poly-enol in Fig. 23.

Symyx plans to widen the definition of a tautomeric region allowing the detection of larger collections of atoms with a mobile hydrogen. This will allow the software to detect tautomeric relationships such as “A is a tautomer of B and B is a tautomer of C, so A is also a tautomer of C” (see Fig. 18). Currently if the user standardizes on B as the reference structure all is well and the relationship is detected, but if A or C is selected as the standard format then only B is detected as a tautomer.

OpenEye

OpenEye does not list specific forms of tautomerism because that is not the nature of its algorithm. It does not have a list of tautomeric forms recognized and successfully handled. Instead, it examines the atom types of the atoms in the molecule and their bonding patterns and then applies an algorithm designed to reproduce the chemistry of tautomerism. This algorithm attempts to be generally applicable and is not reliant on specific definitions of types of tautomers.

OpenEye’s tautomers program handles keto-enol tautomerism at the 1,3 level, and asked for the unique form, will give the same output tautomer for both input tautomers. The behavior for ketones is analogously successful. There is a flag to turn this feature on or off in the software as it can lead to a noticeable explosion of tautomers that is not always desirable. In line with CambridgeSoft, OpenEye does not cover extended keto-enol tautomerism (1,5 shifts and beyond) but it places no restrictions on the size of the tautomeric system in other cases. For the structures in Fig. 9, OpenEye products generate Structure 1 as the unique form from any of the four starting structures (as long as the flag indicating keto-enol tautomerism is set); if the flag is not set, none of the four forms is tautomerized.

OpenEye software perceives tautomer zones (atoms of which the electrons and protons transfer). Each zone has an independent state. When zones overlap, they are merged into a single zone and the algorithm is applied to the entire zone. Aromaticity (and its perception) is considered in the OpenEye algorithm, and a user can request the software to produce tautomers that retain maximum aromaticity.

OpenEye software both enumerates and canonicalizes the seven classes keto-enol, amide-imidic acid, amidine-amidine, diazoamino-diazoamino, nitrosamine-diazohydroxide, thioamide-iminothiol, and thionitrosamine examples in Fig. 16. Tautomeric systems extending across a conjugated system of double bonds (along a chain or around a ring system) are detected, whether or not the conjugated system is aromatic. Imine-enamine, nitroso-oxime, azo-hydrazone, thioketo-thioenol, thionitroso-thiooxime in Fig. 16 are not enumerated, and the non-adjacent tautomeric system of Fig. 18 is not recognized. Proton-shift tautomerism with higher unsaturation (Fig. 10), ring-opening (Fig. 11) and valence tautomerism (Fig. 12) are not addressed, but charge resonance forms are. Both “unreasonable” tautomers (Fig. 13) are recognized and the unique form is the one with the standard carboxylic acid or nitro group. CambridgeSoft is not alone in handling “hidden tautomers”: OpenEye software generates a full set of 68 tautomers whichever one of the four structures in Fig. 15 is input, and it generates Structure 5 as the unique tautomer given any one of the four (or indeed 68) starting molecules.

CACTVS

The CACTVS approach for generating tautomeric states of molecules is described in Ref. [31]. Molecular Networks’ MN.TAUTOMER is a stand-alone encapsulated script of Xemistry’s CACTVS, i.e., a rebranded CACTVS application. It is a rule-based enumerator of tautomeric forms of a chemical compound. The rules automatically detect substructures in a given molecule that can undergo a tautomeric transformation via SMIRKS transforms. In total, a set of 23 transformation rules is encoded in MN.TAUTOMER. Most of the important types of tautomerism are supported including:

simple and long-range enol/thioenol exchange
simple imine exchange
nonsubstituted heteroaromatic exchange
simple or long-range hetero atom hydrogen exchange
keten/inol exchange
nitro form/aci form of nitro compounds
simple nitroso/oxime exchange or nitroso/oxime exchange of aromatic systems
cyanuric acid, formamidinsulfonic acid, hydrogen cyanide/hydrogen isocyanide, phosphonic acid
sulfonamides, sulfonylsulfane, sufonylphosphane.

In CACTVS, the tautomer rules are in principle configurable, and cover up to 1,11 H shifts, with long-range rules becoming more selective and less aggressive than short-range rules.

Chemical Abstracts Service

CAS requirements for the normalization of structures [29] are shown in Fig. 4. All possible tautomers are identified by a computer procedure that searches for potential endpoints, for instance, nitrogen or chalcogen atoms, which are double bonded to an atom acceptable as a centerpoint. When such two-atom sets are identified, the remaining attachments of the centerpoint are checked. If a potential endpoint bearing a mobile group is found, all qualifying endpoints and their mobile groups are included in the tautomer group, and the centerpoint-endpoint bonds are marked as tautomer bonds. So that “onium” substructures common in dyes can be normalized, substances that have mobile groups with a plus (+) charge are also recognized as tautomers. Keto and enol forms of a substance are not normalized to a preferred form in the CAS REGISTRY. CAS scientists enter into the REGISTRY whichever form of the keto-enol tautomer is represented in the article being indexed.

SciTouch

The Indigo organic chemistry tool kit from SciTouch [32] includes a data cartridge called Bingo. Bingo supports ring-chain tautomerism and user-defined constraints on tautomeric chains. The user can restrict the tautomer search by enabling conditions for boundary atoms in tautomeric chains. By default, there are three conditions:

Each boundary atom in the tautomeric chain must be one of N, O, P, S, As, Se, Sb, Te
Carbon not from an aromatic ring at one end of the tautomeric chain, and one of N, O, P, S at the other end
Carbon from an aromatic ring at one end of the tautomeric chain and one of N, O at the other end

Users are allowed to use an arbitrary subset of these conditions for matching, and also add their own conditions, based, as the default three, on atom numbers and/or aromatic and aliphatic property.

Schrödinger

In the 2010 release of Epik [33, 34], Schrödinger will cover 900 types of tautomerism with more than 4,000 tautomers listed (including many low population tautomers for identification purposes). While many types of non-aromatic tautomerism are covered, including various keto-enol tautomerisms and their nitrogen analogs, aromatic systems constitute more than 90% of the tautomerism covered, since these more commonly result in significantly populated tautomers in drug-like molecules. In addition, Epik’s protonation state adjustments cover an even larger number of types of tautomers by removing a proton from one location on the molecule and subsequently adding a proton somewhere else.

Others

ChemAxon tools allow for four different ways of handling tautomers; the types of tautomer recognized will depend on the method. In one method (customizing preferences in Standardizer) all transformations are defined by the user [35], an approach which allows for more esoteric transformations such as ring-chain tautomerism. A publication by Chemical Diversity Labs gives a detailed description of handling tautomerism in exact structure searching, in the ChemoSoft software environment [36]. Since then ChemoSoft has added “long distance” tautomer handling and other unrelated features. In Accelrys’ Pipeline Pilot the default substructures considered to calculate a tautomer score include amide, enamine, keto, enol, diamide, thioamide, aromatic bonds, and exocyclic double bonds. Customization is possible. The procedure is discussed at the appropriate point later in this article.

Registration

The recognition and treatment of tautomers is dependent on the specific task at hand. For example, the representation of a chemical substance in a corporate database might need to be independent of specific tautomeric form, whereas spectral properties often require the distinction between specific forms. Since tautomeric forms may have different signals in NMR spectra, ACD/Labs software generates all tautomers of an input structure and lets the user choose which tautomer(s) to use for the prediction of a spectrum.

Whether or not different tautomeric forms should be registered will depend on the use of the database and the business rules of the organization creating the system. Treating different tautomers as identical will always lose information. Whether that is appropriate depends on the given situation. A research laboratory specializing in ultra-low-temperature species almost certainly would want to keep tautomers distinct. On the other hand, a user keeping an inventory of a stockroom stored at room temperature would probably prefer to merge tautomers.

Reactions and mechanistic information are other cases. Chemists would be unlikely to use the unusual forms of carboxylic acid and nitro groups except in a reaction mechanism, and if such forms were used in a reaction, they probably should be registered unchanged. What is recorded in a reaction and what is put in a corporate registry are not necessarily the same. Symyx recommends registering the reaction as drawn, and also registering the structures after normalization, so that searching for what was done works, and queries that follow corporate business rules also work.

Some of the companies in this article provide software for systematic naming of compounds. Such software must be able to name any structure that is input; tautomerism is to some extent a non-issue. If the software can generate all tautomers it should be able to name all tautomers, should the user really want extra names in addition to the one for the input structure. If the software also has an algorithm for deriving a preferred tautomer, then it can name a preferred tautomer as well. The subject of chemical nomenclature is very complex and is outside the scope of this article. Registration approaches are discussed in this section in alphabetical order of vendor.

Accelrys

In a registration system built using Accelrys’s Pipeline Pilot, using the Pipeline Pilot Chemistry Cartridge to provide structure indexing and searching, there are two potential ways to recognize different tautomeric forms that may already be present in a database. In the first, the registration protocol takes the input molecule, enumerates all possible tautomeric forms of that molecule and stores all of these in an SDfile (along with the original input structure) as a single row in the database. In the second, the registration protocol calculates and stores the canonical tautomer for each molecule being registered into the database. The componentized nature of Pipeline Pilot makes it possible for users to configure the system to support the tautomeric matching behavior that their business rules require. The Enumerate Tautomers component in Pipeline Pilot takes input molecules and can enumerate all possible tautomers or it can convert the molecule to a canonical tautomer. The tautomer algorithm that the component uses is based on that of Sayle and Delany [37].

ACD/Labs

ACD/Labs software stores structures in the database as specific (fixed) tautomeric forms. The tautomer recognition tool generates possible tautomeric forms for a given structure according to a set of rules, estimates a preference for each specific form and ranks them as major, minor, or conditions dependent. For registration the software allows the user to choose the tautomeric form to store, but since the tautomer module is aimed at generation of preferred form(s), if a particular form is recognized as the major one, minor forms are not proposed for registration.