1 Molecular representations for machine learning

Molecular representations and features play an essential role in machine learning applications in the domains of chemistry, drug discovery, and materials science. These representations convert the structural and chemical information of molecules into a format that can be efficiently processed by computational models. In recent years, several reviews on these representations have been published [1,2,3] to give readers different perspectives on the pros and cons of known representations as well as how they are categorized. Despite their valuable information, these works need to be updated with recent advances. Building partially on previous reviews and incorporating updated information and a holistic understanding of these representations, we conducted a comprehensive review of molecular representations for machine learning in bio-cheminformatics. Within the scope of this study, we focus on those that are commonly used in de novo molecular design and Quantitative Structure–Activity Relationship (QSAR) modeling. In this section, we introduce various molecular representations that are classified into six groups: string-based, property-based, molecular fingerprints, language model-based, graph-based, and others based on their characteristics.

1.1 String-based representations

String-based representations include all types that describe molecular bonds and structures using special symbols (e.g., ‘//’, ‘@’), alphabet letters (e.g., ‘H’, ‘C’), or any other non-numeric forms. One of the most widely used string-based molecular representations is the Simplified Molecular-input Line-Entry System (SMILES) [4] (Fig. 1). The SMILES representation of a molecule is a compact textual notation that encodes its molecular structure, where atoms are represented by chemical symbols (e.g., ‘S’ for ‘Sulfur’, ‘O’ for‘ Oxygen’) and bonds are represented by special symbols (e.g., ‘-’ for a single bond, ‘=’ for a double bond, and ‘:’ for an aromatic bond). SMILES has found extensive applications in cheminformatics and drug discovery due to its simplicity and ease of use. The SMILES notation follows a set of predefined rules and syntax, facilitating the conversion between molecular structures and textual representations. It enables the storage, retrieval, and manipulation of molecular information in databases and machine learning workflows. However, the direct input of SMILES representations into machine learning models is not ideal without being transformed into corresponding numeric forms (e.g., one-hot encoding) [5]. It has also been observed that SMILES syntax is redundant, as multiple SMILES strings can represent the same compound. Besides SMILES, there are several other string-based representations, such as the International Chemical Identifier (InChI) [6], InChI Key [6], and SYBYL Line Notation [7]. SELFIES (Self-Referencing Embedded Strings), introduced by Krenn et al. [8], is a more advanced, unique, and concise string-based representation developed with rules for molecular reconstruction.

Fig. 1
figure 1

An example of string-based representations (Compound: Nicotine)

Most string-based representations are supported by the RDKit library [9], open-source software for cheminformatics. SMARTS and SMIRKS are two specific representations used for structural pattern searching and chemical reaction description, respectively. While SMARTS is supported by RDKit, SMIRKS, a hybrid representation based on SMILES and SMARTS, can be generated using Ambit-SMIRKS [10]. SLN is versatile and used for expressing chemical structures, conducting searches, and describing chemical reactions in 3D chemical structures, whereas SMILES is specifically designed for representing 2D structures. Ambit-SLN [11] facilitates the processing of SLN conversion. Table 1 summarizes tools and software that support translation from one string-based representation to another.

Table 1 Tools and software that support string-based representations

1.2 Property-based representations

Property-based representations of molecules are numerical vectors or matrices that carry information on theoretically-derived molecular properties and characteristics (Fig. 2). Molecular descriptors, which are typical property-based representations, can be either continuous or categorical values computed based on the 2D or 3D structures of molecules [12]. These descriptors provide a quantitative representation of molecular structures, which can be directly used in machine learning tasks, exploratory data analysis, or structural similarity assessment.

Fig. 2
figure 2

An example of property-based representations. The Ethene (C\(_2\)H\(_4\))’s Coulomb matrix is constructed from \(N^2\) (where \(N=6\)) Coulomb potential values (\(M_{i,j}^{Coulomb}\)) which are computed based on the pairwise inter-atomic distance between any pair of constitutional atoms as follows: \(M_{i,j}^{Coulomb}\) = \({\left\{ \begin{array}{ll} 0.52 \times Z_i^{2.4} \ \forall \ i=j \\ \frac{Z_i \times Z_j}{|R_i - R_j |} \ \forall \ i \ne j \end{array}\right. }\), where Z and \(|R_i-R_j|\) are the atomic number and the Euclidean distance between atoms i and j, respectively

Molecular descriptors cover a wide range of physicochemical properties, including topological, geometrical, electrostatic, and quantum-chemical properties. Many molecular descriptor sets (e.g., Chemopy, CDK, etc.) are defined by different groups of properties. Numerous non-commercial libraries [9, 13, 14], software [15], and web servers [16] support the computation of these descriptors. In addition to molecular descriptors, electrostatically computed matrices, such as the Coulomb matrix, the Ewald sum matrix, and the Sine matrix, also serve as property-based representations [14]. However, these matrices are expressed in a similar pattern to the adjacency matrix of the molecular graph. Table 2 summarizes tools and software that support property-based representations.

Table 2 Tools and software that support property-based representations

1.3 Molecular fingerprints

Molecular fingerprints, also known as chemical fingerprints or simply fingerprints, are numerical expressions that indicate the presence or absence of specific substructures. The fingerprint vector contains information about substructural patterns within a molecule (Fig. 3). The diversity of fragmentation methods for substructural hashing creates a variety of fingerprints. While most fingerprint vectors are binary, others are substructure-count vectors. A binary fingerprint reader scans the molecular structure and, upon detecting a substructure, counts it as ‘one (1)’, ignoring any subsequent occurrences of the same substructure. In contrast, substructure-count fingerprint readers count all occurrences of repetitive substructures, highlighting differences in the frequency of substructures. Fingerprint vectors are useful for similarity searches [20] and various machine learning tasks, except for de novo molecular design. The high computational cost of reconstructing a molecule from its fingerprint is the primary barrier to the applications of fingerprints in this field. Furthermore, these reconstruction methods often lack precision. E-State and Extended-Connectivity are typical examples of binary fingerprinting tools found in CDK and PubChem [15]. Klekota-Roth, AtomPairs2D, and Substructure fingerprints support both substructure-count and binary forms [15]. Since a fingerprint vector is simply a binary vector of annotated substructures, users can customize their fingerprints by defining which substructures should be detected. NC-MFP [21] is an example of a fingerprint customized for natural compounds. Similar to molecular descriptors, fingerprint vectors can be easily computed using different non-commercial libraries [9, 13], software [15], and web servers [16].

Fig. 3
figure 3

An example of structure-based representations. Aspirin’s Morgan fingerprints are a binary vector in which ‘one (1)’ and ‘zero (0)’ indicate the ‘presence’ and ‘absence’ of a defined substructure, respectively. A set of Morgan substructures is determined by the number of selected bits (e.g., 1024, 2048) and radius (r). The size of the substructure is associated with the radius

Table 3 provides information on tools and software that support typical molecular fingerprints. ChemDes [16] can currently be used to compute 59 commonly used types of fingerprints. Among these, MACCS (166 bits), PubChem (881 bits), Morgan (\(1024 \times n\) bits), and Substructure (307 bits) are frequently used for molecular featurization. The Klekota-Roth fingerprint (4860 bits), introduced by Klekota and Roth [22], creates high-dimensional sparse vectors, whereas the E-State fingerprint (79 bits) generates low-dimensional vectors. The number of bits and the radius (r) can be adjusted for the Extended-Connectivity Fingerprint (ECFP) [23], a more generalized and adaptable version of the Morgan fingerprint. In the Morgan fingerprint, the radius indicates the size of circular substructures; for example, a radius of 2 indicates that each substructure is composed of two atoms. Varying the number of bits and the radius results in different fingerprints. ECFP is frequently followed by a number indicating the chosen diameter (twice the radius), such as ECFP2, ECFP4, and ECFP6, which correspond to radius of 1, 2, and 3 atoms away, respectively. The performance of a downstream task is often influenced by the selected number of bits. It is important to choose a sufficiently large number of bits to cover the most essential substructures in the chemical set, but an excessively large number can result in sparse vectors that slow down computation. Although there is no strict rule for selecting the number of bits, researchers commonly set it to be a multiple of 512 (e.g., 1024, 2048). Additionally, ECFP is used to create language model-based representations, such as Mol2vec [21] and NPBERT [24]. An ECFP fingerprint can also be converted to an indexing vector and then transformed into an embedding matrix for a molecular property prediction task [25, 26]. The Natural Compound-Molecular Fingerprint (NC-MFP) [27], a fingerprint customized for natural compounds, is not readily available as a module. Developing and reimplementing NC-MFP (10,016 bits) is challenging due to the numerous unconnected processing stages and software required. Menke et al. [27] trained a deep neural network to encode the Natural Product Fingerprint (NPFP) vectors, demonstrating that NC-MFP was less effective compared to NPFP in downstream tasks.

Table 3 Tools and software that support molecular fingerprints

1.4 Language model-based representations

Language model-based representations are continuous vectors or matrices created by ‘molecular encoders’ (Fig. 4). Molecular encoders are pre-trained models developed using a large set of molecules. During training, these molecular encoders learn the structural patterns and characteristics of molecules to map them to corresponding continuous forms, which are expected to be convertible back to their original structures. Examples of molecular encoders include Mol2vec [28], ChemBERTa [29], and NPBERT [24]. Most molecular encoders are developed using language models, where each molecule (defined by a specific set of substructures) is treated as a ‘sentence’, and its substructures are treated as ‘words’. A ‘valid molecule’ is analogous to a ‘meaningful sentence’, emphasizing the importance of the order of substructures. The molecular encoders learn the ‘grammar of molecules’ to create a vector space capable of effectively encoding any inputted molecule. The inputs for molecular encoders can include index vectors, one-hot vectors, graph-based matrices, or any other form readable by the model. The quality of the vector space depends on the volume of training data, the architecture used, and the training strategies. These language model-based representations are then used as inputs for downstream machine learning tasks. The use of continuous representations enables more efficient optimization through gradient descent and other brute-force methods [3].

Fig. 4
figure 4

An example of language model-based representation

Table 4 presents tools and software that support language model-based representations. Mol2vec [28], a pre-trained model, was the first molecular encoder to convert molecules into corresponding language model-based features. It draws inspiration from Word2vec [30], a method for word embedding. These encoders are trained using language models and vast sources of data. For developing Mol2vec, Jaeger et al. [28] used nearly 20 million chemical structures as training samples, initially translated into ECFP vectors with 2048 bits and radii of 0 and 1. The Mol2vec encoder was trained with two approaches: Continuous Bag-of-Words (CBOW) and Skip-gram, resulting in two embedding sizes for molecules: 100- and 300-dimensional continuous vectors. Motivated by Mol2vec and aided by advanced deep learning architectures, various molecular encoders have been constructed for specific purposes. Examples include SMILES-BERT [31], MolBERT [32], ChemBERTa [29, 33], NPBERT [24], and FP-BERT [34], all developed using the Bidirectional Encoder Representations from Transformers (BERT) architecture [35]. Currently, BERT is one of the most robust Transformer architectures, employing self-supervised learning methods. SMILES-BERT, MolBERT, and ChemBERTa are designed to learn the syntax of SMILES for encoding input SMILES strings of molecules. The training datasets for SMILES-BERT, MolBERT, and ChemBERTa contained approximately 18 million, 1.6 million, and 77 million molecular structures, respectively. NPBERT and FP-BERT are trained to learn the ECFP fingerprints of the substructures according to their appearance orders in the molecule. The NPBERT training dataset was enriched with 250k structures of natural products and about 1.9 million ordinary chemical data points. FP-BERT was trained with roughly 2.0 million compounds.

Besides using SMILES, ChemBERTa [29, 33] has another version trained with SELFIES. Similarly, SELFormer [36] was designed to create representations from SELFIES using RoBERTa [37], a robustly optimized BERT variant. ChemFormer [38] was constructed using the Bidirectional and Auto-Regressive Transformer (BART) [39] architecture. Contrary to BERT-based models, BART-based models prioritize the correction of sequences that have been altered with random tokens instead of using masked language modeling in their pre-training phase. MoLFormer [40] was developed using the RoFormer [41] architecture, an enhanced Transformer version with rotary position embedding. X-MOL [42], a large-scale molecular encoder, was trained using a Transformer architecture with 12 pairs of Encoder-Decoder. MolMap [43] learned 1,456 molecular descriptors and 16,204-bit fingerprints from about 8.5 million molecules using a dual-path convolutional neural network [44] to create 3D fingerprint maps of size \(37\times 36\times 3\).

Language model-based representations can also be generated from graph-based encoders. The Hierarchical Molecular Graph Self-supervised Learning (HiMol) encoder [45] uses three levels of molecular graph information: node, motif, and graph. Initially, the input molecular graph (atom node-level) is fragmented into motifs to create motif-level nodes before adding a graph-level node. These three levels of a molecular graph’s features are learned by an encoder to create three corresponding representation levels. FunQG [46] is a molecular encoder trained with Quotient Graphs of Functional groups. Instead of using traditional molecular graphs constructed by a network of nodes, Hajiabolhassan et al. [46] considered each functional group as a specific node, resulting in more informative graphs. However, their representation learning is most useful for encoding heavy molecules with complex structures, as small molecules typically consist of a limited number of functional groups.

Table 4 Tools and software that support language model-based representations

1.5 Graph-based representations

Graph-based representations provide graphical expressions of the structural connectivity of molecules. In these representations, atoms are considered ‘nodes’ or ‘vertices’, and ‘intramolecular bonds’ are considered ‘edges’ (Fig. 5). Thus, a molecule can be viewed as a graph \(\mathcal {G = (V, E)}\), defined by a set of nodes (atoms) \(\mathcal {V}\) and a set of edges (bonds) \(\mathcal {E}\), where \(\mathcal {E} \subseteq { { v_i, v_j } \ | \ v_i, v_j \in \mathcal {V} \ and \ v_i \ne v_j } \). Employing molecular graphs helps to extract valuable information on molecular connectivity, such as substructures, symmetry, and functional groups, to predict possible molecular properties (e.g., toxicity, solubility) or to explain the origins of these properties (e.g., alert structures). To be processed by machine learning models, a molecule is transformed into a node matrix’, which is an adjacency matrix indicating connections among all atoms within the molecule. In addition to the node matrix, novel graph-based neural networks utilize additional matrices indicating node or edge attributes to enhance learning efficiency. For molecular graphs, the ‘edge attribute matrix’ provides information on the types of bonds between atom pairs, while the ‘node attribute matrix’ includes additional molecular characteristics (e.g., element, orbital hybridization, charge status). To facilitate the learning process with graph-based representations, a number of graph-based deep learning architectures have been developed and continue to evolve, fully exploiting the potential of molecular graphs [47]. Numerous graph-based representations have been derived from molecular graphs, such as graph-embedding features [48,49,50]. The graph-based representation shown in Fig. 5 is just one of many possible graphs. The node order in the adjacency matrix can change depending on the graph traversal algorithm used. A single molecule can have multiple graph representations tailored for specific tasks. Some examples can be found in [51, 52].

While graphs are inherently 2D data structures with no spatial relationships between elements, they can effectively encode 3D information and stereochemical details by incorporating such data into the node and edge features. Graph representations have significant advantages over linear notations due to their ability to naturally encode 3D information and the interpretability of all molecular subgraphs. However, there are also some disadvantages to using molecular graph representations for certain applications. Molecular graphs are inadequate for representing certain types of molecules, particularly those with delocalized bonds, polycentric bonds, ionic bonds, or metal-metal bonds. Organometallic compounds, for instance, cannot be effectively described by molecular graphs due to their complex bonding schemes. Hypergraphs offer a solution for handling multi-valent bonds by representing edges as sets of atoms, but their use is not widespread. Additionally, for molecules with constantly changing 3D structures, a single static graph representation is not meaningful and could hinder problem-solving. A significant challenge with graph-based representations is their lack of compactness, both in memory usage and size. Representing a molecular graph requires complex data structures that are harder to search than compact linear representations. As a graph grows larger, its memory requirements increase significantly. In contrast, linear notations provide more compact and memory-efficient molecular representations, making them easier to use for identity searches, though less effective for substructure searches.

Fig. 5
figure 5

An example of the graph representation. The node and edge matrices of ethanoic acid (CH\(_3\)COOH) are generated based on the connectivities (bonds) among atoms and their bond types (e.g., single, double, triple). Only heavy atoms (excluding hydrogen) are considered when creating these node and edge matrices

Table 5 summarizes tools and software that support graph representations. Initially, DeepChem [53] was launched as a community project focusing on the applications of deep learning in chemistry and drug discovery. Over the years, the project has expanded to encompass a broader range of applications in molecular science. It now provides an open-source Python library with useful modules for processing multiple molecular representations, including graphs. DGL-LifeSci, developed by Li et al. [54], is another open-source Python library that supports deep learning on graphs in life sciences. Surge, created by McKay et al. [55], is a quick command-line tool for generating molecular graphs from SMILES. However, it struggles to process complex aromatic structures.

1.6 Other representations

In addition to the five types of molecular representations mentioned earlier, other formats, such as ‘3D voxelized’ and ‘image-based’ representations (Fig. 6), can also be employed for machine learning tasks [56]. However, their applications are somewhat restricted due to unaddressed limitations. The 3D voxelized representation creates 3D arrays that often exhibit high sparsity and dimensionality, but this method lacks invariant information regarding molecular rotation, translation, and permutation [57,58,59]. Conversely, image-based representations typically convert most small molecules into 2D images. Drawing on the success of Google’s Inception-ResNet [60]. Goh et al. developed Chemception [61], a specialized approach for molecular embedding. Building on the Chemception concept, Bjerrum et al. [62] introduced another molecular encoder capable of creating five-band molecular images, offering more comprehensive information for downstream machine learning tasks. Table 6 lists the tools and software used in computing these alternative representations.

Table 5 Tools and software that support graph-based representations
Fig. 6
figure 6

An example of the other representations

1.7 De novo molecular design and property prediction

The variety of molecular representations provides researchers with numerous options for creating new computational frameworks. No study has conclusively shown that one representation is consistently superior, as model performance depends on many factors, including data volume, learning strategies, and the characteristics of the molecules. Molecular descriptors and fingerprints might be more appropriate for small datasets because they can be quickly computed and are compatible with traditional machine learning models. However, using these representations often requires feature engineering and selection. Additionally, they are restricted to property prediction tasks because they are uninvertible. In contrast, string-based representations are primarily used for

Table 6 Tools and software that support other representations

de novo molecular design due to their invertibility, while graph-based representations are well-suited for handling large datasets with deep learning models, removing the need for feature engineering. Language model-based representations are particularly effective for exploratory data analysis of molecular structures and property prediction tasks. Their continuous nature allows for more efficient optimization of the learning process compared to other types, such as one-hot matrices and binary vectors. Additionally, as learnable representations, language model-based representations can be customized to distinguish between different classes of molecules, potentially improving the model’s performance. Table 7 summarizes molecular representations and their applicable tasks.

Table 7 Eligible tasks for different molecular representations

2 Representations for structural preservation

Representations for structural preservation are responsible for holding information about the atoms, bonds, connectivity, and coordinates of a molecule. They contain header information, atom information, bond connections, and types, followed by sections for more complex information.

2.1 Connection table

While graphs are fundamental for molecular representation, their connectivity matrices are not compact and scale quadratically with the number of atoms. The connection table (Ctab) provides a more structured format, comprising six parts: Counts line, Atom block, Bond block, Atom list block, Structural text descriptor block, and Properties block. The Counts line offers an overview of the structure by specifying the number of atoms, bonds, atom lists, and chirality presence, along with the version (V2000 or V3000). The Atom block lists atom identities, atomic symbols, mass differences, charges, stereochemistry, and associated hydrogens, often treating hydrogens implicitly to reduce size. The Bond block details atom connectivity and bond types, including bond order. These core blocks form the basis of the Ctab, which is extensible to include additional properties. Connection tables have become standard for handling chemical structural information due to their backward compatibility and widespread use, particularly in Molfile formats. Notably, connection tables are not file formats themselves but serve as the foundational structure for chemical table files (CTfiles).

2.2 The Molfile format

The Molfile (or CTfile) family utilizes connection tables to represent molecular structures. These formats were first developed by MDL Information Systems (MDL), later acquired by Symyx Technologies, and are now known as BIOVIA [66]. The CTfile format is released in an open format that requires users to register to download the specifications. CTfiles are highly extensible, leading to the creation of a series of widely adopted file formats for transferring chemical information. The connection table (Ctab) is encapsulated within the Molfile format, which can be further integrated into a structure-data (SD) file, including both structural information and additional property data for multiple molecules. Similarly, the Reaction file (RXNfile) [67] describes individual reactions, while the Reaction-Data (RDfile) [66] stores either reactions or molecules along with their associated data. The Reaction Query file (RGfile) [68] is designed for handling queries, and the Extended Data file (XDfile) [67], which is XML-based, facilitates the transfer of structures or reactions along with their metadata. Further information on these file types and their structures is available in MDL documentation and cheminformatics textbooks. Although Molfiles themselves contain rich structural information, they are not directly suitable for training machine learning models in their raw forms. Therefore, they need to be pre-processed and converted into a machine-readable format (e.g., molecular fingerprints, descriptors). Figure 7 visualizes the key features of these CTfiles.

3 Representations for chemical reactions

Chemical reactions, which involve the transformation of one set of molecules into another under specific conditions, have been extensively documented, with around 127 million reactions recorded to date [69]. Recently, there has been renewed interest in developing models to predict reaction outcomes, synthetic routes, and analyze reaction networks [70]. While traditional graphical representations of reactions are common, they are not easily machine-readable. Thus, various machine-readable reaction data exchange formats (e.g., RXNfiles, RDfiles) have been developed. These formats are essential for applications in computer-aided synthesis design and autonomous discovery, accommodating the complexities and limitations of different molecular representations.

3.1 SMILES Reaction Kinetics Scheme

SMILES, which describes ordinary text-based molecular structures, has been extended to include the SMILES Reaction Kinetics Scheme (SMIRKS), a notation developed by Daylight Chemical Information Systems for describing generic chemical reaction transformations. SMIRKS extends both SMILES and SMARTS. While SMILES is used to represent specific molecules and SMARTS to define molecular patterns or substructures, SMIRKS is specifically designed to encode reaction transformations, identifying which atoms and bonds change during a reaction.

Fig. 7
figure 7

The MDL family of CTfiles are created based on the connection tables (Ctab). The connection table is specified by atom and bond blocks that describe the atoms and their corresponding connectivity. Molfiles and RXNfiles are used to describe single molecules and reactions, respectively. SDfiles and RDfiles store a series of structures or reactions and associated data. RGfiles are used to handle reaction queries. XDfiles are used for transferring structure or reaction data using the XML format

In Reaction SMILES, reactants, agents, and products are represented as SMILES strings, separated by ’>’ or ’\(\gg \)’. Atom mappings, which connect reactants to products, are included, but additional information like reaction centers or conditions is not supported. Other formats, such as RXNfiles and RDfiles, can store this additional metadata. SMIRKS describes generic reaction transformations by specifying reaction centers and changes in bonds and atoms. It combines features of SMILES and SMARTS, requiring specific rules for application, such as the correspondence of mapped atoms and explicit hydrogens in reactants and products. SMIRKS are then converted into reaction graphs for further use.

3.2 Reaction InChI

Reaction InChI (RInChI) [71, 72], developed between 2008 and 2018, provides a unique, order-invariant identifier for chemical reactions to aid reproducibility and consistency in reaction representation. Unlike Reaction SMILES, RInChI uses InChIs for individual molecules and tracks structureless entities when InChIs cannot be generated. RInChI includes information about equilibrium, unbalanced, or multi-step reactions, and employs a layering system to describe distinct aspects of the reaction, such as solvents, catalysts, and reaction direction. This makes it particularly useful for identifying practically identical reactions conducted under specific conditions. An extension, ProcAuxInfo [73], allows for the storage of metadata like yields and reaction conditions. While RInChI can identify duplicate reactions and efficiently indexing and searching reaction data, it lacks equivalents to SMARTS or SMIRKS, limiting its use for substructure searches and encoding generic transformations.

As a standardized textual identifier for chemical reactions, RInChI facilitates the sharing and indexing of chemical reaction information by encoding the reactants, products, and, optionally, the agents involved. The RInChI system is designed to provide a unique and machine-readable representation of chemical reactions, making it easier to search for, retrieve, and exchange reaction data across different databases and platforms. It includes details about the reaction participants and can also capture information about the reaction conditions, ensuring consistency and interoperability in cheminformatics and related fields.

3.3 Other representations

Varnek et al. developed the Condensed Graph of Reactions (CGR) [74] to encode molecular structures in a matrix, identifying fragment occurrences and highlighting changes in atoms and bonds between reactants and products. This method was inspired by Fujita’s concept of imaginary transition states. CGRtools [75] was developed to support CGR.

The Bond-Electron (BE) matrix [76], proposed by Dugundji and Ugi, represents reactions in a matrix format. It has been employed by the EROS software [77] and the WODCA system [78] for reaction classification. The BE-matrix is an \(N\times N\) where N is the number of atoms in a molecule; diagonal entries denote free valence electrons, while off-diagonal entries indicate bond orders. Reactions are represented by an “R-matrix” that records bond changes, with positive values for bond formation and negative values for bond breakage. Adding the R-matrix to the reactant’s BE-matrix yields the product’s BE-matrix, providing an alternative way to represent reaction centers and illustrating the integration of detailed information into matrix representations.

Hierarchical Organization of Reactions through Attribute and Condition Education (HORACE) [79] utilizes a machine learning algorithm to classify chemical reactions, notable for its hierarchical reaction description. It captures both specific reaction instances and abstract reaction types using three abstraction levels. At the base level, it describes the partial order of atom types, establishing a hierarchy based on atom similarity. The next level characterizes molecules using functional groups, linking them to the reaction center. The top level specifies physicochemical properties, which describe the functional aspects of the corresponding structures. This hierarchical model provides a more comprehensive depiction of chemical reactions than purely structural approaches like SMILES.

Saller et al. introduced the InfoChem CLASSIFY algorithm [80], a method for reaction representation that has significantly influenced the development of rule-based synthesis planning methods [81, 82]. This approach identifies the reaction center by detecting atoms that change their implicit hydrogens, valency, \(\pi \)-electrons, atomic charges, or have bonds made or broken, mapping equivalent atoms in reactants and products. However, determining the reaction center is a key challenge [83,84,85]. To address this issue, the maximum common substructure (MCS) between reactants and products is first identified. Once found, hash codes for atoms in the reaction center are calculated using a modified Morgan algorithm [86], incorporating a wide range of properties such as atom type, valence, hydrogen count, \(\pi \)-electrons, aromaticity, and formal charges. These hash codes are then summed across reactants and one product to yield a unique reaction center representation. This description can be extended to include adjacent atoms for varying specificity: the reaction center alone provides a broad description, adding alpha atoms gives a medium description, and including further adjacent atoms results in a narrower, more specific description. These hash codes facilitate reaction classification and are used in later synthetic planning tools.

The concept of reaction fingerprints involves using binary vectors to capture the structural changes occurring in the reaction center. This method constructs fingerprints (e.g., ECFP variant [23]) and computes the difference between product and reactant vectors, optionally including agents. Patel et al. first discussed reaction vectors [87], which were later utilized in de novo design and classification approaches [88]. Schneider et al. employed difference fingerprints with the atom-pair variant to develop a prediction framework for classifying 50 reaction types [89]. While reaction fingerprints offer an alternative method to traditional reaction center detection and representation, they struggle with convertibility into reaction graphs, and handling stereochemistry remains an ongoing research topic [90]. Colley et al. developed RDChiral [90], an RDKit-based wrapper for managing stereochemistry in retrosynthetic template extraction and future approaches.

4 Representations for macromolecules

4.1 Peptides and proteins

Peptides and proteins are both constructed from amino acids (AAs). A single AA is characterized by an amine (-NH\(_2\)) group, a carboxyl (-COOH) group, and a distinct side chain. AAs are typically denoted by either a one-letter symbol or a three-letter abbreviation [91]. Although the Latin alphabet is sufficient to represent the 20 AAs in the genetic code, more symbols are required to represent the large number of naturally occurring AAs.

Peptides are biological sequences of 2 to 50 amino acids (AAs) that connect to each other via peptide bonds. These sequences get involved in diverse biological activities, ranging from antibiotics to biological modulators. In 1994, Siani et al. developed the CHUCKLES method [92] to create SMILES for polymers based on their sequences and vice versa, facilitating Forward Translation (FT) in cheminformatics. The CHUCKLES method uses a lookup table that maps monomer sequences to their corresponding SMILES, with atoms involved in monomer bonds removed. This approach is suitable for oligomeric peptides and is integrated into BIOPEP-UWM [93]. CHORTLES [94], an upgraded version of CHUCKLES, was then created to deal with oligomeric mixtures.

Hierarchical Editing Language for Macromolecules (HELM) [95, 96] and the Self-Contained Sequence Representation (SCSR) [97] are two prominent notations for describing a high variety of macromolecules. While HELM utilizes SMILES, SCSR uses the v3000 Molfiles. Conversion between these two types can be done by BIOVIA’s toolkit. Pfizer developed HELM under the auspices of the Pistoia Alliance to represent macromolecules composed of diverse structures (e.g., peptides, antibodies). Initially, HELM could only process molecules with well-defined structures, but the introduction of HELM2 expanded its capabilities to handle polymer mixtures and free-form annotations. HELM uses streamlined CHUCKLES and graphs to represent monomers in simple polymers and complex polymers, respectively. Its structure hierarchy reflects the granularity of the components: complex polymer, simple polymer, monomer, and atom. HELM is widely used by numerous pharmaceutical companies, public databases (e.g., ChEMBL), software (e.g., ChemDraw, ChemAxon), and toolkits (e.g., RDKit, Biomolecule Toolkit) [98].

4.2 Glycans

Glycans, or carbohydrates, refer to polymers such as oligosaccharides and polysaccharides that are built of multiple monosaccharides (monomers). These macromolecules play crucial roles in most biological processes, including cell-cell communication, immune response, and protein stabilization. In drug discovery, glycans are of particular interest for their potential as receptors, small-molecule glycomimetics, therapeutic glycopeptides, and vaccines. Oligosaccharides and polysaccharides are polymers that are composed of more than 3 and 20 monomers, respectively.

Glycan databases are essential for carbohydrate research, typically using monosaccharide-based notations to record structures [99,100,101,102]. However, these notations are inadequate for analyzing glycan-protein interactions, which require atom-based representations. To address this, several tools have been developed to translate monosaccharide-based notations into atom-based formats. The Web3 Unique Representation of Carbohydrate Structures (WURCS) [103] was created to provide a linear, unique notation compatible with the semantic web, integrating bioinformatics and cheminformatics features. The latest version of WURCS [104], used by GlyTouCan [105], the International Glycan Structure Repository, encodes the main carbon backbone of monosaccharide residues, backbone modifications, and linkage information, while also handling unspecified structures. Despite its widespread adoption in databases, WURCS remains unsupported by most cheminformatics software. Besides, other independent representations have been proposed to tackle specific issues [106].

4.3 Polymeric drugs

Polymers are used to deliver drug molecules. However, several polymers with therapeutic activities and are used as bioactive agents in treatments, known as polymeric drugs. The BigSMILES [107] syntax was recently created to encode diverse polymer structures, including homopolymers, random and block co-polymers, and complex connectivity types (e.g., linear, ring, and branched). The stochastic units of these polymers are marked by curly brackets, with repeated units separated by commas within the brackets. Since BigSMILES notation has not yet supported canonicalization, several canonicalization methods have been proposed to eliminate multiplicity [108]. There are currently no practical applications for this notation, but its prospective applications in drug discovery modeling are promising.

5 Discussion

5.1 Representations for machine learning

Property-based representations are continuous or discrete numeric features computed by software or libraries. These features, such as molecular descriptors, can be used for various molecular prediction tasks, including solubility, bioactivity, and toxicity prediction. When using these features with distance-based algorithms (e.g., k-Nearest Neighbors) or linear algorithms (e.g., Logistic Regression, Support Vector Machines), data normalization is often required to ensure that all features contribute equally to the model. This normalization step helps improve the performance and convergence of these algorithms. In contrast, when tree-based algorithms (e.g., Random Forest, Extremely Randomized Trees, Gradient Boosting Machines) are employed, data normalization can be omitted. Tree-based methods inherently handle features with different scales and are robust to varying feature distributions. This makes them particularly advantageous for dealing with heterogeneous datasets where feature scaling might be challenging or unnecessary. Furthermore, property-based representations can be combined with ensemble methods to enhance prediction accuracy and robustness. By leveraging multiple algorithms, ensemble methods can capture a broader range of patterns and relationships within the data, leading to improved model performance. These representations can also be integrated with feature selection techniques to identify the most informative features, reducing dimensionality and potentially enhancing computational efficiency and interpretability.

Unlike property-based representations, most molecular fingerprints are binary features with lengths that vary depending on the type used. Since these features are binary, data scaling is not necessary. However, distance-based machine learning algorithms are generally unsuitable for molecular fingerprints due to the lack of robust distance metrics for binary vectors. This limitation is especially applicable when handling unbalanced datasets. For example, the Synthetic Minority Over-sampling Technique (SMOTE) [109] is not suitable for molecular fingerprints because it relies on computing distances and using interpolation, which are not suitable to binary data. Tree-based algorithms are more appropriate for molecular fingerprints because they can effectively manage binary features. Those algorithms are capable of capturing complex relationships and interactions within the binary features without the need for distance metrics. Some advanced tree-based algorithms (e.g., eXtreme Gradient Boosting [110], LightGBM [111], and CatBoost [112]) are proficient in the management of unbalanced datasets and can integrate feature importance metrics to identify the most pertinent binary features. Ensemble learning techniques can also be used to improve the performance of tree-based algorithms [113]. Furthermore, the representation of molecular fingerprints can be optimized by integrating tree-based methods with feature selection and extraction techniques, thereby reducing dimensionality and enhancing computational efficiency.

Language model-based representations are continuous numeric features generated by molecular encoders, which are pre-trained neural networks that map the substructural information of molecules into vectors of continuous values, known as molecular embeddings. Since each molecule is represented by a fixed-length vector, data scaling is unnecessary. For a given encoder, these molecular embeddings are generated based on a learnable distribution. As medium-dimensional continuous vectors or matrices, these embeddings are highly compatible with distance-based, linear, and tree-based models. Additionally, molecular decoders can reconstruct the corresponding molecular structures from the embeddings. Depending on their configuration, molecular decoders may translate the embeddings into either identical or slightly different structures. This reconstructability is crucial for de novo molecular design. Combining molecular generative models with one or more pre-defined networks for property prediction results in property-directed molecule generation systems. These systems generate molecules with desired properties through a multi-objective optimization process. Essentially, the molecular encoder learns a distribution, forming a chemical vector space. The embeddings created within this space can then be transformed into valid molecules with the desired properties.

5.2 Representations for chemical reactions

The SMIRKS, RInChI, and other representations for chemical reactions each have strengths and weaknesses. SMIRKS, an extension of the SMILES notation, excels in simplicity and can encode complex reaction rules in a text-based format, making it accessible for computational applications. However, its simplicity can be a drawback when dealing with intricate reactions or stereochemistry. On the other hand, RInChI, a reaction-specific version of the InChI system, offers a more standardized and detailed representation, capturing precise information about reactants, products, and conditions. This standardization aids in data sharing and interoperability but can be cumbersome to generate and interpret due to its complexity. Additionally, other representations, like reaction graphs, provide a visual and intuitive depiction of chemical reactions, highlighting connectivity and transformations between molecules. While these are beneficial for education and initial analysis, they may lack the depth and precision needed for advanced computational modeling. Ultimately, the choice of representation depends on the specific needs of the task, balancing ease of use, detail, and computational efficiency.

5.3 Representations for macromolecules

Representations for macromolecules offer advantages over purely atomic-based notations in developing modified drug peptides. Replacing natural L-amino acids (L-AA) with D-amino acids (D-AA), for instance, can enhance a peptide’s oral bioavailability. HELM simplifies these modifications by providing readability at the polymer level, whereas SMILES operates at the atomic level. These approaches advance the integration of cheminformatics and bioinformatics. However, translation errors between biological and chemical peptide notations have been confirmed, and solutions have been proposed to address them.

5.4 Limitations and challenges

Despite being essential in bioinformatics, cheminformatics, and drug discovery, molecular representations face several limitations and challenges. The molecular world is vast and complex, with many aspects still unknown to humans. Molecules exhibit a wide range of structures, from simple linear chains to highly complex branched and ring structures. Large molecules, especially those with intricate 3D configurations, are often inadequately represented by most current methods. Macromolecules, such as proteins and polymers, present additional difficulties due to their long chains, bulky structures, and significant molecular weights, complicating the processes of featurization and encoding. String-based representations, such as SMILES or InChI, offer simplified expressions for all molecules but may fail to accurately capture stereochemistry or conformational details. Graph-based representations include connectivity information but still struggle to represent 3D conformations. Because single bonds can rotate, a molecule can exist in multiple conformations, known as conformers. While conformational information is often ignored in some modeling tasks, it can be incorporated into the main graphs as node attributes. Representing molecules with full information on their chiral centers and stereoisomers requires substantial computational resources and specialized tools or software. Current cheminformatics toolkits and libraries can support property-based representations for small or medium-sized molecules but may be slow to process complex structures or unable to compute the physicochemical properties of large molecules. Language model-based representations play crucial roles in various tasks, including property-directed molecule generation, QSAR modeling, and other downstream machine learning tasks. The effectiveness of these representations largely depends on how the pre-trained molecular encoder is developed and can vary across different tasks. Table 8 summarizes all types of molecular representations, highlighting their advantages and disadvantages.

Table 8 All types of molecular representations with highlighted advantages and disadvantages

5.5 Future directions and emerging trends

Emerging trends and innovative molecular representations are transforming cheminformatics, particularly in drug discovery, by addressing the limitations of traditional methods. Recently, advanced graph-based deep learning architectures have been developed to tackle challenges in molecular property prediction, de novo molecular design, and representation learning. The introduction of Message Passing Neural Networks (MPNNs) and their learning mechanisms has significantly influenced the development of other deep learning architectures for molecular graphs [114]. Modern graph-based neural networks now incorporate not only connectivity information but also data on molecular structures, substructures, conformation, and properties. Additionally, quantum molecular graphs have emerged as promising alternatives for representing molecules based on quantum mechanical properties and wave functions [115,116,117]. The rise of transformers and self-attention mechanisms has spurred the development of novel language model-based representations, which can customize the structural patterns of groups of molecules [118]. Quantum computing has made significant progress in recent years, driven by advances in both hardware and algorithms. The potential applications of quantum computing in drug discovery have been extensively discussed [119,120,121]. While opinions on the practical benefits of quantum computing vary, most computational scientists agree that it can save time and effort by substantially accelerating modeling processes. This acceleration allows for the production of larger models with high generalizability in a shorter time. Quantum computing is also expected to enhance the processing of larger molecular graphs and speed up training and prediction phases. Moreover, pre-trained networks for language model-based representations can be trained on a significantly larger number of molecules than existing models, further enhancing their utility and effectiveness in cheminformatics and drug discovery.

6 Conclusion

The role of molecular representations is pivotal since they provide a variety of methods for converting complex chemical structures into numerical formats that can be efficiently processed and analyzed. The selection of representation may significantly impact the outcomes of downstream tasks, with an appropriate balance between capturing relevant structural information and computational efficiency. Molecular representations facilitate various tasks, including similarity searches, virtual screening, and machine learning. In the future, the ceaseless development of more efficient molecular representations will help improve the power of computational approaches and unlock novel directions in cheminformatics and drug discovery.